diff --git a/.ci/spellcheck/.pyspelling.wordlist.txt b/.ci/spellcheck/.pyspelling.wordlist.txt index e9f5cec44d0..4cd9e6c08b4 100644 --- a/.ci/spellcheck/.pyspelling.wordlist.txt +++ b/.ci/spellcheck/.pyspelling.wordlist.txt @@ -297,6 +297,8 @@ instantiation InstructGPT InstructPix intel +InternLM +internlm invertible intervaling im diff --git a/notebooks/254-llm-chatbot/254-llm-chatbot.ipynb b/notebooks/254-llm-chatbot/254-llm-chatbot.ipynb index f1057665e12..f001628ae97 100644 --- a/notebooks/254-llm-chatbot/254-llm-chatbot.ipynb +++ b/notebooks/254-llm-chatbot/254-llm-chatbot.ipynb @@ -53,7 +53,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 33, "id": "563ecf9f-346b-4f14-85ef-c66ff0c95f65", "metadata": { "tags": [] @@ -65,7 +65,13 @@ "text": [ "\u001b[33mWARNING: Skipping openvino-dev as it is not installed.\u001b[0m\u001b[33m\n", "\u001b[0m\u001b[33mWARNING: Skipping openvino as it is not installed.\u001b[0m\u001b[33m\n", + "\u001b[0m\u001b[33mWARNING: Skipping openvino-nightly as it is not installed.\u001b[0m\u001b[33m\n", + "\u001b[0m\u001b[33mWARNING: Skipping optimum as it is not installed.\u001b[0m\u001b[33m\n", + "\u001b[0m\u001b[33mWARNING: Skipping optimum-intel as it is not installed.\u001b[0m\u001b[33m\n", "\u001b[0mNote: you may need to restart the kernel to use updated packages.\n", + "\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.1.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.0\u001b[0m\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n", "Note: you may need to restart the kernel to use updated packages.\n" ] } @@ -148,15 +154,17 @@ " except OSError:\n", " notebook_login()\n", "```\n", - "* **qwen1.5-7b-chat** - Qwen1.5 is the beta version of Qwen2, a transformer-based decoder-only language model pretrained on a large amount of data. Qwen1.5 is a language model series including decoder language models of different model sizes. It is based on the Transformer architecture with SwiGLU activation, attention QKV bias, group query attention, mixture of sliding window attention and full attention. You can find more details about model in the [model card](https://huggingface.co/Qwen/Qwen1.5-7B-Chat).\n", + "* **qwen1.5-0.5b-chat/qwen1.5-1.8b-chat/qwen1.5-7b-chat** - Qwen1.5 is the beta version of Qwen2, a transformer-based decoder-only language model pretrained on a large amount of data. Qwen1.5 is a language model series including decoder language models of different model sizes. It is based on the Transformer architecture with SwiGLU activation, attention QKV bias, group query attention, mixture of sliding window attention and full attention. You can find more details about model in the [model repository](https://huggingface.co/Qwen).\n", + "* **qwen-7b-chat** - Qwen-7B is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-7B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. For more details about Qwen, please refer to the [GitHub](https://github.com/QwenLM/Qwen) code repository.\n", "* **mpt-7b-chat** - MPT-7B is part of the family of MosaicPretrainedTransformer (MPT) models, which use a modified transformer architecture optimized for efficient training and inference. These architectural changes include performance-optimized layer implementations and the elimination of context length limits by replacing positional embeddings with Attention with Linear Biases ([ALiBi](https://arxiv.org/abs/2108.12409)). Thanks to these modifications, MPT models can be trained with high throughput efficiency and stable convergence. MPT-7B-chat is a chatbot-like model for dialogue generation. It was built by finetuning MPT-7B on the [ShareGPT-Vicuna](https://huggingface.co/datasets/jeffwan/sharegpt_vicuna), [HC3](https://huggingface.co/datasets/Hello-SimpleAI/HC3), [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca), [HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf), and [Evol-Instruct](https://huggingface.co/datasets/victor123/evol_instruct_70k) datasets. More details about the model can be found in [blog post](https://www.mosaicml.com/blog/mpt-7b), [repository](https://github.com/mosaicml/llm-foundry/) and [HuggingFace model card](https://huggingface.co/mosaicml/mpt-7b-chat).\n", "* **chatglm3-6b** - ChatGLM3-6B is the latest open-source model in the ChatGLM series. While retaining many excellent features such as smooth dialogue and low deployment threshold from the previous two generations, ChatGLM3-6B employs a more diverse training dataset, more sufficient training steps, and a more reasonable training strategy. ChatGLM3-6B adopts a newly designed [Prompt format](https://github.com/THUDM/ChatGLM3/blob/main/PROMPT_en.md), in addition to the normal multi-turn dialogue. You can find more details about model in the [model card](https://huggingface.co/THUDM/chatglm3-6b)\n", "* **mistral-7b** - The Mistral-7B-v0.1 Large Language Model (LLM) is a pretrained generative text model with 7 billion parameters. You can find more details about model in the [model card](https://huggingface.co/mistralai/Mistral-7B-v0.1), [paper](https://arxiv.org/abs/2310.06825) and [release blog post](https://mistral.ai/news/announcing-mistral-7b/).\n", "* **zephyr-7b-beta** - Zephyr is a series of language models that are trained to act as helpful assistants. Zephyr-7B-beta is the second model in the series, and is a fine-tuned version of [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) that was trained on on a mix of publicly available, synthetic datasets using [Direct Preference Optimization (DPO)](https://arxiv.org/abs/2305.18290). You can find more details about model in [technical report](https://arxiv.org/abs/2310.16944) and [HuggingFace model card](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta).\n", - "* **neural-chat-7b-v3-1** - Mistral-7b model fine-tuned using Intel Gaudi. The model fine-tuned on the open source dataset [Open-Orca/SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) and aligned with [Direct Preference Optimization (DPO) algorithm](https://arxiv.org/abs/2305.18290). More details can be found in [model card](https://huggingface.co/Intel/neural-chat-7b-v3-3) and [blog post](https://medium.com/@NeuralCompressor/the-practice-of-supervised-finetuning-and-direct-preference-optimization-on-habana-gaudi2-a1197d8a3cd3).\n", + "* **neural-chat-7b-v3-1** - Mistral-7b model fine-tuned using Intel Gaudi. The model fine-tuned on the open source dataset [Open-Orca/SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) and aligned with [Direct Preference Optimization (DPO) algorithm](https://arxiv.org/abs/2305.18290). More details can be found in [model card](https://huggingface.co/Intel/neural-chat-7b-v3-1) and [blog post](https://medium.com/@NeuralCompressor/the-practice-of-supervised-finetuning-and-direct-preference-optimization-on-habana-gaudi2-a1197d8a3cd3).\n", "* **notus-7b-v1** - Notus is a collection of fine-tuned models using [Direct Preference Optimization (DPO)](https://arxiv.org/abs/2305.18290). and related [RLHF](https://huggingface.co/blog/rlhf) techniques. This model is the first version, fine-tuned with DPO over zephyr-7b-sft. Following a data-first approach, the only difference between Notus-7B-v1 and Zephyr-7B-beta is the preference dataset used for dDPO. Proposed approach for dataset creation helps to effectively fine-tune Notus-7b that surpasses Zephyr-7B-beta and Claude 2 on [AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/). More details about model can be found in [model card](https://huggingface.co/argilla/notus-7b-v1).\n", "* **youri-7b-chat** - Youri-7b-chat is a Llama2 based model. [Rinna Co., Ltd.](https://rinna.co.jp/) conducted further pre-training for the Llama2 model with a mixture of English and Japanese datasets to improve Japanese task capability. The model is publicly released on Hugging Face hub. You can find detailed information at the [rinna/youri-7b-chat project page](https://huggingface.co/rinna/youri-7b). \n", - "* **baichuan2-7b-chat** - Baichuan 2 is the new generation of large-scale open-source language models launched by [Baichuan Intelligence inc](https://www.baichuan-ai.com/home). It is trained on a high-quality corpus with 2.6 trillion tokens and has achieved the best performance in authoritative Chinese and English benchmarks of the same size." + "* **baichuan2-7b-chat** - Baichuan 2 is the new generation of large-scale open-source language models launched by [Baichuan Intelligence inc](https://www.baichuan-ai.com/home). It is trained on a high-quality corpus with 2.6 trillion tokens and has achieved the best performance in authoritative Chinese and English benchmarks of the same size.\n", + "* **internlm2-chat-1.8b** - InternLM2 is the second generation InternLM series. Compared to the previous generation model, it shows significant improvements in various capabilities, including reasoning, mathematics, and coding. More details about model can be found in [model repository](https://huggingface.co/internlm).\n" ] }, { @@ -175,6 +183,41 @@ { "cell_type": "code", "execution_count": 2, + "id": "e02b34fb", + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "1e0cf311ef1548abaf807081974c7306", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Dropdown(description='Model Language:', options=('English', 'Chinese', 'Japanese'), value='English')" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model_languages = list(SUPPORTED_LLM_MODELS)\n", + "\n", + "model_language = widgets.Dropdown(\n", + " options=model_languages,\n", + " value=model_languages[0],\n", + " description=\"Model Language:\",\n", + " disabled=False,\n", + ")\n", + "\n", + "model_language" + ] + }, + { + "cell_type": "code", + "execution_count": 3, "id": "8d22fedb-d1f6-4306-b910-efac5b849c7c", "metadata": { "tags": [] @@ -183,21 +226,21 @@ { "data": { "application/vnd.jupyter.widget-view+json": { - "model_id": "91cd66a3bb4244a888758faae943ce06", + "model_id": "ab989ac9234b457292153293e6885a7f", "version_major": 2, "version_minor": 0 }, "text/plain": [ - "Dropdown(description='Model:', options=('tiny-llama-1b-chat', 'minicpm-2b-dpo', 'gemma-2b-it', 'red-pajama-3b-…" + "Dropdown(description='Model:', options=('qwen1.5-0.5b-chat', 'qwen1.5-1.8b-chat', 'qwen1.5-7b-chat', 'qwen-7b-…" ] }, - "execution_count": 2, + "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ - "model_ids = list(SUPPORTED_LLM_MODELS)\n", + "model_ids = list(SUPPORTED_LLM_MODELS[model_language.value])\n", "\n", "model_id = widgets.Dropdown(\n", " options=model_ids,\n", @@ -211,7 +254,7 @@ }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 4, "id": "906022ec-96bf-41a9-9447-789d2e875250", "metadata": { "tags": [] @@ -221,12 +264,12 @@ "name": "stdout", "output_type": "stream", "text": [ - "Selected model qwen1.5-7b-chat\n" + "Selected model qwen-7b-chat\n" ] } ], "source": [ - "model_configuration = SUPPORTED_LLM_MODELS[model_id.value]\n", + "model_configuration = SUPPORTED_LLM_MODELS[model_language.value][model_id.value]\n", "print(f\"Selected model {model_id.value}\")" ] }, @@ -263,10 +306,33 @@ }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 5, "id": "8cd910c2", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, tensorflow, onnx, openvino\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2024-03-07 02:52:02.115283: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\n", + "2024-03-07 02:52:02.118993: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.\n", + "2024-03-07 02:52:02.161204: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n", + "2024-03-07 02:52:02.161239: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n", + "2024-03-07 02:52:02.161273: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n", + "2024-03-07 02:52:02.169740: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.\n", + "2024-03-07 02:52:02.171079: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n", + "To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n", + "2024-03-07 02:52:03.108737: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n" + ] + } + ], "source": [ "from transformers import AutoModelForCausalLM, AutoConfig\n", "from optimum.intel.openvino import OVModelForCausalLM\n", @@ -313,7 +379,7 @@ }, { "cell_type": "code", - "execution_count": 23, + "execution_count": 6, "id": "91eb2ccf", "metadata": { "collapsed": false, @@ -325,7 +391,7 @@ { "data": { "application/vnd.jupyter.widget-view+json": { - "model_id": "e936bcc9cd494e92a8e3c01f86308969", + "model_id": "e4a960f313294097a93f4f3186c6e8d0", "version_major": 2, "version_minor": 0 }, @@ -339,7 +405,7 @@ { "data": { "application/vnd.jupyter.widget-view+json": { - "model_id": "9f24a6dd2d194867a2377ceaf55bf524", + "model_id": "bbe56e24c72346c68b1c76f7c61fb020", "version_major": 2, "version_minor": 0 }, @@ -353,7 +419,7 @@ { "data": { "application/vnd.jupyter.widget-view+json": { - "model_id": "53d3df3423c34d3da0a9862f8f44b787", + "model_id": "c27f6cdbe5ac4ee2bfab907fd0456c4e", "version_major": 2, "version_minor": 0 }, @@ -400,7 +466,7 @@ }, { "cell_type": "code", - "execution_count": 25, + "execution_count": 7, "id": "c4ef9112", "metadata": { "collapsed": false, @@ -408,7 +474,101 @@ "outputs_hidden": false } }, - "outputs": [], + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "63e1686cf3304ed396e0f52d0e0c43b8", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "modeling_qwen.py: 0%| | 0.00/55.6k [00:00 self._seq_len_cached or ntk_alpha != self._ntk_alpha_cached:\n", + "/home/ethan/.cache/huggingface/modules/transformers_modules/Qwen/Qwen-7B-Chat/8d24619bab456ea5abe2823c1d05fc5edec19174/modeling_qwen.py:482: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", + " if key_size > self.seq_length and self.use_logn_attn and not self.training:\n", + "/home/ethan/.cache/huggingface/modules/transformers_modules/Qwen/Qwen-7B-Chat/8d24619bab456ea5abe2823c1d05fc5edec19174/modeling_qwen.py:502: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n", + " if query.size(1) == key_size:\n", + "/home/ethan/intel/openvino_notebooks/openvino_env/lib/python3.11/site-packages/torch/jit/_trace.py:160: UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at aten/src/ATen/core/TensorBody.h:489.)\n", + " if a.grad is not None:\n" + ] + } + ], "source": [ "from optimum.intel import OVWeightQuantizationConfig\n", "\n", @@ -615,7 +775,7 @@ }, { "cell_type": "code", - "execution_count": 26, + "execution_count": 8, "id": "281f1d07-998e-4e13-ba95-0264564ede82", "metadata": {}, "outputs": [ @@ -623,7 +783,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "Size of FP16 model is 14743.26 MB\n" + "Size of FP16 model is 14729.26 MB\n" ] } ], @@ -659,7 +819,7 @@ }, { "cell_type": "code", - "execution_count": 27, + "execution_count": 9, "id": "837b4a3b-ccc3-4004-9577-2b2c7b802dea", "metadata": { "tags": [] @@ -668,7 +828,7 @@ { "data": { "application/vnd.jupyter.widget-view+json": { - "model_id": "39ee8605e5664d5a8fd4e339c7503e22", + "model_id": "aeee5a5edc354fb68e3ea2fb981b7ef8", "version_major": 2, "version_minor": 0 }, @@ -676,7 +836,7 @@ "Dropdown(description='Device:', options=('CPU', 'AUTO'), value='CPU')" ] }, - "execution_count": 27, + "execution_count": 9, "metadata": {}, "output_type": "execute_result" } @@ -699,12 +859,12 @@ "id": "bd55ade7-0445-47e1-90f0-72b82da016ca", "metadata": {}, "source": [ - "The cell below create `OVMPTModel`, `OVQWENModel` and `OVCHATGLM2Model` wrapper based on `OVModelForCausalLM` model." + "The cell below create `OVMPTModel` and `OVCHATGLM2Model` wrapper based on `OVModelForCausalLM` model." ] }, { "cell_type": "code", - "execution_count": 28, + "execution_count": 10, "id": "5333ab9b-ff5d-4a7f-bcdc-9cca5d56dc0a", "metadata": { "tags": [] @@ -727,7 +887,7 @@ }, { "cell_type": "code", - "execution_count": 29, + "execution_count": 11, "id": "3536a1a7", "metadata": { "collapsed": false, @@ -739,7 +899,7 @@ { "data": { "application/vnd.jupyter.widget-view+json": { - "model_id": "3d4f3c757a0a4b1eafdc593b7c42a77b", + "model_id": "424786d931fd4dfd96eb45208c42ca4c", "version_major": 2, "version_minor": 0 }, @@ -747,7 +907,7 @@ "Dropdown(description='Model to run:', options=('FP16',), value='FP16')" ] }, - "execution_count": 29, + "execution_count": 11, "metadata": {}, "output_type": "execute_result" } @@ -773,7 +933,7 @@ }, { "cell_type": "code", - "execution_count": 30, + "execution_count": 12, "id": "7a041101-7336-40fd-96c9-cd298015a0f3", "metadata": { "tags": [] @@ -783,14 +943,13 @@ "name": "stdout", "output_type": "stream", "text": [ - "Loading model from qwen1.5-7b-chat/FP16\n" + "Loading model from qwen-7b-chat/FP16\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ - "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n", "The argument `trust_remote_code` is to be used along with export=True. It will be ignored.\n", "Compiling the model to CPU ...\n" ] @@ -834,22 +993,15 @@ }, { "cell_type": "code", - "execution_count": 31, + "execution_count": 13, "id": "8f6f7596-5677-4931-875b-aaabfa23cabc", "metadata": {}, "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Setting `pad_token_id` to `eos_token_id`:151645 for open-end generation.\n" - ] - }, { "name": "stdout", "output_type": "stream", "text": [ - "2 + 2 = 4\n" + "2 + 2 = (1\n" ] } ], @@ -968,9 +1120,9 @@ "\n", "examples = (\n", " chinese_examples\n", - " if (\"qwen\" in model_id.value or \"chatglm\" in model_id.value or \"baichuan\" in model_id.value)\n", + " if (model_language.value == \"Chinese\")\n", " else japanese_examples\n", - " if (\"youri\" in model_id.value)\n", + " if (model_language.value == \"Japanese\")\n", " else english_examples\n", ")\n", "\n", @@ -1025,7 +1177,22 @@ " Returns:\n", " history in token format\n", " \"\"\"\n", - " if history_template is None:\n", + " if pt_model_name == \"baichuan2\":\n", + " system_tokens = tok.encode(start_message)\n", + " history_tokens = []\n", + " for (old_query, response) in history[:-1]:\n", + " round_tokens = []\n", + " round_tokens.append(195)\n", + " round_tokens.extend(tok.encode(old_query))\n", + " round_tokens.append(196)\n", + " round_tokens.extend(tok.encode(response))\n", + " history_tokens = round_tokens + history_tokens\n", + " input_tokens = system_tokens + history_tokens\n", + " input_tokens.append(195)\n", + " input_tokens.extend(tok.encode(history[-1][0]))\n", + " input_tokens.append(196)\n", + " input_token = torch.LongTensor([input_tokens])\n", + " elif history_template is None:\n", " messages = [{\"role\": \"system\", \"content\": start_message}]\n", " for idx, (user_msg, model_msg) in enumerate(history):\n", " if idx == len(history) - 1 and not model_msg:\n", @@ -1035,6 +1202,7 @@ " messages.append({\"role\": \"user\", \"content\": user_msg})\n", " if model_msg:\n", " messages.append({\"role\": \"assistant\", \"content\": model_msg})\n", + " \n", " input_token = tok.apply_chat_template(messages,\n", " add_generation_prompt=True,\n", " tokenize=True,\n", @@ -1142,6 +1310,10 @@ " yield history\n", "\n", "\n", + "def request_cancel():\n", + " ov_model.request.cancel()\n", + "\n", + "\n", "def get_uuid():\n", " \"\"\"\n", " universal unique identifier for thread\n", @@ -1261,7 +1433,7 @@ " queue=True,\n", " )\n", " stop.click(\n", - " fn=None,\n", + " fn=request_cancel,\n", " inputs=None,\n", " outputs=None,\n", " cancels=[submit_event, submit_click_event],\n", @@ -1274,15 +1446,23 @@ "# if you have any issue to launch on your platform, you can pass share=True to launch method:\n", "# demo.launch(share=True)\n", "# it creates a publicly shareable link for the interface. Read more in the docs: https://gradio.app/docs/\n", - "demo.launch()" + "demo.launch(server_name='10.3.233.99', server_port=5677)" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 17, "id": "7b837f9e-4152-4a5c-880a-ed874aa64a74", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Closing server running on port: 5467\n" + ] + } + ], "source": [ "# please uncomment and run this cell for stopping gradio interface\n", "# demo.close()" diff --git a/notebooks/254-llm-chatbot/254-rag-chatbot.ipynb b/notebooks/254-llm-chatbot/254-rag-chatbot.ipynb index 7364593c45b..f1156ffe4e9 100644 --- a/notebooks/254-llm-chatbot/254-rag-chatbot.ipynb +++ b/notebooks/254-llm-chatbot/254-rag-chatbot.ipynb @@ -156,14 +156,17 @@ " except OSError:\n", " notebook_login()\n", "```\n", - "* **qwen1.5-7b-chat** - Qwen1.5 is the beta version of Qwen2, a transformer-based decoder-only language model pretrained on a large amount of data. Qwen1.5 is a language model series including decoder language models of different model sizes. It is based on the Transformer architecture with SwiGLU activation, attention QKV bias, group query attention, mixture of sliding window attention and full attention. You can find more details about model in the [model card](https://huggingface.co/Qwen/Qwen1.5-7B-Chat).\n", + "* **qwen1.5-0.5b-chat/qwen1.5-1.8b-chat/qwen1.5-7b-chat** - Qwen1.5 is the beta version of Qwen2, a transformer-based decoder-only language model pretrained on a large amount of data. Qwen1.5 is a language model series including decoder language models of different model sizes. It is based on the Transformer architecture with SwiGLU activation, attention QKV bias, group query attention, mixture of sliding window attention and full attention. You can find more details about model in the [model repository](https://huggingface.co/Qwen).\n", + "* **qwen-7b-chat** - Qwen-7B is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-7B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. For more details about Qwen, please refer to the [GitHub](https://github.com/QwenLM/Qwen) code repository.\n", "* **mpt-7b-chat** - MPT-7B is part of the family of MosaicPretrainedTransformer (MPT) models, which use a modified transformer architecture optimized for efficient training and inference. These architectural changes include performance-optimized layer implementations and the elimination of context length limits by replacing positional embeddings with Attention with Linear Biases ([ALiBi](https://arxiv.org/abs/2108.12409)). Thanks to these modifications, MPT models can be trained with high throughput efficiency and stable convergence. MPT-7B-chat is a chatbot-like model for dialogue generation. It was built by finetuning MPT-7B on the [ShareGPT-Vicuna](https://huggingface.co/datasets/jeffwan/sharegpt_vicuna), [HC3](https://huggingface.co/datasets/Hello-SimpleAI/HC3), [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca), [HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf), and [Evol-Instruct](https://huggingface.co/datasets/victor123/evol_instruct_70k) datasets. More details about the model can be found in [blog post](https://www.mosaicml.com/blog/mpt-7b), [repository](https://github.com/mosaicml/llm-foundry/) and [HuggingFace model card](https://huggingface.co/mosaicml/mpt-7b-chat).\n", "* **chatglm3-6b** - ChatGLM3-6B is the latest open-source model in the ChatGLM series. While retaining many excellent features such as smooth dialogue and low deployment threshold from the previous two generations, ChatGLM3-6B employs a more diverse training dataset, more sufficient training steps, and a more reasonable training strategy. ChatGLM3-6B adopts a newly designed [Prompt format](https://github.com/THUDM/ChatGLM3/blob/main/PROMPT_en.md), in addition to the normal multi-turn dialogue. You can find more details about model in the [model card](https://huggingface.co/THUDM/chatglm3-6b)\n", "* **mistral-7b** - The Mistral-7B-v0.1 Large Language Model (LLM) is a pretrained generative text model with 7 billion parameters. You can find more details about model in the [model card](https://huggingface.co/mistralai/Mistral-7B-v0.1), [paper](https://arxiv.org/abs/2310.06825) and [release blog post](https://mistral.ai/news/announcing-mistral-7b/).\n", "* **zephyr-7b-beta** - Zephyr is a series of language models that are trained to act as helpful assistants. Zephyr-7B-beta is the second model in the series, and is a fine-tuned version of [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) that was trained on on a mix of publicly available, synthetic datasets using [Direct Preference Optimization (DPO)](https://arxiv.org/abs/2305.18290). You can find more details about model in [technical report](https://arxiv.org/abs/2310.16944) and [HuggingFace model card](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta).\n", "* **neural-chat-7b-v3-1** - Mistral-7b model fine-tuned using Intel Gaudi. The model fine-tuned on the open source dataset [Open-Orca/SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) and aligned with [Direct Preference Optimization (DPO) algorithm](https://arxiv.org/abs/2305.18290). More details can be found in [model card](https://huggingface.co/Intel/neural-chat-7b-v3-1) and [blog post](https://medium.com/@NeuralCompressor/the-practice-of-supervised-finetuning-and-direct-preference-optimization-on-habana-gaudi2-a1197d8a3cd3).\n", "* **notus-7b-v1** - Notus is a collection of fine-tuned models using [Direct Preference Optimization (DPO)](https://arxiv.org/abs/2305.18290). and related [RLHF](https://huggingface.co/blog/rlhf) techniques. This model is the first version, fine-tuned with DPO over zephyr-7b-sft. Following a data-first approach, the only difference between Notus-7B-v1 and Zephyr-7B-beta is the preference dataset used for dDPO. Proposed approach for dataset creation helps to effectively fine-tune Notus-7b that surpasses Zephyr-7B-beta and Claude 2 on [AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/). More details about model can be found in [model card](https://huggingface.co/argilla/notus-7b-v1).\n", - "* **baichuan2-7b-chat** - Baichuan 2 is the new generation of large-scale open-source language models launched by [Baichuan Intelligence inc](https://www.baichuan-ai.com/home). It is trained on a high-quality corpus with 2.6 trillion tokens and has achieved the best performance in authoritative Chinese and English benchmarks of the same size." + "* **youri-7b-chat** - Youri-7b-chat is a Llama2 based model. [Rinna Co., Ltd.](https://rinna.co.jp/) conducted further pre-training for the Llama2 model with a mixture of English and Japanese datasets to improve Japanese task capability. The model is publicly released on Hugging Face hub. You can find detailed information at the [rinna/youri-7b-chat project page](https://huggingface.co/rinna/youri-7b). \n", + "* **baichuan2-7b-chat** - Baichuan 2 is the new generation of large-scale open-source language models launched by [Baichuan Intelligence inc](https://www.baichuan-ai.com/home). It is trained on a high-quality corpus with 2.6 trillion tokens and has achieved the best performance in authoritative Chinese and English benchmarks of the same size.\n", + "* **internlm2-chat-1.8b** - InternLM2 is the second generation InternLM series. Compared to the previous generation model, it shows significant improvements in various capabilities, including reasoning, mathematics, and coding. More details about model can be found in [model repository](https://huggingface.co/internlm)." ] }, { @@ -183,15 +186,15 @@ "name": "stderr", "output_type": "stream", "text": [ - "2024-03-03 05:37:41.176057: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\n", - "2024-03-03 05:37:41.179549: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.\n", - "2024-03-03 05:37:41.221693: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n", - "2024-03-03 05:37:41.221725: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n", - "2024-03-03 05:37:41.221757: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n", - "2024-03-03 05:37:41.231708: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.\n", - "2024-03-03 05:37:41.233781: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n", + "2024-03-06 07:05:19.617312: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\n", + "2024-03-06 07:05:19.620814: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.\n", + "2024-03-06 07:05:19.663621: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n", + "2024-03-06 07:05:19.663653: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n", + "2024-03-06 07:05:19.663683: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n", + "2024-03-06 07:05:19.671963: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.\n", + "2024-03-06 07:05:19.673938: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n", "To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n", - "2024-03-03 05:37:42.211339: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n" + "2024-03-06 07:05:20.726709: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n" ] } ], @@ -239,12 +242,12 @@ { "data": { "application/vnd.jupyter.widget-view+json": { - "model_id": "cdf2125cd615457790e51e1e48272035", + "model_id": "5875d10008c442c38ff1d90da874b8dc", "version_major": 2, "version_minor": 0 }, "text/plain": [ - "Dropdown(description='LLM Model:', options=('tiny-llama-1b-chat', 'minicpm-2b-dpo', 'gemma-2b-it', 'red-pajama…" + "Dropdown(description='Model Language:', options=('English', 'Chinese', 'Japanese'), value='English')" ] }, "execution_count": 2, @@ -255,12 +258,47 @@ "source": [ "from config import SUPPORTED_EMBEDDING_MODELS, SUPPORTED_LLM_MODELS\n", "\n", - "llm_model_id = list(SUPPORTED_LLM_MODELS)\n", + "model_languages = list(SUPPORTED_LLM_MODELS)\n", + "\n", + "model_language = widgets.Dropdown(\n", + " options=model_languages,\n", + " value=model_languages[0],\n", + " description=\"Model Language:\",\n", + " disabled=False,\n", + ")\n", + "\n", + "model_language" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "184d1678-0e73-4f35-8af5-1a7d291c2e6e", + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "c8d393ddf227409d84313cde097d9896", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Dropdown(description='Model:', options=('tiny-llama-1b-chat', 'gemma-2b-it', 'red-pajama-3b-chat', 'gemma-7b-i…" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "llm_model_ids = list(SUPPORTED_LLM_MODELS[model_language.value])\n", "\n", "llm_model_id = widgets.Dropdown(\n", - " options=llm_model_id,\n", - " value=llm_model_id[0],\n", - " description=\"LLM Model:\",\n", + " options=llm_model_ids,\n", + " value=llm_model_ids[0],\n", + " description=\"Model:\",\n", " disabled=False,\n", ")\n", "\n", @@ -269,7 +307,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 16, "id": "49ea95f8", "metadata": {}, "outputs": [ @@ -277,12 +315,12 @@ "name": "stdout", "output_type": "stream", "text": [ - "Selected LLM model chatglm3-6b\n" + "Selected LLM model tiny-llama-1b-chat\n" ] } ], "source": [ - "llm_model_configuration = SUPPORTED_LLM_MODELS[llm_model_id.value]\n", + "llm_model_configuration = SUPPORTED_LLM_MODELS[model_language.value][llm_model_id.value]\n", "print(f\"Selected LLM model {llm_model_id.value}\")" ] }, @@ -344,14 +382,14 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 10, "id": "c6a38153", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { - "model_id": "a75a56b8dcf14fcd83ac08e298311b1a", + "model_id": "10a3596a41864effbe8fb9d81723f3ed", "version_major": 2, "version_minor": 0 }, @@ -365,7 +403,7 @@ { "data": { "application/vnd.jupyter.widget-view+json": { - "model_id": "50d1d22c3d104f71aa7a8b323d829ea4", + "model_id": "da04e6b87e41474194e2de8219da7303", "version_major": 2, "version_minor": 0 }, @@ -379,7 +417,7 @@ { "data": { "application/vnd.jupyter.widget-view+json": { - "model_id": "9e44deed771a4bc0b1ee5fb96e5bc274", + "model_id": "0532ba4230d440aeb3f10cd7becf9156", "version_major": 2, "version_minor": 0 }, @@ -417,7 +455,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 11, "id": "2020d522", "metadata": {}, "outputs": [], @@ -616,7 +654,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 12, "id": "8e127215", "metadata": {}, "outputs": [ @@ -624,7 +662,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "Size of FP16 model is 11909.69 MB\n" + "Size of model with INT4 compressed weights is 1837.58 MB\n" ] } ], @@ -660,22 +698,22 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": 17, "id": "ff80e6eb-7923-40ef-93d8-5e6c56e50667", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { - "model_id": "a1bef0be382747a68399e898afc00112", + "model_id": "d7e6f5925ad0446ca94e882a8c6503fc", "version_major": 2, "version_minor": 0 }, "text/plain": [ - "Dropdown(description='Embedding Model:', options=('all-mpnet-base-v2', 'text2vec-large-chinese'), value='all-m…" + "Dropdown(description='Embedding Model:', options=('all-mpnet-base-v2',), value='all-mpnet-base-v2')" ] }, - "execution_count": 8, + "execution_count": 17, "metadata": {}, "output_type": "execute_result" } @@ -683,7 +721,7 @@ "source": [ "embedding_model_id = list(SUPPORTED_EMBEDDING_MODELS)\n", "\n", - "if \"qwen\" not in llm_model_id.value and \"chatglm\" not in llm_model_id.value:\n", + "if model_language.value != \"Chinese\":\n", " embedding_model_id = [x for x in embedding_model_id if \"chinese\" not in x]\n", "\n", "embedding_model_id = widgets.Dropdown(\n", @@ -922,7 +960,7 @@ "id": "e2610f4b", "metadata": {}, "source": [ - "The cell below create `OVMPTModel`, `OVQWENModel` and `OVCHATGLM2Model` wrapper based on `OVModelForCausalLM` model." + "The cell below create `OVMPTModel` and `OVCHATGLM2Model` wrapper based on `OVModelForCausalLM` model." ] }, { diff --git a/notebooks/254-llm-chatbot/README.md b/notebooks/254-llm-chatbot/README.md index d9febb2288d..9f0914f6b17 100644 --- a/notebooks/254-llm-chatbot/README.md +++ b/notebooks/254-llm-chatbot/README.md @@ -24,15 +24,17 @@ The available options are: * **llama-2-7b-chat** - LLama 2 is the second generation of LLama models developed by Meta. Llama 2 is a collection of pre-trained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. llama-2-7b-chat is 7 billions parameters version of LLama 2 finetuned and optimized for dialogue use case. More details about model can be found in the [paper](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/), [repository](https://github.com/facebookresearch/llama) and [HuggingFace model card](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). >**Note**: run model with demo, you will need to accept license agreement. >You must be a registered user in 🤗 Hugging Face Hub. Please visit [HuggingFace model card](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), carefully read terms of usage and click accept button. You will need to use an access token for the code below to run. For more information on access tokens, refer to [this section of the documentation](https://huggingface.co/docs/hub/security-tokens). -* **qwen1.5-7b-chat** - Qwen1.5 is the beta version of Qwen2, a transformer-based decoder-only language model pretrained on a large amount of data. Qwen1.5 is a language model series including decoder language models of different model sizes. It is based on the Transformer architecture with SwiGLU activation, attention QKV bias, group query attention, mixture of sliding window attention and full attention. You can find more details about model in the [model card](https://huggingface.co/Qwen/Qwen1.5-7B-Chat). -* **mpt-7b-chat** - MPT-7B is part of the family of MosaicPretrainedTransformer (MPT) models, which use a modified transformer architecture optimized for efficient training and inference. These architectural changes include performance-optimized layer implementations and the elimination of context length limits by replacing positional embeddings with Attention with Linear Biases ([ALiBi](https://arxiv.org/abs/2108.12409)). Thanks to these modifications, MPT models can be trained with high throughput efficiency and stable convergence. MPT-7B-chat is a chatbot-like model for dialogue generation. It was built by finetuning MPT-7B on the ShareGPT-Vicuna, [HC3](https://huggingface.co/datasets/Hello-SimpleAI/HC3), [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca), [HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf), and [Evol-Instruct](https://huggingface.co/datasets/victor123/evol_instruct_70k) datasets. More details about the model can be found in [blog post](https://www.mosaicml.com/blog/mpt-7b), [repository](https://github.com/mosaicml/llm-foundry/) and [HuggingFace model card](https://huggingface.co/mosaicml/mpt-7b-chat). +* **qwen1.5-0.5b-chat/qwen1.5-1.8b-chat/qwen1.5-7b-chat** - Qwen1.5 is the beta version of Qwen2, a transformer-based decoder-only language model pretrained on a large amount of data. Qwen1.5 is a language model series including decoder language models of different model sizes. It is based on the Transformer architecture with SwiGLU activation, attention QKV bias, group query attention, mixture of sliding window attention and full attention. You can find more details about model in the [model repository](https://huggingface.co/Qwen). +* **qwen-7b-chat** - Qwen-7B is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-7B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. For more details about Qwen, please refer to the [GitHub](https://github.com/QwenLM/Qwen) code repository. +* **mpt-7b-chat** - MPT-7B is part of the family of MosaicPretrainedTransformer (MPT) models, which use a modified transformer architecture optimized for efficient training and inference. These architectural changes include performance-optimized layer implementations and the elimination of context length limits by replacing positional embeddings with Attention with Linear Biases ([ALiBi](https://arxiv.org/abs/2108.12409)). Thanks to these modifications, MPT models can be trained with high throughput efficiency and stable convergence. MPT-7B-chat is a chatbot-like model for dialogue generation. It was built by finetuning MPT-7B on the [ShareGPT-Vicuna](https://huggingface.co/datasets/jeffwan/sharegpt_vicuna), [HC3](https://huggingface.co/datasets/Hello-SimpleAI/HC3), [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca), [HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf), and [Evol-Instruct](https://huggingface.co/datasets/victor123/evol_instruct_70k) datasets. More details about the model can be found in [blog post](https://www.mosaicml.com/blog/mpt-7b), [repository](https://github.com/mosaicml/llm-foundry/) and [HuggingFace model card](https://huggingface.co/mosaicml/mpt-7b-chat). * **chatglm3-6b** - ChatGLM3-6B is the latest open-source model in the ChatGLM series. While retaining many excellent features such as smooth dialogue and low deployment threshold from the previous two generations, ChatGLM3-6B employs a more diverse training dataset, more sufficient training steps, and a more reasonable training strategy. ChatGLM3-6B adopts a newly designed [Prompt format](https://github.com/THUDM/ChatGLM3/blob/main/PROMPT_en.md), in addition to the normal multi-turn dialogue. You can find more details about model in the [model card](https://huggingface.co/THUDM/chatglm3-6b) * **mistral-7b** - The Mistral-7B-v0.1 Large Language Model (LLM) is a pretrained generative text model with 7 billion parameters. You can find more details about model in the [model card](https://huggingface.co/mistralai/Mistral-7B-v0.1), [paper](https://arxiv.org/abs/2310.06825) and [release blog post](https://mistral.ai/news/announcing-mistral-7b/). * **zephyr-7b-beta** - Zephyr is a series of language models that are trained to act as helpful assistants. Zephyr-7B-beta is the second model in the series, and is a fine-tuned version of [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) that was trained on on a mix of publicly available, synthetic datasets using [Direct Preference Optimization (DPO)](https://arxiv.org/abs/2305.18290). You can find more details about model in [technical report](https://arxiv.org/abs/2310.16944) and [HuggingFace model card](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta). -* **neural-chat-7b-v3-1** - Mistral-7b model fine-tuned using Intel Gaudi. The model fine-tuned on the open source dataset [Open-Orca/SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) and aligned with [Direct Preference Optimization (DPO) algorithm](https://arxiv.org/abs/2305.18290). More details can be found in [model card](https://huggingface.co/Intel/neural-chat-7b-v3-3) and [blog post](https://medium.com/@NeuralCompressor/the-practice-of-supervised-finetuning-and-direct-preference-optimization-on-habana-gaudi2-a1197d8a3cd3). +* **neural-chat-7b-v3-1** - Mistral-7b model fine-tuned using Intel Gaudi. The model fine-tuned on the open source dataset [Open-Orca/SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) and aligned with [Direct Preference Optimization (DPO) algorithm](https://arxiv.org/abs/2305.18290). More details can be found in [model card](https://huggingface.co/Intel/neural-chat-7b-v3-1) and [blog post](https://medium.com/@NeuralCompressor/the-practice-of-supervised-finetuning-and-direct-preference-optimization-on-habana-gaudi2-a1197d8a3cd3). * **notus-7b-v1** - Notus is a collection of fine-tuned models using [Direct Preference Optimization (DPO)](https://arxiv.org/abs/2305.18290). and related [RLHF](https://huggingface.co/blog/rlhf) techniques. This model is the first version, fine-tuned with DPO over zephyr-7b-sft. Following a data-first approach, the only difference between Notus-7B-v1 and Zephyr-7B-beta is the preference dataset used for dDPO. Proposed approach for dataset creation helps to effectively fine-tune Notus-7b that surpasses Zephyr-7B-beta and Claude 2 on [AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/). More details about model can be found in [model card](https://huggingface.co/argilla/notus-7b-v1). * **youri-7b-chat** - Youri-7b-chat is a Llama2 based model. [Rinna Co., Ltd.](https://rinna.co.jp/) conducted further pre-training for the Llama2 model with a mixture of English and Japanese datasets to improve Japanese task capability. The model is publicly released on Hugging Face hub. You can find detailed information at the [rinna/youri-7b-chat project page](https://huggingface.co/rinna/youri-7b). * **baichuan2-7b-chat** - Baichuan 2 is the new generation of large-scale open-source language models launched by [Baichuan Intelligence inc](https://www.baichuan-ai.com/home). It is trained on a high-quality corpus with 2.6 trillion tokens and has achieved the best performance in authoritative Chinese and English benchmarks of the same size. +* **internlm2-chat-1.8b** - InternLM2 is the second generation InternLM series. Compared to the previous generation model, it shows significant improvements in various capabilities, including reasoning, mathematics, and coding. More details about model can be found in [model repository](https://huggingface.co/internlm). The image below illustrates the provided user instruction and model answer examples. diff --git a/notebooks/254-llm-chatbot/config.py b/notebooks/254-llm-chatbot/config.py index 229489ad5a3..87eb36ffb66 100644 --- a/notebooks/254-llm-chatbot/config.py +++ b/notebooks/254-llm-chatbot/config.py @@ -1,9 +1,3 @@ -from transformers import ( - StoppingCriteria, - StoppingCriteriaList, -) -import torch - DEFAULT_SYSTEM_PROMPT = """\ You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\ @@ -55,201 +49,265 @@ def youri_partial_text_processor(partial_text, new_text): return partial_text +def internlm_partial_text_processor(partial_text, new_text): + partial_text += new_text + return partial_text.split("<|im_end|>")[0] + + SUPPORTED_LLM_MODELS = { - "tiny-llama-1b-chat": { - "model_id": "TinyLlama/TinyLlama-1.1B-Chat-v1.0", - "remote": False, - "start_message": f"<|system|>\n{DEFAULT_SYSTEM_PROMPT}\n", - "history_template": "<|user|>\n{user} \n<|assistant|>\n{assistant} \n", - "current_message_template": "<|user|>\n{user} \n<|assistant|>\n{assistant}", - "rag_prompt_template": f"""<|system|> {DEFAULT_RAG_PROMPT }""" - + """ - <|user|> - Question: {question} - Context: {context} - Answer: - <|assistant|>""", - }, - "minicpm-2b-dpo": { - "model_id": "openbmb/MiniCPM-2B-dpo-fp16", - "remote_code": True, - "remote": False, - "start_message": f"<|system|>\n{DEFAULT_SYSTEM_PROMPT}\n", - "history_template": "<|user|>\n{user} \n<|assistant|>\n{assistant} \n", - "current_message_template": "<|user|>\n{user} \n<|assistant|>\n{assistant}", - "stop_tokens": ["<|user|>", "<|assistant|>"], - "rag_prompt_template": f"""<|system|> {DEFAULT_RAG_PROMPT }""" - + """ - <|user|> - Question: {question} - Context: {context} - Answer: - <|assistant|>""", - }, - "gemma-2b-it": { - "model_id": "google/gemma-2b-it", - "remote": True, - "start_message": DEFAULT_SYSTEM_PROMPT + ", ", - "history_template": "user{user}model{assistant}", - "current_message_template": "user{user}model{assistant}", - "rag_prompt_template": f"""{DEFAULT_RAG_PROMPT},"""+"""user{question}context{context}model""" - }, - "red-pajama-3b-chat": { - "model_id": "togethercomputer/RedPajama-INCITE-Chat-3B-v1", - "remote": False, - "start_message": "", - "history_template": "\n:{user}\n:{assistant}", - "stop_tokens": [29, 0], - "partial_text_processor": red_pijama_partial_text_processor, - "current_message_template": "\n:{user}\n:{assistant}", - "rag_prompt_template": f"""{DEFAULT_RAG_PROMPT }""" - + """ - : Question: {question} - Context: {context} - Answer: """, - }, - "gemma-7b-it": { - "model_id": "google/gemma-7b-it", - "remote": True, - "start_message": DEFAULT_SYSTEM_PROMPT + ", ", - "history_template": "user{user}model{assistant}", - "current_message_template": "user{user}model{assistant}", - "rag_prompt_template": f"""{DEFAULT_RAG_PROMPT},"""+"""user{question}context{context}model""" - }, - "llama-2-chat-7b": { - "model_id": "meta-llama/Llama-2-7b-chat-hf", - "remote": False, - "start_message": f"[INST] <>\n{DEFAULT_SYSTEM_PROMPT }\n<>\n\n", - "history_template": "{user}[/INST]{assistant}[INST]", - "current_message_template": "{user} [/INST]{assistant}", - "tokenizer_kwargs": {"add_special_tokens": False}, - "partial_text_processor": llama_partial_text_processor, - "rag_prompt_template": f"""[INST]Human: <> {DEFAULT_RAG_PROMPT }<>""" - + """ - Question: {question} - Context: {context} - Answer: [/INST]""", - }, - "mpt-7b-chat": { - "model_id": "mosaicml/mpt-7b-chat", - "remote": True, - "start_message": f"<|im_start|>system\n {DEFAULT_SYSTEM_PROMPT }<|im_end|>", - "history_template": "<|im_start|>user\n{user}<|im_start|>assistant\n{assistant}<|im_end|>", - "current_message_template": '"<|im_start|>user\n{user}<|im_start|>assistant\n{assistant}', - "stop_tokens": ["<|im_end|>", "<|endoftext|>"], - "rag_prompt_template": f"""<|im_start|>system - {DEFAULT_RAG_PROMPT }<|im_end|>""" - + """ - <|im_start|>user - Question: {question} - Context: {context} - Answer: <|im_start|>assistant""", - }, - "qwen1.5-7b-chat": { - "model_id": "Qwen/Qwen1.5-7B-Chat", - "remote": False, - "start_message": DEFAULT_SYSTEM_PROMPT_CHINESE, - "stop_tokens": ["<|im_end|>", "<|endoftext|>"], - "rag_prompt_template": f"""<|im_start|>system - {DEFAULT_RAG_PROMPT_CHINESE }<|im_end|>""" - + """ - <|im_start|>user - 问题: {question} - 已知内容: {context} - 回答: <|im_end|><|im_start|>assistant""", - }, - "chatglm3-6b": { - "model_id": "THUDM/chatglm3-6b", - "remote": True, - "start_message": DEFAULT_SYSTEM_PROMPT_CHINESE, - "tokenizer_kwargs": {"add_special_tokens": False}, - "stop_tokens": [0, 2], - "rag_prompt_template": f"""{DEFAULT_RAG_PROMPT_CHINESE }""" - + """ - 问题: {question} - 已知内容: {context} - 回答: - """, - }, - "mistral-7b": { - "model_id": "mistralai/Mistral-7B-v0.1", - "remote": False, - "start_message": f"[INST] <>\n{DEFAULT_SYSTEM_PROMPT }\n<>\n\n", - "history_template": "{user}[/INST]{assistant}[INST]", - "current_message_template": "{user} [/INST]{assistant}", - "tokenizer_kwargs": {"add_special_tokens": False}, - "partial_text_processor": llama_partial_text_processor, - "rag_prompt_template": f""" [INST] {DEFAULT_RAG_PROMPT } [/INST] """ - + """ - [INST] Question: {question} - Context: {context} - Answer: [/INST]""", - }, - "zephyr-7b-beta": { - "model_id": "HuggingFaceH4/zephyr-7b-beta", - "remote": False, - "start_message": f"<|system|>\n{DEFAULT_SYSTEM_PROMPT}\n", - "history_template": "<|user|>\n{user} \n<|assistant|>\n{assistant} \n", - "current_message_template": "<|user|>\n{user} \n<|assistant|>\n{assistant}", - "rag_prompt_template": f"""<|system|> {DEFAULT_RAG_PROMPT }""" - + """ - <|user|> - Question: {question} - Context: {context} - Answer: - <|assistant|>""", - }, - "neural-chat-7b-v3-1": { - "model_id": "Intel/neural-chat-7b-v3-3", - "remote": False, - "start_message": f"[INST] <>\n{DEFAULT_SYSTEM_PROMPT }\n<>\n\n", - "history_template": "{user}[/INST]{assistant}[INST]", - "current_message_template": "{user} [/INST]{assistant}", - "tokenizer_kwargs": {"add_special_tokens": False}, - "partial_text_processor": llama_partial_text_processor, - "rag_prompt_template": f""" [INST] {DEFAULT_RAG_PROMPT } [/INST] """ - + """ - [INST] Question: {question} - Context: {context} - Answer: [/INST]""", - }, - "notus-7b-v1": { - "model_id": "argilla/notus-7b-v1", - "remote": False, - "start_message": f"<|system|>\n{DEFAULT_SYSTEM_PROMPT}\n", - "history_template": "<|user|>\n{user} \n<|assistant|>\n{assistant} \n", - "current_message_template": "<|user|>\n{user} \n<|assistant|>\n{assistant}", - "rag_prompt_template": f"""<|system|> {DEFAULT_RAG_PROMPT }""" - + """ - <|user|> - Question: {question} - Context: {context} - Answer: - <|assistant|>""", - }, - "youri-7b-chat": { - "model_id": "rinna/youri-7b-chat", - "remote": False, - "start_message": f"設定: {DEFAULT_SYSTEM_PROMPT_JAPANESE}\n", - "history_template": "ユーザー: {user}\nシステム: {assistant}\n", - "current_message_template": "ユーザー: {user}\nシステム: {assistant}", - "tokenizer_kwargs": {"add_special_tokens": False}, - "partial_text_processor": youri_partial_text_processor, + "English":{ + "tiny-llama-1b-chat": { + "model_id": "TinyLlama/TinyLlama-1.1B-Chat-v1.0", + "remote": False, + "start_message": f"<|system|>\n{DEFAULT_SYSTEM_PROMPT}\n", + "history_template": "<|user|>\n{user} \n<|assistant|>\n{assistant} \n", + "current_message_template": "<|user|>\n{user} \n<|assistant|>\n{assistant}", + "rag_prompt_template": f"""<|system|> {DEFAULT_RAG_PROMPT }""" + + """ + <|user|> + Question: {question} + Context: {context} + Answer: + <|assistant|>""", + }, + "gemma-2b-it": { + "model_id": "google/gemma-2b-it", + "remote": True, + "start_message": DEFAULT_SYSTEM_PROMPT + ", ", + "history_template": "user{user}model{assistant}", + "current_message_template": "user{user}model{assistant}", + "rag_prompt_template": f"""{DEFAULT_RAG_PROMPT},"""+"""user{question}context{context}model""" + }, + "red-pajama-3b-chat": { + "model_id": "togethercomputer/RedPajama-INCITE-Chat-3B-v1", + "remote": False, + "start_message": "", + "history_template": "\n:{user}\n:{assistant}", + "stop_tokens": [29, 0], + "partial_text_processor": red_pijama_partial_text_processor, + "current_message_template": "\n:{user}\n:{assistant}", + "rag_prompt_template": f"""{DEFAULT_RAG_PROMPT }""" + + """ + : Question: {question} + Context: {context} + Answer: """, + }, + "gemma-7b-it": { + "model_id": "google/gemma-7b-it", + "remote": True, + "start_message": DEFAULT_SYSTEM_PROMPT + ", ", + "history_template": "user{user}model{assistant}", + "current_message_template": "user{user}model{assistant}", + "rag_prompt_template": f"""{DEFAULT_RAG_PROMPT},"""+"""user{question}context{context}model""" + }, + "llama-2-chat-7b": { + "model_id": "meta-llama/Llama-2-7b-chat-hf", + "remote": False, + "start_message": f"[INST] <>\n{DEFAULT_SYSTEM_PROMPT }\n<>\n\n", + "history_template": "{user}[/INST]{assistant}[INST]", + "current_message_template": "{user} [/INST]{assistant}", + "tokenizer_kwargs": {"add_special_tokens": False}, + "partial_text_processor": llama_partial_text_processor, + "rag_prompt_template": f"""[INST]Human: <> {DEFAULT_RAG_PROMPT }<>""" + + """ + Question: {question} + Context: {context} + Answer: [/INST]""", + }, + "mpt-7b-chat": { + "model_id": "mosaicml/mpt-7b-chat", + "remote": True, + "start_message": f"<|im_start|>system\n {DEFAULT_SYSTEM_PROMPT }<|im_end|>", + "history_template": "<|im_start|>user\n{user}<|im_start|>assistant\n{assistant}<|im_end|>", + "current_message_template": '"<|im_start|>user\n{user}<|im_start|>assistant\n{assistant}', + "stop_tokens": ["<|im_end|>", "<|endoftext|>"], + "rag_prompt_template": f"""<|im_start|>system + {DEFAULT_RAG_PROMPT }<|im_end|>""" + + """ + <|im_start|>user + Question: {question} + Context: {context} + Answer: <|im_start|>assistant""", + }, + "mistral-7b": { + "model_id": "mistralai/Mistral-7B-v0.1", + "remote": False, + "start_message": f"[INST] <>\n{DEFAULT_SYSTEM_PROMPT }\n<>\n\n", + "history_template": "{user}[/INST]{assistant}[INST]", + "current_message_template": "{user} [/INST]{assistant}", + "tokenizer_kwargs": {"add_special_tokens": False}, + "partial_text_processor": llama_partial_text_processor, + "rag_prompt_template": f""" [INST] {DEFAULT_RAG_PROMPT } [/INST] """ + + """ + [INST] Question: {question} + Context: {context} + Answer: [/INST]""", + }, + "zephyr-7b-beta": { + "model_id": "HuggingFaceH4/zephyr-7b-beta", + "remote": False, + "start_message": f"<|system|>\n{DEFAULT_SYSTEM_PROMPT}\n", + "history_template": "<|user|>\n{user} \n<|assistant|>\n{assistant} \n", + "current_message_template": "<|user|>\n{user} \n<|assistant|>\n{assistant}", + "rag_prompt_template": f"""<|system|> {DEFAULT_RAG_PROMPT }""" + + """ + <|user|> + Question: {question} + Context: {context} + Answer: + <|assistant|>""", + }, + "neural-chat-7b-v3-1": { + "model_id": "Intel/neural-chat-7b-v3-3", + "remote": False, + "start_message": f"[INST] <>\n{DEFAULT_SYSTEM_PROMPT }\n<>\n\n", + "history_template": "{user}[/INST]{assistant}[INST]", + "current_message_template": "{user} [/INST]{assistant}", + "tokenizer_kwargs": {"add_special_tokens": False}, + "partial_text_processor": llama_partial_text_processor, + "rag_prompt_template": f""" [INST] {DEFAULT_RAG_PROMPT } [/INST] """ + + """ + [INST] Question: {question} + Context: {context} + Answer: [/INST]""", + }, + "notus-7b-v1": { + "model_id": "argilla/notus-7b-v1", + "remote": False, + "start_message": f"<|system|>\n{DEFAULT_SYSTEM_PROMPT}\n", + "history_template": "<|user|>\n{user} \n<|assistant|>\n{assistant} \n", + "current_message_template": "<|user|>\n{user} \n<|assistant|>\n{assistant}", + "rag_prompt_template": f"""<|system|> {DEFAULT_RAG_PROMPT }""" + + """ + <|user|> + Question: {question} + Context: {context} + Answer: + <|assistant|>""", + }, }, - "baichuan2-7b-chat": { - "model_id": "baichuan-inc/Baichuan2-7B-Chat", - "remote": True, - "start_message": f"{DEFAULT_SYSTEM_PROMPT_CHINESE }", - "roles": [195, 196], - "tokenizer_kwargs": {"add_special_tokens": False}, - "stop_tokens": [2], - "rag_prompt_template": f"""{DEFAULT_RAG_PROMPT_CHINESE }""" - + """ - 问题: {question} - 已知内容: {context} - 回答: - """, + "Chinese":{ + "qwen1.5-0.5b-chat": { + "model_id": "Qwen/Qwen1.5-0.5B-Chat", + "remote": False, + "start_message": DEFAULT_SYSTEM_PROMPT_CHINESE, + "stop_tokens": ["<|im_end|>", "<|endoftext|>"], + "rag_prompt_template": f"""<|im_start|>system + {DEFAULT_RAG_PROMPT_CHINESE }<|im_end|>""" + + """ + <|im_start|>user + 问题: {question} + 已知内容: {context} + 回答: <|im_end|><|im_start|>assistant""", + }, + "qwen1.5-1.8b-chat": { + "model_id": "Qwen/Qwen-1_8B-Chat", + "remote": False, + "start_message": DEFAULT_SYSTEM_PROMPT_CHINESE, + "stop_tokens": ["<|im_end|>", "<|endoftext|>"], + "rag_prompt_template": f"""<|im_start|>system + {DEFAULT_RAG_PROMPT_CHINESE }<|im_end|>""" + + """ + <|im_start|>user + 问题: {question} + 已知内容: {context} + 回答: <|im_end|><|im_start|>assistant""", + }, + "qwen1.5-7b-chat": { + "model_id": "Qwen/Qwen1.5-7B-Chat", + "remote": False, + "start_message": DEFAULT_SYSTEM_PROMPT_CHINESE, + "stop_tokens": ["<|im_end|>", "<|endoftext|>"], + "rag_prompt_template": f"""<|im_start|>system + {DEFAULT_RAG_PROMPT_CHINESE }<|im_end|>""" + + """ + <|im_start|>user + 问题: {question} + 已知内容: {context} + 回答: <|im_end|><|im_start|>assistant""", + }, + "qwen-7b-chat": { + "model_id": "Qwen/Qwen-7B-Chat", + "remote": True, + "start_message": f"<|im_start|>system\n {DEFAULT_SYSTEM_PROMPT_CHINESE }<|im_end|>", + "history_template": "<|im_start|>user\n{user}<|im_start|>assistant\n{assistant}<|im_end|>", + "current_message_template": '"<|im_start|>user\n{user}<|im_start|>assistant\n{assistant}', + "stop_tokens": ["<|im_end|>", "<|endoftext|>"], + "revision": "2abd8e5777bb4ce9c8ab4be7dbbd0fe4526db78d", + "rag_prompt_template": f"""<|im_start|>system + {DEFAULT_RAG_PROMPT_CHINESE }<|im_end|>""" + + """ + <|im_start|>user + 问题: {question} + 已知内容: {context} + 回答: <|im_end|><|im_start|>assistant""", + }, + "chatglm3-6b": { + "model_id": "THUDM/chatglm3-6b", + "remote": True, + "start_message": DEFAULT_SYSTEM_PROMPT_CHINESE, + "tokenizer_kwargs": {"add_special_tokens": False}, + "stop_tokens": [0, 2], + "rag_prompt_template": f"""{DEFAULT_RAG_PROMPT_CHINESE }""" + + """ + 问题: {question} + 已知内容: {context} + 回答: + """, + }, + "baichuan2-7b-chat": { + "model_id": "baichuan-inc/Baichuan2-7B-Chat", + "remote": True, + "start_message": DEFAULT_SYSTEM_PROMPT_CHINESE, + "tokenizer_kwargs": {"add_special_tokens": False}, + "stop_tokens": [0, 2], + "rag_prompt_template": f"""{DEFAULT_RAG_PROMPT_CHINESE }""" + + """ + 问题: {question} + 已知内容: {context} + 回答: + """, + }, + "minicpm-2b-dpo": { + "model_id": "openbmb/MiniCPM-2B-dpo-fp16", + "remote_code": True, + "remote": False, + "start_message": DEFAULT_SYSTEM_PROMPT_CHINESE, + "stop_tokens": [2], + "rag_prompt_template": f"""{DEFAULT_RAG_PROMPT_CHINESE }""" + + """ + 问题: {question} + 已知内容: {context} + 回答: + """, + }, + "internlm2-chat-1.8b": { + "model_id": "internlm/internlm2-chat-1_8b", + "remote_code": True, + "remote": False, + "start_message": DEFAULT_SYSTEM_PROMPT_CHINESE, + "stop_tokens": [2, 92542], + "partial_text_processor": internlm_partial_text_processor, + "rag_prompt_template": f"""<|im_start|>system + {DEFAULT_RAG_PROMPT_CHINESE }<|im_end|>""" + + """ + <|im_start|>user + 问题: {question} + 已知内容: {context} + 回答: <|im_end|><|im_start|>assistant""", + }, }, + "Japanese":{ + "youri-7b-chat": { + "model_id": "rinna/youri-7b-chat", + "remote": False, + "start_message": f"設定: {DEFAULT_SYSTEM_PROMPT_JAPANESE}\n", + "history_template": "ユーザー: {user}\nシステム: {assistant}\n", + "current_message_template": "ユーザー: {user}\nシステム: {assistant}", + "tokenizer_kwargs": {"add_special_tokens": False}, + "partial_text_processor": youri_partial_text_processor, + }, + } } SUPPORTED_EMBEDDING_MODELS = { diff --git a/notebooks/254-llm-chatbot/converter.py b/notebooks/254-llm-chatbot/converter.py index 632c11797ed..dcae26dbac9 100644 --- a/notebooks/254-llm-chatbot/converter.py +++ b/notebooks/254-llm-chatbot/converter.py @@ -20,6 +20,8 @@ def register_configs(): from optimum.exporters.tasks import TasksManager TasksManager._SUPPORTED_MODEL_TYPE["minicpm"] = TasksManager._SUPPORTED_MODEL_TYPE["llama"] TasksManager._SUPPORTED_MODEL_TYPE["qwen2"] = TasksManager._SUPPORTED_MODEL_TYPE["llama"] + TasksManager._SUPPORTED_MODEL_TYPE["internlm2"] = TasksManager._SUPPORTED_MODEL_TYPE["llama"] + def patch_stateful(ov_model, model_type): key_value_input_names = [ @@ -42,7 +44,6 @@ def patch_stateful(ov_model, model_type): ) - def flattenize_inputs(inputs): """ Helper function for making nested inputs flattens @@ -143,67 +144,6 @@ def ts_patched_forward( del pt_model -def convert_baichuan(pt_model: torch.nn.Module, model_path: Path): - """ - Baichuan model conversion function - Params: - pt_model: PyTorch model - model_path: path for saving model - Returns: - None - """ - ov_out_path = Path(model_path) / "openvino_model.xml" - pt_model.config.save_pretrained(ov_out_path.parent) - pt_model.config.use_cache = True - outs = pt_model( - input_ids=torch.ones((1, 10), dtype=torch.long), - attention_mask=torch.ones((1, 10), dtype=torch.long), - ) - inputs = ["input_ids", "attention_mask"] - outputs = ["logits"] - - dynamic_shapes = { - "input_ids": {0: "batch_size", 1: "seq_len"}, - "attention_mask": {0: "batch_size", 1: "seq_len"}, - } - for idx in range(len(outs.past_key_values)): - inputs.extend([f"past_key_values.{idx}.key", f"past_key_values.{idx}.value"]) - dynamic_shapes[inputs[-1]] = {0: "batch_size", 2: "past_sequence + sequence"} - dynamic_shapes[inputs[-2]] = {0: "batch_size", 2: "past_sequence + sequence"} - outputs.extend([f"present.{idx}.key", f"present.{idx}.value"]) - - dummy_inputs = { - "input_ids": torch.ones((1, 2), dtype=torch.long), - "attention_mask": torch.ones((1, 12), dtype=torch.long), - "past_key_values": outs.past_key_values, - } - pt_model.config.torchscript = True - ov_model = ov.convert_model(pt_model, example_input=dummy_inputs) - for inp_name, m_input, input_data in zip( - inputs, ov_model.inputs, flattenize_inputs(dummy_inputs.values()) - ): - input_node = m_input.get_node() - if input_node.element_type == ov.Type.dynamic: - m_input.get_node().set_element_type(ov.Type.f32) - shape = list(input_data.shape) - if inp_name in dynamic_shapes: - for k in dynamic_shapes[inp_name]: - shape[k] = -1 - input_node.set_partial_shape(ov.PartialShape(shape)) - m_input.get_tensor().set_names({inp_name}) - - for out, out_name in zip(ov_model.outputs, outputs): - out.get_tensor().set_names({out_name}) - - ov_model.validate_nodes_and_infer_types() - if make_stateful is not None: - patch_stateful(ov_model, "baichuan") - ov.save_model(ov_model, ov_out_path) - del ov_model - cleanup_torchscript_cache() - del pt_model - - @torch.jit.script_if_tracing def _chatglm2_get_context_layer(query_layer: torch.Tensor, key_layer: torch.Tensor, value_layer: torch.Tensor): mask = torch.zeros((query_layer.shape[-2], key_layer.shape[-2]), dtype=query_layer.dtype) @@ -383,9 +323,78 @@ def convert_chatglm(pt_model: torch.nn.Module, model_path: Path): cleanup_torchscript_cache() del pt_model -def convert_gemma(pt_model: torch.nn.Module, model_path: Path): + +def _update_qwen_rotary_embedding_cache(model): + model.transformer.rotary_emb(2048) + + +def convert_qwen(pt_model: torch.nn.Module, model_path: Path): + """ + Qwen model conversion function + Params: + pt_model: PyTorch model + model_path: path for saving model + Returns: + None + """ + _update_qwen_rotary_embedding_cache(pt_model) + ov_out_path = Path(model_path) / "openvino_model.xml" + pt_model.config.save_pretrained(ov_out_path.parent) + pt_model.config.use_cache = True + outs = pt_model( + input_ids=torch.ones((1, 10), dtype=torch.long), + attention_mask=torch.ones((1, 10), dtype=torch.long), + ) + inputs = ["input_ids"] + outputs = ["logits"] + + dynamic_shapes = { + "input_ids": {0: "batch_size", 1: "seq_len"}, + "attention_mask": {0: "batch_size", 1: "seq_len"}, + "token_type_ids": {0: "batch_size", 1: "seq_len"}, + } + for idx in range(len(outs.past_key_values)): + inputs.extend([f"past_key_values.{idx}.key", f"past_key_values.{idx}.value"]) + dynamic_shapes[inputs[-1]] = {0: "batch_size", 1: "past_sequence + sequence"} + dynamic_shapes[inputs[-2]] = {0: "batch_size", 1: "past_sequence + sequence"} + outputs.extend([f"present.{idx}.key", f"present.{idx}.value"]) + + inputs += ["attention_mask", "token_type_ids"] + dummy_inputs = { + "input_ids": torch.ones((1, 2), dtype=torch.long), + "past_key_values": outs.past_key_values, + "attention_mask": torch.ones((1, 12), dtype=torch.long), + "token_type_ids": torch.zeros((1, 2), dtype=torch.long), + } + pt_model.config.torchscript = True + ov_model = ov.convert_model(pt_model, example_input=dummy_inputs) + for inp_name, m_input, input_data in zip( + inputs, ov_model.inputs, flattenize_inputs(dummy_inputs.values()) + ): + input_node = m_input.get_node() + if input_node.element_type == ov.Type.dynamic: + m_input.get_node().set_element_type(ov.Type.f32) + shape = list(input_data.shape) + if inp_name in dynamic_shapes: + for k in dynamic_shapes[inp_name]: + shape[k] = -1 + input_node.set_partial_shape(ov.PartialShape(shape)) + m_input.get_tensor().set_names({inp_name}) + for out, out_name in zip(ov_model.outputs, outputs): + out.get_tensor().set_names({out_name}) + + ov_model.validate_nodes_and_infer_types() + if make_stateful is not None: + patch_stateful(ov_model, "qwen") + ov.save_model(ov_model, ov_out_path) + del ov_model + cleanup_torchscript_cache() + del pt_model + + +def convert_default(pt_model: torch.nn.Module, model_path: Path): """ - Gamma model conversion function + model conversion function Params: pt_model: PyTorch model @@ -394,6 +403,7 @@ def convert_gemma(pt_model: torch.nn.Module, model_path: Path): None """ ov_out_path = Path(model_path) / "openvino_model.xml" + model_name = str(model_path.parent).split("-")[0] pt_model.config.save_pretrained(ov_out_path.parent) pt_model.config.use_cache = True outs = pt_model(input_ids=torch.ones((2, 10), dtype=torch.long)) @@ -438,7 +448,7 @@ def convert_gemma(pt_model: torch.nn.Module, model_path: Path): ov_model.validate_nodes_and_infer_types() if make_stateful is not None: - patch_stateful(ov_model, "gemma") + patch_stateful(ov_model, model_name) ov.save_model(ov_model, ov_out_path) del ov_model cleanup_torchscript_cache() @@ -465,8 +475,9 @@ def convert_bert(pt_model: torch.nn.Module, model_path: Path): # LLM models "mpt": convert_mpt, "chatglm3": convert_chatglm, - "baichuan2": convert_baichuan, - "gemma": convert_gemma, + "qwen": convert_qwen, + "baichuan2": convert_default, + "gemma": convert_default, # embedding models "all-mpnet-base-v2": convert_mpnet, "text2vec-large-chinese": convert_bert, diff --git a/notebooks/254-llm-chatbot/ov_llm_model.py b/notebooks/254-llm-chatbot/ov_llm_model.py index e7c0d332c83..9b7444edc5b 100644 --- a/notebooks/254-llm-chatbot/ov_llm_model.py +++ b/notebooks/254-llm-chatbot/ov_llm_model.py @@ -204,9 +204,9 @@ def _from_pretrained( ) -class OVBAICHUANModel(OVModelForCausalLM): +class OVCHATGLMModel(OVModelForCausalLM): """ - Optimum intel compatible model wrapper for QWEN + Optimum intel compatible model wrapper for CHATGLM2 """ def __init__( @@ -219,7 +219,7 @@ def __init__( model_save_dir: Optional[Union[str, Path]] = None, **kwargs, ): - NormalizedConfigManager._conf["baichuan"] = NormalizedTextConfig.with_args( + NormalizedConfigManager._conf["chatglm"] = NormalizedTextConfig.with_args( num_layers="num_hidden_layers", num_attention_heads="num_attention_heads", hidden_size="hidden_size", @@ -227,14 +227,20 @@ def __init__( super().__init__( model, config, device, dynamic_shapes, ov_config, model_save_dir, **kwargs ) - + def _reshape(self, model: "Model", *args, **kwargs): shapes = {} for inputs in model.inputs: shapes[inputs] = inputs.get_partial_shape() - if inputs.get_any_name().startswith('beam_idx'): + shapes[inputs][0] = -1 + input_name = inputs.get_any_name() + if input_name.startswith('beam_idx'): continue - shapes[inputs][1] = -1 + if input_name.startswith('past_key_values'): + shapes[inputs][1] = -1 + shapes[inputs][2] = 2 + elif shapes[inputs].rank.get_length() > 1: + shapes[inputs][1] = -1 model.reshape(shapes) return model @@ -270,18 +276,17 @@ def _from_pretrained( ) model = cls.load_model(model_cache_path, load_in_8bit=load_in_8bit) - init_cls = OVBAICHUANModel + init_cls = OVCHATGLMModel return init_cls( model=model, config=config, model_save_dir=model_cache_path.parent, **kwargs ) - - -class OVCHATGLMModel(OVModelForCausalLM): + + +class OVQWENModel(OVModelForCausalLM): """ - Optimum intel compatible model wrapper for CHATGLM2 + Optimum intel compatible model wrapper for QWEN """ - def __init__( self, model: "Model", @@ -292,7 +297,7 @@ def __init__( model_save_dir: Optional[Union[str, Path]] = None, **kwargs, ): - NormalizedConfigManager._conf["chatglm"] = NormalizedTextConfig.with_args( + NormalizedConfigManager._conf["qwen"] = NormalizedTextConfig.with_args( num_layers="num_hidden_layers", num_attention_heads="num_attention_heads", hidden_size="hidden_size", @@ -300,65 +305,134 @@ def __init__( super().__init__( model, config, device, dynamic_shapes, ov_config, model_save_dir, **kwargs ) - + def _reshape(self, model: "Model", *args, **kwargs): shapes = {} for inputs in model.inputs: shapes[inputs] = inputs.get_partial_shape() - shapes[inputs][0] = -1 - input_name = inputs.get_any_name() - if input_name.startswith('beam_idx'): + if inputs.get_any_name().startswith('beam_idx'): continue - if input_name.startswith('past_key_values'): - shapes[inputs][1] = -1 - shapes[inputs][2] = 2 - elif shapes[inputs].rank.get_length() > 1: - shapes[inputs][1] = -1 + shapes[inputs][1] = -1 model.reshape(shapes) return model - @classmethod - def _from_pretrained( - cls, - model_id: Union[str, Path], - config: PretrainedConfig, - use_auth_token: Optional[Union[bool, str, None]] = None, - revision: Optional[Union[str, None]] = None, - force_download: bool = False, - cache_dir: Optional[str] = None, - file_name: Optional[str] = None, - subfolder: str = "", - from_onnx: bool = False, - local_files_only: bool = False, - load_in_8bit: bool = False, + def forward( + self, + input_ids: torch.LongTensor, + past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, + attention_mask: Optional[torch.LongTensor] = None, + token_type_ids: Optional[torch.LongTensor] = None, **kwargs, - ): - model_path = Path(model_id) - default_file_name = OV_XML_FILE_NAME - file_name = file_name or default_file_name + ) -> CausalLMOutputWithPast: + self.compile() - model_cache_path = cls._cached_file( - model_path=model_path, - use_auth_token=use_auth_token, - revision=revision, - force_download=force_download, - cache_dir=cache_dir, - file_name=file_name, - subfolder=subfolder, - local_files_only=local_files_only, - ) + if self.use_cache and past_key_values is not None: + input_ids = input_ids[:, -1:] - model = cls.load_model(model_cache_path, load_in_8bit=load_in_8bit) - init_cls = OVCHATGLMModel + batch_size = input_ids.shape[0] - return init_cls( - model=model, config=config, model_save_dir=model_cache_path.parent, **kwargs - ) + inputs = {} + past_len = 0 + if not self.stateful: + if past_key_values is not None: + past_len = past_key_values[0][1].shape[-2] + if self._pkv_precision == Type.bf16: + # numpy does not support bf16, pretending f16, should change to bf16 + past_key_values = tuple( + Tensor(past_key_value, past_key_value.shape, Type.bf16) + for pkv_per_layer in past_key_values + for past_key_value in pkv_per_layer + ) + else: + # Flatten the past_key_values + past_key_values = tuple( + past_key_value for pkv_per_layer in past_key_values for past_key_value in pkv_per_layer + ) + + + # Add the past_key_values to the decoder inputs + inputs = dict(zip(self.key_value_input_names, past_key_values)) + + # Create empty past_key_values for decoder_with_past first generation step + elif self.use_cache: + for input_name in self.key_value_input_names: + model_inputs = self.model.input(input_name) + shape = model_inputs.get_partial_shape() + if self.config.model_type == 'chatglm': + shape[0] = 0 + shape[1] = batch_size + else: + shape[0] = batch_size + if shape[2].is_dynamic: + shape[2] = 0 + elif shape.rank.get_length() == 4 and shape[3].is_dynamic: + shape[3] = 0 + else: + shape[1] = 0 + inputs[input_name] = Tensor(model_inputs.get_element_type(), shape.get_shape()) + else: + # past_key_values are not used explicitly, instead they are handled inside the model + if past_key_values is None: + # Need a marker to differentiate the first generate iteration from the others in + # the first condition at the function beginning above. + # It should be something that is not None and it should be True when converted to Boolean. + past_key_values = ((),) + # This is the first iteration in a sequence, reset all states + for state in self.request.query_state(): + state.reset() + # Set initial value for the next beam_idx input that will be used at the current iteration + # and will be optionally updated by _reorder_cache at the next iterations if beam_search is used + self.next_beam_idx = np.array(range(batch_size), dtype=int) + + inputs["input_ids"] = np.array(input_ids) + # Add the attention_mask inputs when needed + if "attention_mask" in self.input_names or "token_type_ids" in self.input_names: + if attention_mask is not None: + attention_mask = np.array(attention_mask) + else: + attention_mask = np.ones( + (input_ids.shape[0], input_ids.shape[1] + past_len), dtype=inputs["input_ids"].dtype + ) + + if "attention_mask" in self.input_names: + inputs["attention_mask"] = attention_mask + + if "token_type_ids" in self.input_names: + if token_type_ids is not None: + token_type_ids = np.array(token_type_ids) + else: + token_type_ids = np.zeros( + (input_ids.shape[0], input_ids.shape[1]), dtype=inputs["input_ids"].dtype + ) + + inputs["token_type_ids"] = token_type_ids + + if hasattr(self, 'next_beam_idx'): + inputs['beam_idx'] = self.next_beam_idx + + # Run inference + self.request.start_async(inputs, share_inputs=True) + self.request.wait() + logits = torch.from_numpy(self.request.get_tensor("logits").data).to(self.device) + + if not self.stateful: + if self.use_cache: + # Tuple of length equal to : number of layer * number of past_key_value per decoder layer (2 corresponds to the self-attention layer) + past_key_values = tuple(self.request.get_tensor(key).data for key in self.key_value_output_names) + # Tuple of tuple of length `n_layers`, with each tuple of length equal to 2 (k/v of self-attention) + past_key_values = tuple( + past_key_values[i : i + self.num_pkv] for i in range(0, len(past_key_values), self.num_pkv) + ) + else: + past_key_values = None + + return CausalLMOutputWithPast(logits=logits, past_key_values=past_key_values) model_classes = { "mpt": OVMPTModel, - "baichuan2": OVBAICHUANModel, "chatglm3": OVCHATGLMModel, - "gemma": OVModelForCausalLM + "gemma": OVModelForCausalLM, + "qwen": OVQWENModel, + "baichuan2": OVModelForCausalLM, }