add llama3 in qa notebook (#1947)

openvinotoolkit · Apr 19, 2024 · 6d09efb · 6d09efb
1 parent ff7adcc
commit 6d09efb
Show file tree

Hide file tree

Showing 2 changed files with 150 additions and 20 deletions.
diff --git a/notebooks/llm-question-answering/config.py b/notebooks/llm-question-answering/config.py
@@ -30,4 +30,9 @@
         "prompt_template": "<s> [INST] {instruction} [/INST] </s>",
         "tokenizer_kwargs": {"add_special_tokens": False},
     },
+    "llama-3-8b-instruct": {
+        "model_id": "meta-llama/Meta-Llama-3-8B-Instruct",
+        "end_key": "<|eot_id|>",
+        "prompt_template": "<|start_header_id|>system<|end_header_id|>\n\nBelow is an instruction that describes a task. Write a response that appropriately completes the request.<|eot_id|><|start_header_id|>user<|end_header_id|>Instruction: {instruction} Answer:<|eot_id|><|start_header_id|>assistant<|end_header_id|>",
+    },
 }
diff --git a/notebooks/llm-question-answering/llm-question-answering.ipynb b/notebooks/llm-question-answering/llm-question-answering.ipynb
@@ -56,8 +56,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "%pip uninstall -q -y openvino openvino-dev openvino-nightly optimum optimum-intel\n",
-    "%pip install -q \"torch>=2.1\" openvino-nightly \"nncf>=2.7\" \"transformers>=4.36.0\" onnx \"optimum>=1.16.1\" \"accelerate\" \"datasets>=2.14.6\" \"gradio>=4.19\" \"git+https://github.com/huggingface/optimum-intel.git\" --extra-index-url https://download.pytorch.org/whl/cpu"
+    "# %pip uninstall -q -y openvino openvino-dev openvino-nightly optimum optimum-intel\n",
+    "# %pip install -q \"torch>=2.1\" openvino-nightly \"nncf>=2.7\" \"transformers>=4.36.0\" onnx \"optimum>=1.16.1\" \"accelerate\" \"datasets>=2.14.6\" \"gradio>=4.19\" \"git+https://github.com/huggingface/optimum-intel.git\" --extra-index-url https://download.pytorch.org/whl/cpu"
    ]
   },
   {
@@ -78,7 +78,23 @@
     "* **phi-2** - Phi-2 is a Transformer with 2.7 billion parameters. It was trained using the same data sources as [Phi-1.5](https://huggingface.co/microsoft/phi-1_5), augmented with a new data source that consists of various NLP synthetic texts and filtered websites (for safety and educational value). When assessed against benchmarks testing common sense, language understanding, and logical reasoning, Phi-2 showcased a nearly state-of-the-art performance among models with less than 13 billion parameters. More details about model can be found in [model card](https://huggingface.co/microsoft/phi-2#limitations-of-phi-2).\n",
     "* **dolly-v2-3b** - Dolly 2.0 is an instruction-following large language model trained on the Databricks machine-learning platform that is licensed for commercial use. It is based on [Pythia](https://github.com/EleutherAI/pythia) and is trained on ~15k instruction/response fine-tuning records generated by Databricks employees in various capability domains, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization. Dolly 2.0 works by processing natural language instructions and generating responses that follow the given instructions. It can be used for a wide range of applications, including closed question-answering, summarization, and generation. More details about model can be found in [model card](https://huggingface.co/databricks/dolly-v2-3b).\n",
     "* **red-pajama-3b-instruct** -  A 2.8B parameter pre-trained language model based on GPT-NEOX architecture. The model was fine-tuned for few-shot applications on the data of [GPT-JT](https://huggingface.co/togethercomputer/GPT-JT-6B-v1), with exclusion of tasks that overlap with the HELM core scenarios.More details about model can be found in [model card](https://huggingface.co/togethercomputer/RedPajama-INCITE-Instruct-3B-v1).\n",
-    "* **mistral-7b** - The Mistral-7B-v0.2 Large Language Model (LLM) is a pretrained generative text model with 7 billion parameters. You can find more details about model in the [model card](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2), [paper](https://arxiv.org/abs/2310.06825) and [release blog post](https://mistral.ai/news/announcing-mistral-7b/)."
+    "* **mistral-7b** - The Mistral-7B-v0.2 Large Language Model (LLM) is a pretrained generative text model with 7 billion parameters. You can find more details about model in the [model card](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2), [paper](https://arxiv.org/abs/2310.06825) and [release blog post](https://mistral.ai/news/announcing-mistral-7b/).\n",
+    "* **llama-3-8b-instruct** - Llama 3 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. The Llama 3 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks. More details about model can be found in [Meta blog post](https://ai.meta.com/blog/meta-llama-3/), [model website](https://llama.meta.com/llama3) and [model card](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct).\n",
+    ">**Note**: run model with demo, you will need to accept license agreement. \n",
+    ">You must be a registered user in 🤗 Hugging Face Hub. Please visit [HuggingFace model card](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), carefully read terms of usage and click accept button.  You will need to use an access token for the code below to run. For more information on access tokens, refer to [this section of the documentation](https://huggingface.co/docs/hub/security-tokens).\n",
+    ">You can login on Hugging Face Hub in notebook environment, using following code:\n",
+    " \n",
+    "```python\n",
+    "    ## login to huggingfacehub to get access to pretrained model \n",
+    "\n",
+    "    from huggingface_hub import notebook_login, whoami\n",
+    "\n",
+    "    try:\n",
+    "        whoami()\n",
+    "        print('Authorization token already provided')\n",
+    "    except OSError:\n",
+    "        notebook_login()\n",
+    "```"
    ]
   },
   {
@@ -101,7 +117,7 @@
     {
      "data": {
       "application/vnd.jupyter.widget-view+json": {
-       "model_id": "2419310978924af096fd1c8f68cd7a64",
+       "model_id": "bae88788e6c34d919af8ec4660684580",
        "version_major": 2,
        "version_minor": 0
       },
@@ -137,7 +153,7 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Selected model phi-2\n"
+      "Selected model llama-3-8b-instruct\n"
      ]
     }
    ],
@@ -203,7 +219,7 @@
     {
      "data": {
       "application/vnd.jupyter.widget-view+json": {
-       "model_id": "418ba56f828f45aaab3df253cda53cf8",
+       "model_id": "bf01bf342e814557a718ac8946822062",
        "version_major": 2,
        "version_minor": 0
       },
@@ -217,7 +233,7 @@
     {
      "data": {
       "application/vnd.jupyter.widget-view+json": {
-       "model_id": "70ee0bbd42b74b5aa746e7560c851a35",
+       "model_id": "4ad18c76a71041fd9e7346cb2714b913",
        "version_major": 2,
        "version_minor": 0
       },
@@ -231,7 +247,7 @@
     {
      "data": {
       "application/vnd.jupyter.widget-view+json": {
-       "model_id": "96bc22efd3064bbe944fce42084e340a",
+       "model_id": "87f253aac28a445bafee3b412a39b27b",
        "version_major": 2,
        "version_minor": 0
       },
@@ -277,16 +293,101 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, onnx, openvino\n"
+      "INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, tensorflow, onnx, openvino\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2024-04-19 10:35:50.012050: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\n",
+      "2024-04-19 10:35:50.025002: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.\n",
+      "2024-04-19 10:35:50.060073: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
+      "2024-04-19 10:35:50.060108: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
+      "2024-04-19 10:35:50.060134: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
+      "2024-04-19 10:35:50.068691: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.\n",
+      "2024-04-19 10:35:50.069448: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n",
+      "To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n",
+      "2024-04-19 10:35:51.045741: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n",
+      "The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.\n",
+      "Framework not specified. Using pt to export the model.\n"
      ]
     },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "914afbcfeb414e0cb1d5be8408334ace",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
     {
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "/home/ea/work/genai_env/lib/python3.8/site-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11080). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)\n",
-      "  return torch._C._cuda_getDeviceCount() > 0\n",
-      "No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'\n"
+      "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n",
+      "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n",
+      "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n",
+      "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n",
+      "Using framework PyTorch: 2.2.2+cpu\n",
+      "Overriding 1 configuration item(s)\n",
+      "\t- use_cache -> True\n",
+      "/home/ea/miniconda3/lib/python3.11/site-packages/transformers/modeling_utils.py:4225: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead\n",
+      "  warnings.warn(\n",
+      "The cos_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use the forward method of RoPE from now on instead. It is not used in the `LlamaAttention` class\n",
+      "The sin_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use the forward method of RoPE from now on instead. It is not used in the `LlamaAttention` class\n",
+      "/home/ea/miniconda3/lib/python3.11/site-packages/optimum/exporters/openvino/model_patcher.py:311: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!\n",
+      "  if sequence_length != 1:\n"
+     ]
+    },
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "ac846020ffb1429187ba5c4972e99758",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": [
+       "Output()"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"></pre>\n"
+      ],
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
+       "</pre>\n"
+      ],
+      "text/plain": [
+       "\n"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Configuration saved in llama-3-8b-instruct/INT4_compressed_weights/openvino_config.json\n"
      ]
     }
    ],
@@ -299,6 +400,7 @@
     "from optimum.utils import NormalizedTextConfig, NormalizedConfigManager\n",
     "import gc\n",
     "\n",
+    "the\n",
     "NormalizedConfigManager._conf[\"phi\"] = NormalizedTextConfig\n",
     "\n",
     "nncf.set_log_level(logging.ERROR)\n",
@@ -343,6 +445,7 @@
     "            \"ratio\": 0.5,\n",
     "        },\n",
     "        \"dolly-v2-3b\": {\"sym\": False, \"group_size\": 32, \"ratio\": 0.5},\n",
+    "        \"llama-3-8b-instruct\": {\"sym\": True, \"group_size\": 128, \"ratio\": 1.0},\n",
     "        \"default\": {\n",
     "            \"sym\": False,\n",
     "            \"group_size\": 128,\n",
@@ -391,7 +494,7 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Size of model with INT4 compressed weights is 1734.02 MB\n"
+      "Size of model with INT4 compressed weights is 4435.75 MB\n"
      ]
     }
    ],
@@ -430,7 +533,7 @@
     {
      "data": {
       "application/vnd.jupyter.widget-view+json": {
-       "model_id": "6138478c099f474e8ec9125f54e75d76",
+       "model_id": "4dbc2b23f51f4fb499c3d91bc9cf4a1a",
        "version_major": 2,
        "version_minor": 0
       },
@@ -464,7 +567,7 @@
     {
      "data": {
       "application/vnd.jupyter.widget-view+json": {
-       "model_id": "161aa343733841b3b5beca097adaccff",
+       "model_id": "e02bb6d7af734d15bdbeb532430dee31",
        "version_major": 2,
        "version_minor": 0
       },
@@ -506,7 +609,7 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Loading model from phi-2/INT4_compressed_weights\n"
+      "Loading model from llama-3-8b-instruct/INT4_compressed_weights\n"
      ]
     },
     {
@@ -850,10 +953,32 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 15,
    "id": "9f222d02-847a-490f-8d66-02608a53259b",
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Running on local URL:  http://127.0.0.1:7860\n",
+      "\n",
+      "To create a public link, set `share=True` in `launch()`.\n"
+     ]
+    },
+    {
+     "data": {
+      "text/html": [
+       "<div><iframe src=\"http://127.0.0.1:7860/\" width=\"100%\" height=\"800\" allow=\"autoplay; camera; microphone; clipboard-read; clipboard-write;\" frameborder=\"0\" allowfullscreen></iframe></div>"
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
    "source": [
     "examples = [\n",
     "    \"Give me a recipe for pizza with pineapple\",\n",
@@ -952,7 +1077,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
+   "display_name": "Python 3",
    "language": "python",
    "name": "python3"
   },
@@ -966,7 +1091,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.10"
+   "version": "3.11.4"
   },
   "openvino_notebooks": {
    "imageUrl": "https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/daafd702-5a42-4f54-ae72-2e4480d73501",