Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLAMA_CPP notebook with Qwen-1.5-7B-Chat #901

Merged
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
281 changes: 281 additions & 0 deletions modules/llama_cpp_plugin/notebooks/qwen.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,281 @@
{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshampor , I suggest to specify the high-level requirements for this notebooks at the beginning. Is it required to have NVidia GPU+CUDA? Will it work on Windows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaximProshin CUDA and GPU is not necessary in the basic version of the notebook execution, although the user may uncomment a CMake option to compile for CUDA support, in which case they are naturally expected to have CUDA and GPUs on board. Windows is not supported due to my using Linux paths and direct shell calls to build the plugin from source.

"cells": [
{
"cell_type": "markdown",
"id": "b2cb321e-2c20-45ea-93e0-a3bd16e5a120",
"metadata": {},
"source": [
"## QWEN model inference w/ OpenVINO's LLAMA_CPP plugin"
]
},
{
"cell_type": "markdown",
"id": "2bb9b46a-d9c5-42dc-8e50-6700180aad0c",
"metadata": {},
"source": [
"This notebook illustrates the usage of the `LLAMA_CPP` plugin for OpenVINO, which enables the `llama.cpp`-powered inference of corresponding GGUF-format model files via OpenVINO API. The user flow will be demonstrated on an LLM inferencing task with the Qwen-7B-Chat model in Chinese."
]
},
{
"cell_type": "markdown",
"id": "77a75ee1-6e0b-4c75-9af0-24989658ea81",
"metadata": {},
"source": [
"The first step is to procure a GGUF file that you would like to execute. Below the required dependencies and repositories are checked out and the Qwen-7B-Chat model is exported from HuggingFace's `transformers` into a GGUF file:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "64317459-fe47-4835-b1ae-e239301cde17",
"metadata": {},
"outputs": [],
"source": [
"!pip install transformers[torch]\n",
"!pip install tiktoken\n",
"!huggingface-cli download Qwen/Qwen1.5-7B-Chat-GGUF qwen1_5-7b-chat-q5_k_m.gguf --local-dir . --local-dir-use-symlinks False"
]
},
{
"cell_type": "markdown",
"id": "ed2e6a45-a4e1-4478-b251-a1bb52a845da",
"metadata": {
"scrolled": true
},
"source": [
"The `LLAMA_CPP` plugin must be built from [sources](https://github.com/openvinotoolkit/openvino_contrib/modules/llama_cpp_plugin) in a standard [OpenVINO extra module user flow](https://github.com/openvinotoolkit/openvino_contrib/#how-to-build-openvino-with-extra-modules). "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ff88d765-5257-4aa4-9373-4e1207197bf9",
"metadata": {},
"outputs": [],
"source": [
"!git clone https://github.com/openvinotoolkit/openvino_contrib\n",
"!git clone --recurse-submodules https://github.com/openvinotoolkit/openvino\n",
"\n",
"# Add -DLLAMA_CUBLAS=1 to the cmake line below to build the plugin with the CUDA backend.\n",
"# The underlying llama.cpp inference code will be executed on CUDA-powered GPUs on your host.\n",
"!cmake -B build -DCMAKE_BUILD_TYPE=Release -DOPENVINO_EXTRA_MODULES=../openvino_contrib/modules/llama_cpp_plugin -DENABLE_PLUGINS_XML=ON -DENABLE_LLAMA_CPP_PLUGIN_REGISTRATION=ON -DENABLE_PYTHON=1 -DENABLE_WHEEL=ON openvino #-DLLAMA_CUBLAS=1\n",
"\n",
"!cmake --build build --parallel -- llama_cpp_plugin pyopenvino ie_wheel"
]
},
{
"cell_type": "markdown",
"id": "458258b7-2e00-44b3-b3ee-e73238a47019",
"metadata": {},
"source": [
"After the build, the plugin binaries should be installed into the same directory as the rest of the OpenVINO plugin binaries, and the plugin itself should be registered in the `plugins.xml` file in the same directory. In our case, we configured the build above to generate a Python `.whl`, which will contain the necessary binaries and files automatically after installation into a virtual environment."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ea92d639-4f1e-414b-af95-7b82b98f117a",
"metadata": {},
"outputs": [],
"source": [
"# You may want to restart your kernel after this cell.\n",
"!pip install --force-reinstall ./build/wheels/openvino-2024.1.0-cp39-cp39-linux_x86_64.whl"
]
},
{
"cell_type": "markdown",
"id": "f8f9a117-6302-4e7b-8570-a6fecffcf104",
"metadata": {},
"source": [
"Now the actual inferencing pipeline in Python can be executed. Load the GGUF file as if it were an OpenVINO-supported serialized file format by passing the path to it into the `.compile_model` call and explicitly specify the `LLAMA_CPP` as the target pseudo-\"device\" so that the `LLAMA_CPP` plugin code would be used instead of the regular OpenVINO code paths:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "77d741bc-9674-4da0-b5c6-01ee95693128",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"import openvino as ov\n",
"ov_model = ov.Core().compile_model(\"qwen1_5-7b-chat-q5_k_m.gguf\", \"LLAMA_CPP\")"
]
},
{
"cell_type": "markdown",
"id": "76785d5e-e6f5-46b8-9ff8-4436ac5e67c0",
"metadata": {},
"source": [
"The models loaded through the `LLAMA_CPP` plugin flow from GGUF expose two primary inputs - `input_ids` and `position_ids`, with the same semantics as the corresponding model inputs in the original PyTorch representation of the models in the HuggingFace repository. Additionally, `attention_mask` and `beam_idx` inputs are exposed for drop-in compatibility with existing OpenVINO example pipelines, but these inputs are left unused since the `llama.cpp` execution model either does not require or does not expose these inputs."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1e5531e8-84c7-4730-8036-3f35eb49ab84",
"metadata": {},
"outputs": [],
"source": [
"ov_model.inputs"
]
},
{
"cell_type": "markdown",
"id": "356a09b3-d8b1-4bf7-b74e-a3df922915d5",
"metadata": {},
"source": [
"Since this a chat version of the Qwen model, the user input should be formatted in a model and language-specific way. Below is some utility code that will be used to format the input prompt to the model."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ed042d99-8e22-42c8-ae7e-53719b7ca222",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"DEFAULT_SYSTEM_PROMPT_CHINESE = \"\"\"\\\n",
"你是一个乐于助人、尊重他人以及诚实可靠的助手。在安全的情况下,始终尽可能有帮助地回答。 您的回答不应包含任何有害、不道德、种族主义、性别歧视、有毒、危险或非法的内容。请确保您的回答在社会上是公正的和积极的。\n",
"如果一个问题没有任何意义或与事实不符,请解释原因,而不是回答错误的问题。如果您不知道问题的答案,请不要分享虚假信息。另外,答案请使用中文。\\\n",
"\"\"\"\n",
"\n",
"model_configuration = {\n",
" \"model_id\": \"Qwen/Qwen-7B-Chat\",\n",
" \"remote\": True,\n",
" \"start_message\": f\"<|im_start|>system\\n {DEFAULT_SYSTEM_PROMPT_CHINESE }<|im_end|>\",\n",
" \"history_template\": \"<|im_start|>user\\n{user}<im_end><|im_start|>assistant\\n{assistant}<|im_end|>\",\n",
" \"current_message_template\": '\"<|im_start|>user\\n{user}<im_end><|im_start|>assistant\\n{assistant}',\n",
" \"stop_tokens\": [\"<|im_end|>\", \"<|endoftext|>\"]\n",
"}\n",
"\n",
"model_name = model_configuration[\"model_id\"]\n",
"start_message = model_configuration[\"start_message\"]\n",
"history_template = model_configuration.get(\"history_template\")\n",
"current_message_template = model_configuration.get(\"current_message_template\")\n",
"stop_tokens = model_configuration.get(\"stop_tokens\")\n",
"tokenizer_kwargs = model_configuration.get(\"tokenizer_kwargs\", {})\n",
"\n",
"def convert_history(history):\n",
" text = start_message + \"\".join([\"\".join([history_template.format(num=round, user=item[0], assistant=item[1])]) \n",
" for round, item in enumerate(history[:-1])])\n",
" text += \"\".join([\"\".join([current_message_template.format(num=len(history) + 1, user=history[-1][0], assistant=history[-1][1])])])\n",
" return text"
]
},
{
"cell_type": "markdown",
"id": "2b543737-eb59-453c-99f5-236fa162ada3",
"metadata": {},
"source": [
"Now we can build the input tokens representing the user prompt to the model. The prompt is adjusted to conform to the expected chatbot template using the utility functions defined earlier and then tokenized using the corresponding tokenizer.\n",
"\n",
"**Note:** Although the GGUF file contains the vocabulary and the tokenizer information, the tokenization and detokenization steps are currently not part of the `LLAMA_CPP` flow. These steps of the LLM processing pipeline should be executed using other means (e.g. in Python you can instantiate the required tokenizers from the `transformers` library if these are available for your model, and optionally convert these to OpenVINO representation separately to be executed with other plugins such as 'CPU')."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "41f70a67-729c-4f3b-810c-d273d455e49e",
"metadata": {},
"outputs": [],
"source": [
"user_prompt = \"孙悟空是谁?\"\n",
"# user_prompt = \"太阳为什么是黄色的?\"\n",
"\n",
"formatted_input_prompt = convert_history([[user_prompt, \"\"]])\n",
"\n",
"from transformers import AutoTokenizer\n",
"tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)\n",
"\n",
"initial_prompt_tokens = tokenizer(formatted_input_prompt, return_tensors=\"np\", **tokenizer_kwargs).input_ids"
]
},
{
"cell_type": "markdown",
"id": "3c523b75-4acf-46cc-abb3-ff312321ff3d",
"metadata": {},
"source": [
"The initial prompt tokens are fed through the model to provide the context for the response generation. Note that the regular OpenVINO Python API is used to supply inputs and receive outputs:"
]
},
{
"cell_type": "markdown",
"id": "fa601c84-33d4-44a3-a27f-83f08eebe407",
"metadata": {},
"source": [
"The response tokens are generated one-by-one, conditioned on all previous tokens (within a certain window) by the KV-cache internally maintained by `llama.cpp`, until an end-of-sequence token is encountered as output or the maximum generated token count is reached."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cdbf4d60-519d-4c1d-862f-3796c362154c",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"sequence_length = len(initial_prompt_tokens[0])\n",
"position_ids = np.arange(0, sequence_length).reshape(initial_prompt_tokens.shape)\n",
"\n",
"output = ov_model({\"input_ids\": initial_prompt_tokens, \"position_ids\": position_ids})\n",
"logits = output[\"logits\"]\n",
"curr_token_ids = np.argmax(logits[:, -1, :], axis=1).reshape([1, 1])\n",
"\n",
"MAX_TOKENS_GENERATED = 256\n",
"STOP_TOKENS = [tokenizer(st, return_tensors=\"np\").input_ids[0][0] for st in model_configuration[\"stop_tokens\"]]\n",
"\n",
"curr_tokens_generated = 0\n",
"last_token_id = curr_token_ids[0][0]\n",
"\n",
"response_tokens = []\n",
"next_position_id = sequence_length - 1\n",
"\n",
"while (last_token_id not in STOP_TOKENS) and (curr_tokens_generated < MAX_TOKENS_GENERATED): \n",
" print(tokenizer.decode(last_token_id), end='')\n",
" curr_tokens_generated += 1\n",
" curr_position_ids = np.ndarray([1, 1], dtype=np.int64)\n",
" curr_position_ids[0][0] = next_position_id \n",
" next_position_id += 1\n",
" curr_generated_output = ov_model({\"input_ids\": curr_token_ids, \"position_ids\": curr_position_ids})\n",
" curr_logits = curr_generated_output[\"logits\"]\n",
" curr_token_ids = np.argmax(curr_logits[:, -1, :], axis=1).reshape([1, 1])\n",
" last_token_id = curr_token_ids[0][0]\n",
"\n",
"ov_model.create_infer_request().reset_state()"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does all InferRequest share state and one can reset it like this, or does this line create a new request and not affect the original?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The state is shared across all infer requests to the same model object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a LLAMA_CPP plugin feature?

Maybe the code sample should create an explicit InferRequest object anyway because it does not align with the rest of the OpenVINO (where each request has its own state) and might cause confusion in the future.

Copy link
Contributor Author

@vshampor vshampor Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The InferRequest object is explicitly created at the line you highlighted, during the evaluation of the ov_model.create_infer_request() subexpression. On the second look it might be possible to have separate KV caches associated with separate infer requests, with corresponding effects on memory consumption scaling with the increase of infer requests alive, but then I don't see how the syntax in line 248 should work in a stateful fashion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first (implicit) infer request, which contains KV Cache state created during the first inference on line 229. It populates CompiledModel._infer_requiest attribute and, as I understand, is used by any __call__ method invocation (line 248). So this state lives as long as the CompiledModel object lives.

ov_model.create_infer_request() creates a new request with a blank state, so reset_state() call on line 253 does not affect the CompiledModel._infer_requiest state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akuporos does this mean that the __call__ API for OV inference is incompatible with stateful execution and all OV docs and notebooks should be adjusted to explicitly state this?

This could be easily fixed by providing a getter for the internal infer request in the CompiledModel object, I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be easily fixed by providing a getter for the internal infer request in the CompiledModel object, I think.

CompiledModel.reset_state might be more consistent with the existing API. Otherwise, there is not much sense in a hidden InferRequest in the first place.

Copy link

@jiwaszki jiwaszki Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vshampor @apaniukov Yes CompiledModel.reset_state can be more consistent and you can create request for such feature. The CompiledModel.__call__ was added months ago as simplified method to work with OV, not as a functional replacement of all APIs from InferRequest because of some voices that APIs "should align" -- that "call" was the most it could be done, let's keep that in mind.

there is not much sense in a hidden InferRequest in the first place.

It is/was, because simplified API is what the name states.

providing a getter for the internal infer request

If you need access right now, CompiledModel._infer_request will work as WA. If this is None (i.e. no inference was called yet) just skip the call, or you could create placeholder before:

  if compiled_model._infer_request is None:
      compiled_model._infer_request = compiled_model.create_infer_request()

So Python API would be happy to extend CompiledModel interface if there is direct need, but require story in JIRA to back it up.

@akuporos giving back the voice to you.

Copy link
Contributor Author

@vshampor vshampor Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiwaszki in the context of stateful execution this is not a feature request, but a bug report against the new API. I will submit the corresponding ticket. The user does not care about the reasons for introducing this exact version of inference, they care about fulfilling their use case. I think I have illustrated the direct need in this notebook - also
the semi-casual user, familiar with the semantics of other inferencing frameworks, will try the __call__ API first since it is, as you've said, "aligned" with what the other inferencing frameworks provide. Having CompiledModel.reset_state is fine by me.

If you need access right now, CompiledModel._infer_request will work as WA.

That is understood, but surely you can't be recommending coding against internal APIs in a user-facing example.

Copy link
Contributor Author

@vshampor vshampor Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meanwhile I rewrote the code to use InferRequests explicitly. The separate states for different infer request objects are introduced in #908.

]
},
{
"cell_type": "markdown",
"id": "7ba9e1a2-beda-426a-9979-0e1ab2970430",
"metadata": {},
"source": [
"Note the last line in the cell above - since the model inference is stateful, we should reset the model's internal state if we want to process new text inputs that are unrelated to the current chatbot interaction. The same OpenVINO API call may be used to do this, just like with other OpenVINO \"stateful\" models:"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading