This example demonstrates how to serve a LLaMA2-7B model using vLLM continuous batching on Intel GPU (with BigDL-LLM low-bits optimizations).
The code shown in the following example is ported from vLLM.
In this example, we will run Llama2-7b model using Arc A770 and provide OpenAI-compatible
interface for users.
To use Intel GPUs for deep-learning tasks, you should install the XPU driver and the oneAPI Base Toolkit. Please check the requirements at here.
After install the toolkit, run the following commands in your environment before starting vLLM GPU:
source /opt/intel/oneapi/setvars.sh
# sycl-ls will list all the compatible Intel GPUs in your environment
sycl-ls
# Example output with one Arc A770:
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000]
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K 3.0 [2023.16.7.0.21_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241]
To run vLLM continuous batching on Intel GPUs, install the dependencies as follows:
# First create an conda environment
conda create -n bigdl-vllm python==3.9
conda activate bigdl-vllm
# Install dependencies
pip3 install psutil
pip3 install sentencepiece # Required for LLaMA tokenizer.
pip3 install numpy
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade "bigdl-llm[xpu]" -f https://developer.intel.com/ipex-whl-stable-xpu
pip3 install fastapi
pip3 install "uvicorn[standard]"
pip3 install "pydantic<2" # Required for OpenAI server.
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
To run offline inference using vLLM for a quick impression, use the following example:
#!/bin/bash
# Please first modify the MODEL_PATH in offline_inference.py
# Modify load_in_low_bit to use different quantization dtype
python offline_inference.py
To fully utilize the continuous batching feature of the vLLM
, you can send requests to the service using curl or other similar methods. The requests sent to the engine will be batched at token level. Queries will be executed in the same forward
step of the LLM and be removed when they are finished instead of waiting for all sequences to be finished.
#!/bin/bash
# You may also want to adjust the `--max-num-batched-tokens` argument, it indicates the hard limit
# of batched prompt length the server will accept
python -m bigdl.llm.vllm.entrypoints.openai.api_server \
--model /MODEL_PATH/Llama-2-7b-chat-hf/ --port 8000 \
--load-format 'auto' --device xpu --dtype bfloat16 \
--load-in-low-bit sym_int4 \
--max-num-batched-tokens 4096
Then you can access the api server as follows:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/MODEL_PATH/Llama-2-7b-chat-hf-bigdl/",
"prompt": "San Francisco is a",
"max_tokens": 128,
"temperature": 0
}' &
Currently we have only supported LLaMA family model (including llama
, vicuna
, llama-2
, etc.). To use aother model, you may need add some adaptions.
Create or clone the Pytorch model code to BigDL/python/llm/src/bigdl/llm/vllm/model_executor/models
.
Refering to BigDL/python/llm/src/bigdl/llm/vllm/model_executor/models/bigdl_llama.py
, it's necessary to maintain a kv_cache
, which is a nested list of dictionary that maps req_id
to a three-dimensional tensor (the structure may vary from models). Before the model's actual forward
method, you could prepare a past_key_values
according to current req_id
, and after that you need to update the kv_cache
with output.past_key_values
. The clearence will be executed when the request is finished.
Finally, register your *ForCausalLM
class to the _MODEL_REGISTRY in BigDL/python/llm/src/bigdl/llm/vllm/model_executor/model_loader.py
.