In this directory, you will find the examples on how IPEX-LLM accelerate inference with speculative sampling using EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), a speculative sampling method that improves text generation speed) on Intel GPUs. See here to view the paper and here for more info on EAGLE code.
To apply Intel GPU acceleration, there’re several steps for tools installation and environment preparation. See the GPU installation guide for more details.
Step 1, only Linux system is supported now, Ubuntu 22.04 is prefered.
Step 2, please refer to our driver installation for general purpose GPU capabilities.
Note: IPEX 2.1.10+xpu requires Intel GPU Driver version >= stable_775_20_20231219.
Step 3, you also need to download and install Intel® oneAPI Base Toolkit. OneMKL and DPC++ compiler are needed, others are optional.
Note: IPEX 2.1.10+xpu requires Intel® oneAPI Base Toolkit's version == 2024.0.
- Intel Data Center GPU Max Series
- Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series
In this example, we run inference for a Llama2 model to showcase the speed of EAGLE with IPEX-LLM on MT-bench data on Intel GPUs. We use EAGLE-2 which have better performance than EAGLE-1
We suggest using conda to manage environment:
conda create -n llm python=3.11
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
git clone https://github.com/SafeAILab/EAGLE.git
cd EAGLE
pip install -r requirements.txt
pip install -e .
We suggest using conda to manage environment:
conda create -n llm python=3.11 libuv
conda activate llm
# below command will use pip to install the Intel oneAPI Base Toolkit 2024.0
pip install dpcpp-cpp-rt==2024.0.2 mkl-dpcpp==2024.0.0 onednn==2024.0.0
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
git clone https://github.com/SafeAILab/EAGLE.git
cd EAGLE
pip install -r requirements.txt
pip install -e .
Note
Skip this step if you are running on Windows.
This is a required step on Linux for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI.
source /opt/intel/oneapi/setvars.sh
For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export SYCL_CACHE_PERSISTENT=1
For Intel Data Center GPU Max Series
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export SYCL_CACHE_PERSISTENT=1
export ENABLE_SDP_FUSION=1
Note: Please note that
libtcmalloc.so
can be installed byconda install -c conda-forge -y gperftools=2.10
.
You can test the speed of EAGLE speculative sampling with ipex-llm on MT-bench using the following command.
python -m evaluation.gen_ea_answer_llama2chat_e2_ipex_optimize\
--ea-model-path [path of EAGLE weight]\
--base-model-path [path of the original model]\
--enable-ipex-llm\
Please refer to here for the complete list of available EAGLE weights.
The above command will generate a .jsonl file that records the generation results and wall time. Then, you can use evaluation/speed.py to calculate the speed.
python -m evaluation.speed\
--base-model-path [path of the original model]\
--jsonl-file [pathname of the .jsonl file]\