Llama 3 is the latest Large Language Models released by Meta which provides state-of-the-art performance and excels at language nuances, contextual understanding, and complex tasks like translation and dialogue generation.
Now, you can easily run Llama 3 on Intel GPU using llama.cpp
and Ollama
with IPEX-LLM.
See the demo of running Llama-3-8B-Instruct on Intel Arc GPU using Ollama
below.
You could also click here to watch the demo video. |
This quickstart guide walks you through how to run Llama 3 on Intel GPU using llama.cpp
/ Ollama
with IPEX-LLM.
Visit Run llama.cpp with IPEX-LLM on Intel GPU Guide, and follow the instructions in section Prerequisites to setup and section Install IPEX-LLM for llama.cpp to install the IPEX-LLM with llama.cpp binaries, then follow the instructions in section Initialize llama.cpp with IPEX-LLM to initialize.
After above steps, you should have created a conda environment, named llm-cpp
for instance and have llama.cpp binaries in your current directory.
Now you can use these executable files by standard llama.cpp usage.
There already are some GGUF models of Llama3 in community, here we take Meta-Llama-3-8B-Instruct-GGUF for example.
Suppose you have downloaded a Meta-Llama-3-8B-Instruct-Q4_K_M.gguf model from Meta-Llama-3-8B-Instruct-GGUF and put it under <model_dir>
.
To use GPU acceleration, several environment variables are required or recommended before running llama.cpp
.
-
For Linux users:
source /opt/intel/oneapi/setvars.sh export SYCL_CACHE_PERSISTENT=1 # [optional] under most circumstances, the following environment variable may improve performance, but sometimes this may also cause performance degradation export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 # [optional] if you want to run on single GPU, use below command to limit GPU may improve performance export ONEAPI_DEVICE_SELECTOR=level_zero:0
-
For Windows users:
Please run the following command in Miniforge Prompt.
set SYCL_CACHE_PERSISTENT=1 rem under most circumstances, the following environment variable may improve performance, but sometimes this may also cause performance degradation set SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
Tip
When your machine has multi GPUs and you want to run on one of them, you need to set ONEAPI_DEVICE_SELECTOR=level_zero:[gpu_id]
, here [gpu_id]
varies based on your requirement. For more details, you can refer to this section.
Note
The environment variable SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS
determines the usage of immediate command lists for task submission to the GPU. While this mode typically enhances performance, exceptions may occur. Please consider experimenting with and without this environment variable for best performance. For more details, you can refer to this article.
Under your current directory, exceuting below command to do inference with Llama3:
-
For Linux users:
./llama-cli -m <model_dir>/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun doing something" -c 1024 -t 8 -e -ngl 33 --color --no-mmap
-
For Windows users:
Please run the following command in Miniforge Prompt.
llama-cli -m <model_dir>/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun doing something" -c 1024 -e -ngl 33 --color --no-mmap
Under your current directory, you can also execute below command to have interactive chat with Llama3:
-
For Linux users:
./llama-cli -ngl 33 --interactive-first --color -e --in-prefix '<|start_header_id|>user<|end_header_id|>\n\n' --in-suffix '<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' -r '<|eot_id|>' -m <model_dir>/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -c 1024
-
For Windows users:
Please run the following command in Miniforge Prompt.
llama-cli -ngl 33 --interactive-first --color -e --in-prefix "<|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -r "<|eot_id|>" -m <model_dir>/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -c 1024
Below is a sample output on Intel Arc GPU:
Visit Run Ollama with IPEX-LLM on Intel GPU, and follow the instructions in section Install IPEX-LLM for llama.cpp to install the IPEX-LLM with Ollama binary, then follow the instructions in section Initialize Ollama to initialize.
After above steps, you should have created a conda environment, named llm-cpp
for instance and have ollama binary file in your current directory.
Now you can use this executable file by standard Ollama usage.
ollama/ollama has alreadly added Llama3 into its library, so it's really easy to run Llama3 using ollama now.
Launch the Ollama service:
-
For Linux users:
export no_proxy=localhost,127.0.0.1 export ZES_ENABLE_SYSMAN=1 export OLLAMA_NUM_GPU=999 source /opt/intel/oneapi/setvars.sh export SYCL_CACHE_PERSISTENT=1 # [optional] under most circumstances, the following environment variable may improve performance, but sometimes this may also cause performance degradation export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 # [optional] if you want to run on single GPU, use below command to limit GPU may improve performance export ONEAPI_DEVICE_SELECTOR=level_zero:0 ./ollama serve
-
For Windows users:
Please run the following command in Miniforge Prompt.
set no_proxy=localhost,127.0.0.1 set ZES_ENABLE_SYSMAN=1 set OLLAMA_NUM_GPU=999 set SYCL_CACHE_PERSISTENT=1 rem under most circumstances, the following environment variable may improve performance, but sometimes this may also cause performance degradation set SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 ollama serve
Note
To allow the service to accept connections from all IP addresses, use OLLAMA_HOST=0.0.0.0 ./ollama serve
instead of just ./ollama serve
.
Tip
When your machine has multi GPUs and you want to run on one of them, you need to set ONEAPI_DEVICE_SELECTOR=level_zero:[gpu_id]
, here [gpu_id]
varies based on your requirement. For more details, you can refer to this section.
Note
The environment variable SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS
determines the usage of immediate command lists for task submission to the GPU. While this mode typically enhances performance, exceptions may occur. Please consider experimenting with and without this environment variable for best performance. For more details, you can refer to this article.
Keep the Ollama service on and open another terminal and run llama3 with ollama run
:
-
For Linux users:
export no_proxy=localhost,127.0.0.1 ./ollama run llama3:8b-instruct-q4_K_M
-
For Windows users:
Please run the following command in Miniforge Prompt.
set no_proxy=localhost,127.0.0.1 ollama run llama3:8b-instruct-q4_K_M
Note
Here we just take llama3:8b-instruct-q4_K_M
for example, you can replace it with any other Llama3 model you want.