Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more examples for pipeline parallel inference #11372

Merged
merged 7 commits into from
Jun 21, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 92 additions & 1 deletion python/llm/example/GPU/Pipeline-Parallel-Inference/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,17 @@ To run this example with IPEX-LLM on Intel GPUs, we have some recommended requir
- [baichuan-inc/Baichuan2-13B-Chat](./run_baichuan2_arc_2_card.sh)
- [microsoft/Phi-3-mini-4k-instruct](./run_phi3_arc_2_card.sh)
- [microsoft/Phi-3-medium-4k-instruct](./run_phi3_arc_2_card.sh)

- [mistralai/Mistral-7B-v0.1](./run_mistral_arc_2_card.sh)
- [mistralai/Mixtral-8x7B-Instruct-v0.1](./run_mistral_arc_2_card.sh)
- [01-ai/Yi-6B-Chat](./run_yi_arc_2_card.sh)
- [01-ai/Yi-34B-Chat](./run_yi_arc_2_card.sh)
- [codellama/CodeLlama-7b-Instruct-hf](./run_codellama_arc_2_card.sh)
- [codellama/CodeLlama-13b-Instruct-hf](./run_codellama_arc_2_card.sh)
- [codellama/CodeLlama-34b-Instruct-hf](./run_codellama_arc_2_card.sh)
- [upstage/SOLAR-10.7B-Instruct-v1.0](./run_solar_arc_2_card.sh)
- [lmsys/vicuna-7b-v1.3](./run_vicuna_arc_2_card.sh)
- [lmsys/vicuna-13b-v1.3](./run_vicuna_arc_2_card.sh)
- [lmsys/vicuna-33b-v1.3](./run_vicuna_arc_2_card.sh)

## Example: Run pipeline parallel inference on multiple GPUs

Expand Down Expand Up @@ -101,6 +111,87 @@ bash run_phi3_arc_2_card.sh

</details>

</details>

<details>
<summary> Show Mistral/Mixtral example </summary>

#### Run Mistral-7B-v0.1 / Mixtral-8x7B-Instruct-v0.1 on two Intel Arc A770

You could specify `--repo-id-or-model-path` in the test script to be the huggingface repo id for Mistral / Mixtral to be downloaded, or the path to the huggingface checkpoint folder. Besides, you could change `NUM_GPUS` to the number of GPUs you have on your machine.

```bash
pip install transformers==4.37.0
bash run_mistral_arc_2_card.sh
```

</details>

</details>

<details>
<summary> Show Yi example </summary>

#### Run Yi-6B-Chat / Yi-34B-Chat on two Intel Arc A770

You could specify `--repo-id-or-model-path` in the test script to be the huggingface repo id for Yi to be downloaded, or the path to the huggingface checkpoint folder. Besides, you could change `NUM_GPUS` to the number of GPUs you have on your machine.

```bash
pip install transformers==4.37.0
bash run_yi_arc_2_card.sh
```

</details>

</details>

<details>
<summary> Show Codellama example </summary>

#### Run CodeLlama-7b-Instruct-hf / CodeLlama-13b-Instruct-hf / CodeLlama-34b-Instruct-hf on two Intel Arc A770

You could specify `--repo-id-or-model-path` in the test script to be the huggingface repo id for Codellama to be downloaded, or the path to the huggingface checkpoint folder. Besides, you could change `NUM_GPUS` to the number of GPUs you have on your machine.

```bash
pip install transformers==4.37.0
bash run_codellama_arc_2_card.sh
```

</details>

</details>

<details>
<summary> Show Solar example </summary>

#### Run SOLAR-10.7B-Instruct-v1.0 on two Intel Arc A770

You could specify `--repo-id-or-model-path` in the test script to be the huggingface repo id for Solar to be downloaded, or the path to the huggingface checkpoint folder. Besides, you could change `NUM_GPUS` to the number of GPUs you have on your machine.

```bash
pip install transformers==4.37.0
bash run_solar_arc_2_card.sh
```

</details>

</details>

<details>
<summary> Show Vicuna example </summary>

#### Run vicuna-7b-v1.3 / vicuna-13b-v1.3 / vicuna-33b-v1.3 on two Intel Arc A770

You could specify `--repo-id-or-model-path` in the test script to be the huggingface repo id for Vicuna to be downloaded, or the path to the huggingface checkpoint folder. Besides, you could change `NUM_GPUS` to the number of GPUs you have on your machine.

```bash
pip install transformers==4.37.0
bash run_vicuna_arc_2_card.sh
```

</details>


### 3. Sample Output
#### [meta-llama/Llama-2-13b-chat-hf](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf)
```log
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
source /opt/intel/oneapi/setvars.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add license for each script.

export MASTER_ADDR=127.0.0.1
export MASTER_PORT=9090
export FI_PROVIDER=tcp
export USE_XETLA=OFF
export OMP_NUM_THREADS=6
export IPEX_LLM_QUANTIZE_KV_CACHE=1
if [[ $KERNEL_VERSION != *"6.5"* ]]; then
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
fi
export TORCH_LLM_ALLREDUCE=0

NUM_GPUS=2 # number of used GPU

# To run CodeLlama-7b-Instruct-hf
CCL_ZE_IPC_EXCHANGE=sockets torchrun --standalone --nnodes=1 --nproc-per-node $NUM_GPUS \
generate.py --repo-id-or-model-path 'codellama/CodeLlama-7b-Instruct-hf' --gpu-num $NUM_GPUS

# To run CodeLlama-13b-Instruct-hf
# CCL_ZE_IPC_EXCHANGE=sockets torchrun --standalone --nnodes=1 --nproc-per-node $NUM_GPUS \
# generate.py --repo-id-or-model-path 'codellama/CodeLlama-7b-Instruct-hf' --gpu-num $NUM_GPUS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'codellama/CodeLlama-7b-Instruct-hf' -> 'codellama/CodeLlama-13b-Instruct-hf'


# To run CodeLlama-34b-Instruct-hf
# CCL_ZE_IPC_EXCHANGE=sockets torchrun --standalone --nnodes=1 --nproc-per-node $NUM_GPUS \
# generate.py --repo-id-or-model-path 'codellama/CodeLlama-34b-Instruct-hf' --gpu-num $NUM_GPUS
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
source /opt/intel/oneapi/setvars.sh
export MASTER_ADDR=127.0.0.1
export MASTER_PORT=9090
export FI_PROVIDER=tcp
export USE_XETLA=OFF
export OMP_NUM_THREADS=6
export IPEX_LLM_QUANTIZE_KV_CACHE=1
if [[ $KERNEL_VERSION != *"6.5"* ]]; then
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
fi
export TORCH_LLM_ALLREDUCE=0

NUM_GPUS=2 # number of used GPU

# To run Mistral-7B-v0.1
CCL_ZE_IPC_EXCHANGE=sockets torchrun --standalone --nnodes=1 --nproc-per-node $NUM_GPUS \
generate.py --repo-id-or-model-path 'mistralai/Mistral-7B-v0.1' --gpu-num $NUM_GPUS

# To run Mixtral-8x7B-Instruct-v0.1
# CCL_ZE_IPC_EXCHANGE=sockets torchrun --standalone --nnodes=1 --nproc-per-node $NUM_GPUS \
# generate.py --repo-id-or-model-path 'mistralai/Mixtral-8x7B-Instruct-v0.1' --gpu-num $NUM_GPUS
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
source /opt/intel/oneapi/setvars.sh
export MASTER_ADDR=127.0.0.1
export MASTER_PORT=9090
export FI_PROVIDER=tcp
export USE_XETLA=OFF
export OMP_NUM_THREADS=6
export IPEX_LLM_QUANTIZE_KV_CACHE=1
if [[ $KERNEL_VERSION != *"6.5"* ]]; then
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
fi
export TORCH_LLM_ALLREDUCE=0

NUM_GPUS=2 # number of used GPU

# To run SOLAR-10.7B-Instruct-v1.0
CCL_ZE_IPC_EXCHANGE=sockets torchrun --standalone --nnodes=1 --nproc-per-node $NUM_GPUS \
generate.py --repo-id-or-model-path 'upstage/SOLAR-10.7B-Instruct-v1.0' --gpu-num $NUM_GPUS
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
source /opt/intel/oneapi/setvars.sh
export MASTER_ADDR=127.0.0.1
export MASTER_PORT=9090
export FI_PROVIDER=tcp
export USE_XETLA=OFF
export OMP_NUM_THREADS=6
export IPEX_LLM_QUANTIZE_KV_CACHE=1
if [[ $KERNEL_VERSION != *"6.5"* ]]; then
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
fi
export TORCH_LLM_ALLREDUCE=0

NUM_GPUS=2 # number of used GPU

# To run vicuna-7b-v1.3
CCL_ZE_IPC_EXCHANGE=sockets torchrun --standalone --nnodes=1 --nproc-per-node $NUM_GPUS \
generate.py --repo-id-or-model-path 'lmsys/vicuna-7b-v1.3' --gpu-num $NUM_GPUS

# To run vicuna-13b-v1.3
# CCL_ZE_IPC_EXCHANGE=sockets torchrun --standalone --nnodes=1 --nproc-per-node $NUM_GPUS \
# generate.py --repo-id-or-model-path 'lmsys/vicuna-13b-v1.3' --gpu-num $NUM_GPUS

# To run vicuna-33b-v1.3
# CCL_ZE_IPC_EXCHANGE=sockets torchrun --standalone --nnodes=1 --nproc-per-node $NUM_GPUS \
# generate.py --repo-id-or-model-path 'lmsys/vicuna-33b-v1.3' --gpu-num $NUM_GPUS
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
source /opt/intel/oneapi/setvars.sh
export MASTER_ADDR=127.0.0.1
export MASTER_PORT=9090
export FI_PROVIDER=tcp
export USE_XETLA=OFF
export OMP_NUM_THREADS=6
export IPEX_LLM_QUANTIZE_KV_CACHE=1
if [[ $KERNEL_VERSION != *"6.5"* ]]; then
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
fi
export TORCH_LLM_ALLREDUCE=0

NUM_GPUS=2 # number of used GPU

# To run Yi-6B-Chat
CCL_ZE_IPC_EXCHANGE=sockets torchrun --standalone --nnodes=1 --nproc-per-node $NUM_GPUS \
generate.py --repo-id-or-model-path '01-ai/Yi-6B-Chat' --gpu-num $NUM_GPUS

# To run Yi-34B-Chat
# CCL_ZE_IPC_EXCHANGE=sockets torchrun --standalone --nnodes=1 --nproc-per-node $NUM_GPUS \
# generate.py --repo-id-or-model-path '01-ai/Yi-34B-Chat' --gpu-num $NUM_GPUS
3 changes: 2 additions & 1 deletion python/llm/src/ipex_llm/transformers/pipeline_parallel.py
Original file line number Diff line number Diff line change
Expand Up @@ -234,7 +234,8 @@ def pipeline_parallel_generate(self,
"make sure that `pad_token_id` is defined.")
next_ids = next_ids * unfinished_sequences + pad_token_id * (1 - unfinished_sequences)

if isinstance(outputs.past_key_values, tuple) and local_rank != 0:
# Temporarily specify as Baichuan and ChatGLM
if self.config.model_type in ["baichuan", "chatglm"] and local_rank != 0:
value_placeholder = torch.empty_like((outputs.past_key_values)[-1][0])
past_key_values_placeholder = tuple(
(value_placeholder, value_placeholder) for _ in range(layer_start)
Expand Down
Loading