GitHub - rahulunair/simpsons_llm_xpu: Finetune an LLM on intel discrete GPUs to generate dialogues based on the simpsons dataset

Simpson's LLM on XPUs

Welcome to the 'Simpson's LLM XPU' repository where we finetune a Language Model (LLM) on Intel discrete GPUs to generate dialogues based on the 'Simpsons' dataset.

The implementation leverages the original idea and exceptional work done by Replicate for the dataset prep. In case the Replicate link is unavailable, please refer to my forked version for guidelines on preparing the dataset. The preparation steps are laid out simply in a Jupyter notebook.

After the data is generated, copy the data.json file to utils in the repo and run python rename_data_keys.py to get isdata.json, which is our dataset on which finetuning will be done.

Getting Started

To utilize this code, start by preparing the dataset as suggested in the Replicate blog.

Finetuning using standard PyTorch Training loop

Full model finetuning

python finetune_no_trainer.py

Fine tuning with LoRA

python finetune_no_trainer_lora.py

Finetuning with Trainer API of Transformers

Note - For this, you will have to clone a patched version for Transformers from this repo and install it manually using

git clone https://github.com/rahulunair/transformers_xpu
cd transformers_xpu
git checkout xpu_trainer
python setup.py install
cd .. && rm -rf transformers_xpu

For a Single XPU Device:

python finetune.py

For a Multi-XPU Configuration (Multiple dGPUs) using oneCCL:

Regarding oneCCL, Intel oneAPI Collective Communications Library (oneCCL) is a library that provides routines needed for communication between devices in distributed systems. These routines are built with a focus on performance and provide efficient inter-node and intra-node communication, making them suitable for multi-node, multi-core CPUs, and accelerators. We use PyTorch bindings for oneCCL (torch_ccl) to do distributed training. We can install torch_ccl by using prebuilt weels from here.

As we are using HuggingFace Trainer* object, we don't have to change the code in anyway, but execute the code using mpi.

First, set up the oneCCL environment variables by executing:

oneccl_path=$(python -c "from oneccl_bindings_for_pytorch import cwd; print(cwd)")
source $oneccl_path/env/setvars.sh

Then set these environment variables for MPI:

export MASTER_ADDR=127.0.0.1
export CCL_ZE_IPC_EXCHANGE=sockets
export FI_PROVIDER=sockets

Then, execute the following command to initiate the finetuning process across multiple XPUs:

mpirun -n 4 python finetune.py    # uses 4 Intel Data Center GPU Max 1550

To debug oneccl backend, use this env variable before executing mpirun:

export CCL_LOG_LEVEL=debug

I have also provided a small standalone program in the utils directory to check if your setup for distributed communication works correctly. To run it, use:

mpirun -n 2 utils/oneccl_test.py  # usig 2 processes.

Post Finetuning:

Once the finetuning is complete, you can test the model with the following command:

python inference.py --infer

Literate Version of Finetuning

To get a better understanding of the Low-rank Option for finetuning Transformers (LORΛ) and the finetuning approach, I have added a literate version of the finetune.py file as a Jupyter notebook - literate_finetune.ipynb. This version provides detailed explanations of each step and includes code snippets to provide a comprehensive understanding of the finetuning process.

By going through this literate version, I hope that you can gain insights into the workings of LORΛ, how it interacts with the training process, and how you can utilize Intel GPUs for efficient finetuning. This is especially beneficial for practitioners new to language model finetuning, or those looking to gain a deeper understanding of the process.

Happy Finetuning!

*. we use a forked version of huggingface transformers, it can be found here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simpson's LLM on XPUs

Getting Started

Finetuning using standard PyTorch Training loop

Full model finetuning

Fine tuning with LoRA

Finetuning with Trainer API of Transformers

For a Single XPU Device:

For a Multi-XPU Configuration (Multiple dGPUs) using oneCCL:

Post Finetuning:

Literate Version of Finetuning

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
images		images
utils		utils
LICENSE		LICENSE
README.md		README.md
finetune.py		finetune.py
finetune_no_trainer.py		finetune_no_trainer.py
finetune_no_trainer_lora.py		finetune_no_trainer_lora.py
inference.py		inference.py
literate_finetune.ipynb		literate_finetune.ipynb

License

rahulunair/simpsons_llm_xpu

Folders and files

Latest commit

History

Repository files navigation

Simpson's LLM on XPUs

Getting Started

Finetuning using standard PyTorch Training loop

Full model finetuning

Fine tuning with LoRA

Finetuning with Trainer API of Transformers

For a Single XPU Device:

For a Multi-XPU Configuration (Multiple dGPUs) using oneCCL:

Post Finetuning:

Literate Version of Finetuning

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages