Welcome to the 'Simpson's LLM XPU' repository where we finetune a Language Model (LLM) on Intel discrete GPUs to generate dialogues based on the 'Simpsons' dataset.
The implementation leverages the original idea and exceptional work done by Replicate for the dataset prep. In case the Replicate link is unavailable, please refer to my forked version for guidelines on preparing the dataset. The preparation steps are laid out simply in a Jupyter notebook.
After the data is generated, copy the data.json file to utils
in the repo and run python rename_data_keys.py
to get isdata.json
, which is our dataset on which finetuning will be done.
To utilize this code, start by preparing the dataset as suggested in the Replicate blog.
python finetune_no_trainer.py
python finetune_no_trainer_lora.py
Note - For this, you will have to clone a patched version for Transformers from this repo and install it manually using
git clone https://github.com/rahulunair/transformers_xpu
cd transformers_xpu
git checkout xpu_trainer
python setup.py install
cd .. && rm -rf transformers_xpu
python finetune.py
Regarding oneCCL, Intel oneAPI Collective Communications Library (oneCCL) is a library that provides routines needed for communication between devices in distributed systems. These routines are built with a focus on performance and provide efficient inter-node and intra-node communication, making them suitable for multi-node, multi-core CPUs, and accelerators. We use PyTorch bindings for oneCCL (torch_ccl
) to do distributed training. We can install torch_ccl
by using prebuilt weels from here.
As we are using HuggingFace Trainer* object, we don't have to change the code in anyway, but execute the code using mpi
.
First, set up the oneCCL environment variables by executing:
oneccl_path=$(python -c "from oneccl_bindings_for_pytorch import cwd; print(cwd)")
source $oneccl_path/env/setvars.sh
Then set these environment variables for MPI:
export MASTER_ADDR=127.0.0.1
export CCL_ZE_IPC_EXCHANGE=sockets
export FI_PROVIDER=sockets
Then, execute the following command to initiate the finetuning process across multiple XPUs:
mpirun -n 4 python finetune.py # uses 4 Intel Data Center GPU Max 1550
To debug oneccl backend, use this env variable before executing mpirun
:
export CCL_LOG_LEVEL=debug
I have also provided a small standalone program in the utils
directory to check if your setup for distributed communication works correctly. To run it, use:
mpirun -n 2 utils/oneccl_test.py # usig 2 processes.
Once the finetuning is complete, you can test the model with the following command:
python inference.py --infer
To get a better understanding of the Low-rank Option for finetuning Transformers (LORΛ) and the finetuning approach, I have added a literate version of the finetune.py file as a Jupyter notebook - literate_finetune.ipynb. This version provides detailed explanations of each step and includes code snippets to provide a comprehensive understanding of the finetuning process.
By going through this literate version, I hope that you can gain insights into the workings of LORΛ, how it interacts with the training process, and how you can utilize Intel GPUs for efficient finetuning. This is especially beneficial for practitioners new to language model finetuning, or those looking to gain a deeper understanding of the process.
Happy Finetuning!
*. we use a forked version of huggingface transformers, it can be found here.