This repository contains the implementation of a trainable layer skipping mechanism using the Gumbel Softmax function. The code is tailored for Llama2.
Contact person: Ji-Ung Lee
Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions. Some of the code is also based on the original llama code, so you may find this repository helpful as well.
configs
— contains five different configuration files:datasets.py
available datasets, names, and splitsfsdp.py
settings for training on multiple GPUs and concerning quantizationinference.py
inference configuration; e.g., max_new_tokens, temperature, etc.peft.py
settings for parameter efficient training methods (e.g., LoRA)training.py
settings for training, e.g., learning rate, batch size, etc.
llama_datasets
— Data loading and preparation routinesmodel_checkpointing
— (self explanatory)neuralnets
— Implementation of the gumbel softmax for Llama2utils
— various utilities for training, saving/loading models, etc.policies
— Utilities for FSDP (do not touch)results
— Result folder (needs to be created)
To run the experiments, first create a respective virtual env (using e.g., conda):
conda create --name=<envname> python=3.9
conda activate <envname>
pip install -r requirements.txt
We run all our experiments with python 3.9.
To finetune a base model, you can use torchrun with finetune_llama2.py
torchrun --standalone \
--nnodes 1 --nproc_per_node 2 finetune_llama2.py \ # Use 2 GPUs
--batch_size_training 1 \
--model_name "<path-to-model>" \
--pure_bf16 \
--num_epochs 3 \
--output_dir "model_output/llama-7b-trained-gumbel" \
--gumbel 1 \
--dataset "samsum_dataset"
Detailed description of all parameters are provided in configs/training.py
.
To perform innference, you can use torchrun with evaluate_llama2.py
.
torchrun --standalone \
--nnodes 1 --nproc_per_node 1 evaluate_llama2.py \
--model_name "model_output/llama-7b-trained-gumbel" \
--use_gumbel 1 \
--output_dir "results"
This will also measure the overall time taken for inference (normalized by the number of generated tokens) and keep track of the layers that were activated for the Gumbel Softmax (written into activations.json
). The scores and generated text will be written into .json
and .tsv
files.
This work has no accompanying paper. However you may cite the preliminary work on adaptable adapters that served as a basis for this work.
@inproceedings{moosavi-etal-2022-adaptable,
title = "Adaptable Adapters",
author = "Moosavi, Nafise and
Delfosse, Quentin and
Kersting, Kristian and
Gurevych, Iryna",
booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
month = jul,
year = "2022",
address = "Seattle, United States",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.naacl-main.274",
pages = "3742--3753",
}
This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.