Note Tulu 1/2 results used an ealier version of Open Instruct with a pinned version of Transformers. If you are looking to replicate these results, refer to this commit or older.
Our checkpoints can be found:
- Here for all Tulu v1 models.
- Here for all Tulu v2 models.
- OLMo 7B SFT and Instruct, along with a 2048 sequence length version of Tulu 2.
Our Tulu V1 models were released as weight diffs (due to LLaMa 1 license). We use a slightly modified form of the Alpaca weight diff script, which runs the same.
To merge a model:
- Download the relevant LLaMa model and convert it to Hugging Face format (see above).
- Download our repository and install the right dependencies (see above).
- Download the model diff you want.
- Run the command below:
python scripts/weights/weight_diff.py recover --path_raw ${hf_llama_path} --path_tuned ${output_path} --path_diff ${diff_location}
We provide the scripts for running evaluation of Huggingface/OpenAI models on a list of standard benchmarks targeting for the core capabilities of large language models. These benchmakrs include:
- MMLU
- Grade School Math (GSM)
- MATH
- Big-Bench Hard (BBH)
- TydiQA
- Codex HumanEval
- HumanEval+ and MBPP+
- IFEval
- ToxiGen
- XSTest
- TruthfulQA
- AlpacaEval 1 and 2
We are working on including more promising benchmarks into this list. Please stay tuned!
You can use the following script to download all the evaluation data:
./scripts/data/prepare_eval_data.sh
Evaluation scripts for different datasets are put under ./scripts
. For example, you can use the following command to run the MMLU evaluation script:
./scripts/eval/mmlu.sh
We release our human evaluation interface and collected annotations in the ./human_eval
folder. Please see the corresponding README for more details.
We include a collection of representative instruction datasets in our exploration and are adding new ones to our list. We unify them into the same chatting format. To download and prepare these datasets, simply run the following command:
./scripts/data/prepare_train_data.sh
Please check these datasets for licenses and restrictions around their use!
You can also find the processed Tulu v1 and Tulu v2 SFT datasets on HuggingFace. Note that the train data preparation script will not precisely recreate the Tulu v2 mixture due to randomness in the generation and shifts in data availability - see this PR for some details. If you need exactly yhe training data used, the HuggingFace mixture is exactly this - the exact same data used during model training.
Generally, most huggingface-compatible causal language models should work fine with our codebase, potentially with some adjusting for different tokenizers etc. Some models may require addtional requests to download. E.g., for LLaMa 1 and 2, please consult the Hugging Face documentation for requesting access and converting them to a huggingface-compatible format.
You can use the following command to run instruction tuning (finetuning a pretrained model to follow instructions):
./scripts/finetune_with_accelerate.sh
Make sure to adjust model_name_or_path
, tokenizer_name
, train_file
, and output_dir
to your models / data / setting. By default, this uses deepspeed
with accelerate
.
Note: If you are looking to replicate the released Tulu 2 models, it may be useful to swap the loss calculation to --reduce_loss sum
. This uses a sum reduction instead of a mean reduction for loss calculations, and means we weight all tokens evenly when training, better mimicking the larger batch sizes used to train Tulu 2 models. See huggingface/transformers#24725 for more discussion and details. Generally, you may get better results using the sum reduction if you need to use a lot of gradient accumulation (including for training Llama 3 models).
We support LoRA finetuning, wherein only a small number of parameters are updated, resulting in faster and cheaper training. For even more efficiency, we also support QLoRA finetuning, wherein the non-trained (underlying) model parameters are quantised during 4-bit training. This means you can train a 70b Llama model on a single 80GB A100! Please refer to the respective papers for more details.
Please also note you cannot currently run QLoRA with model parallelism - only data-parallel training is supported, so you cannot train a model that does not fit on one GPU. For LoRA, you can use deepspeed + zero-3 to achieve model parallelism (and FSDP is not currently supported).
Please see ./scripts/finetune_lora_with_accelerate.sh
and ./scripts/finetune_qlora_with_accelerate.sh
for example hyperparameters. We found a larger rank (e.g. 256) and higher learning rate (e.g. 2e-4) worked best. Additionally, we found that QLoRA tended to always achieve similar results to LoRA, while LoRA itself sometimes fell behind full-finetuning, especially in long, complex generation tasks. However, for most purposes, LoRA training essentially matches full-finetuning performance. We recommend merging modules learnt with QLoRA into a dequantised model (run our merge script with the --qlora
flag).
For an example of how to fully finetune a model with DPO, see scripts/dpo_train_with_accelerate.sh
. Note you will require at least 8 80GB A100s to be able to train a 7b size model, and will require more compute for anything larger. We have not tested multi-node training with this script, but it should work.
Our script also supports PEFT training with QLoRA. See scripts/dpo_train_with_qlora.sh
for an example. We have not trained models with this, so it may require additional hyperparameter tuning to achieve reasonable results.