Skip to content

Code for "Vocabulary Expansion of Chat Models with Unlabeled Target Language Data"

License

Notifications You must be signed in to change notification settings

gucci-j/chat-cve

Repository files navigation

Vocabulary Expansion of Chat Models with Unlabeled Target Language Data

This is a repository for the paper "Vocabulary Expansion of Chat Models with Unlabeled Target Language Data".

motivation

Requirements

See requirements.txt for the required packages. Also, we require PyTorch v2 or higher.

If you are using the conda package manager, you can create a new environment with the required packages by running:

# Create a new env for training and evaluation
conda create --name dec2024 python=3.12
conda activate dec2024
conda install conda-forge::pytorch
mkdir -m 700 src
cd src && git clone https://github.com/huggingface/transformers.git
cd transformers
pip3 install -e .
pip3 install peft datasets evaluate bitsandbytes scikit-learn sentencepiece huggingface-hub tqdm pyarrow protobuf tiktoken lighteval

# Create a new env for IFEval evaluation
conda create --name dec2024_eval python=3.12
conda activate dec2024_eval
conda install conda-forge::pytorch
cd transformers
pip3 install -e .
pip3 install peft datasets evaluate bitsandbytes scikit-learn sentencepiece huggingface-hub tqdm pyarrow protobuf tiktoken
cd ..
git clone git@github.com:huggingface/lm-evaluation-harness.git
cd lm-evaluation-harness
git checkout main
pip3 install -e .
pip3 install langdetect immutabledict nltk

Reproducing the results

1. Preprocessing

Please visit the preprocessing directory.

2. Initializing the model

Please visit the instantiation directory.

3. Training the model

Please visit the training directory.

4. Model merging

Please visit the merging directory.

5. Evaluation

Please visit the evaluation directory.

Models

The models will available soon on the Hugging Face model hub.

Citation

If you use this code or the models in your research, please cite the following paper:

@misc{yamaguchi2024vocabularyexpansionchatmodels,
      title={Vocabulary Expansion of Chat Models with Unlabeled Target Language Data}, 
      author={Atsuki Yamaguchi and Terufumi Morishita and Aline Villavicencio and Nikolaos Aletras},
      year={2024},
      eprint={2412.11704},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.11704}, 
}

License

This code is licensed under the MIT License unless otherwise stated in the file.

About

Code for "Vocabulary Expansion of Chat Models with Unlabeled Target Language Data"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published