This is the repository for the pipeline of One For All Framework, which aims to find a good initialization of subword embeddings when we want to adapt a monolingual or multilingual PLM to many languages. The framework optionally applies matrix factorization to the original PLM subword embeddings and replaces the new subword embeddings with two smaller matrices, which can largely reduce the number of parameters. Therefore, the OFA framework can boost efficient large-scale multilingual continued pretraining, which is especially helpful to a limited computation budget. Some of the code is based on Glot500, WECHSEL and FOCUS.
Paper on arXiv: https://arxiv.org/abs/2311.08849
.
├── README.md
├── evaluation
│ ├── retrieval
│ │ ├── bible_lang_list.txt
│ │ ├── evaluate_retrieval_bible.py
│ │ ├── evaluate_retrieval_bible_roberta.sh
│ │ ├── evaluate_retrieval_bible_xlm.sh
│ │ ├── evaluate_retrieval_tatoeba.py
│ │ ├── evaluate_retrieval_tatoeba_roberta.sh
│ │ ├── evaluate_retrieval_tatoeba_xlm.sh
│ │ └── tatoeba_lang_list.txt
│ ├── tagging
│ │ ├── evaluate_ner.py
│ │ ├── evaluate_ner.sh
│ │ ├── evaluate_ner_xlmr.sh
│ │ ├── evaluate_pos.py
│ │ ├── evaluate_pos.sh
│ │ ├── evaluate_pos_xlmr.sh
│ │ ├── ner_lang_list.txt
│ │ ├── pos_lang_list.txt
│ │ ├── run_tag.py
│ │ └── utils_tag.py
│ └── taxi1500
│ ├── evaluate.py
│ ├── evaluate.sh
│ ├── evaluate_xlmr.sh
│ └── texi1500_lang_list.txt
├── model_loader_extra.py
├── modeling_roberta_extra.py
├── modeling_xlmr_extra.py
├── ofa
│ ├── __init__.py
│ ├── ofa.py
│ ├── random_init.py
│ ├── run_ofa.bash
│ └── utils.py
├── requirements.txt
├── run_extra.py
├── train_bash_roberta.sh
└── train_bash_xlm_roberta.sh
Initializing the subword embeddings using the OFA framework:
cd ofa
bash run_ofa.bash
This will create embedding matrices for the subwords in the target tokenizer under four different dimensions: [100, 200, 400, 768]. The embedding initialization is based on the vocabulary of the source and target tokenizer, the embedding layer of the source model, and the external multilingual embeddings. The multilingual word embeddings used in OFA can be downloaded here.
To randomly initialize the unseen subword embeddings, run the following code:
cd ofa
python random_init.py
We use the Glot500-c corpus for continued-pretraining our models. The dataset contains more than 500 languages.
To continued-pretrain the model initialized with OFA (RoBERTa as the source model, i.e., monolingual as source), run:
bash train_bash_roberta.sh
To continued-pretrain the model initialized with OFA (XLM-R as the source model, i.e., multilingual as source), run:
bash train_bash_xlm_roberta.sh
You can change the .sh files for specifying --num_primitive
with the latent embedding dimensions you want to use (in [100, 200, 400, 768]), --use_initialization
with True and --random_initialization
with False if you use OFA framework to initialize and with True if you use random initialization.
We release our models on Huggingface, you can download ofa-multi-100, ofa-multi-200, ofa-multi-400 and ofa-multi-768. The current HuggingFace Transformer does not support the model architecture except for ofa-multi-768.
To use ofa-multi-768, you could do something like the following, as its architecture is XLMRobertaForMaskedLM that HuggingFace supports:
>>> from transformers import pipeline
>>> MODEL_PATH = 'your_saved_model_path'
>>> mask_filler = pipeline('fill-mask', model=MODEL_PATH)
>>> mask_filler("Hello I'm a <mask> model.", tok_k=3)
or
from transformers import XLMRobertaForMaskedLM, XLMRobertaTokenizer
MODEL_PATH = 'your_saved_model_path'
model = XLMRobertaForMaskedLM.from_pretrained(MODEL_PATH)
tokenizer = XLMRobertaTokenizer.from_pretrained(MODEL_PATH)
text = "Hello I'm a <mask> model."
inputs = tokenizer(text, return_tensors="pt")
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
logits = model(**inputs).logits
mask_token_logits = logits[0, mask_token_index, :]
top_3_tokens = torch.topk(mask_token_logits, 3, dim=1).indices[0].tolist()
for token in top_3_tokens:
print(text.replace(tokenizer.mask_token, tokenizer.decode([token])))
To use models with smaller embedding dimensions, you could do something like the following:
# you have to import the architecture
from modeling_xlmr_extra import XLMRobertaAssembledForMaskedLM
from transformers import XLMRobertaTokenizer
MODEL_PATH = 'your_saved_model_path'
model = XLMRobertaAssembledForMaskedLM.from_pretrained(MODEL_PATH)
tokenizer = XLMRobertaTokenizer.from_pretrained(MODEL_PATH)
text = "Hello I'm a <mask> model."
inputs = tokenizer(text, return_tensors="pt")
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
logits = model(**inputs).logits
mask_token_logits = logits[0, mask_token_index, :]
top_3_tokens = torch.topk(mask_token_logits, 3, dim=1).indices[0].tolist()
for token in top_3_tokens:
print(text.replace(tokenizer.mask_token, tokenizer.decode([token])))
Please refer to Glot500 for downloading the datasets used for evaluation.
For SR-B, first go to evaluation/retrieval'.
If you want to evaluate the ofa-mono-xxx models, run:
bash evaluate_retrieval_bible_roberta.sh
If you want to evaluate the ofa-multi-xxx models, run:
bash evaluate_retrieval_bible_xlm.sh
For SR-T, first go to evaluation/retrieval'.
If you want to evaluate the ofa-mono-xxx models, run:
bash evaluate_retrieval_tatoeba_roberta.sh
If you want to evaluate the ofa-multi-xxx models, run:
bash evaluate_retrieval_tatoeba_xlm.sh
First go to evaluation/taxi1500'.
If you want to evaluate the ofa-mono-xxx models, run:
bash evaluate.sh
If you want to evaluate the ofa-multi-xxx models, run:
bash evaluate_xlmr.sh
For NER, first go to evaluation/tagging'.
If you want to evaluate the ofa-mono-xxx models, run:
bash evaluate_ner.sh
If you want to evaluate the ofa-multi-xxx models, run:
bash evaluate_ner_xlmr.sh
For POS, first go to evaluation/tagging'.
If you want to evaluate the ofa-mono-xxx models, run:
bash evaluate_pos.sh
If you want to evaluate the ofa-multi-xxx models, run:
bash evaluate_pos_xlmr.sh
If you find our code, model, or data useful for your research, please considering citing:
@article{liu2023ofa,
title={OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining}
author={Liu, Yihong and Lin, Peiqin and Wang, Mingyang and Sch{\"u}tze, Hinrich},
journal={arXiv preprint arXiv:2311.08849},
year={2023}
}
or
@inproceedings{imanigooghari-etal-2023-glot500,
title = {Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages},
author = {ImaniGooghari, Ayyoob and Lin, Peiqin and Kargaran, Amir Hossein and Severini, Silvia and Jalili Sabet, Masoud and Kassner, Nora and Ma, Chunlan and Schmid, Helmut and Martins, Andr{\'e} and Yvon, Fran{\c{c}}ois and Sch{\"u}tze, Hinrich},
year = 2023,
month = jul,
booktitle = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
publisher = {Association for Computational Linguistics},
address = {Toronto, Canada},
pages = {1082--1117},
url = {https://aclanthology.org/2023.acl-long.61}
}
This repository is built on top of transformers, xtreme, Glot500, WECHSEL and FOCUS.