Code for "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models" accepted at the EMNLP 2023 main conference.
Paper on arXiv: https://arxiv.org/abs/2305.14481.
We provide the package via pip install deepfocus
.
Alternatively, you can simply copy the deepfocus
folder and drop it into your project.
The necessary dependencies are listed in requirements.txt
(pip install -r requirements.txt
).
The following example shows how to use FOCUS to specialize xlm-roberta-base
on German with a custom, language-specific tokenizer. The code is also available in focus_example.py
.
from transformers import AutoModelForMaskedLM, AutoTokenizer
from deepfocus import FOCUS
source_tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")
source_model = AutoModelForMaskedLM.from_pretrained("xlm-roberta-base")
target_tokenizer = AutoTokenizer.from_pretrained(
"./tokenizers/de/xlmr-unigram-50k"
)
# Example for training a new tokenizer:
# target_tokenizer = source_tokenizer.train_new_from_iterator(
# load_dataset("cc100", lang="de", split="train")["text"],
# vocab_size=50_048
# )
# target_tokenizer.save_pretrained("./target_tokenizer_test")
target_embeddings = FOCUS(
source_embeddings=source_model.get_input_embeddings().weight,
source_tokenizer=source_tokenizer,
target_tokenizer=target_tokenizer,
target_training_data_path="/path/to/data.txt"
# fasttext_model_path="/path/to/fasttext.bin", # or directly provide path to token-level fasttext model
# In the paper, we use `target_training_data_path` but we also implement using
# WECHSEL's word-to-subword mapping if the language has pretrained fasttext word embeddings available online
# To use, supply a two-letter `language_identifier` (e.g. "de" for German) instead of `target_training_data_path` and set:
# auxiliary_embedding_mode="fasttext-wordlevel",
# language_identifier="de",
)
source_model.resize_token_embeddings(len(target_tokenizer))
source_model.get_input_embeddings().weight.data = target_embeddings
# if the model has separate output embeddings, apply FOCUS separately
if not model.config.tie_word_embeddings:
target_output_embeddings = FOCUS(
source_embeddings=source_model.get_output_embeddings().weight,
source_tokenizer=source_tokenizer,
target_tokenizer=target_tokenizer,
target_training_data_path="/path/to/data.txt"
# same argument options as above, fasttext models are cached!
)
model.get_output_embeddings().weight.data = target_output_embeddings
# Continue training the model on the target language with `target_tokenizer`.
# ...
We publish the checkpoints trained with FOCUS on HuggingFace:
In our experiments, full vocabulary replacement coupled with FOCUS outperformed extending XLM-R's original vocabulary, while also resulting in a smaller model and being faster to train.
You can cite FOCUS like this:
@inproceedings{dobler-de-melo-2023-focus,
title = "{FOCUS}: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models",
author = "Dobler, Konstantin and
de Melo, Gerard",
editor = "Bouamor, Houda and
Pino, Juan and
Bali, Kalika",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.emnlp-main.829",
doi = "10.18653/v1/2023.emnlp-main.829",
pages = "13440--13454",
}
If you use the "WECHSEL-style" word-to-subword mapping, please consider also citing their original work.