This repository contains the source code submitted by LMU Munich to the WMT 2020 Unsupervised MT Shared Task. For a detailed description, check our paper.
Our system ranked first in both translation directions (German -> Sorbian, Sorbian->German). This code base is largely based on MASS, XLM and RE-LM.
The target of the task was to translate between German and Upper Sorbian (minority language of Eastern Germany, similar to Czech). Our system is based on a combination of Unsupervised Neural MT and Unsupervised Statistical MT.
-
For the Neural MT part, we use MASS. However, instead of pretraining on German and Sorbian, we pretrain only on German. Upon convergence, we extend the vocabulary of the pretrained model and fine-tune it to Sorbian and German. This follows RE-LM, a competitive method for low-resource unsupervised NMT. Then, we train for NMT in an unsupervised way (online back-translation).
-
For the Statistical MT part, we use monoses. Specifically, we map fastText embeddings using VecMap with identical pairs. Then, we back-translate and get a pseudo-parallel corpus for both directions. We train our NMT system using online BT and a supervised loss on the pseudo-parallel corpus from USMT.
Also useful:
-
Sampling when doing the prediction during online BT instead of greedy decoding. See flags
--sampling_frequency
,--sample_temperature
in the code. -
Oversampling the Sorbian corpus using BPE-Dropout. We preprocess data using subword-nmt with the flag
--dropout 0.1
.
Our proposed pipeline:
Right arrows indicate transfer of weights.
- Python 3.6.9
- NumPy (tested on version 1.15.4)
- PyTorch (tested on version 1.2.0)
- Apex (for fp16 training)
Create Environment (Optional): Ideally, you should create a conda environment for the project.
conda create -n wmt python=3.6.9
conda activate wmt
Install PyTorch 1.2.0
with the desired cuda version to use the GPU:
conda install pytorch==1.2.0 torchvision -c pytorch
Clone the project:
git clone https://github.com/alexandra-chron/umt-lmu-wmt2020.git
cd umt-lmu-wmt2020
Then install the rest of the requirements:
pip install -r ./requirements.txt
To train with multiple GPUs use:
export NGPU=8; python3 -m torch.distributed.launch --nproc_per_node=$NGPU train.py
You can download all the German Newscrawl data, all the Sorbian monolingual data, and the evaluation/test sets from the WMT official website.
To preprocess your data using BPE tokenization, make sure you have placed them in ./data/de-wmt
. Then run:
./get_data_mass_pretraining.sh --src de
Then, train the De MASS model:
python3 train.py --exp_name de_mass --dump_path './models' --data_path './data/de-wmt' --lgs de --mass_steps de --encoder_only false --emb_dim 1024 --n_layers 6 --n_heads 8 --dropout '0.1' --attention_dropout '0.1' --gelu_activation true --tokens_per_batch 2000 --optimizer 'adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001' --epoch_size 200000 --max_epoch 100000 --word_mass '0.5' --min_len 5
Before this step, you need to extend the vocabulary to account for the new, Sorbian BPE vocabulary items. Specifically, the embedding layer (and the output layer) of the MASS model need to be increased by the amount of new items added to the existing vocabulary for this step. To do that, use the following command:
./get_data_and_preprocess.sh --src de --tgt hsb
In the directory ./data/de-hsb-wmt/
, a file named vocab.hsb-de-ext-by-$NUMBER
has been created. This number indicates by how many items we need to extend the initial vocabulary, and consequently the embedding and linear layer, to account for the Hsb language.
You will need to give this value to the --increase_vocab_by
argument so that you successfully run the fine-tuning step of MASS.
Then, fine-tune the model:
python3 train.py --exp_name de_mass_ft_hsb --dump_path './models' --data_path './data/de-hsb-wmt/' --lgs 'de-hsb' --mass_steps 'de,hsb' --encoder_only false --emb_dim 1024 --n_layers 6 --n_heads 8 --dropout '0.1' --attention_dropout '0.1' --gelu_activation true --tokens_per_batch 2000 --optimizer 'adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001' --epoch_size 50000 --max_epoch 100000 --word_mass '0.5' --min_len 5 --reload_model './models/de_mass/3w8dqrykpd/checkpoint.pth' --increase_vocab_for_lang de --increase_vocab_from_lang hsb --increase_vocab_by $NUMBER
python3 train.py --exp_name 'unsup_nmt_de_mass_ft_hsb_ft_nmt_sampling_th-0.95_spl-0.5' --dump_path './models' --data_path './data/de-hsb-wmt/' --lgs 'de-hsb' --bt_steps 'de-hsb-de,hsb-de-hsb' --encoder_only false --emb_dim 1024 --n_layers 6 --n_heads 8 --dropout '0.1' --attention_dropout '0.1' --gelu_activation true --tokens_per_batch 1000 --batch_size 32 --optimizer 'adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001' --epoch_size 50000 --max_epoch 100000 --eval_bleu true --sample_temperature '0.95' --reload_model './models/de_mass_ft_hsb/8fark50w1p/checkpoint.pth,./models/de_mass_ft_hsb/8fark50w1p/checkpoint.pth' --increase_vocab_for_lang de --increase_vocab_from_lang hsb --sampling_frequency '0.5'
We use Monoses to build an USMT system. Due to the small Sorbian data the unsupervised mapping used in the off-the-shelf tool doesn't lead to a good performance. We replace this step with fastText and VecMap identical word mapping.
The necessary bilingual word embeddings can be built by running the following scrip. The data preprocessing steps in 1. and 2. are necessary for this step:
./build_embeddings.sh --src de --tgt hsb
The output embeddings can be found in models/SMT/step3
and models/SMT/step4
. In order to create pseudo-parallel data please run the missing steps of Monoses accordingly.
The pseudo-parallel data has to be BPE tokenized, e.g.:
./BPE_split.sh --input <original.de> --output ./data/de-hsb-wmt/train.hsb-de.de --lang de --src de --tgt hsb
./BPE_split.sh --input <back-translation.hsb> --output ./data/de-hsb-wmt/train.hsb-de.hsb --lang hsb --src de --tgt hsb
5. Fine-tune the UNMT model, using both a BT loss on the monolingual data and a supervised loss on the pseudo-parallel data from USMT
Assuming you have created pseudo-parallel data from USMT and placed them in ./data/de-hsb-wmt
in the following form:
-
train.hsb-de.{de, hsb}: original de monolingual data, hsb back-translations
-
train.de-hsb.{de, hsb}: original hsb monolingual data, de back-translations
As a final step, you need to binarize the back-translated data, using:
./preprocess.py ./data/de-hsb-wmt/$VOCAB_FINAL train.hsb-de.{de, hsb}
./preprocess.py ./data/de-hsb-wmt/$VOCAB_FINAL train.de-hsb.{de, hsb}
Check the name of VOCAB_FINAL in the de-hsb-wmt/
directory. It will have the form vocab.de-hsb-ext-by-$N
, where $N depends on the amount of extra vocabulary items added.
This will be used as a pseudo-parallel corpus (--mt_steps
flag):
python3 train.py --exp_name 'unsup_nmt_de_mass_ft_hsb_ft_nmt_sampling_th-0.95_spl-0.5_ft_smt_both_dir' --dump_path './models' --data_path './data/de-hsb-wmt' --lgs 'de-hsb' --ae_steps 'de,hsb' --bt_steps 'de-hsb-de,hsb-de-hsb' --mt_steps 'de-hsb,hsb-de' --encoder_only false --emb_dim 1024 --n_layers 6 --n_heads 8 --dropout '0.1' --attention_dropout '0.1' --gelu_activation true --tokens_per_batch 1000 --batch_size 32 --optimizer 'adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001' --epoch_size 50000 --max_epoch 100000 --eval_bleu true --increase_vocab_for_lang de --increase_vocab_from_lang hsb --reload_model './models/unsup_nmt_de_mass_ft_hsb_ft_nmt_sampling_th-0.95_spl-0.5/fsp0smjzgu/checkpoint.pth,./models/unsup_nmt_de_mass_ft_hsb_ft_nmt_sampling_th-0.95_spl-0.5/fsp0smjzgu/checkpoint.pth' --sampling_frequency '0.5' --sample_temperature '0.95' --load_diff_mt_direction_data true
It is better to pick a subset of train.de, as it will probably be very large (we downloaded 327M sentences from NewsCrawl).
ln -s ./de-hsb-wmt/train.de to ./temp/train.hsb-de.de
ln -s ./de-hsb-wmt/train.hsb to ./temp/train.de-hsb.hsb
Then, run the NMT model:
python3 translate.py --src_lang de --tgt_lang hsb --model_path ./models/unsup_nmt_de_mass_ft_hsb_ft_nmt_sampling_th-0.95_spl-0.5_ft_smt_both_dir/tca9s0sr08/checkpoint.pth --exp_name translate_de_hsb_750k --dump_path './models' --output_path ./data/temp/train.hsb-de.hsb --batch_size 64 --input_path ./data/temp/train.hsb-de.de --beam 5
-
train.hsb-de.de will contain the original De data
-
train.hsb-de.hsb will contain the back-translated Hsb data
This will be used as a pseudo-parallel corpus in the next step.
python3 translate.py --src_lang hsb --tgt_lang de --model_path ./models/unsup_nmt_de_mass_ft_hsb_ft_nmt_sampling_th-0.95_spl-0.5_ft_smt_both_dir/tca9s0sr08/checkpoint.pth --exp_name translate_hsb_de_750k --dump_path './models' --output_path ./data/temp/train.hsb-de.de --batch_size 64 --input_path ./data/temp/train.hsb-de.hsb --beam 5
Accordingly,
-
train.de-hsb.de will contain the back-translated De data
-
train.de-hsb.hsb will contain the original Hsb data
After you store the USMT pseudo-parallel corpus (see step 5) in a different directory so thast you do not overwrite it, move the ./data/temp/train.{hsb-de,de-hsb}.{hsb,de}
files in the ./data/de-hsb-wmt
directory, in order to use them in step 7.
python3 train.py --exp_name 'unsup_nmt_de_mass_ft_hsb_ft_nmt_sampling_th-0.95_spl-0.5_ft_smt_both_dir' --dump_path './models' --data_path './data/de-hsb-wmt/' --lgs 'de-hsb' --ae_steps 'de,hsb' --bt_steps 'de-hsb-de,hsb-de-hsb' --mt_steps 'de-hsb,hsb-de' --encoder_only false --emb_dim 1024 --n_layers 6 --n_heads 8 --dropout '0.1' --attention_dropout '0.1' --gelu_activation true --tokens_per_batch 1000 --batch_size 32 --optimizer 'adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001' --epoch_size 50000 --max_epoch 100000 --eval_bleu true --increase_vocab_for_lang de --increase_vocab_from_lang hsb --reload_model './models/unsup_nmt_de_mass_ft_hsb_ft_nmt_sampling_th-0.95_spl-0.5/fsp0smjzgu/checkpoint.pth,./models/unsup_nmt_de_mass_ft_hsb_ft_nmt_sampling_th-0.95_spl-0.5/fsp0smjzgu/checkpoint.pth' --sampling_frequency '0.5' --sample_temperature '0.95' --load_diff_mt_direction_data true
After you oversample the Hsb corpus, apply BPE-dropout to it using apply-bpe
from subword-nmt with the flag --dropout 0.1
.
Then, place it in a directory ./data/de-hsb-wmt-bpe-dropout
, together with the De data and run the following command:
python3 train.py --exp_name cont_from_best_unmt_bpe_drop --dump_path './models' --data_path './data/de-hsb-wmt-bpe-dropout' --lgs 'de-hsb' --bt_steps 'de-hsb-de,hsb-de-hsb' --encoder_only false --emb_dim 1024 --n_layers 6 --n_heads 8 --dropout '0.1' --attention_dropout '0.1' --gelu_activation true --tokens_per_batch 1000 --batch_size 32 --optimizer 'adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001' --epoch_size 50000 --max_epoch 100000 --eval_bleu true --increase_vocab_for_lang de --increase_vocab_from_lang hsb --reload_model 'unsup_nmt_de_mass_ft_hsb_ft_nmt_sampling_th-0.95_spl-0.5_ft_smt_both_dir/saa386ltp2/checkpoint.pth,unsup_nmt_de_mass_ft_hsb_ft_nmt_sampling_th-0.95_spl-0.5_ft_smt_both_dir/saa386ltp2/checkpoint.pth' --sampling_frequency '0.5' --sample_temperature '0.95'
If you use our work, please cite:
@InProceedings{chronopoulou-EtAl:2020:WMT,
author = {Chronopoulou, Alexandra and Stojanovski, Dario and Hangya, Viktor and Fraser, Alexander},
title = {{T}he {LMU} {M}unich {S}ystem for the {WMT} 2020 {U}nsupervised {M}achine {T}ranslation {S}hared {T}ask},
booktitle = {Proceedings of the Fifth Conference on Machine Translation},
month = {November},
year = {2020},
address = {Online},
publisher = {Association for Computational Linguistics},
pages = {1082--1089},
abstract = {This paper describes the submission of LMU Munich to the WMT 2020 unsupervised shared task, in two language directions, German↔Upper Sorbian. Our core unsupervised neural machine translation (UNMT) system follows the strategy of Chronopoulou et al. (2020), using a monolingual pretrained language generation model (on German) and fine-tuning it on both German and Upper Sorbian, before initializing a UNMT model, which is trained with online backtranslation. Pseudo-parallel data obtained from an unsupervised statistical machine translation (USMT) system is used to fine-tune the UNMT model. We also apply BPE-Dropout to the low resource (Upper Sorbian) data to obtain a more robust system. We additionally experiment with residual adapters and find them useful in the Upper Sorbian→German direction. We explore sampling during backtranslation and curriculum learning to use SMT translations in a more principled way. Finally, we ensemble our best-performing systems and reach a BLEU score of 32.4 on German→Upper Sorbian and 35.2 on Upper Sorbian→German.},
url = {https://www.aclweb.org/anthology/2020.wmt-1.128}
}