Code for the NAACL 2019 paper "Improving Robustness of Machine Translation with Synthetic Noise"
pytorch (tested on version 0.3.0.post4)
docopt
tqdm
- Fetch the data
mkdir data
mkdir work_dir
cd data/
wget http://www.statmt.org/europarl/v7/fr-en.tgz
wget http://www.statmt.org/wmt15/dev-v2.tgz
wget http://www.cs.cmu.edu/~pmichel1/hosting/MTNT.1.0.tar.gz
The training data consists of fr-en parallel data from europarl. The noisy data is collected from Reddit (https://arxiv.org/abs/1809.00388).
- Untar the files
tar -xvzf fr-en.tgz
tar -xvzf dev-v2.tgz
tar xvzf MTNT.1.0.tar.gz
- Preparing the data for training and testing includes selecting sentences of length less than 50, selecting subset of data and removing html tags
cd ../
sed '/<seg/!d' data/dev/newsdiscussdev2015-fren-ref.en.sgm | sed -e 's/\s*<[^>]*>\s*//g' > data/dev/dev.en
sed '/<seg/!d' data/dev/newsdiscussdev2015-fren-src.fr.sgm | sed -e 's/\s*<[^>]*>\s*//g' > data/dev/dev.fr
python prune_sentences.py data/europarl-v7.fr-en.fr data/europarl-v7-50.fr-en.fr data/europarl-v7.fr-en.en data/europarl-v7-50.fr-en.en 50
head -n1000000 data/europarl-v7-50.fr-en.fr > train.fr
head -n1000000 data/europarl-v7-50.fr-en.en > train.en
cut -f2 -d$'\t' MTNT/train/train.fr-en.tsv > train.mtnt.fr
cut -f3 -d$'\t' MTNT/train/train.fr-en.tsv > train.mtnt.en
cut -f2 -d$'\t' MTNT/valid/valid.fr-en.tsv > dev.mtnt.fr
cut -f3 -d$'\t' MTNT/valid/valid.fr-en.tsv > dev.mtnt.en
cut -f3 -d$'\t' MTNT/test/test.fr-en.tsv > test.mtnt.en
cut -f2 -d$'\t' MTNT/test/test.fr-en.tsv > test.mtnt.fr
- Use spe encoding to create sub-word data. Highly recommended ! Below command assumes you have spe models placed in sp_models/, specifically, europarl-v7.fr-en.en.model and europarl-v7.fr-en.fr.model. We have provided these files in the repository with vocab size 16k.
python encode_spm.py -m sp_models/europarl-v7.fr-en.fr.model -i data/train.fr -o data/train.tok.fr
python encode_spm.py -m sp_models/europarl-v7.fr-en.en.model -i data/train.en -o data/train.tok.en
python encode_spm.py -m sp_models/europarl-v7.fr-en.fr.model -i data/dev/dev.fr -o data/dev/dev.tok.fr
python encode_spm.py -m sp_models/europarl-v7.fr-en.en.model -i data/dev/dev.en -o data/dev/dev.tok.en
python encode_spm.py -m sp_models/europarl-v7.fr-en.fr.model -i data/test.mtnt.fr -o data/test.mtnt.tok.fr
- Training the model.
python vocab.py --train-src=data/train.tok.fr --train-tgt=data/train.tok.en data/vocab-bpe.bin --freq-cutoff 0
python nmt.py train --train-src="data/train.tok.fr" --train-tgt="data/train.tok.en" --dev-src="data/dev/dev.tok.fr" --dev-tgt="data/dev/dev.tok.en" --vocab="data/vocab.bin" --save-to="work_dir/" --valid-niter=1000 --batch-size=32 --hidden-size=256 --embed-size=512 --optim=1 --max-epoch=30 --uniform-init=0.1 --dropout=0.3 --lr=0.01 --clip-grad=20 --lr-decay=0.5 --patience=3 --tie-weights=1 --n_layers=2
- Decoding on the dev set, once the model is trained. Replace the model file name with the correct name.
python nmt.py decode --beam-size=5 --max-decoding-time-step=100 --embed-size=512 --tie-weights=1 --n_layers=2 --vocab="data/vocab-bpe.bin" "work_dir/model_epoch.t7" "data/dev/dev/dev.tok.fr" "work_dir/decode-fr-en.tok.txt"
python decode_spm.py -m sp_models/europarl-v7.fr-en.en.model -i work_dir/decode-fr-en.tok.txt -o work_dir/decode-fr-en.txt
- Compute the bleu score using the decoded file decode-fr-en.txt.
perl multi-bleu.perl "data/dev/dev.en" < "work_dir/decode-fr-en.txt"
- Randomly sample 100k parallel examples from train.en and train.fr to create train-100k.en and train-100k.fr. An easy way to do this would be to just pick the top 100k examples (this might not be most representative of the entire dataset).
head -n100000 data/train.fr > data/train-100k.fr
head -n100000 data/train.en > data/train-100k.en
- Add random noise in this smaller dataset. We used the same smaller dataset across all the methods proposed in the paper.
python artificial_noise.py data/train-100k.fr data/train-100k.en data/train-100k.sni.fr data/train.sni.en "0.04,0.007,0.002,0.015"
python encode_spm.py -m sp_models/europarl-v7.fr-en.fr.model -i data/train.sni.fr -o data/train.sni.tok.fr
python encode_spm.py -m sp_models/europarl-v7.fr-en.en.model -i data/train.sni.en -o data/train.sni.tok.en
- Get the TED talks data for training the two intermediate models from http://phontron.com/data/ted_talks.tar.gz
- Use ted_reader.py provided in the repository https://github.com/neulab/word-embeddings-for-nmt for extracting parallel corpora for en-fr.
- Make sure the following files exist in the folder ted_data/, train.fr, train.en, dev.fr and dev.en (these files will be obtained using the above two steps)
- Prune out sentences with length more than 50.
python prune_sentences.py ted_data/train.fr ted_data/train-50.fr ted_data/train.en ted_data/train-50.en 50
- Encode the data using the spe model trained on ted data (spe model provided).
python encode_spm.py -m sp_models/ted.fr-en.en.model -i ted_data/train-50.en -o ted_data/train.tok.en
python encode_spm.py -m sp_models/ted.fr-en.fr.model -i ted_data/train-50.fr -o ted_data/train.tok.fr
python encode_spm.py -m sp_models/ted.fr-en.en.model -i ted_data/dev.en -o ted_data/dev.tok.en
python encode_spm.py -m sp_models/ted.fr-en.fr.model -i ted_data/dev.fr -o ted_data/dev.tok.fr
python encode_spm.py -m sp_models/ted.fr-en.fr.model -i data/test.mtnt.fr -o ted_data/test.mtnt.tok.fr
- Concatenate the MTNT training data.
cat ted_data/train.tok.fr MTNT/train/train-50.mtnt.tok.fr > ted_data/train_ted_mtnt.tok.fr
cat ted_data/train.tok.en MTNT/train/train-50.mtnt.tok.en > ted_data/train_ted_mtnt.tok.en
- Train the model in forward and backward direction using the concatenated data and the commands mentioned for training the baseline model.
- To get the noisy data using UBT i.e EP-100k-UBT, just decode using the best models obtained in the previous step.
python encode_spm.py -m sp_models/ted.fr-en.fr.model -i data/train-100k.fr -o data/train-100k.ted.tok.fr
python encode_spm.py -m sp_models/ted.fr-en.en.model -i data/train-100k.en -o data/train-100k.ted.tok.en
python nmt.py decode --beam-size=5 --max-decoding-time-step=100 --embed-size=512 --tie-weights=1 --n_layers=2 --vocab="vocab-ted-fr-en.bin" "work_dir/model_ted_best_forward.t7" "data/train-100k.ted.tok.fr" "work_dir/ted.decode-fr-en.tok.en"
python nmt.py decode --beam-size=5 --max-decoding-time-step=100 --embed-size=512 --tie-weights=1 --n_layers=2 --vocab="ted_data/vocab-spe-reverse.bin" "work_dir/model_ted_best_backward.t7" "work_dir/ted.decode-fr-en.tok.en" "work_dir/ted.decode-fr-en.tok.fr"
python decode_spm.py -m sp_models/ted.fr-en.en.model -i work_dir/ted.decode-fr-en.tok.en -o work_dir/ted.train.decode-fr-en.en
python decode_spm.py -m sp_models/ted.fr-en.fr.model -i work_dir/ted.decode-fr-en.tok.fr -o work_dir/ted.train.decode-fr-en.fr
- The files ted.train.decode-fr-en.en and ted.train.decode-fr-en.fr are used in finetuning for the UBT method.
- Follow the first five steps of UBT to get the required data.
- Prior to concatenating the data, append the data source tag.
sed 's/^/_TED_ /g' ted_data/train.tok.fr > ted_data/train.tag.tok.fr
sed 's/^/_MTNT_ /g' MTNT/train/train-50.mtnt.tok.fr > MTNT/train/train-50.mtnt.tag.tok.fr
- Concatenate the above two files for fr and en respectively.
cat ted_data/train.tag.tok.fr MTNT/train/train-50.mtnt.tag.tok.fr > ted_data/train_ted_mtnt.tag.tok.fr
- Use the files obtained in the above step to train the forward and backward model.
- To get the noisy data using TBT i.e EP-100k-TBT, just decode using the best models obtained in the previous step. But append the noisy data source tag before decoding.
sed 's/^/_MTNT_ /g' data/train-100k.ted.tok.fr > data/train-100k.ted.tag.tok.fr
sed 's/^/_MTNT_ /g' data/train-100k.ted.tok.en > data/train-100k.ted.tag.tok.en
python nmt.py decode --beam-size=5 --max-decoding-time-step=100 --embed-size=512 --tie-weights=1 --n_layers=2 --vocab="vocab-ted-fr-en.bin" "work_dir/model_ted_best_forward.t7" "data/train-100k.ted.tag.tok.fr" "work_dir/ted.decode-fr-en.tag.tok.en"
python nmt.py decode --beam-size=5 --max-decoding-time-step=100 --embed-size=512 --tie-weights=1 --n_layers=2 --vocab="ted_data/vocab-spe-reverse.bin" "work_dir/model_ted_best_backward.t7" "work_dir/ted.decode-fr-en.tag.tok.en" "work_dir/ted.decode-fr-en.tag.tok.fr"
python decode_spm.py -m sp_models/ted.fr-en.en.model -i work_dir/ted.decode-fr-en.tag.tok.en -o work_dir/ted.train.decode-fr-en.en
python decode_spm.py -m sp_models/ted.fr-en.fr.model -i work_dir/ted.decode-fr-en.tag.tok.fr -o work_dir/ted.train.decode-fr-en.fr
- The files ted.train.decode-fr-en.en and ted.train.decode-fr-en.fr are used in finetuning for the TBT method.
It is same as training a new model using the additional data but with the weights loaded from a pre-trained model.
python nmt.py train --train-src="data/train.sni.tok.fr" --train-tgt="data/train.sni.tok.en" --dev-src="data/dev/dev.tok.fr" --dev-tgt="data/dev/dev.tok.en" --vocab="data/vocab.bin" --save-to="work_dir/" --valid-niter=1000 --batch-size=32 --hidden-size=256 --embed-size=512 --optim=1 --max-epoch=30 --uniform-init=0.1 --dropout=0.3 --lr=0.01 --clip-grad=20 --lr-decay=0.5 --patience=3 --tie-weights=1 --n_layers=2 --load-weights-from "work_dir/model_baseline.t7"
If you use the code, please consider citing the paper using following bibtex:
@inproceedings{vaibhav19naacl,
title = {Improving Robustness of Machine Translation with Synthetic Noise},
author = {Vaibhav and Sumeet Singh and Craig Stewart and Graham Neubig},
booktitle = {Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL)},
address = {Minneapolis, USA},
month = {June},
year = {2019}
}