Scaling Neural Machine Translation (Ott et al., 2018)

This page includes instructions for reproducing results from the paper Scaling Neural Machine Translation (Ott et al., 2018).

Pre-trained models

Model	Description	Dataset	Download
`transformer.wmt14.en-fr`	Transformer (Ott et al., 2018)	WMT14 English-French	model: download (.tar.bz2) newstest2014: download (.tar.bz2)
`transformer.wmt16.en-de`	Transformer (Ott et al., 2018)	WMT16 English-German	model: download (.tar.bz2) newstest2014: download (.tar.bz2)

Training a new model on WMT'16 En-De

First download the preprocessed WMT'16 En-De data provided by Google.

Then:

1. Extract the WMT'16 En-De data

TEXT=wmt16_en_de_bpe32k
mkdir -p $TEXT
tar -xzvf wmt16_en_de.tar.gz -C $TEXT

2. Preprocess the dataset with a joined dictionary

fairseq-preprocess \
    --source-lang en --target-lang de \
    --trainpref $TEXT/train.tok.clean.bpe.32000 \
    --validpref $TEXT/newstest2013.tok.bpe.32000 \
    --testpref $TEXT/newstest2014.tok.bpe.32000 \
    --destdir data-bin/wmt16_en_de_bpe32k \
    --nwordssrc 32768 --nwordstgt 32768 \
    --joined-dictionary \
    --workers 20

3. Train a model

fairseq-train \
    data-bin/wmt16_en_de_bpe32k \
    --arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 \
    --dropout 0.3 --weight-decay 0.0 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --max-tokens 3584 \
    --fp16

Note that the --fp16 flag requires you have CUDA 9.1 or greater and a Volta GPU or newer.

IMPORTANT: You will get better performance by training with big batches and increasing the learning rate. If you want to train the above model with big batches (assuming your machine has 8 GPUs):

add --update-freq 16 to simulate training on 8x16=128 GPUs
increase the learning rate; 0.001 works well for big batches

4. Evaluate

Now we can evaluate our trained model.

Note that the original Attention Is All You Need paper used a couple tricks to achieve better BLEU scores. We use these same tricks in the Scaling NMT paper, so it's important to apply them when reproducing our results.

First, use the average_checkpoints.py script to average the last few checkpoints. Averaging the last 5-10 checkpoints is usually good, but you may need to adjust this depending on how long you've trained:

python scripts/average_checkpoints \
    --inputs /path/to/checkpoints \
    --num-epoch-checkpoints 5 \
    --output checkpoint.avg5.pt

Next, generate translations using a beam width of 4 and length penalty of 0.6:

fairseq-generate \
    data-bin/wmt16_en_de_bpe32k \
    --path checkpoint.avg5.pt \
    --beam 4 --lenpen 0.6 --remove-bpe

Citation

@inproceedings{ott2018scaling,
  title = {Scaling Neural Machine Translation},
  author = {Ott, Myle and Edunov, Sergey and Grangier, David and Auli, Michael},
  booktitle = {Proceedings of the Third Conference on Machine Translation (WMT)},
  year = 2018,
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Scaling Neural Machine Translation (Ott et al., 2018)

Pre-trained models

Training a new model on WMT'16 En-De

1. Extract the WMT'16 En-De data

2. Preprocess the dataset with a joined dictionary

3. Train a model

4. Evaluate

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Scaling Neural Machine Translation (Ott et al., 2018)

Pre-trained models

Training a new model on WMT'16 En-De

1. Extract the WMT'16 En-De data

2. Preprocess the dataset with a joined dictionary

3. Train a model

4. Evaluate

Citation