This repo provides data and experimental details for the paper LOW-RESOURCE NEURAL MACHINE TRANSLATION: A BENCHMARK FOR FIVE AFRICAN LANGUAGES .
Updates:
- [July 2020] Data and scripts are available (see ./data, ./scripts directories)
- [March, 2020] Data, scripts, pre-trained models will be available asap.
...benchmark NMT between English and five African LRL pairs (Swahili, Amharic, Tigrigna, Oromo, Somali [SATOS]). We collected the available resources on the SATOS languages to evaluate the current state of NMT for LRLs. Our evaluation, comparing a baseline single language pair supervised NMT model against semi-supervised learning, transfer-learning, and multilingual modeling, shows significant performance improvements both in the En → LRL and LRL → En directions.
Baseline Supervised NMT
- Benchmarks a single language pair NMT models between En and the SATOS languages.
Semi-Supervised NMT
- Utilizes back-translation that leverages monolingual data to improve the supervised models.
Transfer-Learning NMT
- Utilizes dynamic transfer-learning approach from a parent multilingual model to initialize single language pair child models.
Multilingual NMT
- Trains a multilingual model ( of 10 directions) aggregating data from all the pairs.
Additional summaries on each of these approaches can be found in the paper. Further readings on semi-supervised, transfer-learning, and multilingual-nmt
For installing requirements and initial setup, run: ./env-setup.sh
- Monolingual Data (wikipedia articles)
./scripts/get-monolingual-data.sh [lang-id]
- Parallel Data (Opus data of differen corpus)
./scripts/get-opus-data.sh [src-lang-id] [tgt-lang-id] ['corpus-1 corpus-2 corpus-n']
- For evaluation (out-of-domain), we use Ted Talks data:
./scripts/get-ted-data.sh [src-lang-id] [tgt-lang-id]
To skip to data processing, download prepared data
- The monolingual data provided in this repo includes segments extracted from wikipedia. However, in the paper we also used monolingual data (specifically for Amharic, Oromo, Somali, and Tigrigna languages) from the HaBiT corpus. If you would like to access and include this data please refer HaBiT, and make sure to cite their work.
Before getting the training data, a one time process is to split the collected data to the train, Dev, and Test portions: ./get-nonoverlap-split.sh
Build Training Data:
./scripts/build-training-data.sh ['src-tgt tgt-src src2-tgt tgt-src2'] [flag] [exp-dir]
For instance, to train a bidirectional am<>en
model with a language flag, build the data as:
./scripts/build-training-data.sh 'am-en en-am' flag 'experiments/am-en'
. If training only a single pair src-tgt
model set flag=false
. For model training using a specific domain data, update the script.
Preprocess Data:
./script/preprocess.sh [exp-dir]
./train.sh [exp-dir] [exp-id] [gpu/device-id]
To train a multilingual model, simply change number of provided pairs in the Build Training Data step, followed by the same training steps as in the baseline. For furtherr details on training a transfer-learning model see dynamic transfer-learning repo.
./translate.sh [exp-dir] [exp-id] [src-tgt tgt-src ...] [flag] [gpu/device-id]
@article{lakew2020low,
title={Low Resource Neural Machine Translation: A Benchmark for Five African Languages},
author={Lakew, Surafel M and Negri, Matteo and Turchi, Marco},
journal={arXiv preprint arXiv:2003.14402},
year={2020}
}
- If you are working on one of the five languages or in general on low-resource languages, and if have a question, discussion, or looking for a collaboration dont hesitate to reach out.