The English-Turkish subset of the WMT16 task. A set of manually translated sentence pairs from news articles.
The training split of this dataset is based on SETIMES2 from OPUS (Tiedemann, 2009). The development set is released by WMT16, and the test set was donated by Yandex. The content of the dataset is mostly sampled from news articles.
The dataset is provided in jsonlines
format using UTF-8 encoding.
{"translation": {"en": "On paper at least, it looks like a great idea.", "tr": "En azından kağıt üzerinde, harika bir fikir gibi görünüyor."}}
{"translation": {"en": "Esat Berisha is one such example.", "tr": "Esat Berişa buna bir örnek."}}
..
..
Statistics are provided for training/validation/test splits in the following table:
#Sentences | #Words | |
---|---|---|
Turkish | 205K / 1K / 3K | 3.6M / 14K / 44K |
English | 205K / 1K / 3K | 4.4M / 19K / 58K |
@InProceedings{bojar-etal-2016-wmt16,
author = {Bojar, Ond{r}ej and Chatterjee, Rajen and Federmann, Christian and Graham, Yvette and Haddow, Barry and Huck, Matthias and Jimeno Yepes, Antonio and Koehn, Philipp and Logacheva, Varvara and Monz, Christof and Negri, Matteo and Neveol, Aurelie and Neves, Mariana and Popel, Martin and Post, Matt and Rubino, Raphael and Scarton, Carolina and Specia, Lucia and Turchi, Marco and Verspoor, Karin and Zampieri, Marcos},
title = {Findings of the 2016 Conference on Machine Translation},
booktitle = {Proceedings of the First Conference on Machine Translation},
month = {August},
year = {2016},
address = {Berlin, Germany},
publisher = {Association for Computational Linguistics},
pages = {131--198},
url = {http://www.aclweb.org/anthology/W/W16/W16-2301}
}
Uploaded and documented by Ali Safaya: alisafaya gmail com