The English-Turkish subset of the WMT16 task. A set of manually translated sentence pairs from news articles.

Dataset Curation

The training split of this dataset is based on SETIMES2 from OPUS (Tiedemann, 2009). The development set is released by WMT16, and the test set was donated by Yandex. The content of the dataset is mostly sampled from news articles.

Dataset Format

The dataset is provided in jsonlines format using UTF-8 encoding.

{"translation": {"en": "On paper at least, it looks like a great idea.", "tr": "En azından kağıt üzerinde, harika bir fikir gibi görünüyor."}}
{"translation": {"en": "Esat Berisha is one such example.", "tr": "Esat Berişa buna bir örnek."}}
..
..

Dataset Statistics

Statistics are provided for training/validation/test splits in the following table:

	#Sentences	#Words
Turkish	205K / 1K / 3K	3.6M / 14K / 44K
English	205K / 1K / 3K	4.4M / 19K / 58K

Citation

@InProceedings{bojar-etal-2016-wmt16,
  author    = {Bojar, Ond{r}ej  and  Chatterjee, Rajen  and  Federmann, Christian  and  Graham, Yvette  and  Haddow, Barry  and  Huck, Matthias  and  Jimeno Yepes, Antonio  and  Koehn, Philipp  and  Logacheva, Varvara  and  Monz, Christof  and  Negri, Matteo  and  Neveol, Aurelie  and  Neves, Mariana  and  Popel, Martin  and  Post, Matt  and  Rubino, Raphael  and  Scarton, Carolina  and  Specia, Lucia  and  Turchi, Marco  and  Verspoor, Karin  and  Zampieri, Marcos},
  title     = {Findings of the 2016 Conference on Machine Translation},
  booktitle = {Proceedings of the First Conference on Machine Translation},
  month     = {August},
  year      = {2016},
  address   = {Berlin, Germany},
  publisher = {Association for Computational Linguistics},
  pages     = {131--198},
  url       = {http://www.aclweb.org/anthology/W/W16/W16-2301}
}

Contact

Uploaded and documented by Ali Safaya: alisafaya gmail com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TDD-PC-202112-CC-001.md

TDD-PC-202112-CC-001.md

Dataset Curation

Dataset Format

Dataset Statistics

Citation

Contact

Files

TDD-PC-202112-CC-001.md

Latest commit

History

TDD-PC-202112-CC-001.md

File metadata and controls

Dataset Curation

Dataset Format

Dataset Statistics

Citation

Contact