Skip to content

Latest commit

 

History

History
45 lines (33 loc) · 2.07 KB

TDD-PC-202112-CC-001.md

File metadata and controls

45 lines (33 loc) · 2.07 KB

The English-Turkish subset of the WMT16 task. A set of manually translated sentence pairs from news articles.

Dataset Curation

The training split of this dataset is based on SETIMES2 from OPUS (Tiedemann, 2009). The development set is released by WMT16, and the test set was donated by Yandex. The content of the dataset is mostly sampled from news articles.

Dataset Format

The dataset is provided in jsonlines format using UTF-8 encoding.

{"translation": {"en": "On paper at least, it looks like a great idea.", "tr": "En azından kağıt üzerinde, harika bir fikir gibi görünüyor."}}
{"translation": {"en": "Esat Berisha is one such example.", "tr": "Esat Berişa buna bir örnek."}}
..
..

Dataset Statistics

Statistics are provided for training/validation/test splits in the following table:

#Sentences #Words
Turkish 205K / 1K / 3K 3.6M / 14K / 44K
English 205K / 1K / 3K 4.4M / 19K / 58K

Citation

@InProceedings{bojar-etal-2016-wmt16,
  author    = {Bojar, Ond{r}ej  and  Chatterjee, Rajen  and  Federmann, Christian  and  Graham, Yvette  and  Haddow, Barry  and  Huck, Matthias  and  Jimeno Yepes, Antonio  and  Koehn, Philipp  and  Logacheva, Varvara  and  Monz, Christof  and  Negri, Matteo  and  Neveol, Aurelie  and  Neves, Mariana  and  Popel, Martin  and  Post, Matt  and  Rubino, Raphael  and  Scarton, Carolina  and  Specia, Lucia  and  Turchi, Marco  and  Verspoor, Karin  and  Zampieri, Marcos},
  title     = {Findings of the 2016 Conference on Machine Translation},
  booktitle = {Proceedings of the First Conference on Machine Translation},
  month     = {August},
  year      = {2016},
  address   = {Berlin, Germany},
  publisher = {Association for Computational Linguistics},
  pages     = {131--198},
  url       = {http://www.aclweb.org/anthology/W/W16/W16-2301}
}

Contact

Uploaded and documented by Ali Safaya: alisafaya gmail com