Semantic Textual Similarity (STS) benchmark Turkish (STSb-TR) dataset is the machine translated version of English STS benchmark dataset using Google Cloud Translation API. No human corrections have been made to the translations. The official website for the STSb Turkish dataset can be found at STSb-TR.
The dataset consists of 8628 machine-translated Turkish sentence pairs. In this dataset, each sentence pair was annotated by crowdsourcing and assigned a semantic similarity score. Scores range from 0 (no semantic similarity) to 5 (semantically equivalent).
An example from STSb-TR dataset is given below.
Example:
{'genre' : 'main-captions',
'dataset' : 'MSRvid',
'year' : 2012test,
'sid' : '0062',
'score': 1.6,
'sentence1': 'Bir bebek kaplan bir topla oynuyor.',
'sentence2': 'Bir bebek bir oyuncak bebekle oynuyor.',
}
field | dtype | Explanation |
---|---|---|
genre | string | Genre of the text |
dataset | string | Original name of the dataset |
year | string | Year of the original dataset |
sid | string | Sentence id |
score | float | Semantic similarity score. 0.0 is the lowest and 5.0 is the highest score. |
sentence1 | string | First sentence |
sentence2 | string | Second sentence |
Training | Validation | Test | Total |
---|---|---|---|
5749 | 1500 | 1379 | 8628 |
Semantic textual similarity studies are common in English. These studies are based on datasets that have been annotated by humans. However, annotation is a costly and time-consuming task. Recently, with the increasing advancement in machine translation, it has become possible to use datasets by translating them from one language to another.
The original STSb dataset is a selection of the English datasets used in SemEval STS studies between 2012 and 2017. It includes text from image captions, news headlines, and user forums.
Each sentence pair was annotated by crowdsourcing via Amazon Mechanical Turk. Five scores were collected for each pair and gold scores were obtained by averaging them.
The quality of translations was tested as follows: We selected 50 sentence pairs (100 sentences) randomly, considering the percentage of the categories in the dataset. So, 6, 19 and 25 pairs chosen from forum, caption and news categories respectively. These sentences were translated by three native Turkish speakers who are fluent in English. We evaluated quality of the system translations with the three references and found BLEU score as 60.21 which shows that our system translations can be considered as very high quality translations. Therefore, no changes have been made to the translations.
This version of the dataset is retrieved from the original github repository at 2022.04.30 from commit id: 5546780
“Published by Figen Beken Fikri, Kemal Oflazer, Berrin Yanıkoğlu.”
Please cite the following paper if you found this dataset useful:
Figen Beken Fikri, Kemal Oflazer, Berrin Yanıkoğlu. Semantic Similarity Based Evaluation for Abstractive News Summarization , In Proceedings of ACL-IJCNLP Workshop GEM: Natural Language Generation, Evaluation, and Metrics, August 6, 2021.
@inproceedings{beken-fikri-etal-2021-semantic,
title = "Semantic Similarity Based Evaluation for Abstractive News Summarization",
author = "Beken Fikri, Figen and Oflazer, Kemal and Yanikoglu, Berrin",
booktitle = "Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021)",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.gem-1.3",
doi = "10.18653/v1/2021.gem-1.3",
pages = "24--33",
abstract = "ROUGE is a widely used evaluation metric in text summarization. However, it is not suitable for the evaluation of abstractive summarization systems as it relies on lexical overlap between the gold standard and the generated summaries. This limitation becomes more apparent for agglutinative languages with very large vocabularies and high type/token ratios. In this paper, we present semantic similarity models for Turkish and apply them as evaluation metrics for an abstractive summarization task. To achieve this, we translated the English STSb dataset into Turkish and presented the first semantic textual similarity dataset for Turkish. We showed that our best similarity models have better alignment with average human judgments compared to ROUGE in both Pearson and Spearman correlations.",
}
Documented by fbekenfikri sabanciuniv edu
and reviewed/uploaded by ardagoktogan gmail com