Skip to content

Latest commit

 

History

History
101 lines (65 loc) · 5.3 KB

TDD-C-202204-CC-001.md

File metadata and controls

101 lines (65 loc) · 5.3 KB

Semantic Textual Similarity benchmark Turkish (STSb-TR)

Semantic Textual Similarity (STS) benchmark Turkish (STSb-TR) dataset is the machine translated version of English STS benchmark dataset using Google Cloud Translation API. No human corrections have been made to the translations. The official website for the STSb Turkish dataset can be found at STSb-TR.

Dataset Details

The dataset consists of 8628 machine-translated Turkish sentence pairs. In this dataset, each sentence pair was annotated by crowdsourcing and assigned a semantic similarity score. Scores range from 0 (no semantic similarity) to 5 (semantically equivalent).

Samples

An example from STSb-TR dataset is given below.

Example:

{'genre' : 'main-captions',
 'dataset' : 'MSRvid',
 'year' : 2012test,
 'sid' : '0062',
 'score': 1.6,
 'sentence1': 'Bir bebek kaplan bir topla oynuyor.',
 'sentence2': 'Bir bebek bir oyuncak bebekle oynuyor.',
}

Fields

field dtype Explanation
genre string Genre of the text
dataset string Original name of the dataset
year string Year of the original dataset
sid string Sentence id
score float Semantic similarity score. 0.0 is the lowest and 5.0 is the highest score.
sentence1 string First sentence
sentence2 string Second sentence

Splits

Training Validation Test Total
5749 1500 1379 8628

Dataset Creation

Curation Rationale

Semantic textual similarity studies are common in English. These studies are based on datasets that have been annotated by humans. However, annotation is a costly and time-consuming task. Recently, with the increasing advancement in machine translation, it has become possible to use datasets by translating them from one language to another.

Data Source

The original STSb dataset is a selection of the English datasets used in SemEval STS studies between 2012 and 2017. It includes text from image captions, news headlines, and user forums.

Annotations

Each sentence pair was annotated by crowdsourcing via Amazon Mechanical Turk. Five scores were collected for each pair and gold scores were obtained by averaging them.

Quality

The quality of translations was tested as follows: We selected 50 sentence pairs (100 sentences) randomly, considering the percentage of the categories in the dataset. So, 6, 19 and 25 pairs chosen from forum, caption and news categories respectively. These sentences were translated by three native Turkish speakers who are fluent in English. We evaluated quality of the system translations with the three references and found BLEU score as 60.21 which shows that our system translations can be considered as very high quality translations. Therefore, no changes have been made to the translations.

Additional Information

Version

This version of the dataset is retrieved from the original github repository at 2022.04.30 from commit id: 5546780

Dataset Curators

“Published by Figen Beken Fikri, Kemal Oflazer, Berrin Yanıkoğlu.”

Citation Information

Please cite the following paper if you found this dataset useful:

Figen Beken Fikri, Kemal Oflazer, Berrin Yanıkoğlu. Semantic Similarity Based Evaluation for Abstractive News Summarization , In Proceedings of ACL-IJCNLP Workshop GEM: Natural Language Generation, Evaluation, and Metrics, August 6, 2021.

@inproceedings{beken-fikri-etal-2021-semantic,
    title = "Semantic Similarity Based Evaluation for Abstractive News Summarization",
    author = "Beken Fikri, Figen  and Oflazer, Kemal and Yanikoglu, Berrin",
    booktitle = "Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.gem-1.3",
    doi = "10.18653/v1/2021.gem-1.3",
    pages = "24--33",
    abstract = "ROUGE is a widely used evaluation metric in text summarization. However, it is not suitable for the evaluation of abstractive summarization systems as it relies on lexical overlap between the gold standard and the generated summaries. This limitation becomes more apparent for agglutinative languages with very large vocabularies and high type/token ratios. In this paper, we present semantic similarity models for Turkish and apply them as evaluation metrics for an abstractive summarization task. To achieve this, we translated the English STSb dataset into Turkish and presented the first semantic textual similarity dataset for Turkish. We showed that our best similarity models have better alignment with average human judgments compared to ROUGE in both Pearson and Spearman correlations.",
}

Contact

Documented by fbekenfikri sabanciuniv edu and reviewed/uploaded by ardagoktogan gmail com