Skip to content

Latest commit

 

History

History
34 lines (22 loc) · 2.58 KB

TDD-C-202109-CC-004.md

File metadata and controls

34 lines (22 loc) · 2.58 KB

Sentence Segmentation task is a challenging task, especially when it comes to non-editorial texts. This dataset is intended to benchmark the performance of sentence-segmentation methods for the Turkish language.

Dataset Curation

This dataset consists of 300 scientific abstracts from (Özturk et al., 2014), 300 curated news articles from trnews-64, and a set of 10K tweets.

For the scientific abstracts, the sampling rationale is to maximize the number of abbreviations that reduce the accuracy of the rule-based approaches. As for the news set, the length of documents and the number of proper names is maximized. In the Twitter subset, the number of multi/single sentence tweets is balanced, and the tweets were preprocessed by replacing all URLs with http://some.url, and all user mentions with @user. The output of the test splits is kept private in order to preserve the benchmark quality.

Applying sentence segmentation to user-generated content such as social media posts or comments can be quite challenging. To simulate such difficult cases and expose the weaknesses of rule-based methods, another version of trseg-41 is created. Where the boundaries of sentences are artificially corrupted. This is done by randomly converting them to lowercase or uppercase with 50% probability, or by removing all punctuation marks with 50% probability.

Dataset format

We share the dataset using .jsonl format (one json object per line) with UTF-8 encoding.

{
    "text": "Hürriyet gazetesi köşe yazarı Yılmaz Özdil , sanatçı Sezen Aksu'nun kendisi hakkında suç duyurusunda bulunduğunu yazdı. Özdil, suç duyurusunun reddedildiğini ve Sezen Aksu'nun şikayet dilekçesinde \"tarihi bir yalanlama\"da bulunduğunu yazdı. Özdil köşesinde....",
    "sentences": ["Hürriyet gazetesi köşe yazarı Yılmaz Özdil , sanatçı Sezen Aksu'nun kendisi hakkında suç duyurusunda bulunduğunu yazdı.", "Özdil, suç duyurusunun reddedildiğini ve Sezen Aksu'nun şikayet dilekçesinde \"tarihi bir yalanlama\"da bulunduğunu yazdı.", ", "Özdil köşesinde \"Özgürlük\.... ]
}

Dataset Statistics

#Articles #Sentences #Words
News 300 6K 102K
Tweets 10K 28K 242K
Abstracts 300 6K 112K
Total 10.6K 40K 456K

Contact

Uploaded and documented by Ali Safaya: alisafaya gmail com