Sentence Segmentation task is a challenging task, especially when it comes to non-editorial texts. This dataset is intended to benchmark the performance of sentence-segmentation methods for the Turkish language.
This dataset consists of 300 scientific abstracts from (Özturk et al., 2014), 300 curated news articles from trnews-64, and a set of 10K tweets.
For the scientific abstracts, the sampling rationale is to maximize the number of abbreviations that reduce the accuracy of the rule-based approaches. As for the news set, the length of documents and the number of proper names is maximized. In the Twitter subset, the number of multi/single sentence tweets is balanced, and the tweets were preprocessed by replacing all URLs with http://some.url
, and all user mentions with @user
. The output of the test splits is kept private in order to preserve the benchmark quality.
Applying sentence segmentation to user-generated content such as social media posts or comments can be quite challenging. To simulate such difficult cases and expose the weaknesses of rule-based methods, another version of trseg-41 is created. Where the boundaries of sentences are artificially corrupted. This is done by randomly converting them to lowercase or uppercase with 50% probability, or by removing all punctuation marks with 50% probability.
We share the dataset using .jsonl
format (one json object per line) with UTF-8 encoding.
{
"text": "Hürriyet gazetesi köşe yazarı Yılmaz Özdil , sanatçı Sezen Aksu'nun kendisi hakkında suç duyurusunda bulunduğunu yazdı. Özdil, suç duyurusunun reddedildiğini ve Sezen Aksu'nun şikayet dilekçesinde \"tarihi bir yalanlama\"da bulunduğunu yazdı. Özdil köşesinde....",
"sentences": ["Hürriyet gazetesi köşe yazarı Yılmaz Özdil , sanatçı Sezen Aksu'nun kendisi hakkında suç duyurusunda bulunduğunu yazdı.", "Özdil, suç duyurusunun reddedildiğini ve Sezen Aksu'nun şikayet dilekçesinde \"tarihi bir yalanlama\"da bulunduğunu yazdı.", ", "Özdil köşesinde \"Özgürlük\.... ]
}
#Articles | #Sentences | #Words | |
---|---|---|---|
News | 300 | 6K | 102K |
Tweets | 10K | 28K | 242K |
Abstracts | 300 | 6K | 112K |
Total | 10.6K | 40K | 456K |
Uploaded and documented by Ali Safaya: alisafaya gmail com