Sentence Segmentation task is a challenging task, especially when it comes to non-editorial texts. This dataset is intended to benchmark the performance of sentence-segmentation methods for the Turkish language.

Dataset Curation

This dataset consists of 300 scientific abstracts from (Özturk et al., 2014), 300 curated news articles from trnews-64, and a set of 10K tweets.

For the scientific abstracts, the sampling rationale is to maximize the number of abbreviations that reduce the accuracy of the rule-based approaches. As for the news set, the length of documents and the number of proper names is maximized. In the Twitter subset, the number of multi/single sentence tweets is balanced, and the tweets were preprocessed by replacing all URLs with http://some.url, and all user mentions with @user. The output of the test splits is kept private in order to preserve the benchmark quality.

Applying sentence segmentation to user-generated content such as social media posts or comments can be quite challenging. To simulate such difficult cases and expose the weaknesses of rule-based methods, another version of trseg-41 is created. Where the boundaries of sentences are artificially corrupted. This is done by randomly converting them to lowercase or uppercase with 50% probability, or by removing all punctuation marks with 50% probability.

Dataset format

We share the dataset using .jsonl format (one json object per line) with UTF-8 encoding.

{
    "text": "Hürriyet gazetesi köşe yazarı Yılmaz Özdil , sanatçı Sezen Aksu'nun kendisi hakkında suç duyurusunda bulunduğunu yazdı. Özdil, suç duyurusunun reddedildiğini ve Sezen Aksu'nun şikayet dilekçesinde \"tarihi bir yalanlama\"da bulunduğunu yazdı. Özdil köşesinde....",
    "sentences": ["Hürriyet gazetesi köşe yazarı Yılmaz Özdil , sanatçı Sezen Aksu'nun kendisi hakkında suç duyurusunda bulunduğunu yazdı.", "Özdil, suç duyurusunun reddedildiğini ve Sezen Aksu'nun şikayet dilekçesinde \"tarihi bir yalanlama\"da bulunduğunu yazdı.", ", "Özdil köşesinde \"Özgürlük\.... ]
}

Dataset Statistics

	#Articles	#Sentences	#Words
News	300	6K	102K
Tweets	10K	28K	242K
Abstracts	300	6K	112K
Total	10.6K	40K	456K

Contact

Uploaded and documented by Ali Safaya: alisafaya gmail com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TDD-C-202109-CC-004.md

TDD-C-202109-CC-004.md

Dataset Curation

Dataset format

Dataset Statistics

Contact

Files

TDD-C-202109-CC-004.md

Latest commit

History

TDD-C-202109-CC-004.md

File metadata and controls

Dataset Curation

Dataset format

Dataset Statistics

Contact