The dataset is a collection of analogical reasoning task pairs prepared by O. Gungor, and E. Yildiz. It contains 4 sets of pairs with different types of affixes, and word vectors obtained by skip-gram algorithm in their studies. GitHub Repository of the dataset and the paper can be found by given links.
The pairs contains two words with their simple versions followed by affixed states. This dataset contains 4 subsets obtained by different pairings according to their affixes.
File name | Description | Number of pairs |
---|---|---|
noun_inflection_quads.txt | Pairs of nouns with inflections | 15133 |
verb_inflection_quads.txt | Pairs of verbs with inflections | 14231 |
derivation_quads.txt | Pairs of words with derivational affixes | 4836 |
both_inflection_derivation.quads.txt | Pairs of words with both inflectional and derivational affixes | 303 |
Sample pairs that take inflection:
konu konuları yer yerleri
konu konuları gün günleri
konu konuları ülke ülkeleri
konu konuları devlet devletleri
konu konuları iş işleri
konu konuları hak hakları
The dataset is motivated by the desire to advance the word vector representations in the Turkish language to be used in many areas of NLP.
The word vectors are obtained from another dataset created by Yildiz et al. by using skip-gram algorithm. The test pairs are obtained from the dataset by D. Yuret and F. Ture, by using several steps explained with details in the original paper.
Annotation procedure contains 4 steps.
- 60.745 morpholigcal tagging obtained from the morphological disambiguity dataset by D. Yuret and F. Ture.
- Taggings are seperated into 3 groups by words with only inflection, only derivation, and both affixes.
- 25 nouns and 25 words are randomly selected from commonly used Turkish words.
- The analysis of these words was found and was used to obtain superficial words from the morphological analyzes produced as a result of the above steps.
This dataset is part of an effort to advance word representations to be used in Turkish NLP studies. For agglutinative languages such as Turkish, word representations are important to represent morphological and semantical relations between words. Despite the limited studies of this subject in the Turkish language compared to English, this dataset can be used in future researches to advance word representations in the Turkish language.
This documentation is written based on the dataset version with commitment with id 9e14852.
Published by Onur Gungor and Eray Yildiz.
Please cite the following paper if you found this dataset useful:
Onur Gungor, Eray Yildiz, "Linguistic Features in Turkish Word Representations - Türkçe Sözcük Temsillerinde Dilbilimsel Özellikler", 2017 25th Signal Processing and Communications Applications Conference (SIU), Antalya, 2017.
@INPROCEEDINGS{7960223,
author={Güngör, Onur and Yıldız, Eray},
booktitle={2017 25th Signal Processing and Communications Applications Conference (SIU)},
title={Linguistic features in Turkish word representations},
year={2017},
volume={},
number={},
pages={1-4},
doi={10.1109/SIU.2017.7960223}}