(last updated on 04/03/2024)
A curated list of awesome resources, tools and scientific papers for Kurdish language technology
Although I do my best to keep this page as comprehensive as possible by including all projects, the list may not include all the fantastic small and big projects regarding Kurdish language processing. Please be kind and notify me by reaching out by email or through our community on Gitter.
Are you interested in contributing to Kurdish language processing? Check out this post to see how you can do so.
- A few datasets are added for automatic speech recognition and Central Kurdish dialect identification and translation
- A few datasets are added for emotion analysis, summarization and news headline classification
- Two projects are released for language identification of Zaza-Gorani and Kurdish langauges.
- A benchmark is released for sentiment analysis of Central Kurdish.
- Kurdish Llama (Fine-tuned Llama model for Sorani)
- CORDI (Central Kurdish varieties of Sulaymaniyah, Sanandaj, Mahabad, Erbil, Sardasht and Kalar)
- Open Super-large Crawled ALMAnaCH coRpus (OSCAR) (Sorani and Kurmanji)
- Pewan (Sorani and Kurmanji)
- Kurdish folkloric lyrics corpus (Sorani)
- AsoSoft corpus (Sorani)
- Kurdish Textbooks Corpus (Sorani)
- Zaza-Gorani corpus (Zazaki and Gorani)
- Southern Kurdish and Laki corpora (Southern Kurdish and Laki)
- Kurdish resources on Clarin
- University of Bamberg's corpora [Kurmanji & Laki]
- CORDI (Parallel corpus of Central Kurdish varieties of Sulaymaniyah, Sanandaj, Mahabad and Erbil along with Standard Central Kurdish and English)
- Ataman's Bianet corpus containing Turkish-English-Kurmanji aligned texts
- Ahmadi et al's corpus containing English-Kurmanji-Sorani aligned texts
- Tanzil: one Qoran translation alignable with many other translations in other languages, including 11 in English (see this project)
- Bible translations in Kurmanji-Latin and Kurmanji-Cyrillic
- TED Talks subtitles
- HLP Colloquial Corpus #1 (Sorani and Kurmanji (Latin and Arabic)) (not free)
- A parallel corpus of Sorani-English text
- FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation (Sorani)
- AsoSoft Speech Corpus for Central-Kurdish Text-To-Speech (Sorani)
Check out a comprehensive list of Kurdish dictionaries and beware of copyright issues in the following projects:
- Kurdî Wikibase (Sorani, Kurmanji, Gorani and Southern Kurdish)
- Kurdish lexicographical resources in Ontolex-Lemon (Sorani, Kurmanji, Gorani and Southern Kurdish)
- Check Dolan Hêriş's repositories for a list of Kurdish dictionaries and tools to extract words
- KurdNet-the Kurdish wordNet (Sorani)
- Kurdish annotated lexicon (Sorani)
- Freedict word lists (Sorani and Kurmanji)
- Translation Initiative for COVID-19 including Sorani and Kurmanji
- MyMemory dictionaries with an open-access API (Sorani)
- Manchester Database of Kurdish Dialects
- Dataset of Kurdish poems with meter and form tags
- A Twitter dataset (Sorani and Kurmanji)
- Datasets for text to Kurdish Sign Language (Sorani)
- A dataset for speech recognition (Sorani)
- Universal dependency (Kurmanji)
- Web Inventory of Transcribed and Translated Talks (WIT3) (Sorani)
- Sorani and Kurmanji morphological datasets in UniMorph
- FakeKurdNews, an annotated dataset for Sorani Kurdish fake news detection
- profanity language (Sorani)
- Cyberbullying dataset (Sorani)
- Summarization dataset (Sorani)
- Sentiment analysis (Sorani)
- Emotion analysis (Sorani)
- News headline classification (Sorani)
- CORDI (Central Kurdish varieties)
- KASET - Kurmanji and Sorani Kurdish Speech and Transcripts
- Whisper model on Central Kurdish
- Kurdish spoken dialect recognition using x-vector speaker embedding (Northern, Central, Southern Kurdish, Hawrami & Zazaki)
- Morphological analysis:
- KurdishHunspell evaluation datasets (Sorani)
- Tokenization:
- KurdishTokenization (Sorani, Kurmanji)
- A sentence-segmented dataset (Sorani)
- Transliteration
- Spelling error correction
- Sentiment Analyis
- Sentiment Analysis (Sorani)
- Unconventional writing normalization
- fastText word vectors (Sorani and Kurmanji)
- Polyglot's word embeddings
- Kurd-Spell
- Wergor for transliteration (Sorani and Kurmanji)
- Kurdish Tokenization
- Jedar stemmer
- Apertium project for Kurmanji and Sorani morphological analysis
- Kurdish Hunspell for Sorani morphological analysig, spell checking, stemming and lemmatization
- A finite-state morphological analyzer for Central Kurdish (Sorani)
- Part-of-speech tagger (Sorani)
- Alexina Framework: morphological analysis and POS-tagger for Sorani (
soralex
) and Kurmanji (kurlex
) - Kurdspell for Sorani spell checking
- Apertium rule-based Sorani spell-checker
- Gende Stemmer (Sorani)
- Conversion of numbers into words (Sorani and Kurmanji)
- Conversion of words into IPA (Kurmanji)
- Apertium (Sorani and Kurmanji)
- Kurdish MT (Sorani)
- Autoregressive Entity Retrieval (Kurmanji)
- Kurdish Handwritten Words (Sorani)
- Kurdish Language Processing Toolkit: a natural language processing toolkit in Python
- Kurdînûs: pure JavaScript tools for transliteration, text conversion and normalization
- Kurdish Language Library: converting characters and digits in Persian, English and Arabic to Kurdish and vice versa
- AsoSoft's Library for Kurdish: normalizer, numeral converter, grapheme-to-phoneme convertor in C#
- CORDI (Central Kurdish varieties of Sulaymaniyah, Sanandaj, Mahabad, Erbil, Sardasht and Kalar)
- Language identification of Kurdish and Zaza-Gorani languages
- Perso-Arabic and KurdishLID projects covering many languages including (Kurmanji, Sorani, Southern Kurdish, Gorani and Zazaki)
- Language identifier (Sorani and Kurmanji)
In addition to these, you can find further information in other repositories and pages as follows:
These references are provided based on the data collected in the paper entitled KLPT – Kurdish Language Processing Toolkit. Note that references are provided in the bibliography
file.
Reference | Year | Field | dialects |
---|---|---|---|
esmaili2013sorani |
2013 | Dialectology | Sorani, Kurmanji |
hassani2016automatic |
2016 | Dialectology | Sorani, Kurmanji |
malmasi2016subdialectal |
2016 | Dialectology | Sorani |
al2017kurdish |
2017 | Dialectology | Sorani, Kurmanji, Gorani |
amani:hal-03262435 |
2021 | Dialectology | Kurdish, Zazaki & Gorani |
ahmadi2024cordi |
2024 | Dialectology | Sorani varieties |
mohammed2012automatic |
2012 | Information retrieval and Text mining | Sorani |
esmaili2012challenges |
2012 | Information retrieval and Text mining | Sorani |
littell2016named |
2016 | Information retrieval and Text mining | Sorani |
hassani2017method |
2017 | Information retrieval and Text mining | Sorani, Kurmanji |
esmaAl-Talabaniili2014towards |
2014 | Information retrieval and Text mining | Sorani, Kurmanji |
jaf2016simple |
2016 | Information retrieval and Text mining | Sorani |
rashid2017robust |
2017 | Information retrieval and Text mining | Sorani |
rashid2017automatic |
2017 | Information retrieval and Text mining | Sorani |
saeed2018improving |
2018 | Information retrieval and Text mining | Sorani |
mustafa2018kurdish |
2018 | Information retrieval and Text mining | Sorani |
saeed2018evaluation |
2018 | Information retrieval and Text mining | Sorani |
ahmadi2019wergor |
2019 | Information retrieval and Text mining | Sorani |
mahmudi2021automated |
2021 | Information retrieval and Text mining | Sorani |
abdulrahman2022lmspell |
2022 | Information retrieval and Text mining | Sorani |
esmaili2013building |
2013 | Lexical resources | Sorani |
aliabadi2014towards |
2014 | Lexical resources | Sorani |
aliabadi2014semi |
2014 | Lexical resources | Sorani |
ataman2018bianet |
2018 | Lexical resources | Kurmanji |
ahmadi2019towards |
2019 | Lexical resources | Sorani, Kurmanji, Gorani |
abdulrahman2019developing |
2019 | Lexical resources | Sorani |
abdulrahman2020using |
2020 | Lexical resources | Sorani |
veisi2020toward |
2020 | Lexical resources | Sorani |
ahmadi2020corpus |
2020 | Lexical resources | Sorani |
ahmadi-2020-building |
2020 | Lexical resources | Zaza, Gorani |
veisi2021jira |
2021 | Lexical resources | Sorani |
azin2021sk |
2021 | Lexical resources | Southern Kurdish |
hassani2017kurdish |
2017 | Machine Translation | Sorani, Kurmanji |
kaka2018english |
2018 | Machine Translation | Sorani |
ahmadi2020machine |
2020 | Machine Translation | Sorani |
goyal2021flores |
2021 | Machine Translation | 101 languages incl. Sorani |
amini2021central |
2021 | Machine Translation | Sorani |
ahmadi2022leveraging |
2022 | Machine Translation | Sorani |
ahmadi2024cordi |
2024 | Machine Translation | Sorani |
baban1995programmable |
1995 | Morphological and syntactic analysis | Sorani |
walther2010developing |
2010 | Morphological and syntactic analysis | Sorani |
walther2010fast |
2010 | Morphological and syntactic analysis | Kurmanji |
salavati2013stemming |
2013 | Morphological and syntactic analysis | Sorani |
jaf2014stemmer |
2014 | Morphological and syntactic analysis | Sorani |
jaf2016chapter |
2016 | Morphological and syntactic analysis | Sorani |
gokirmak2017dependency |
2017 | Morphological and syntactic analysis | Kurmanji |
salavati2018building |
2018 | Morphological and syntactic analysis | Sorani |
mustafa2018kurdish |
2018 | Morphological and syntactic analysis | Sorani |
ahmadi2020towards |
2020 | Morphological and syntactic analysis | Sorani |
ahmadi-2020-tokenization |
2020 | Morphological and syntactic analysis | Sorani, Kurmanji |
ahmadi2021modelling |
2021 | Morphological and syntactic analysis | Sorani |
ahmadi2020Hunspell |
2021 | Morphological and syntactic analysis | Sorani |
naserzade2021ckmorph |
2021 | Morphological and syntactic analysis | Sorani |
ahmadi2023revisiting |
2023 | Morphological and syntactic analysis | Sorani |
mohammed2012uniqueness |
2012 | Optical character recognition | Sorani |
mohammed2013handwritten |
2013 | Optical character recognition | Sorani |
shaltookisentiment |
2016 | Optical character recognition | Sorani |
zarro2017recognition |
2017 | Optical character recognition | Sorani |
yaseen2018kurdish |
2018 | Optical character recognition | Sorani |
dinler2018kurdish |
2018 | Optical character recognition | Sorani |
app11209752 |
2021 | Optical character recognition | Sorani |
kaka2017building |
2017 | Other | Sorani |
mahmudi2021automatic |
2021 | Other | Sorani |
ahmadi2021ickl |
2021 | Other | Sorani |
ahmadi2023script |
2023 | Other | Sorani, Kurmanji, Gorani |
hashim2018kurdish |
2018 | Sign language recognition | Sorani |
kamal-hassani-2020-towards |
2020 | Sign language recognition | Sorani |
daneshfar2009implementation |
2009 | Speech recognition | Sorani |
barkhoda2009comparison |
2009 | Speech recognition | Sorani |
bahrampour2009implementation |
2009 | Speech recognition | Sorani |
hassani2011kurdish |
2011 | Speech recognition | Sorani |
dinler2017formant |
2017 | Speech recognition | Kurmanji |
dinler2018extraction |
2018 | Speech recognition | Sorani, Kurmanji |
qader2019kurdish |
2019 | Speech recognition | Sorani |
delgado2024kaset |
2024 | Speech recognition | Sorani, Kurmanji |
ahmadi2024cordi |
2024 | Speech recognition | Sorani varieties |
ahmadi-2020-klpt |
2020 | Toolkits | Sorani, Kurmanji |
de2021multilingual |
2021 | Named-entity recognition | Kurmanji |
abdullah2022 |
2022 | Sentiment analysis | Sorani |
awlla2022 |
2022 | Sentiment analysis | Sorani |
amin2022kurdish |
2022 | Sentiment analysis | Sorani |
hameed2023sentiment |
2023 | Sentiment analysis | Sorani |
zuhair2021 |
2021 | Other | Sorani |
kamala2022kurdish |
2022 | Other | Sorani |
ahmadi2023fieldmatters |
2023 | Language identification | Sorani, Kurmanji, Southern Kurdish, Zazaki, Gorani |
ahmadi2023pali |
2023 | Language identification | Sorani, Kurmanji, Southern Kurdish, Gorani |
If you find the provided data useful for your project, feel free to use it and please, cite the following paper, too:
@inproceedings{ahmadi-2020-klpt,
title = "{KLPT} {--} {K}urdish Language Processing Toolkit",
author = "Ahmadi, Sina",
booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.nlposs-1.11",
doi = "10.18653/v1/2020.nlposs-1.11",
pages = "72--84"
}