From 96ef921e1f031de588afe974b52667d9d2845e16 Mon Sep 17 00:00:00 2001 From: wannaphong Date: Fri, 17 May 2024 17:20:02 +0000 Subject: [PATCH] =?UTF-8?q?Deploying=20to=20gh-pages=20from=20=20@=20cce00?= =?UTF-8?q?5df9526f942d5e102fd800e8b551a02b0e6=20=F0=9F=9A=80?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- index.html | 2 +- search/search_index.json | 2 +- sitemap.xml.gz | Bin 127 -> 127 bytes tasks/parser/index.html | 16 ++++++++++++++++ 4 files changed, 18 insertions(+), 2 deletions(-) diff --git a/index.html b/index.html index b8f21e2..4a6ad1d 100644 --- a/index.html +++ b/index.html @@ -403,5 +403,5 @@

NLP For Thai

diff --git a/search/search_index.json b/search/search_index.json index 403432f..6cab0b8 100644 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"NLP For Thai It's Thai NLP homepage. All is Open Source. Website: NLPForThai.com maintained by PyThaiNLP Menu Tasks Other Contributors Thanks all the contributors . (Image made with contributors-img ) How to Contribute You can fork and send your pull request at https://github.com/PyThaiNLP/nlpforthai.com We build Thai NLP. PyThaiNLP","title":"NLP For Thai"},{"location":"#nlp-for-thai","text":"It's Thai NLP homepage. All is Open Source. Website: NLPForThai.com maintained by PyThaiNLP Menu Tasks Other Contributors Thanks all the contributors . (Image made with contributors-img ) How to Contribute You can fork and send your pull request at https://github.com/PyThaiNLP/nlpforthai.com We build Thai NLP. PyThaiNLP","title":"NLP For Thai"},{"location":"other/","text":"Other <- back to homepage Menu Dictionaries N-gram Word Similarity Name WordNet Word embeddings Sentence Embedding Glossary Dictionaries Name Description Size License Creator Download LEXiTRON Thai<->English Dictionary Thai-English 83,000 words CC BY-SA-NC 3.0 NECTEC aiforthai (registration required) Yaitron Yaitron English-Thai and Thai-English dictionary based on LEXiTRON created since May 2006. An objective of Yaitron is to built a dictionary that is formatted in well formed XML and easy to be manipulated by machine. LEXiTRON License Vee Satayamas GitHub Volubilis Dict - Thai-English-French VOLUBILIS - Thai English French Database sourceforge Ground-truth bilingual dictionaries 110 large-scale ground-truth bilingual dictionaries train 5000 word and test 1500 word Facebook Research GitHub Thai Wrong words dataset Wannaphong Phatthiyaphaibun GitHub up to menu N-gram Name Description Size License Creator Download Unigram from OSCAR Corpus Unigram from OSCAR Corpus Korakot Chaovavanich Facebook TTC N-gram from Thai text book 3,037,772 word Website Thai National Corpus Thai National Corpus (Unigram, Bi-gram, Ti-gram) Faculty of Arts, Chulalongkorn University Website up to menu Word Similarity Name Description Size License Creator Download Word Similarity Datasets for Thai Language This repo contains translated and re-rated datasets for word similarity for Thai language. Ponrudee Netisopakul\ufffc, Gerhard Wohlgenannt, Aleksei Pulich GitHub up to menu Thai Name Name Description Size License Creator Download Thai Male and Female Names Corpus The project contains Thai male, female, and family names, aimed for Thai language analysis. 22,058 Name CC BY-SA 4.0 Korkeat W. GitHub up to menu WordNet Name Description Size License Creator Download Open Multilingual Wordnet The goal is to make it easy to use wordnets in multiple languages. 81% Website th-wn-sqlite Thai wordnet in SQLite - Vee Satayamas sourceforge \u0e18\u0e19\u0e19\u0e17\u0e4c \u0e2b\u0e25\u0e35\u0e19\u0e49\u0e2d\u0e22 2008 \u0e18\u0e19\u0e19\u0e17\u0e4c \u0e2b\u0e25\u0e35\u0e19\u0e49\u0e2d\u0e22 Website \u0e1b\u0e23\u0e34\u0e28\u0e19\u0e32 \u0e2d\u0e31\u0e04\u0e23\u0e1e\u0e38\u0e17\u0e18\u0e34\u0e1e\u0e23 Data 2008 \u0e1b\u0e23\u0e34\u0e28\u0e19\u0e32 \u0e2d\u0e31\u0e04\u0e23\u0e1e\u0e38\u0e17\u0e18\u0e34\u0e1e\u0e23 Website up to menu Word embeddings Name Detail Download ConceptNet Numberbatch ConceptNet Numberbatch is a set of semantic vectors (also known as word embeddings) than can be used directly as a representation of word meanings or as a starting point for further machine learning. GitHub FastText Word vectors The pre-trained word vectors for 157 languages, trained on Common Crawl and Wikipedia using fastText. Website Thai2Fit (old Thai2Vec) Homepage Download word2vec: PyThaiNLP LTW2V: The Large Thai Word2Vec LTW2V is The large Thai Word2Vec. It built with oxidized-thainlp from OSCAR Corpus (Open Super-large Crawled Aggregated coRpus). GitHub Sentence Embedding Name Detail Paper Owner Download LASER LASER Language-Agnostic SEntence Representations Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond Facebook GitHub MUSE Multilingual Universal Sentence Encoderfor Semantic Retrieval Multilingual Universal Sentence Encoder for Semantic Retrieval Google Tensorflow Hub LaBSE Language-Agnostic BERT Sentence Embedding by Google AI. Language-agnostic BERT Sentence Embedding Google Glossary Name Detail Website Thai Glossary Thai Glossary for Open Source Software by OpenTLE (backup) Website Glossary for Open Source Software by OpenTLE Web archive up to menu","title":"Other"},{"location":"other/#other","text":"<- back to homepage Menu Dictionaries N-gram Word Similarity Name WordNet Word embeddings Sentence Embedding Glossary","title":"Other"},{"location":"other/#dictionaries","text":"Name Description Size License Creator Download LEXiTRON Thai<->English Dictionary Thai-English 83,000 words CC BY-SA-NC 3.0 NECTEC aiforthai (registration required) Yaitron Yaitron English-Thai and Thai-English dictionary based on LEXiTRON created since May 2006. An objective of Yaitron is to built a dictionary that is formatted in well formed XML and easy to be manipulated by machine. LEXiTRON License Vee Satayamas GitHub Volubilis Dict - Thai-English-French VOLUBILIS - Thai English French Database sourceforge Ground-truth bilingual dictionaries 110 large-scale ground-truth bilingual dictionaries train 5000 word and test 1500 word Facebook Research GitHub Thai Wrong words dataset Wannaphong Phatthiyaphaibun GitHub up to menu","title":"Dictionaries"},{"location":"other/#n-gram","text":"Name Description Size License Creator Download Unigram from OSCAR Corpus Unigram from OSCAR Corpus Korakot Chaovavanich Facebook TTC N-gram from Thai text book 3,037,772 word Website Thai National Corpus Thai National Corpus (Unigram, Bi-gram, Ti-gram) Faculty of Arts, Chulalongkorn University Website up to menu","title":"N-gram"},{"location":"other/#word-similarity","text":"Name Description Size License Creator Download Word Similarity Datasets for Thai Language This repo contains translated and re-rated datasets for word similarity for Thai language. Ponrudee Netisopakul\ufffc, Gerhard Wohlgenannt, Aleksei Pulich GitHub up to menu","title":"Word Similarity"},{"location":"other/#thai-name","text":"Name Description Size License Creator Download Thai Male and Female Names Corpus The project contains Thai male, female, and family names, aimed for Thai language analysis. 22,058 Name CC BY-SA 4.0 Korkeat W. GitHub up to menu","title":"Thai Name"},{"location":"other/#wordnet","text":"Name Description Size License Creator Download Open Multilingual Wordnet The goal is to make it easy to use wordnets in multiple languages. 81% Website th-wn-sqlite Thai wordnet in SQLite - Vee Satayamas sourceforge \u0e18\u0e19\u0e19\u0e17\u0e4c \u0e2b\u0e25\u0e35\u0e19\u0e49\u0e2d\u0e22 2008 \u0e18\u0e19\u0e19\u0e17\u0e4c \u0e2b\u0e25\u0e35\u0e19\u0e49\u0e2d\u0e22 Website \u0e1b\u0e23\u0e34\u0e28\u0e19\u0e32 \u0e2d\u0e31\u0e04\u0e23\u0e1e\u0e38\u0e17\u0e18\u0e34\u0e1e\u0e23 Data 2008 \u0e1b\u0e23\u0e34\u0e28\u0e19\u0e32 \u0e2d\u0e31\u0e04\u0e23\u0e1e\u0e38\u0e17\u0e18\u0e34\u0e1e\u0e23 Website up to menu","title":"WordNet"},{"location":"other/#word-embeddings","text":"Name Detail Download ConceptNet Numberbatch ConceptNet Numberbatch is a set of semantic vectors (also known as word embeddings) than can be used directly as a representation of word meanings or as a starting point for further machine learning. GitHub FastText Word vectors The pre-trained word vectors for 157 languages, trained on Common Crawl and Wikipedia using fastText. Website Thai2Fit (old Thai2Vec) Homepage Download word2vec: PyThaiNLP LTW2V: The Large Thai Word2Vec LTW2V is The large Thai Word2Vec. It built with oxidized-thainlp from OSCAR Corpus (Open Super-large Crawled Aggregated coRpus). GitHub","title":"Word embeddings"},{"location":"other/#sentence-embedding","text":"Name Detail Paper Owner Download LASER LASER Language-Agnostic SEntence Representations Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond Facebook GitHub MUSE Multilingual Universal Sentence Encoderfor Semantic Retrieval Multilingual Universal Sentence Encoder for Semantic Retrieval Google Tensorflow Hub LaBSE Language-Agnostic BERT Sentence Embedding by Google AI. Language-agnostic BERT Sentence Embedding Google","title":"Sentence Embedding"},{"location":"other/#glossary","text":"Name Detail Website Thai Glossary Thai Glossary for Open Source Software by OpenTLE (backup) Website Glossary for Open Source Software by OpenTLE Web archive up to menu","title":"Glossary"},{"location":"tasks/","text":"Thai NLP Tasks Word Segmentation Sentence Segmentation Syllable Segmentation Part-of-speech tagging Named Entity Recognition Text Classification Text Generation Text Summarization Spell Correct Soundex Speech Recognition Speech Synthesis Speech Emotion Recognition Speech-to-text translation Optical Character Recognition Machine Translation Dependency Parser Grapheme to Phoneme Language model Question Answering Plagiarism Treebank Natural Language Inference Natural Language Understanding Image Captioning Spoken Language Understanding","title":"Thai NLP Tasks"},{"location":"tasks/#thai-nlp-tasks","text":"Word Segmentation Sentence Segmentation Syllable Segmentation Part-of-speech tagging Named Entity Recognition Text Classification Text Generation Text Summarization Spell Correct Soundex Speech Recognition Speech Synthesis Speech Emotion Recognition Speech-to-text translation Optical Character Recognition Machine Translation Dependency Parser Grapheme to Phoneme Language model Question Answering Plagiarism Treebank Natural Language Inference Natural Language Understanding Image Captioning Spoken Language Understanding","title":"Thai NLP Tasks"},{"location":"tasks/chatbot/","text":"Chatbot Model Name Description Size License Creator Download WangChanGLM WangChanGLM elephant\u200a-\u200aThe Multilingual Instruction-Following Model 7.5B CC BY-SA 4.0 VISTEC-depa AI Research Institute of Thailand & PyThaiNLP GitHub","title":"Chatbot"},{"location":"tasks/chatbot/#chatbot","text":"","title":"Chatbot"},{"location":"tasks/chatbot/#model","text":"Name Description Size License Creator Download WangChanGLM WangChanGLM elephant\u200a-\u200aThe Multilingual Instruction-Following Model 7.5B CC BY-SA 4.0 VISTEC-depa AI Research Institute of Thailand & PyThaiNLP GitHub","title":"Model"},{"location":"tasks/dependency_parser/","text":"Dependency Parser Corpus Name Description Size License Creator Download UD Thai PUD This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies. 1,000 sentences CC BY-SA 3.0 Universal Dependencies GitHub Blackboard Treebank Blackboard Treebank is a Thai dependency corpus based on the LST20 Annotation Guideline. It features dependency structures, constituency structures, word boundaries, named entities, clause boundaries, and sentence boundaries. 122,851 clauses (38,558 sentences) CC BY 3.0 Prachya Boonkwan, NECTEC bitbucket or GitHub Thai Discourse Treebank The Thai Discourse Treebank (TDTB) is a project at Chulalongkorn University, Bangkok, Thailand. The annotation adopts the sense inventory from PDTB 3.0. 180 documents - Chulalongkorn University GitHub Software Name Description Status Language License spaCy-Thai Tokenizer, POS-tagger, and dependency-parser for Thai language, working on Universal Dependencies. active Python 3.X MIT license esupar Tokenizer, POS-tagger, and dependency-parser with Transformers and SuPar. active Python 3.X MIT license TowerParse TowerParse is a Python tool for multilingual dependency parsing, built on top of the HuggingFace Transformers library. Unlike other multilingual dependency parsers (e.g., UDify , UDapter), TowerParse offers a language-dedicated parsing model for each language (actually, for each test UD treebank, i.e., for languages with multiple treebanks, we offer multiple parsing models). ? Python 3.X CC0-1.0 license","title":"Dependency Parser"},{"location":"tasks/dependency_parser/#dependency-parser","text":"","title":"Dependency Parser"},{"location":"tasks/dependency_parser/#corpus","text":"Name Description Size License Creator Download UD Thai PUD This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies. 1,000 sentences CC BY-SA 3.0 Universal Dependencies GitHub Blackboard Treebank Blackboard Treebank is a Thai dependency corpus based on the LST20 Annotation Guideline. It features dependency structures, constituency structures, word boundaries, named entities, clause boundaries, and sentence boundaries. 122,851 clauses (38,558 sentences) CC BY 3.0 Prachya Boonkwan, NECTEC bitbucket or GitHub Thai Discourse Treebank The Thai Discourse Treebank (TDTB) is a project at Chulalongkorn University, Bangkok, Thailand. The annotation adopts the sense inventory from PDTB 3.0. 180 documents - Chulalongkorn University GitHub","title":"Corpus"},{"location":"tasks/dependency_parser/#software","text":"Name Description Status Language License spaCy-Thai Tokenizer, POS-tagger, and dependency-parser for Thai language, working on Universal Dependencies. active Python 3.X MIT license esupar Tokenizer, POS-tagger, and dependency-parser with Transformers and SuPar. active Python 3.X MIT license TowerParse TowerParse is a Python tool for multilingual dependency parsing, built on top of the HuggingFace Transformers library. Unlike other multilingual dependency parsers (e.g., UDify , UDapter), TowerParse offers a language-dedicated parsing model for each language (actually, for each test UD treebank, i.e., for languages with multiple treebanks, we offer multiple parsing models). ? Python 3.X CC0-1.0 license","title":"Software"},{"location":"tasks/g2p/","text":"Grapheme to Phoneme Corpus Name Description Size License Creator Download Grapheme to Phoneme Thai Grapheme to Phoneme from Wiktionary 14,483 word CC BY-SA 3.0 Wannaphong Phatthiyaphaibun GitHub Software Name Description Status Language License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0 TLTK Thai Language Toolkit active Python 3.X BSD License (BSD-3-Clause) thpronun thpronun is a program for analyzing pronunciation of Thai words. The output can be in Thai pronunciation, Romanization, or in any other phonetic systems. It is designed to be extensible. active C/C++ GPL-3.0 License Thai G2P (grapheme to phoneme) dictionary-based conversion + BiLSTM seq2seq model (under construction) active Python 3.X CharsiuG2P CharsiuG2P is transformer based tool for grapheme-to-phoneme conversion in 100 languages. Given an orthographic word, CharsiuG2P predicts its pronunciation through a neural G2P model. active Python 3.X MIT license Software Name Detail Owner Download CharsiuG2P Multilingual G2P in 100 languages GitHub","title":"Grapheme to Phoneme"},{"location":"tasks/g2p/#grapheme-to-phoneme","text":"","title":"Grapheme to Phoneme"},{"location":"tasks/g2p/#corpus","text":"Name Description Size License Creator Download Grapheme to Phoneme Thai Grapheme to Phoneme from Wiktionary 14,483 word CC BY-SA 3.0 Wannaphong Phatthiyaphaibun GitHub","title":"Corpus"},{"location":"tasks/g2p/#software","text":"Name Description Status Language License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0 TLTK Thai Language Toolkit active Python 3.X BSD License (BSD-3-Clause) thpronun thpronun is a program for analyzing pronunciation of Thai words. The output can be in Thai pronunciation, Romanization, or in any other phonetic systems. It is designed to be extensible. active C/C++ GPL-3.0 License Thai G2P (grapheme to phoneme) dictionary-based conversion + BiLSTM seq2seq model (under construction) active Python 3.X CharsiuG2P CharsiuG2P is transformer based tool for grapheme-to-phoneme conversion in 100 languages. Given an orthographic word, CharsiuG2P predicts its pronunciation through a neural G2P model. active Python 3.X MIT license","title":"Software"},{"location":"tasks/g2p/#software_1","text":"Name Detail Owner Download CharsiuG2P Multilingual G2P in 100 languages GitHub","title":"Software"},{"location":"tasks/image-captioning/","text":"Image Captioning Software Name Description Status Language License Image Captioning in Thai: AI \u0e0a\u0e48\u0e27\u0e22\u0e1c\u0e39\u0e49\u0e1e\u0e34\u0e01\u0e32\u0e23\u0e17\u0e32\u0e07\u0e2a\u0e32\u0e22\u0e15\u0e32 Image Captioning in Thai from AI Builder https://www.facebook.com/aibuildersx/posts/175053151329799 Python 3.X ?","title":"Image Captioning"},{"location":"tasks/image-captioning/#image-captioning","text":"","title":"Image Captioning"},{"location":"tasks/image-captioning/#software","text":"Name Description Status Language License Image Captioning in Thai: AI \u0e0a\u0e48\u0e27\u0e22\u0e1c\u0e39\u0e49\u0e1e\u0e34\u0e01\u0e32\u0e23\u0e17\u0e32\u0e07\u0e2a\u0e32\u0e22\u0e15\u0e32 Image Captioning in Thai from AI Builder https://www.facebook.com/aibuildersx/posts/175053151329799 Python 3.X ?","title":"Software"},{"location":"tasks/language-model/","text":"Language model Text Corpus Name Description Size License Creator Download Thai Constitution Corpus The Constitution of Thailand Dataset Since 1932 Public Domain Wannaphong Phatthiyaphaibun GitHub Thai Law Thai Law Dataset (Act of Parliament) Public Domain Wannaphong Phatthiyaphaibun GitHub IO-LM Learn how to talk like an Information-Operation-er GitHub HC corpora HC corpora is a collection of corpora for various languages freely available to download. homepage : http://corpora.epizy.com/about.html MediaFire thai-joke-corpus Thai jokes scraped from 4 Thai jokes facebook pages collected by iApp Technology Co, Ltd. 449 Jokes GPL-3.0 License iApp Technology Co, Ltd GitHub Thai Literature Corpora (TLC) texts from Vajirayana Digital Library, stored by chapters and stanzas (non-tokenized). a total of 34 documents, 292,270 lines, 31,790,734 characters Jitkapat Sawatphol Website HSE Thai Corpus A 35 Million Word Corpus of Thai Kaggle ThaiGov corpus Data from Thai government website. public domain Wannaphong Phatthiyaphaibun GitHub ThaiGov V2 Corpus Thai News Dataset from Thai government website. public domain Wannaphong Phatthiyaphaibun GitHub OSCAR Corpus OSCAR or Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. 951,743,087 words public domain Homepage mC4 A multilingual colossal, cleaned version of Common Crawl's web crawl corpus. Hugging Face Multilingual Open Text 1.0: Public Domain News in 44 Languages This is a corpus of public domain news in 44 languages. public domain GitHub Thai depression detection dataset and baseline models Detecting Depression in Thai Blog Posts: a Dataset and a Baseline. Zenodo Enocder Preatrained Name Detail Owner Download Thai2Fit ULMFit Language Modeling, Text Feature Extraction and Text Classification in Thai Language. Created as part of pyThaiNLP with ULMFit implementation from fast.ai Charin Polpanumas GitHub BERT-th BERT pre-training in Thai language ThAIKeras GitHub BERT-Base, Multilingual Cased 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters Google GitHub bert-base-th-cased We are sharing smaller versions of bert-base-multilingual-cased that handle a custom number of languages. Geotrend Hugging Face WangchanBERTa Pretraining transformer-based Thai Language Models AI Research Institute of Thailand (AIResearch) GitHub & Hugging Face mLUKE A multilingual extension of LUKE. Hugging Face TwHIN-BERT TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for Multilingual Tweet Representations Twitter GitHub PhayaThaiBERT 278M P. Sriwirote Notebook Load BERT-th, BERT-Base, Multilingual Cased and bert-base-th-cased with Hugging Face in Python LLMs Name Parameters Detail Owner Download OpenThaiGPT 13B Kobkrit GitHub Typhoon 7B SCB10X Hugging Face SeaLLMs 13B DAMO GitHub Sea-Lion 7.5B AI Singapore GitHub WangChanGLM 7.5B VISTEC-PyThaiNLP GitHub","title":"Language model"},{"location":"tasks/language-model/#language-model","text":"","title":"Language model"},{"location":"tasks/language-model/#text-corpus","text":"Name Description Size License Creator Download Thai Constitution Corpus The Constitution of Thailand Dataset Since 1932 Public Domain Wannaphong Phatthiyaphaibun GitHub Thai Law Thai Law Dataset (Act of Parliament) Public Domain Wannaphong Phatthiyaphaibun GitHub IO-LM Learn how to talk like an Information-Operation-er GitHub HC corpora HC corpora is a collection of corpora for various languages freely available to download. homepage : http://corpora.epizy.com/about.html MediaFire thai-joke-corpus Thai jokes scraped from 4 Thai jokes facebook pages collected by iApp Technology Co, Ltd. 449 Jokes GPL-3.0 License iApp Technology Co, Ltd GitHub Thai Literature Corpora (TLC) texts from Vajirayana Digital Library, stored by chapters and stanzas (non-tokenized). a total of 34 documents, 292,270 lines, 31,790,734 characters Jitkapat Sawatphol Website HSE Thai Corpus A 35 Million Word Corpus of Thai Kaggle ThaiGov corpus Data from Thai government website. public domain Wannaphong Phatthiyaphaibun GitHub ThaiGov V2 Corpus Thai News Dataset from Thai government website. public domain Wannaphong Phatthiyaphaibun GitHub OSCAR Corpus OSCAR or Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. 951,743,087 words public domain Homepage mC4 A multilingual colossal, cleaned version of Common Crawl's web crawl corpus. Hugging Face Multilingual Open Text 1.0: Public Domain News in 44 Languages This is a corpus of public domain news in 44 languages. public domain GitHub Thai depression detection dataset and baseline models Detecting Depression in Thai Blog Posts: a Dataset and a Baseline. Zenodo","title":"Text Corpus"},{"location":"tasks/language-model/#enocder-preatrained","text":"Name Detail Owner Download Thai2Fit ULMFit Language Modeling, Text Feature Extraction and Text Classification in Thai Language. Created as part of pyThaiNLP with ULMFit implementation from fast.ai Charin Polpanumas GitHub BERT-th BERT pre-training in Thai language ThAIKeras GitHub BERT-Base, Multilingual Cased 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters Google GitHub bert-base-th-cased We are sharing smaller versions of bert-base-multilingual-cased that handle a custom number of languages. Geotrend Hugging Face WangchanBERTa Pretraining transformer-based Thai Language Models AI Research Institute of Thailand (AIResearch) GitHub & Hugging Face mLUKE A multilingual extension of LUKE. Hugging Face TwHIN-BERT TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for Multilingual Tweet Representations Twitter GitHub PhayaThaiBERT 278M P. Sriwirote","title":"Enocder Preatrained"},{"location":"tasks/language-model/#notebook","text":"Load BERT-th, BERT-Base, Multilingual Cased and bert-base-th-cased with Hugging Face in Python","title":"Notebook"},{"location":"tasks/language-model/#llms","text":"Name Parameters Detail Owner Download OpenThaiGPT 13B Kobkrit GitHub Typhoon 7B SCB10X Hugging Face SeaLLMs 13B DAMO GitHub Sea-Lion 7.5B AI Singapore GitHub WangChanGLM 7.5B VISTEC-PyThaiNLP GitHub","title":"LLMs"},{"location":"tasks/machine-translation/","text":"Machine Translation Corpus Name Description Size License Creator Download TALPCo TUFS Asian Language Parallel Corpus 1,327 sent CC BY 4.0 Nomoto, Hiroki, Kenji Okano, Sunisa Wittayapanyanon and Junta Nomura GitHub scb-mt-en-th-2020 English-Thai Machine Translation Dataset with the collaboration between Vidyasirimedhi Institute of Science and Technology (VISTEC) and Digital Economy Promotion Agency (depa), publishes an open English-Thai machine translation dataset, with the sponsorship from Siam Commercial Bank (SCB) 1,001,752 segment pairs CC BY-SA 4.0 AI Research Institute of Thailand (AIResearch) GitHub Software Documentation Data Set for Machine Translation A parallel evaluation data set of SAP software documentation with document structure annotation dev: 2048 segment pairs, test: 2050 segment pairs CC BY-NC 4.0 SAP GitHub Thai Lao Parallel corpus Thai Lao Parallel corpus CC0-1.0 License Wannaphong Phatthiyaphaibun GitHub Contradictory, My Dear Watson Translated text Non-English text converted to English language Kaggle Asian Language Treebank Parallel Corpus This is the Asian Language Treebank (ALT) Parallel Corpus. train: 1,698 articles, 18,088 sentences dev: 98 articles, 1,000 sentences test: 97 articles, 1,018 sentences CC BY 4.0 Website WikiLingua A Multilingual Abstractive Summarization Dataset 14,770 parallel (for thai) CC0-1.0 License Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown GitHub Web Inventory of Transcribed & Translated(WIT) Ted Talks The Web Inventory Talk is a collection of the original Ted talks and their translated version. The translations are available in more than 109+ languages, though the distribution is not uniform. Hugging Face generated_reviews_enth generated_reviews_enth is created as part of scb-mt-en-th-2020 for machine translation task. CC BY-SA 4.0 AI Research Institute of Thailand (AIResearch) GitHub FLORES-101 FLORES-101 is a Many-to-Many multilingual translation benchmark dataset for 101 languages. Facebook GitHub thai_usembassy This dataset collect all Thai & English news from U.S. Embassy Bangkok. CC-0 PyThaiNLP HuggingFace Software Name Description Status Language License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0 Pretrained Name Description Status Language License Lalita Chinese-Thai Machine Translation Chinese-Thai Machine Translation by AI Builder active Python 3.X Apache License 2.0 English-Thai Machine Translation Models English-Thai Machine Translation Models by VISTEC-depa Thailand Artificial Intelligence Research Institute active Python 3.X Apache License 2.0","title":"Machine Translation"},{"location":"tasks/machine-translation/#machine-translation","text":"","title":"Machine Translation"},{"location":"tasks/machine-translation/#corpus","text":"Name Description Size License Creator Download TALPCo TUFS Asian Language Parallel Corpus 1,327 sent CC BY 4.0 Nomoto, Hiroki, Kenji Okano, Sunisa Wittayapanyanon and Junta Nomura GitHub scb-mt-en-th-2020 English-Thai Machine Translation Dataset with the collaboration between Vidyasirimedhi Institute of Science and Technology (VISTEC) and Digital Economy Promotion Agency (depa), publishes an open English-Thai machine translation dataset, with the sponsorship from Siam Commercial Bank (SCB) 1,001,752 segment pairs CC BY-SA 4.0 AI Research Institute of Thailand (AIResearch) GitHub Software Documentation Data Set for Machine Translation A parallel evaluation data set of SAP software documentation with document structure annotation dev: 2048 segment pairs, test: 2050 segment pairs CC BY-NC 4.0 SAP GitHub Thai Lao Parallel corpus Thai Lao Parallel corpus CC0-1.0 License Wannaphong Phatthiyaphaibun GitHub Contradictory, My Dear Watson Translated text Non-English text converted to English language Kaggle Asian Language Treebank Parallel Corpus This is the Asian Language Treebank (ALT) Parallel Corpus. train: 1,698 articles, 18,088 sentences dev: 98 articles, 1,000 sentences test: 97 articles, 1,018 sentences CC BY 4.0 Website WikiLingua A Multilingual Abstractive Summarization Dataset 14,770 parallel (for thai) CC0-1.0 License Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown GitHub Web Inventory of Transcribed & Translated(WIT) Ted Talks The Web Inventory Talk is a collection of the original Ted talks and their translated version. The translations are available in more than 109+ languages, though the distribution is not uniform. Hugging Face generated_reviews_enth generated_reviews_enth is created as part of scb-mt-en-th-2020 for machine translation task. CC BY-SA 4.0 AI Research Institute of Thailand (AIResearch) GitHub FLORES-101 FLORES-101 is a Many-to-Many multilingual translation benchmark dataset for 101 languages. Facebook GitHub thai_usembassy This dataset collect all Thai & English news from U.S. Embassy Bangkok. CC-0 PyThaiNLP HuggingFace","title":"Corpus"},{"location":"tasks/machine-translation/#software","text":"Name Description Status Language License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0","title":"Software"},{"location":"tasks/machine-translation/#pretrained","text":"Name Description Status Language License Lalita Chinese-Thai Machine Translation Chinese-Thai Machine Translation by AI Builder active Python 3.X Apache License 2.0 English-Thai Machine Translation Models English-Thai Machine Translation Models by VISTEC-depa Thailand Artificial Intelligence Research Institute active Python 3.X Apache License 2.0","title":"Pretrained"},{"location":"tasks/ner/","text":"Named Entity Recognition Corpus Name Description Size License Creator Download Thai-NNER (Thai Nested Named Entity Recognition Corpus) This work presents the first Thai Nested Named Entity Recognition (N-NER) dataset. Thai N-NER consists of 264,798 mentions, 104 classes, and a maximum depth of 8 layers obtained from news articles and restaurant reviews, a total of 4894 documents. Our work, to the best of our knowledge, presents the largest non-English N-NER dataset and the first non-English one with fine-grained classes. CC-BY-SA 3.0 IST, VISTEC GitHub \u0e19\u0e31\u0e0a\u0e0a\u0e32 \u0e16\u0e34\u0e23\u0e30\u0e2a\u0e32\u0e42\u0e23\u0e0a corpora by Wirote Aroonmanakun's students ? \u0e19\u0e31\u0e0a\u0e0a\u0e32 \u0e16\u0e34\u0e23\u0e30\u0e2a\u0e32\u0e42\u0e23\u0e0a \u0e19\u0e31\u0e0a\u0e0a\u0e32 \u0e16\u0e34\u0e23\u0e30\u0e2a\u0e32\u0e42\u0e23\u0e0a Data \u0e28\u0e28\u0e34\u0e27\u0e34\u0e21\u0e25 \u0e01\u0e32\u0e25\u0e31\u0e19\u0e2a\u0e35\u0e21\u0e32 corpora by Wirote Aroonmanakun's students ? \u0e28\u0e28\u0e34\u0e27\u0e34\u0e21\u0e25 \u0e01\u0e32\u0e25\u0e31\u0e19\u0e2a\u0e35\u0e21\u0e32 \u0e28\u0e28\u0e34\u0e27\u0e34\u0e21\u0e25 \u0e01\u0e32\u0e25\u0e31\u0e19\u0e2a\u0e35\u0e21\u0e32 Data \u0e13\u0e31\u0e10\u0e14\u0e32\u0e1e\u0e23 \u0e40\u0e25\u0e34\u0e28\u0e0a\u0e35\u0e27\u0e30 corpora by Wirote Aroonmanakun's students ? \u0e13\u0e31\u0e10\u0e14\u0e32\u0e1e\u0e23 \u0e40\u0e25\u0e34\u0e28\u0e0a\u0e35\u0e27\u0e30 \u0e13\u0e31\u0e10\u0e14\u0e32\u0e1e\u0e23 \u0e40\u0e25\u0e34\u0e28\u0e0a\u0e35\u0e27\u0e30 Data Thai NER Thai NER project is part of PyThaiNLP. CC BY 3.0 Wannaphong Phatthiyaphaibun GitHub THAI-NEST Thai Named Entity tagging Corpus from NECTEC & Thammasat University CC BY-SA-NC 3.0 NECTEC aiforthai (registration required) WikiANN WikiANN (sometimes called PAN-X) is a multilingual named entity recognition dataset consisting of Wikipedia articles annotated with LOC (location), PER (person), and ORG (organisation) tags in the IOB2 format. Rahimi, Afshin and Li, Yuan and Cohn, Trevor GitHub Crime Named Entity Recognition NER project with Thai crime news dataset GitHub Software Name Description Status Language License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0 TLTK Thai Language Toolkit active Python 3.X BSD License (BSD-3-Clause) Thai-NNER Thai Nested Named Entity Recognition active Python 3.X MIT License","title":"Named Entity Recognition"},{"location":"tasks/ner/#named-entity-recognition","text":"","title":"Named Entity Recognition"},{"location":"tasks/ner/#corpus","text":"Name Description Size License Creator Download Thai-NNER (Thai Nested Named Entity Recognition Corpus) This work presents the first Thai Nested Named Entity Recognition (N-NER) dataset. Thai N-NER consists of 264,798 mentions, 104 classes, and a maximum depth of 8 layers obtained from news articles and restaurant reviews, a total of 4894 documents. Our work, to the best of our knowledge, presents the largest non-English N-NER dataset and the first non-English one with fine-grained classes. CC-BY-SA 3.0 IST, VISTEC GitHub \u0e19\u0e31\u0e0a\u0e0a\u0e32 \u0e16\u0e34\u0e23\u0e30\u0e2a\u0e32\u0e42\u0e23\u0e0a corpora by Wirote Aroonmanakun's students ? \u0e19\u0e31\u0e0a\u0e0a\u0e32 \u0e16\u0e34\u0e23\u0e30\u0e2a\u0e32\u0e42\u0e23\u0e0a \u0e19\u0e31\u0e0a\u0e0a\u0e32 \u0e16\u0e34\u0e23\u0e30\u0e2a\u0e32\u0e42\u0e23\u0e0a Data \u0e28\u0e28\u0e34\u0e27\u0e34\u0e21\u0e25 \u0e01\u0e32\u0e25\u0e31\u0e19\u0e2a\u0e35\u0e21\u0e32 corpora by Wirote Aroonmanakun's students ? \u0e28\u0e28\u0e34\u0e27\u0e34\u0e21\u0e25 \u0e01\u0e32\u0e25\u0e31\u0e19\u0e2a\u0e35\u0e21\u0e32 \u0e28\u0e28\u0e34\u0e27\u0e34\u0e21\u0e25 \u0e01\u0e32\u0e25\u0e31\u0e19\u0e2a\u0e35\u0e21\u0e32 Data \u0e13\u0e31\u0e10\u0e14\u0e32\u0e1e\u0e23 \u0e40\u0e25\u0e34\u0e28\u0e0a\u0e35\u0e27\u0e30 corpora by Wirote Aroonmanakun's students ? \u0e13\u0e31\u0e10\u0e14\u0e32\u0e1e\u0e23 \u0e40\u0e25\u0e34\u0e28\u0e0a\u0e35\u0e27\u0e30 \u0e13\u0e31\u0e10\u0e14\u0e32\u0e1e\u0e23 \u0e40\u0e25\u0e34\u0e28\u0e0a\u0e35\u0e27\u0e30 Data Thai NER Thai NER project is part of PyThaiNLP. CC BY 3.0 Wannaphong Phatthiyaphaibun GitHub THAI-NEST Thai Named Entity tagging Corpus from NECTEC & Thammasat University CC BY-SA-NC 3.0 NECTEC aiforthai (registration required) WikiANN WikiANN (sometimes called PAN-X) is a multilingual named entity recognition dataset consisting of Wikipedia articles annotated with LOC (location), PER (person), and ORG (organisation) tags in the IOB2 format. Rahimi, Afshin and Li, Yuan and Cohn, Trevor GitHub Crime Named Entity Recognition NER project with Thai crime news dataset GitHub","title":"Corpus"},{"location":"tasks/ner/#software","text":"Name Description Status Language License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0 TLTK Thai Language Toolkit active Python 3.X BSD License (BSD-3-Clause) Thai-NNER Thai Nested Named Entity Recognition active Python 3.X MIT License","title":"Software"},{"location":"tasks/nli/","text":"Natural Language Inference Corpus Name Description Size License Creator Download XNLI The Cross-lingual Natural Language Inference (XNLI) corpus is a crowd-sourced collection of 5,000 test and 2,500 dev pairs for the MultiNLI corpus. The pairs are annotated with textual entailment and translated into 14 languages: French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu. 5,000 test and 2,500 dev pairs CC BY-NC 4.0 Facebook Research GitHub","title":"Natural Language Inference"},{"location":"tasks/nli/#natural-language-inference","text":"","title":"Natural Language Inference"},{"location":"tasks/nli/#corpus","text":"Name Description Size License Creator Download XNLI The Cross-lingual Natural Language Inference (XNLI) corpus is a crowd-sourced collection of 5,000 test and 2,500 dev pairs for the MultiNLI corpus. The pairs are annotated with textual entailment and translated into 14 languages: French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu. 5,000 test and 2,500 dev pairs CC BY-NC 4.0 Facebook Research GitHub","title":"Corpus"},{"location":"tasks/nlu/","text":"Natural Language Understanding Corpus Name Description Size License Creator Download MASSIVE MASSIVE is a parallel dataset of > 1M utterances across 51 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation. Utterances span 60 intents and include 55 slot types. MASSIVE was created by localizing the SLURP dataset, composed of general Intelligent Voice Assistant single-shot interactions. ~1M utterances , 51 languages CC BY 4.0 Amazon GitHub Thai Winograd A collection of Winograd Schemas in the Thai language. These schemas are adapted from the original set of English Winograd Schemas proposed by Levesque et al., which was based on Ernest Davis's collection. A Winograd schema is a pair of sentences that differ by only a word or two. They include ambiguities that are resolved differently in each sentence and require world knowledge and reasoning to understand. This concept is named after Terry Winograd, who provided a well-known example. 285 questions CC BY 4.0 Phakphum Artkaew Hugging Face","title":"Natural Language Understanding"},{"location":"tasks/nlu/#natural-language-understanding","text":"","title":"Natural Language Understanding"},{"location":"tasks/nlu/#corpus","text":"Name Description Size License Creator Download MASSIVE MASSIVE is a parallel dataset of > 1M utterances across 51 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation. Utterances span 60 intents and include 55 slot types. MASSIVE was created by localizing the SLURP dataset, composed of general Intelligent Voice Assistant single-shot interactions. ~1M utterances , 51 languages CC BY 4.0 Amazon GitHub Thai Winograd A collection of Winograd Schemas in the Thai language. These schemas are adapted from the original set of English Winograd Schemas proposed by Levesque et al., which was based on Ernest Davis's collection. A Winograd schema is a pair of sentences that differ by only a word or two. They include ambiguities that are resolved differently in each sentence and require world knowledge and reasoning to understand. This concept is named after Terry Winograd, who provided a well-known example. 285 questions CC BY 4.0 Phakphum Artkaew Hugging Face","title":"Corpus"},{"location":"tasks/ocr/","text":"Optical Character Recognition Corpus Name Description Size License Creator Download KVIS Thai OCR Dataset Offline Thai Handwritten Character Dataset CC BY 4.0 John Joseph, Ferdin Joe Website Thai OCR Thai ocr dataset from NECTEC Training set: 81,100 image CC BY-SA-NC 3.0 NECTEC aiforthai (registration required) Thai handwriting number dataset Create Thai handwriting number dataset MIT @kittinan GitHub Software Name Description Status Language License Tesseract OCR Tesseract Open Source OCR Engine active C/C++ Apache License 2.0 Easy OCR Ready-to-use OCR with 40+ languages supported including Chinese, Japanese, Korean and Thai. active Python 3.X Apache License 2.0 Thai National Document Optical Character Recognition (THND OCR) Tesseract OCR tools for read Thai National Document used TH Sarabun National Font trained and fine-tuned. Read README.md to see about my process. active Python 3.X","title":"Optical Character Recognition"},{"location":"tasks/ocr/#optical-character-recognition","text":"","title":"Optical Character Recognition"},{"location":"tasks/ocr/#corpus","text":"Name Description Size License Creator Download KVIS Thai OCR Dataset Offline Thai Handwritten Character Dataset CC BY 4.0 John Joseph, Ferdin Joe Website Thai OCR Thai ocr dataset from NECTEC Training set: 81,100 image CC BY-SA-NC 3.0 NECTEC aiforthai (registration required) Thai handwriting number dataset Create Thai handwriting number dataset MIT @kittinan GitHub","title":"Corpus"},{"location":"tasks/ocr/#software","text":"Name Description Status Language License Tesseract OCR Tesseract Open Source OCR Engine active C/C++ Apache License 2.0 Easy OCR Ready-to-use OCR with 40+ languages supported including Chinese, Japanese, Korean and Thai. active Python 3.X Apache License 2.0 Thai National Document Optical Character Recognition (THND OCR) Tesseract OCR tools for read Thai National Document used TH Sarabun National Font trained and fine-tuned. Read README.md to see about my process. active Python 3.X","title":"Software"},{"location":"tasks/parser/","text":"Parser Corpus Name Description Size License Creator Download UD Thai PUD This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies. 1,000 sentences CC BY-SA 3.0 Universal Dependencies GitHub Software Name Description Status Language License spaCy-Thai Tokenizer, POS-tagger, and dependency-parser for Thai language, working on Universal Dependencies. active Python 3.X MIT License Link Grammar Parser A syntactic parser based on link grammar active Python 3.X LGPL","title":"Parser"},{"location":"tasks/parser/#parser","text":"","title":"Parser"},{"location":"tasks/parser/#corpus","text":"Name Description Size License Creator Download UD Thai PUD This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies. 1,000 sentences CC BY-SA 3.0 Universal Dependencies GitHub","title":"Corpus"},{"location":"tasks/parser/#software","text":"Name Description Status Language License spaCy-Thai Tokenizer, POS-tagger, and dependency-parser for Thai language, working on Universal Dependencies. active Python 3.X MIT License Link Grammar Parser A syntactic parser based on link grammar active Python 3.X LGPL","title":"Software"},{"location":"tasks/part-of-speech/","text":"Part-of-speech tagging Corpus Name Description Size License Creator Download Orchid Corpus Thai part of speech (POS) tagged corpus 5,200 sentences CC BY-SA-NC 4.0 NECTEC Mirror from @wannaphong Blackboard Treebank Blackboard Treebank is a Thai dependency corpus based on the LST20 Annotation Guideline. It features dependency structures, constituency structures, word boundaries, named entities, clause boundaries, and sentence boundaries. 122,851 clauses (38,558 sentences) CC BY 3.0 Prachya Boonkwan, NECTEC bitbucket UD Thai PUD This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies. 1,000 sentences CC BY-SA 3.0 Universal Dependencies GitHub thai-political-tweets A small Thai political twitter dataset with UD POS tags 41 tweets, 965 words Unlicense License Can Udomcharoenchaikit GitHub Software Name Description Status Language License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0 TLTK Thai Language Toolkit active Python 3.X BSD License (BSD-3-Clause)","title":"Part-of-speech tagging"},{"location":"tasks/part-of-speech/#part-of-speech-tagging","text":"","title":"Part-of-speech tagging"},{"location":"tasks/part-of-speech/#corpus","text":"Name Description Size License Creator Download Orchid Corpus Thai part of speech (POS) tagged corpus 5,200 sentences CC BY-SA-NC 4.0 NECTEC Mirror from @wannaphong Blackboard Treebank Blackboard Treebank is a Thai dependency corpus based on the LST20 Annotation Guideline. It features dependency structures, constituency structures, word boundaries, named entities, clause boundaries, and sentence boundaries. 122,851 clauses (38,558 sentences) CC BY 3.0 Prachya Boonkwan, NECTEC bitbucket UD Thai PUD This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies. 1,000 sentences CC BY-SA 3.0 Universal Dependencies GitHub thai-political-tweets A small Thai political twitter dataset with UD POS tags 41 tweets, 965 words Unlicense License Can Udomcharoenchaikit GitHub","title":"Corpus"},{"location":"tasks/part-of-speech/#software","text":"Name Description Status Language License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0 TLTK Thai Language Toolkit active Python 3.X BSD License (BSD-3-Clause)","title":"Software"},{"location":"tasks/plagiarism/","text":"Plagiarism Corpus Name Description Size License Creator Download Thai Plagiarism Thai Plagiarism Detection http://copycatch.in.th/thai-plagiarism-task.html CC BY-SA-NC 3.0 NECTEC aiforthai (registration required)","title":"Plagiarism"},{"location":"tasks/plagiarism/#plagiarism","text":"","title":"Plagiarism"},{"location":"tasks/plagiarism/#corpus","text":"Name Description Size License Creator Download Thai Plagiarism Thai Plagiarism Detection http://copycatch.in.th/thai-plagiarism-task.html CC BY-SA-NC 3.0 NECTEC aiforthai (registration required)","title":"Corpus"},{"location":"tasks/question-answering/","text":"Question Answering Corpus Name Description Size License Creator Download XQuAD XQuAD (Cross-lingual Question Answering Dataset) is a benchmark dataset for evaluating cross-lingual question answering performance. 240 paragraphs and 1,190 question-answer pairs CC BY-SA 4.0 DeepMind GitHub Thai QA Question answering program from Thai Wikipedia. 4,000 question-answer pairs CC BY-SA-NC 3.0 NECTEC Dataset: aiforthai (registration required), wiki: copycatch , Sample data set: copycatch TyDi QA A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages 200k human-annotated question-answer pairs Apache-2.0 License Google Research GitHub iapp-wiki-qa-dataset Open Thai Wikipedia QA Dataset made by iApp Technology 1,961 Documents 9,170 Questions MIT License iApp Technology GitHub MKQA MKQA: Multilingual Knowledge Questions & Answers. MKQA contains 10,000 queries sampled from the Google Natural Questions dataset. 10,000 queries Apple GitHub Thai WIKI QA Dataset from National Software Contest (NSC) 2018 - 2019 Factoid 15,000 question-answer pairs, boolean 2,000 question CC BY-SA-NC 3.0 NECTEC Dataset: aiforthai Software Name Detail Owner Download Zero-shot multilingual QA from DeepPavlov DeepPavlov is an open-source conversational AI library built on TensorFlow, Keras and PyTorch. Neural Networks and Deep Learning lab, MIPT GitHub Colab","title":"Question Answering"},{"location":"tasks/question-answering/#question-answering","text":"","title":"Question Answering"},{"location":"tasks/question-answering/#corpus","text":"Name Description Size License Creator Download XQuAD XQuAD (Cross-lingual Question Answering Dataset) is a benchmark dataset for evaluating cross-lingual question answering performance. 240 paragraphs and 1,190 question-answer pairs CC BY-SA 4.0 DeepMind GitHub Thai QA Question answering program from Thai Wikipedia. 4,000 question-answer pairs CC BY-SA-NC 3.0 NECTEC Dataset: aiforthai (registration required), wiki: copycatch , Sample data set: copycatch TyDi QA A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages 200k human-annotated question-answer pairs Apache-2.0 License Google Research GitHub iapp-wiki-qa-dataset Open Thai Wikipedia QA Dataset made by iApp Technology 1,961 Documents 9,170 Questions MIT License iApp Technology GitHub MKQA MKQA: Multilingual Knowledge Questions & Answers. MKQA contains 10,000 queries sampled from the Google Natural Questions dataset. 10,000 queries Apple GitHub Thai WIKI QA Dataset from National Software Contest (NSC) 2018 - 2019 Factoid 15,000 question-answer pairs, boolean 2,000 question CC BY-SA-NC 3.0 NECTEC Dataset: aiforthai","title":"Corpus"},{"location":"tasks/question-answering/#software","text":"Name Detail Owner Download Zero-shot multilingual QA from DeepPavlov DeepPavlov is an open-source conversational AI library built on TensorFlow, Keras and PyTorch. Neural Networks and Deep Learning lab, MIPT GitHub Colab","title":"Software"},{"location":"tasks/sentence-segmentation/","text":"Sentence Segmentation Corpus Name Description Size License Creator Download Orchid Corpus Thai part of speech (POS) tagged corpus 5,200 sentences CC BY-SA-NC 4.0 NECTEC Mirror from @wannaphong Blackboard Treebank Blackboard Treebank is a Thai dependency corpus based on the LST20 Annotation Guideline. It features dependency structures, constituency structures, word boundaries, named entities, clause boundaries, and sentence boundaries. 122,851 clauses (38,558 sentences) CC BY 3.0 Prachya Boonkwan, NECTEC bitbucket Fake review CC BY-SA 4.0 AI Research Institute of Thailand (AIResearch) GitHub Software Name Description Status Language License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0 TLTK Thai Language Toolkit active Python 3.X BSD License (BSD-3-Clause) BoydCut Bidirectional LSTM-CNN Model for Thai Sentence Segmenter active Python 3.X MIT License ThaiSum Simple Thai Sentence Segmentor active Python 3.X Apache Licence 2.0","title":"Sentence Segmentation"},{"location":"tasks/sentence-segmentation/#sentence-segmentation","text":"","title":"Sentence Segmentation"},{"location":"tasks/sentence-segmentation/#corpus","text":"Name Description Size License Creator Download Orchid Corpus Thai part of speech (POS) tagged corpus 5,200 sentences CC BY-SA-NC 4.0 NECTEC Mirror from @wannaphong Blackboard Treebank Blackboard Treebank is a Thai dependency corpus based on the LST20 Annotation Guideline. It features dependency structures, constituency structures, word boundaries, named entities, clause boundaries, and sentence boundaries. 122,851 clauses (38,558 sentences) CC BY 3.0 Prachya Boonkwan, NECTEC bitbucket Fake review CC BY-SA 4.0 AI Research Institute of Thailand (AIResearch) GitHub","title":"Corpus"},{"location":"tasks/sentence-segmentation/#software","text":"Name Description Status Language License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0 TLTK Thai Language Toolkit active Python 3.X BSD License (BSD-3-Clause) BoydCut Bidirectional LSTM-CNN Model for Thai Sentence Segmenter active Python 3.X MIT License ThaiSum Simple Thai Sentence Segmentor active Python 3.X Apache Licence 2.0","title":"Software"},{"location":"tasks/ser/","text":"Speech Emotion Recognition Corpus Name Description Size License Creator Download Thai Speech Emotion Dataset Thai Speech Emotion Recognition Dataset 36 hours CC BY-SA 4.0 AI Research Institute of Thailand (AIResearch) AIResearch Software Name Description Status Language License Vistec-AIS Speech Emotion Recognition Speech Emotion Recognition Model and Inferencing using Pytorch active Python 3.X Apache License 2.0","title":"Speech Emotion Recognition"},{"location":"tasks/ser/#speech-emotion-recognition","text":"","title":"Speech Emotion Recognition"},{"location":"tasks/ser/#corpus","text":"Name Description Size License Creator Download Thai Speech Emotion Dataset Thai Speech Emotion Recognition Dataset 36 hours CC BY-SA 4.0 AI Research Institute of Thailand (AIResearch) AIResearch","title":"Corpus"},{"location":"tasks/ser/#software","text":"Name Description Status Language License Vistec-AIS Speech Emotion Recognition Speech Emotion Recognition Model and Inferencing using Pytorch active Python 3.X Apache License 2.0","title":"Software"},{"location":"tasks/slu/","text":"Spoken Language Understanding Corpus Name Description Size License Creator Download Facebook Multilingual SLU Dataset Dataset from Cross-Lingual Transfer Learning for Multilingual Task Oriented Dialog 5k annotated utterance in Thai Facebook Facebook MTOP: Multilingual TOP MTOP comprisingof 100k annotated utterances in 6 languages across 11 domains. dataset from MTOP: A Comprehensive Multilingual Task-Oriented Semantic Parsing Benchmark Facebook Facebook","title":"Spoken Language Understanding"},{"location":"tasks/slu/#spoken-language-understanding","text":"","title":"Spoken Language Understanding"},{"location":"tasks/slu/#corpus","text":"Name Description Size License Creator Download Facebook Multilingual SLU Dataset Dataset from Cross-Lingual Transfer Learning for Multilingual Task Oriented Dialog 5k annotated utterance in Thai Facebook Facebook MTOP: Multilingual TOP MTOP comprisingof 100k annotated utterances in 6 languages across 11 domains. dataset from MTOP: A Comprehensive Multilingual Task-Oriented Semantic Parsing Benchmark Facebook Facebook","title":"Corpus"},{"location":"tasks/soundex/","text":"Soundex Software Name Description Status Language License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0 thpronun Use -s option to output soundex active C/C++ GPL-3.0 License","title":"Soundex"},{"location":"tasks/soundex/#soundex","text":"","title":"Soundex"},{"location":"tasks/soundex/#software","text":"Name Description Status Language License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0 thpronun Use -s option to output soundex active C/C++ GPL-3.0 License","title":"Software"},{"location":"tasks/speech-recognition/","text":"Speech Recognition Automatic Speech Recognition Corpus Name Description Size License Creator Download Lotus Thai Speech Recognition corpus from NECTEC (not full corpus) 12 hours CC BY-SA-NC 3.0 NECTEC aiforthai (registration required) and Mirror from @korakot: GitHub Common Voice Corpus Common Voice Corpus from mozilla 171 hours (valid) CC0-1.0 License mozilla Common Voice Gowajee corpus The corpus was collected in the Automatic Speech Recognition class offered at Chulalongkorn University as a homework assignment. 11 hours MIT License Ekapol Chuangsuwanich, Atiwong Suchato, Korrawe Karunratanakul, Burin Naowarat, Chompakorn CChaichot and Penpicha Sangsa-nga GitHub Lotus BN Thai News Speech Recognition corpus from NECTEC (not full corpus) 28 minute CC BY-SA-NC 3.0 NECTEC Mirror from @korakot: GitHub Lotus Cell Thai Speech corpus over the phone. (not full corpus) 11 hours CC BY-SA-NC 3.0 NECTEC Mirror from @korakot: GitHub Thai Elderly Speech dataset by Data Wow and VISAI Thai Elderly Speech dataset, consisting of 17 hours 11 minutes (19,200 files). The files are divided into 2 categories: Health care (health issues and services) and Smart Home (using Smart Home devices in household contexts). 17 hours 11 minutes CC BY-SA 4.0 VISAI AI Company Limited and Data Wow Company Limited VISAI AI Company Limited and Data Wow Company Limited FLEURS Fleurs is the speech version of the FLoRes machine translation benchmark. We use 2009 n-way parallel sentences from the FLoRes dev and devtest publicly available sets, in 102 languages. CC BY Google huggingface XTREME-S The Cross-lingual TRansfer Evaluation of Multilingual Encoders for Speech (XTREME-S) benchmark is a benchmark designed to evaluate speech representations across languages, tasks, domains and data regimes. It covers 102 languages from 10+ language families, 3 different domains and 4 task families: speech recognition, translation, classification and retrieval. CC BY Google huggingface Thai Dialect Corpus Corpus of Central Thai dialect and three other Thai dialects (Khummuang, Korat, and Pattani). CC BY-SA 4.0 Chulalongkorn University Github Software Name Description Status Language License PyThaiASR PyThaiASR is a Python package for Automatic Speech Recognition with focus on Thai language. It have offline thai automatic speech recognition model from Artificial Intelligence Research Institute of Thailand (AIResearch.in.th). active Python 3.X Apache License 2.0 Preatrained Name Detail Owner Download wav2vec2-large-xlsr-53-th` Finetuning wav2vec2-large-xlsr-53 on Thai Common Voice 7.0 Artificial Intelligence Research Institute of Thailand (AIResearch.in.th) Hugging Face Thai Wav2Vec2 with CommonVoice V8 (newmm tokenizer) + language model This model trained with CommonVoice V8 dataset by increase data from CommonVoice V7 dataset that It was use in airesearch/wav2vec2-large-xlsr-53-th. It was finetune wav2vec2-large-xlsr-53. Wannaphong Phatthiyaphaibun Hugging Face Thai Wav2Vec2 with CommonVoice V8 (deepcut tokenizer) + language model This model trained with CommonVoice V8 dataset by increase data from CommonVoice V7 dataset that It was use in airesearch/wav2vec2-large-xlsr-53-th. It was finetune wav2vec2-large-xlsr-53. Wannaphong Phatthiyaphaibun Hugging Face Whisper Whisper is a general-purpose speech recognition model. (include S2T X->English) OpenAI GitHub Language Identification Corpus Name Description Size License Creator Download VoxLingua107 VoxLingua107 is a speech dataset for training spoken language identification models and contains data for 107 languages. (including Thai!!!) 61 hours, 5.8G (Thai) CC-BY 4.0 License J\u00f6rgen Valk, Tanel Alum\u00e4e. Website","title":"Speech Recognition"},{"location":"tasks/speech-recognition/#speech-recognition","text":"","title":"Speech Recognition"},{"location":"tasks/speech-recognition/#automatic-speech-recognition","text":"","title":"Automatic Speech Recognition"},{"location":"tasks/speech-recognition/#corpus","text":"Name Description Size License Creator Download Lotus Thai Speech Recognition corpus from NECTEC (not full corpus) 12 hours CC BY-SA-NC 3.0 NECTEC aiforthai (registration required) and Mirror from @korakot: GitHub Common Voice Corpus Common Voice Corpus from mozilla 171 hours (valid) CC0-1.0 License mozilla Common Voice Gowajee corpus The corpus was collected in the Automatic Speech Recognition class offered at Chulalongkorn University as a homework assignment. 11 hours MIT License Ekapol Chuangsuwanich, Atiwong Suchato, Korrawe Karunratanakul, Burin Naowarat, Chompakorn CChaichot and Penpicha Sangsa-nga GitHub Lotus BN Thai News Speech Recognition corpus from NECTEC (not full corpus) 28 minute CC BY-SA-NC 3.0 NECTEC Mirror from @korakot: GitHub Lotus Cell Thai Speech corpus over the phone. (not full corpus) 11 hours CC BY-SA-NC 3.0 NECTEC Mirror from @korakot: GitHub Thai Elderly Speech dataset by Data Wow and VISAI Thai Elderly Speech dataset, consisting of 17 hours 11 minutes (19,200 files). The files are divided into 2 categories: Health care (health issues and services) and Smart Home (using Smart Home devices in household contexts). 17 hours 11 minutes CC BY-SA 4.0 VISAI AI Company Limited and Data Wow Company Limited VISAI AI Company Limited and Data Wow Company Limited FLEURS Fleurs is the speech version of the FLoRes machine translation benchmark. We use 2009 n-way parallel sentences from the FLoRes dev and devtest publicly available sets, in 102 languages. CC BY Google huggingface XTREME-S The Cross-lingual TRansfer Evaluation of Multilingual Encoders for Speech (XTREME-S) benchmark is a benchmark designed to evaluate speech representations across languages, tasks, domains and data regimes. It covers 102 languages from 10+ language families, 3 different domains and 4 task families: speech recognition, translation, classification and retrieval. CC BY Google huggingface Thai Dialect Corpus Corpus of Central Thai dialect and three other Thai dialects (Khummuang, Korat, and Pattani). CC BY-SA 4.0 Chulalongkorn University Github","title":"Corpus"},{"location":"tasks/speech-recognition/#software","text":"Name Description Status Language License PyThaiASR PyThaiASR is a Python package for Automatic Speech Recognition with focus on Thai language. It have offline thai automatic speech recognition model from Artificial Intelligence Research Institute of Thailand (AIResearch.in.th). active Python 3.X Apache License 2.0","title":"Software"},{"location":"tasks/speech-recognition/#preatrained","text":"Name Detail Owner Download wav2vec2-large-xlsr-53-th` Finetuning wav2vec2-large-xlsr-53 on Thai Common Voice 7.0 Artificial Intelligence Research Institute of Thailand (AIResearch.in.th) Hugging Face Thai Wav2Vec2 with CommonVoice V8 (newmm tokenizer) + language model This model trained with CommonVoice V8 dataset by increase data from CommonVoice V7 dataset that It was use in airesearch/wav2vec2-large-xlsr-53-th. It was finetune wav2vec2-large-xlsr-53. Wannaphong Phatthiyaphaibun Hugging Face Thai Wav2Vec2 with CommonVoice V8 (deepcut tokenizer) + language model This model trained with CommonVoice V8 dataset by increase data from CommonVoice V7 dataset that It was use in airesearch/wav2vec2-large-xlsr-53-th. It was finetune wav2vec2-large-xlsr-53. Wannaphong Phatthiyaphaibun Hugging Face Whisper Whisper is a general-purpose speech recognition model. (include S2T X->English) OpenAI GitHub","title":"Preatrained"},{"location":"tasks/speech-recognition/#language-identification","text":"","title":"Language Identification"},{"location":"tasks/speech-recognition/#corpus_1","text":"Name Description Size License Creator Download VoxLingua107 VoxLingua107 is a speech dataset for training spoken language identification models and contains data for 107 languages. (including Thai!!!) 61 hours, 5.8G (Thai) CC-BY 4.0 License J\u00f6rgen Valk, Tanel Alum\u00e4e. Website","title":"Corpus"},{"location":"tasks/speech-synthesis/","text":"Speech Synthesis Corpus Name Description Size License Creator Download TSync-1 Corpus Thai speech synthesis corpus from NECTEC (not full corpus) 6 hours CC BY-SA-NC 3.0 NECTEC Mirror from @korakot: GitHub TSync-2 Corpus Thai speech synthesis corpus from NECTEC (not full corpus) 5hr 25m CC BY-SA-NC 3.0 NECTEC aiforthai (registration required) and Mirror from @korakot edited_common_voice This dataset is a Thai TTS dataset that use the voice from Common Voice dataset and modify the voice to not to sound like the original. - MIT taetiya taechamatavorn HuggingFace Software Name Description Status Language License Thai TTS Tacotron Thai_TTS is the project about training \"Text to Speech in Thai\" using Tacotron2 by NVIDIA. active Python 3.X Apache License 2.0 PyThaiTTS Open Source Thai Text-to-speech library in Python active Python 3.X Apache License 2.0","title":"Speech Synthesis"},{"location":"tasks/speech-synthesis/#speech-synthesis","text":"","title":"Speech Synthesis"},{"location":"tasks/speech-synthesis/#corpus","text":"Name Description Size License Creator Download TSync-1 Corpus Thai speech synthesis corpus from NECTEC (not full corpus) 6 hours CC BY-SA-NC 3.0 NECTEC Mirror from @korakot: GitHub TSync-2 Corpus Thai speech synthesis corpus from NECTEC (not full corpus) 5hr 25m CC BY-SA-NC 3.0 NECTEC aiforthai (registration required) and Mirror from @korakot edited_common_voice This dataset is a Thai TTS dataset that use the voice from Common Voice dataset and modify the voice to not to sound like the original. - MIT taetiya taechamatavorn HuggingFace","title":"Corpus"},{"location":"tasks/speech-synthesis/#software","text":"Name Description Status Language License Thai TTS Tacotron Thai_TTS is the project about training \"Text to Speech in Thai\" using Tacotron2 by NVIDIA. active Python 3.X Apache License 2.0 PyThaiTTS Open Source Thai Text-to-speech library in Python active Python 3.X Apache License 2.0","title":"Software"},{"location":"tasks/speech2text-translation/","text":"Speech-to-text translation Speech-to-text translation or S2T are translate speech to text with different language. Corpus Name Description Size License Creator Download FLEURS Fleurs is the speech version of the FLoRes machine translation benchmark. We use 2009 n-way parallel sentences from the FLoRes dev and devtest publicly available sets, in 102 languages. CC BY Google huggingface Model Name Detail Owner Download Whisper Whisper is a general-purpose speech recognition model. (include S2T X->English) OpenAI GitHub","title":"Speech-to-text translation"},{"location":"tasks/speech2text-translation/#speech-to-text-translation","text":"Speech-to-text translation or S2T are translate speech to text with different language.","title":"Speech-to-text translation"},{"location":"tasks/speech2text-translation/#corpus","text":"Name Description Size License Creator Download FLEURS Fleurs is the speech version of the FLoRes machine translation benchmark. We use 2009 n-way parallel sentences from the FLoRes dev and devtest publicly available sets, in 102 languages. CC BY Google huggingface","title":"Corpus"},{"location":"tasks/speech2text-translation/#model","text":"Name Detail Owner Download Whisper Whisper is a general-purpose speech recognition model. (include S2T X->English) OpenAI GitHub","title":"Model"},{"location":"tasks/speech_representations/","text":"Speech Representations Benchmark Name Description Size License Creator Download XTREME-S The Cross-lingual TRansfer Evaluation of Multilingual Encoders for Speech (XTREME-S) benchmark is a benchmark designed to evaluate speech representations across languages, tasks, domains and data regimes. It covers 102 languages from 10+ language families, 3 different domains and 4 task families: speech recognition, translation, classification and retrieval. CC BY 4.0 Google Hugging Face","title":"Speech Representations"},{"location":"tasks/speech_representations/#speech-representations","text":"","title":"Speech Representations"},{"location":"tasks/speech_representations/#benchmark","text":"Name Description Size License Creator Download XTREME-S The Cross-lingual TRansfer Evaluation of Multilingual Encoders for Speech (XTREME-S) benchmark is a benchmark designed to evaluate speech representations across languages, tasks, domains and data regimes. It covers 102 languages from 10+ language families, 3 different domains and 4 task families: speech recognition, translation, classification and retrieval. CC BY 4.0 Google Hugging Face","title":"Benchmark"},{"location":"tasks/spell-correct/","text":"Spell Correct Corpus Name Description Size License Creator Download VISTEC-TP-TH-21 The largest social media domain datasets for Thai text processing (word segmentation, misspell correction and detection, and named-entity boundary) called \"VISTEC-TP-TH-2021\" or VISTEC-2021. 49,997 sentences with 3.39M words CC-BY-SA 3.0 VISTEC & Chiang Mai University GitHub Software Name Description Status Language License Hunspell Hunspell is the spell checker of LibreOffice, OpenOffice.org, Mozilla Firefox 3 & Thunderbird, Google Chrome. active C/C++ GNU Lesser General Public License and Mozilla Public License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0 khanaa Khanaa is a tool to make spelling Thai more convenient. active Python 3.X MIT license","title":"Spell Correct"},{"location":"tasks/spell-correct/#spell-correct","text":"","title":"Spell Correct"},{"location":"tasks/spell-correct/#corpus","text":"Name Description Size License Creator Download VISTEC-TP-TH-21 The largest social media domain datasets for Thai text processing (word segmentation, misspell correction and detection, and named-entity boundary) called \"VISTEC-TP-TH-2021\" or VISTEC-2021. 49,997 sentences with 3.39M words CC-BY-SA 3.0 VISTEC & Chiang Mai University GitHub","title":"Corpus"},{"location":"tasks/spell-correct/#software","text":"Name Description Status Language License Hunspell Hunspell is the spell checker of LibreOffice, OpenOffice.org, Mozilla Firefox 3 & Thunderbird, Google Chrome. active C/C++ GNU Lesser General Public License and Mozilla Public License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0 khanaa Khanaa is a tool to make spelling Thai more convenient. active Python 3.X MIT license","title":"Software"},{"location":"tasks/syllable-segmentation/","text":"Syllable Segmentation Software Name Description Status Language License ssg CRF syllable segmenter for Thai active Python 3.X Apache License 2.0 TLTK Thai Language Toolkit active Python 3.X BSD License (BSD-3-Clause)","title":"Syllable Segmentation"},{"location":"tasks/syllable-segmentation/#syllable-segmentation","text":"","title":"Syllable Segmentation"},{"location":"tasks/syllable-segmentation/#software","text":"Name Description Status Language License ssg CRF syllable segmenter for Thai active Python 3.X Apache License 2.0 TLTK Thai Language Toolkit active Python 3.X BSD License (BSD-3-Clause)","title":"Software"},{"location":"tasks/text-classification/","text":"Text Classification Corpus Name Description Size Labels License Creator Download prachathai-67k News Article Corpus from Prachathai.com 67,889 articles wtih 51,797 tags 12 CC BY 4.0 @lukkiddd and @cstorm125 GitHub wisesight sentiment Social media messages in Thai language with sentiment label (positive, neutral, negative, question). 26,737 messages 4 CC0-1.0 License Arthit Suriyawongkul, Ekapol Chuangsuwanich GitHub wongnai corpus This project is a collection of Wongnai's datasets which are mostly in Thai language. 500K words labeled 5 LGPL-3.0 License wongnai GitHub Toxicity in Thai Tweet Corpus Toxicity in Thai Tweet Corpus 3,300 messages 2 CC BY-NC 4.0 Tokyo Metropolitan University Natural Language Processing Group GitHub Thai-Clickbait The dataset for Thai Clickbait classification train: 37,376 messages, test: 9,344 messages 1 MIT License @9meo at GitHub GitHub sentiment_analysis_thai Thai sentiment analysis from @JagerV3 2 ? @JagerV3 at GitHub GitHub thai-emojification Emojification of Thai Text, Using Deep Learning (LSTM). train: 128 messages, test: 55 messages 5 (\u2764\ufe0f\ud83d\ude04\ud83d\ude1e\ud83c\udf74\u26be) GPL-3.0 License iApp Technology Co, Ltd GitHub The 40 Thai Children Stories The dataset was collected from 40 Thai children stories. We manually split the text into sentences which leads to 1,964 sentences 1,964 sentences 3 ? Kitsuchart Pasupa, Thititorn Seneewong Na Ayutthaya GitHub Thai sentiment analysis dataset Thai sentiment analysis dataset from PyThaiNLP 2 CC BY 3.0 PyThaiNLP GitHub LimeSoda: Dataset for Fake News Detection in Healthcare Domain Thai fake news dataset in the healthcare domain consisting of curate and manually annotated 7,191 documents annotated 7,191 documents 3 (fact, fake, or undefined) CC-BY-4.0 License Payoungkhamdee, Patomporn and Porkaew, Peerachet and Sinthunyathum, Atthasith and Songphum, Phattharaphon and Kawidam, Witsarut and Loha-Udom, Wichayut and Boonkwan, Prachya and Sutantayawalee, Vipas GitHub krathu-500 A dataset of post-comment on Pantip, a popular Thai web board. 3 (Positive, Negative, and Neutral) GitHub thai_cyberbullying_lgbt LGBT Cyberbullying Detection in Thai Language Utilizing Transformers-Based Algorithms GitHub Software Name Description Status Language License thai_sentiment The naive sentiment classification function based on NBSVM trained on wisesight_sentiment active Python 3.X Apache License 2.0","title":"Text Classification"},{"location":"tasks/text-classification/#text-classification","text":"","title":"Text Classification"},{"location":"tasks/text-classification/#corpus","text":"Name Description Size Labels License Creator Download prachathai-67k News Article Corpus from Prachathai.com 67,889 articles wtih 51,797 tags 12 CC BY 4.0 @lukkiddd and @cstorm125 GitHub wisesight sentiment Social media messages in Thai language with sentiment label (positive, neutral, negative, question). 26,737 messages 4 CC0-1.0 License Arthit Suriyawongkul, Ekapol Chuangsuwanich GitHub wongnai corpus This project is a collection of Wongnai's datasets which are mostly in Thai language. 500K words labeled 5 LGPL-3.0 License wongnai GitHub Toxicity in Thai Tweet Corpus Toxicity in Thai Tweet Corpus 3,300 messages 2 CC BY-NC 4.0 Tokyo Metropolitan University Natural Language Processing Group GitHub Thai-Clickbait The dataset for Thai Clickbait classification train: 37,376 messages, test: 9,344 messages 1 MIT License @9meo at GitHub GitHub sentiment_analysis_thai Thai sentiment analysis from @JagerV3 2 ? @JagerV3 at GitHub GitHub thai-emojification Emojification of Thai Text, Using Deep Learning (LSTM). train: 128 messages, test: 55 messages 5 (\u2764\ufe0f\ud83d\ude04\ud83d\ude1e\ud83c\udf74\u26be) GPL-3.0 License iApp Technology Co, Ltd GitHub The 40 Thai Children Stories The dataset was collected from 40 Thai children stories. We manually split the text into sentences which leads to 1,964 sentences 1,964 sentences 3 ? Kitsuchart Pasupa, Thititorn Seneewong Na Ayutthaya GitHub Thai sentiment analysis dataset Thai sentiment analysis dataset from PyThaiNLP 2 CC BY 3.0 PyThaiNLP GitHub LimeSoda: Dataset for Fake News Detection in Healthcare Domain Thai fake news dataset in the healthcare domain consisting of curate and manually annotated 7,191 documents annotated 7,191 documents 3 (fact, fake, or undefined) CC-BY-4.0 License Payoungkhamdee, Patomporn and Porkaew, Peerachet and Sinthunyathum, Atthasith and Songphum, Phattharaphon and Kawidam, Witsarut and Loha-Udom, Wichayut and Boonkwan, Prachya and Sutantayawalee, Vipas GitHub krathu-500 A dataset of post-comment on Pantip, a popular Thai web board. 3 (Positive, Negative, and Neutral) GitHub thai_cyberbullying_lgbt LGBT Cyberbullying Detection in Thai Language Utilizing Transformers-Based Algorithms GitHub","title":"Corpus"},{"location":"tasks/text-classification/#software","text":"Name Description Status Language License thai_sentiment The naive sentiment classification function based on NBSVM trained on wisesight_sentiment active Python 3.X Apache License 2.0","title":"Software"},{"location":"tasks/text-generation/","text":"Text Generation Software Name Description Status Language License TTG Thai Text Generator active Python 3.X Apache License 2.0 Pretrained Name Detail Owner Download Flax's GPT-2 base GPT-2 Base Thai is a causal language model based on the OpenAI GPT-2 model. It was trained on the OSCAR dataset, specifically the unshuffled_deduplicated_th subset. The model was trained from scratch and achieved an evaluation loss of 1.708 and an evaluation perplexity of 5.516. Flax Community Hugging Face GPT-Neo GPT-Neo 1.3B is a transformer model designed using EleutherAI's replication of the GPT-3 architecture. GPT-Neo refers to the class of models, while 1.3B represents the number of parameters of this particular pre-trained model. (It is not training for Thai but It's can working with Thai) EleutherAI Hugging Face Thai GPT Next It is fine-tune the GPT-Neo model for Thai language. Wannaphong Phatthiyaphaibun GitHub","title":"Text Generation"},{"location":"tasks/text-generation/#text-generation","text":"","title":"Text Generation"},{"location":"tasks/text-generation/#software","text":"Name Description Status Language License TTG Thai Text Generator active Python 3.X Apache License 2.0","title":"Software"},{"location":"tasks/text-generation/#pretrained","text":"Name Detail Owner Download Flax's GPT-2 base GPT-2 Base Thai is a causal language model based on the OpenAI GPT-2 model. It was trained on the OSCAR dataset, specifically the unshuffled_deduplicated_th subset. The model was trained from scratch and achieved an evaluation loss of 1.708 and an evaluation perplexity of 5.516. Flax Community Hugging Face GPT-Neo GPT-Neo 1.3B is a transformer model designed using EleutherAI's replication of the GPT-3 architecture. GPT-Neo refers to the class of models, while 1.3B represents the number of parameters of this particular pre-trained model. (It is not training for Thai but It's can working with Thai) EleutherAI Hugging Face Thai GPT Next It is fine-tune the GPT-Neo model for Thai language. Wannaphong Phatthiyaphaibun GitHub","title":"Pretrained"},{"location":"tasks/text-summarization/","text":"Text Summarization Corpus Name Description Size License Creator Download ThaiSum The largest dataset for Thai text summarization. 350,000 articles (2.9 GB) MIT Licence Nakhun Chumpolsathien GitHub TR-TPBS A dataset for Thai text summarization. 310K articles MIT License Nakhun Chumpolsathien GitHub XL-Sum This dataset annotated article-summary pairs from BBC News and covers 45 languages ranging from low to high-resource. 8,268 (for thai) CC BY-NC-SA 4.0 GitHub ThaiCrossSum Corpora Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization th-en 310,926 articles and th-zh 310,926 articles Nakhun Chumpolsathien GitHub Pretrained Model Detail Paper Download mT5: Multilingual T5 Multilingual T5 (mT5) is a massively multilingual pretrained text-to-text transformer model, trained following a similar recipe as T5. mT5: A massively multilingual pre-trained text-to-text transformer GitHub BertSum Trained Model by Nakhun Chumpolsathien & Tanachat Arayachutinan Using Knowledge Distillation from Keyword Extraction to Improve the Informativeness of Neural Cross-lingual Summarization GitHub ARedSum Trained Model by Nakhun Chumpolsathien & Tanachat Arayachutinan Using Knowledge Distillation from Keyword Extraction to Improve the Informativeness of Neural Cross-lingual Summarization GitHub TNCLS Trained Model from ThaiCrossSum Corpora by Nakhun Chumpolsathien GitHub CLS+MS Trained Model from ThaiCrossSum Corpora by Nakhun Chumpolsathien GitHub CLS+MT Trained Model from ThaiCrossSum Corpora by Nakhun Chumpolsathien GitHub XLS \u2013 RL-ROUGE Trained Model from ThaiCrossSum Corpora by Nakhun Chumpolsathien GitHub mt5-cpe-kmutt-thai-sentence-sum This repository contains the finetuned mT5-base model for Thai sentence summarization. huggingface","title":"Text Summarization"},{"location":"tasks/text-summarization/#text-summarization","text":"","title":"Text Summarization"},{"location":"tasks/text-summarization/#corpus","text":"Name Description Size License Creator Download ThaiSum The largest dataset for Thai text summarization. 350,000 articles (2.9 GB) MIT Licence Nakhun Chumpolsathien GitHub TR-TPBS A dataset for Thai text summarization. 310K articles MIT License Nakhun Chumpolsathien GitHub XL-Sum This dataset annotated article-summary pairs from BBC News and covers 45 languages ranging from low to high-resource. 8,268 (for thai) CC BY-NC-SA 4.0 GitHub ThaiCrossSum Corpora Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization th-en 310,926 articles and th-zh 310,926 articles Nakhun Chumpolsathien GitHub","title":"Corpus"},{"location":"tasks/text-summarization/#pretrained","text":"Model Detail Paper Download mT5: Multilingual T5 Multilingual T5 (mT5) is a massively multilingual pretrained text-to-text transformer model, trained following a similar recipe as T5. mT5: A massively multilingual pre-trained text-to-text transformer GitHub BertSum Trained Model by Nakhun Chumpolsathien & Tanachat Arayachutinan Using Knowledge Distillation from Keyword Extraction to Improve the Informativeness of Neural Cross-lingual Summarization GitHub ARedSum Trained Model by Nakhun Chumpolsathien & Tanachat Arayachutinan Using Knowledge Distillation from Keyword Extraction to Improve the Informativeness of Neural Cross-lingual Summarization GitHub TNCLS Trained Model from ThaiCrossSum Corpora by Nakhun Chumpolsathien GitHub CLS+MS Trained Model from ThaiCrossSum Corpora by Nakhun Chumpolsathien GitHub CLS+MT Trained Model from ThaiCrossSum Corpora by Nakhun Chumpolsathien GitHub XLS \u2013 RL-ROUGE Trained Model from ThaiCrossSum Corpora by Nakhun Chumpolsathien GitHub mt5-cpe-kmutt-thai-sentence-sum This repository contains the finetuned mT5-base model for Thai sentence summarization. huggingface","title":"Pretrained"},{"location":"tasks/transliterate/","text":"Transliterate Corpus Name Description Size License Creator Download Thai2Rom Thai Romanization Dataset CC BY-SA 3.0 Wannaphong Phatthiyaphaibun kaggle Thai-English transliteration dictionary This project is Thai-English transliteration dictionary. It is store words for Thai-English transliteration pairs. Thai words are English words from English to Thai by transliteration in Thai. CC-BY 4.0 Wannaphong Phatthiyaphaibun GitHub Software Name Description Status Language License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0 TLTK Thai Language Toolkit active Python 3.X BSD License (BSD-3-Clause) Wunsen Wunsen transliterates/transcribes from other languages into Thai. active Python 3.X MIT License","title":"Transliterate"},{"location":"tasks/transliterate/#transliterate","text":"","title":"Transliterate"},{"location":"tasks/transliterate/#corpus","text":"Name Description Size License Creator Download Thai2Rom Thai Romanization Dataset CC BY-SA 3.0 Wannaphong Phatthiyaphaibun kaggle Thai-English transliteration dictionary This project is Thai-English transliteration dictionary. It is store words for Thai-English transliteration pairs. Thai words are English words from English to Thai by transliteration in Thai. CC-BY 4.0 Wannaphong Phatthiyaphaibun GitHub","title":"Corpus"},{"location":"tasks/transliterate/#software","text":"Name Description Status Language License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0 TLTK Thai Language Toolkit active Python 3.X BSD License (BSD-3-Clause) Wunsen Wunsen transliterates/transcribes from other languages into Thai. active Python 3.X MIT License","title":"Software"},{"location":"tasks/treebank/","text":"Treebank Corpus Name Description Size License Creator Download UD Thai PUD This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies. 1,000 sentences CC BY-SA 3.0 Universal Dependencies GitHub Thai Treebanks Dataset (thtb) To enable research oppotunities with very few Thai Computational Linguitic resources, we willingly introduce fundamental high-level language resouces built with passion, Thai Treebanks, build from scratch for researchers and enthusiasts. 5,200 sentences CC BY 4.0 Pechlada Seenual, Thodsaporn Chay-intr and Thanaruk Theeramunkong GitHub Blackboard Treebank Blackboard Treebank is a Thai dependency corpus based on the LST20 Annotation Guideline. It features dependency structures, constituency structures, word boundaries, named entities, clause boundaries, and sentence boundaries. 122,851 clauses (38,558 sentences) CC BY 3.0 Prachya Boonkwan, NECTEC bitbucket","title":"Treebank"},{"location":"tasks/treebank/#treebank","text":"","title":"Treebank"},{"location":"tasks/treebank/#corpus","text":"Name Description Size License Creator Download UD Thai PUD This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies. 1,000 sentences CC BY-SA 3.0 Universal Dependencies GitHub Thai Treebanks Dataset (thtb) To enable research oppotunities with very few Thai Computational Linguitic resources, we willingly introduce fundamental high-level language resouces built with passion, Thai Treebanks, build from scratch for researchers and enthusiasts. 5,200 sentences CC BY 4.0 Pechlada Seenual, Thodsaporn Chay-intr and Thanaruk Theeramunkong GitHub Blackboard Treebank Blackboard Treebank is a Thai dependency corpus based on the LST20 Annotation Guideline. It features dependency structures, constituency structures, word boundaries, named entities, clause boundaries, and sentence boundaries. 122,851 clauses (38,558 sentences) CC BY 3.0 Prachya Boonkwan, NECTEC bitbucket","title":"Corpus"},{"location":"tasks/word-segmentation/","text":"Word Segmentation for Thai language, Word Segmentation is the first step for process Thai text for segment thai text to words. Corpus Name Description Size License Creator Download BEST I (BEST 2009) Benchmark for Enhancing the Standard of Thai language processing 5,000,000 word CC BY-SA-NC 4.0 NECTEC aiforthai (registration required) and Mirror from @korakot Blackboard Treebank Blackboard Treebank is a Thai dependency corpus based on the LST20 Annotation Guideline. It features dependency structures, constituency structures, word boundaries, named entities, clause boundaries, and sentence boundaries. 122,851 clauses (38,558 sentences) CC BY 3.0 Prachya Boonkwan, NECTEC bitbucket Wisesight Samples with Word Tokenization Label This directory contains samples of Thai social media text, tokenized by humans. These samples are randomly drawn from the full Wisesight Sentiment Corpus. 160 sentences (wisesight-160) and 1,000 sentences (wiseight-1000) CC0-1.0 License Nitchakarn Chantarapratin, Pattarawat Chormai, Ponrawee Prasertsom, Jitkapat Sawatphol, Nozomi Yamada, and Attapol Rutherford GitHub Thai National Historical Corpus (TNHC) texts from Thai National Historical Corpus, stored by lines (manually tokenized). 47 documents, 756,478 lines, 13,361,142 characters Jitkapat Sawatphol GitHub Orchid Corpus Thai part of speech (POS) tagged corpus 5,200 sentences CC BY-SA-NC 3.0 NECTEC Mirror from @wannaphong Corpus Komped Poem (windy part) Pattarawat Chormai GitHub VISTEC-TP-TH-21 The largest social media domain datasets for Thai text processing (word segmentation, misspell correction and detection, and named-entity boundary) called \"VISTEC-TP-TH-2021\" or VISTEC-2021. 49,997 sentences with 3.39M words CC-BY-SA 3.0 VISTEC & Chiang Mai University GitHub BEST I BEST I is the Benchmark for Enhancing the Standard of Thai language processing. Number of words: 5,000,000 words Details Creator: NECTEC License: CC BY-SA-NC 4.0 Paper: Download: aiforthai (registration required) Benchmarks We are not benchmarks for this corpus because we have not an answer of testset. Blackboard Treebank Blackboard Treebank is a Thai dependency corpus based on the LST20 Annotation Guideline. It features dependency structures, constituency structures, word boundaries, named entities, clause boundaries, and sentence boundaries. 122,851 clauses (38,558 sentences) Details Creator: NECTEC License: CC-BY 3.0 Download: bitbucket Benchmarks [WIP] Orchid Corpus Orchid Corpus is Thai part of speech (POS) tagged corpus with word segmentation corpus. Number of words: words Details Creator: NECTEC License: CC BY-SA-NC 3.0 Paper: Thai Part-of-speech Tagged Corpus: ORCHID Download: Mirror from @wannaphong Benchmarks Orchid Corpus is not have the testset. Wisesight Corpus This directory contains samples of Thai social media text, tokenized by humans. These samples are randomly drawn from the full Wisesight Sentiment Corpus. wisesight-160 has 160 sentences. Number of words: 3,833 words wiseight-1000 has 1,000 sentences. Number of words: 21,745 words Benchmarks [WIP] Thai National Historical Corpus Thai National Historical Corpus or TNHC tokenized by humans. Number of words: ? words 47 documents, 756,478 lines, 13,361,142 characters Details Creator: Jitkapat Sawatphol Download: GitHub Corpus Komped Poem (windy part) Number of words: 317 words Details Creator: Pattarawat Chormai License: CC-BY-SA 3.0 Paper: - Download: GitHub Benchmarks [WIP] VISTEC-TP-TH-21 The largest social media domain datasets for Thai text processing (word segmentation, misspell correction and detection, and named-entity boundary) called \"VISTEC-TP-TH-2021\" or VISTEC-2021. Number of words: 3.39M words Details Creator: VISTEC & Chiang Mai University License: CC-BY-SA 3.0 Paper: - Download: GitHub Software Name Description Status Language License ICU ICU - International Components for Unicode active C/C++/Java Unicode License libthai is a set of Thai language support routines aimed to ease developers' tasks to incorporate Thai language support in their applications. active C/C++ LGPL-2.1 License SWATH Smart Word Analysis for THai active C/C++ GPL-2.0 License AttaCut Fast and Reasonably Accurate Word Tokenizer for Thai. active Python 3.X MIT License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0 PyWordCut wordcutpy is a simple Thai word breaker written in Python 3+ active Python 3.X LGPLv3 DeepCut A Thai word tokenization library using Deep Neural Network. active Python 3.X MIT License TLTK Thai Language Toolkit active Python 3.X BSD License (BSD-3-Clause) KUCut Thai word segmentor that is difference from existing segmentor such as CTTEX or SWATH. deactive Python 2.4-2.5 GPL-2.0 License SEFR CUT Stacked Ensemble Filter and Refine for Word Segmentation active Python 3.X MIT License CutKum Thai Word-Segmentation with LSTM in Tensorflow - Python 3.X MIT License ThaiLMCut Word Tokenizer for Thai Language based on Transfer Learning and bidirectional-LSTM active Python 3.X MIT License LexTo Thai word segmentation ( Longest Matching ) - Java LGPLv2.1 sertiscorp /thai-word-segmentation Thai word segmentation with bi-directional RNN - Python 3.X MIT License Thai Analysis Plugin for Elasticsearch The Thaichub2 (thai-chub-chub) Analysis Plugin integrates the Thai word segmentation modules into Elasticsearch. active Java Apache-2.0 License Wordcut Thai word breaker for Node.js active JavaScript, Node.JS LGPLv3 V8 BreakIterator Chrome's V8 Engine, using ICU active JavaScript Apache License 2.0 icu-wordsplit Simple icu boundary analysis module bindings for node.js inactive JavaScript BSD newmm-tokenizer Standalone Dictionary-based, Maximum Matching + Thai Character Cluster (newmm) tokenizer extracted from PyThaiNLP. active Python 3.X Apache License 2.0 Stanza Official Stanford NLP Python Library for Many Human Languages active Python 3.X Apache License 2.0 Multi Candidate Thai Word Segmentation Most existing word segmentation methods output one single segmentation solution. active Python 3.X MIT License PhlongTaIam PHP Thai word breaker active PHP LGPL-2.1 License Chamkho Rust Thai word breaker active Rust LGPL-3 License oxidized-thainlp Thai Natural Language Processing in Rust, with Python-binding. active Python & Rust Apache License 2.0 OSKut Handling Cross- and Out-of-Domain Samples in Thai Word Segmentation (ACL 2021 Findings) Stacked Ensemble Framework and DeepCut as Baseline model active Python MIT License Tools Name Description License Creator Download MudYom MudYom is a module for pre/post-processing text. It combines, aka \u0e21\u0e31\u0e14, words that should be together into one token. This process is done according to a user-defined dictionary. Pattarawat Chormai GitHub","title":"Word Segmentation"},{"location":"tasks/word-segmentation/#word-segmentation","text":"for Thai language, Word Segmentation is the first step for process Thai text for segment thai text to words.","title":"Word Segmentation"},{"location":"tasks/word-segmentation/#corpus","text":"Name Description Size License Creator Download BEST I (BEST 2009) Benchmark for Enhancing the Standard of Thai language processing 5,000,000 word CC BY-SA-NC 4.0 NECTEC aiforthai (registration required) and Mirror from @korakot Blackboard Treebank Blackboard Treebank is a Thai dependency corpus based on the LST20 Annotation Guideline. It features dependency structures, constituency structures, word boundaries, named entities, clause boundaries, and sentence boundaries. 122,851 clauses (38,558 sentences) CC BY 3.0 Prachya Boonkwan, NECTEC bitbucket Wisesight Samples with Word Tokenization Label This directory contains samples of Thai social media text, tokenized by humans. These samples are randomly drawn from the full Wisesight Sentiment Corpus. 160 sentences (wisesight-160) and 1,000 sentences (wiseight-1000) CC0-1.0 License Nitchakarn Chantarapratin, Pattarawat Chormai, Ponrawee Prasertsom, Jitkapat Sawatphol, Nozomi Yamada, and Attapol Rutherford GitHub Thai National Historical Corpus (TNHC) texts from Thai National Historical Corpus, stored by lines (manually tokenized). 47 documents, 756,478 lines, 13,361,142 characters Jitkapat Sawatphol GitHub Orchid Corpus Thai part of speech (POS) tagged corpus 5,200 sentences CC BY-SA-NC 3.0 NECTEC Mirror from @wannaphong Corpus Komped Poem (windy part) Pattarawat Chormai GitHub VISTEC-TP-TH-21 The largest social media domain datasets for Thai text processing (word segmentation, misspell correction and detection, and named-entity boundary) called \"VISTEC-TP-TH-2021\" or VISTEC-2021. 49,997 sentences with 3.39M words CC-BY-SA 3.0 VISTEC & Chiang Mai University GitHub","title":"Corpus"},{"location":"tasks/word-segmentation/#best-i","text":"BEST I is the Benchmark for Enhancing the Standard of Thai language processing. Number of words: 5,000,000 words Details Creator: NECTEC License: CC BY-SA-NC 4.0 Paper: Download: aiforthai (registration required)","title":"BEST I"},{"location":"tasks/word-segmentation/#benchmarks","text":"We are not benchmarks for this corpus because we have not an answer of testset.","title":"Benchmarks"},{"location":"tasks/word-segmentation/#blackboard-treebank","text":"Blackboard Treebank is a Thai dependency corpus based on the LST20 Annotation Guideline. It features dependency structures, constituency structures, word boundaries, named entities, clause boundaries, and sentence boundaries. 122,851 clauses (38,558 sentences) Details Creator: NECTEC License: CC-BY 3.0 Download: bitbucket","title":"Blackboard Treebank"},{"location":"tasks/word-segmentation/#benchmarks_1","text":"[WIP]","title":"Benchmarks"},{"location":"tasks/word-segmentation/#orchid-corpus","text":"Orchid Corpus is Thai part of speech (POS) tagged corpus with word segmentation corpus. Number of words: words Details Creator: NECTEC License: CC BY-SA-NC 3.0 Paper: Thai Part-of-speech Tagged Corpus: ORCHID Download: Mirror from @wannaphong","title":"Orchid Corpus"},{"location":"tasks/word-segmentation/#benchmarks_2","text":"Orchid Corpus is not have the testset.","title":"Benchmarks"},{"location":"tasks/word-segmentation/#wisesight-corpus","text":"This directory contains samples of Thai social media text, tokenized by humans. These samples are randomly drawn from the full Wisesight Sentiment Corpus. wisesight-160 has 160 sentences. Number of words: 3,833 words wiseight-1000 has 1,000 sentences. Number of words: 21,745 words","title":"Wisesight Corpus"},{"location":"tasks/word-segmentation/#benchmarks_3","text":"[WIP]","title":"Benchmarks"},{"location":"tasks/word-segmentation/#thai-national-historical-corpus","text":"Thai National Historical Corpus or TNHC tokenized by humans. Number of words: ? words 47 documents, 756,478 lines, 13,361,142 characters Details Creator: Jitkapat Sawatphol Download: GitHub","title":"Thai National Historical Corpus"},{"location":"tasks/word-segmentation/#corpus-komped-poem-windy-part","text":"Number of words: 317 words Details Creator: Pattarawat Chormai License: CC-BY-SA 3.0 Paper: - Download: GitHub","title":"Corpus Komped Poem (windy part)"},{"location":"tasks/word-segmentation/#benchmarks_4","text":"[WIP]","title":"Benchmarks"},{"location":"tasks/word-segmentation/#vistec-tp-th-21","text":"The largest social media domain datasets for Thai text processing (word segmentation, misspell correction and detection, and named-entity boundary) called \"VISTEC-TP-TH-2021\" or VISTEC-2021. Number of words: 3.39M words Details Creator: VISTEC & Chiang Mai University License: CC-BY-SA 3.0 Paper: - Download: GitHub","title":"VISTEC-TP-TH-21"},{"location":"tasks/word-segmentation/#software","text":"Name Description Status Language License ICU ICU - International Components for Unicode active C/C++/Java Unicode License libthai is a set of Thai language support routines aimed to ease developers' tasks to incorporate Thai language support in their applications. active C/C++ LGPL-2.1 License SWATH Smart Word Analysis for THai active C/C++ GPL-2.0 License AttaCut Fast and Reasonably Accurate Word Tokenizer for Thai. active Python 3.X MIT License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0 PyWordCut wordcutpy is a simple Thai word breaker written in Python 3+ active Python 3.X LGPLv3 DeepCut A Thai word tokenization library using Deep Neural Network. active Python 3.X MIT License TLTK Thai Language Toolkit active Python 3.X BSD License (BSD-3-Clause) KUCut Thai word segmentor that is difference from existing segmentor such as CTTEX or SWATH. deactive Python 2.4-2.5 GPL-2.0 License SEFR CUT Stacked Ensemble Filter and Refine for Word Segmentation active Python 3.X MIT License CutKum Thai Word-Segmentation with LSTM in Tensorflow - Python 3.X MIT License ThaiLMCut Word Tokenizer for Thai Language based on Transfer Learning and bidirectional-LSTM active Python 3.X MIT License LexTo Thai word segmentation ( Longest Matching ) - Java LGPLv2.1 sertiscorp /thai-word-segmentation Thai word segmentation with bi-directional RNN - Python 3.X MIT License Thai Analysis Plugin for Elasticsearch The Thaichub2 (thai-chub-chub) Analysis Plugin integrates the Thai word segmentation modules into Elasticsearch. active Java Apache-2.0 License Wordcut Thai word breaker for Node.js active JavaScript, Node.JS LGPLv3 V8 BreakIterator Chrome's V8 Engine, using ICU active JavaScript Apache License 2.0 icu-wordsplit Simple icu boundary analysis module bindings for node.js inactive JavaScript BSD newmm-tokenizer Standalone Dictionary-based, Maximum Matching + Thai Character Cluster (newmm) tokenizer extracted from PyThaiNLP. active Python 3.X Apache License 2.0 Stanza Official Stanford NLP Python Library for Many Human Languages active Python 3.X Apache License 2.0 Multi Candidate Thai Word Segmentation Most existing word segmentation methods output one single segmentation solution. active Python 3.X MIT License PhlongTaIam PHP Thai word breaker active PHP LGPL-2.1 License Chamkho Rust Thai word breaker active Rust LGPL-3 License oxidized-thainlp Thai Natural Language Processing in Rust, with Python-binding. active Python & Rust Apache License 2.0 OSKut Handling Cross- and Out-of-Domain Samples in Thai Word Segmentation (ACL 2021 Findings) Stacked Ensemble Framework and DeepCut as Baseline model active Python MIT License","title":"Software"},{"location":"tasks/word-segmentation/#tools","text":"Name Description License Creator Download MudYom MudYom is a module for pre/post-processing text. It combines, aka \u0e21\u0e31\u0e14, words that should be together into one token. This process is done according to a user-defined dictionary. Pattarawat Chormai GitHub","title":"Tools"}]} \ No newline at end of file +{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"NLP For Thai It's Thai NLP homepage. All is Open Source. Website: NLPForThai.com maintained by PyThaiNLP Menu Tasks Other Contributors Thanks all the contributors . (Image made with contributors-img ) How to Contribute You can fork and send your pull request at https://github.com/PyThaiNLP/nlpforthai.com We build Thai NLP. PyThaiNLP","title":"NLP For Thai"},{"location":"#nlp-for-thai","text":"It's Thai NLP homepage. All is Open Source. Website: NLPForThai.com maintained by PyThaiNLP Menu Tasks Other Contributors Thanks all the contributors . (Image made with contributors-img ) How to Contribute You can fork and send your pull request at https://github.com/PyThaiNLP/nlpforthai.com We build Thai NLP. PyThaiNLP","title":"NLP For Thai"},{"location":"other/","text":"Other <- back to homepage Menu Dictionaries N-gram Word Similarity Name WordNet Word embeddings Sentence Embedding Glossary Dictionaries Name Description Size License Creator Download LEXiTRON Thai<->English Dictionary Thai-English 83,000 words CC BY-SA-NC 3.0 NECTEC aiforthai (registration required) Yaitron Yaitron English-Thai and Thai-English dictionary based on LEXiTRON created since May 2006. An objective of Yaitron is to built a dictionary that is formatted in well formed XML and easy to be manipulated by machine. LEXiTRON License Vee Satayamas GitHub Volubilis Dict - Thai-English-French VOLUBILIS - Thai English French Database sourceforge Ground-truth bilingual dictionaries 110 large-scale ground-truth bilingual dictionaries train 5000 word and test 1500 word Facebook Research GitHub Thai Wrong words dataset Wannaphong Phatthiyaphaibun GitHub up to menu N-gram Name Description Size License Creator Download Unigram from OSCAR Corpus Unigram from OSCAR Corpus Korakot Chaovavanich Facebook TTC N-gram from Thai text book 3,037,772 word Website Thai National Corpus Thai National Corpus (Unigram, Bi-gram, Ti-gram) Faculty of Arts, Chulalongkorn University Website up to menu Word Similarity Name Description Size License Creator Download Word Similarity Datasets for Thai Language This repo contains translated and re-rated datasets for word similarity for Thai language. Ponrudee Netisopakul\ufffc, Gerhard Wohlgenannt, Aleksei Pulich GitHub up to menu Thai Name Name Description Size License Creator Download Thai Male and Female Names Corpus The project contains Thai male, female, and family names, aimed for Thai language analysis. 22,058 Name CC BY-SA 4.0 Korkeat W. GitHub up to menu WordNet Name Description Size License Creator Download Open Multilingual Wordnet The goal is to make it easy to use wordnets in multiple languages. 81% Website th-wn-sqlite Thai wordnet in SQLite - Vee Satayamas sourceforge \u0e18\u0e19\u0e19\u0e17\u0e4c \u0e2b\u0e25\u0e35\u0e19\u0e49\u0e2d\u0e22 2008 \u0e18\u0e19\u0e19\u0e17\u0e4c \u0e2b\u0e25\u0e35\u0e19\u0e49\u0e2d\u0e22 Website \u0e1b\u0e23\u0e34\u0e28\u0e19\u0e32 \u0e2d\u0e31\u0e04\u0e23\u0e1e\u0e38\u0e17\u0e18\u0e34\u0e1e\u0e23 Data 2008 \u0e1b\u0e23\u0e34\u0e28\u0e19\u0e32 \u0e2d\u0e31\u0e04\u0e23\u0e1e\u0e38\u0e17\u0e18\u0e34\u0e1e\u0e23 Website up to menu Word embeddings Name Detail Download ConceptNet Numberbatch ConceptNet Numberbatch is a set of semantic vectors (also known as word embeddings) than can be used directly as a representation of word meanings or as a starting point for further machine learning. GitHub FastText Word vectors The pre-trained word vectors for 157 languages, trained on Common Crawl and Wikipedia using fastText. Website Thai2Fit (old Thai2Vec) Homepage Download word2vec: PyThaiNLP LTW2V: The Large Thai Word2Vec LTW2V is The large Thai Word2Vec. It built with oxidized-thainlp from OSCAR Corpus (Open Super-large Crawled Aggregated coRpus). GitHub Sentence Embedding Name Detail Paper Owner Download LASER LASER Language-Agnostic SEntence Representations Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond Facebook GitHub MUSE Multilingual Universal Sentence Encoderfor Semantic Retrieval Multilingual Universal Sentence Encoder for Semantic Retrieval Google Tensorflow Hub LaBSE Language-Agnostic BERT Sentence Embedding by Google AI. Language-agnostic BERT Sentence Embedding Google Glossary Name Detail Website Thai Glossary Thai Glossary for Open Source Software by OpenTLE (backup) Website Glossary for Open Source Software by OpenTLE Web archive up to menu","title":"Other"},{"location":"other/#other","text":"<- back to homepage Menu Dictionaries N-gram Word Similarity Name WordNet Word embeddings Sentence Embedding Glossary","title":"Other"},{"location":"other/#dictionaries","text":"Name Description Size License Creator Download LEXiTRON Thai<->English Dictionary Thai-English 83,000 words CC BY-SA-NC 3.0 NECTEC aiforthai (registration required) Yaitron Yaitron English-Thai and Thai-English dictionary based on LEXiTRON created since May 2006. An objective of Yaitron is to built a dictionary that is formatted in well formed XML and easy to be manipulated by machine. LEXiTRON License Vee Satayamas GitHub Volubilis Dict - Thai-English-French VOLUBILIS - Thai English French Database sourceforge Ground-truth bilingual dictionaries 110 large-scale ground-truth bilingual dictionaries train 5000 word and test 1500 word Facebook Research GitHub Thai Wrong words dataset Wannaphong Phatthiyaphaibun GitHub up to menu","title":"Dictionaries"},{"location":"other/#n-gram","text":"Name Description Size License Creator Download Unigram from OSCAR Corpus Unigram from OSCAR Corpus Korakot Chaovavanich Facebook TTC N-gram from Thai text book 3,037,772 word Website Thai National Corpus Thai National Corpus (Unigram, Bi-gram, Ti-gram) Faculty of Arts, Chulalongkorn University Website up to menu","title":"N-gram"},{"location":"other/#word-similarity","text":"Name Description Size License Creator Download Word Similarity Datasets for Thai Language This repo contains translated and re-rated datasets for word similarity for Thai language. Ponrudee Netisopakul\ufffc, Gerhard Wohlgenannt, Aleksei Pulich GitHub up to menu","title":"Word Similarity"},{"location":"other/#thai-name","text":"Name Description Size License Creator Download Thai Male and Female Names Corpus The project contains Thai male, female, and family names, aimed for Thai language analysis. 22,058 Name CC BY-SA 4.0 Korkeat W. GitHub up to menu","title":"Thai Name"},{"location":"other/#wordnet","text":"Name Description Size License Creator Download Open Multilingual Wordnet The goal is to make it easy to use wordnets in multiple languages. 81% Website th-wn-sqlite Thai wordnet in SQLite - Vee Satayamas sourceforge \u0e18\u0e19\u0e19\u0e17\u0e4c \u0e2b\u0e25\u0e35\u0e19\u0e49\u0e2d\u0e22 2008 \u0e18\u0e19\u0e19\u0e17\u0e4c \u0e2b\u0e25\u0e35\u0e19\u0e49\u0e2d\u0e22 Website \u0e1b\u0e23\u0e34\u0e28\u0e19\u0e32 \u0e2d\u0e31\u0e04\u0e23\u0e1e\u0e38\u0e17\u0e18\u0e34\u0e1e\u0e23 Data 2008 \u0e1b\u0e23\u0e34\u0e28\u0e19\u0e32 \u0e2d\u0e31\u0e04\u0e23\u0e1e\u0e38\u0e17\u0e18\u0e34\u0e1e\u0e23 Website up to menu","title":"WordNet"},{"location":"other/#word-embeddings","text":"Name Detail Download ConceptNet Numberbatch ConceptNet Numberbatch is a set of semantic vectors (also known as word embeddings) than can be used directly as a representation of word meanings or as a starting point for further machine learning. GitHub FastText Word vectors The pre-trained word vectors for 157 languages, trained on Common Crawl and Wikipedia using fastText. Website Thai2Fit (old Thai2Vec) Homepage Download word2vec: PyThaiNLP LTW2V: The Large Thai Word2Vec LTW2V is The large Thai Word2Vec. It built with oxidized-thainlp from OSCAR Corpus (Open Super-large Crawled Aggregated coRpus). GitHub","title":"Word embeddings"},{"location":"other/#sentence-embedding","text":"Name Detail Paper Owner Download LASER LASER Language-Agnostic SEntence Representations Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond Facebook GitHub MUSE Multilingual Universal Sentence Encoderfor Semantic Retrieval Multilingual Universal Sentence Encoder for Semantic Retrieval Google Tensorflow Hub LaBSE Language-Agnostic BERT Sentence Embedding by Google AI. Language-agnostic BERT Sentence Embedding Google","title":"Sentence Embedding"},{"location":"other/#glossary","text":"Name Detail Website Thai Glossary Thai Glossary for Open Source Software by OpenTLE (backup) Website Glossary for Open Source Software by OpenTLE Web archive up to menu","title":"Glossary"},{"location":"tasks/","text":"Thai NLP Tasks Word Segmentation Sentence Segmentation Syllable Segmentation Part-of-speech tagging Named Entity Recognition Text Classification Text Generation Text Summarization Spell Correct Soundex Speech Recognition Speech Synthesis Speech Emotion Recognition Speech-to-text translation Optical Character Recognition Machine Translation Dependency Parser Grapheme to Phoneme Language model Question Answering Plagiarism Treebank Natural Language Inference Natural Language Understanding Image Captioning Spoken Language Understanding","title":"Thai NLP Tasks"},{"location":"tasks/#thai-nlp-tasks","text":"Word Segmentation Sentence Segmentation Syllable Segmentation Part-of-speech tagging Named Entity Recognition Text Classification Text Generation Text Summarization Spell Correct Soundex Speech Recognition Speech Synthesis Speech Emotion Recognition Speech-to-text translation Optical Character Recognition Machine Translation Dependency Parser Grapheme to Phoneme Language model Question Answering Plagiarism Treebank Natural Language Inference Natural Language Understanding Image Captioning Spoken Language Understanding","title":"Thai NLP Tasks"},{"location":"tasks/chatbot/","text":"Chatbot Model Name Description Size License Creator Download WangChanGLM WangChanGLM elephant\u200a-\u200aThe Multilingual Instruction-Following Model 7.5B CC BY-SA 4.0 VISTEC-depa AI Research Institute of Thailand & PyThaiNLP GitHub","title":"Chatbot"},{"location":"tasks/chatbot/#chatbot","text":"","title":"Chatbot"},{"location":"tasks/chatbot/#model","text":"Name Description Size License Creator Download WangChanGLM WangChanGLM elephant\u200a-\u200aThe Multilingual Instruction-Following Model 7.5B CC BY-SA 4.0 VISTEC-depa AI Research Institute of Thailand & PyThaiNLP GitHub","title":"Model"},{"location":"tasks/dependency_parser/","text":"Dependency Parser Corpus Name Description Size License Creator Download UD Thai PUD This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies. 1,000 sentences CC BY-SA 3.0 Universal Dependencies GitHub Blackboard Treebank Blackboard Treebank is a Thai dependency corpus based on the LST20 Annotation Guideline. It features dependency structures, constituency structures, word boundaries, named entities, clause boundaries, and sentence boundaries. 122,851 clauses (38,558 sentences) CC BY 3.0 Prachya Boonkwan, NECTEC bitbucket or GitHub Thai Discourse Treebank The Thai Discourse Treebank (TDTB) is a project at Chulalongkorn University, Bangkok, Thailand. The annotation adopts the sense inventory from PDTB 3.0. 180 documents - Chulalongkorn University GitHub Software Name Description Status Language License spaCy-Thai Tokenizer, POS-tagger, and dependency-parser for Thai language, working on Universal Dependencies. active Python 3.X MIT license esupar Tokenizer, POS-tagger, and dependency-parser with Transformers and SuPar. active Python 3.X MIT license TowerParse TowerParse is a Python tool for multilingual dependency parsing, built on top of the HuggingFace Transformers library. Unlike other multilingual dependency parsers (e.g., UDify , UDapter), TowerParse offers a language-dedicated parsing model for each language (actually, for each test UD treebank, i.e., for languages with multiple treebanks, we offer multiple parsing models). ? Python 3.X CC0-1.0 license","title":"Dependency Parser"},{"location":"tasks/dependency_parser/#dependency-parser","text":"","title":"Dependency Parser"},{"location":"tasks/dependency_parser/#corpus","text":"Name Description Size License Creator Download UD Thai PUD This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies. 1,000 sentences CC BY-SA 3.0 Universal Dependencies GitHub Blackboard Treebank Blackboard Treebank is a Thai dependency corpus based on the LST20 Annotation Guideline. It features dependency structures, constituency structures, word boundaries, named entities, clause boundaries, and sentence boundaries. 122,851 clauses (38,558 sentences) CC BY 3.0 Prachya Boonkwan, NECTEC bitbucket or GitHub Thai Discourse Treebank The Thai Discourse Treebank (TDTB) is a project at Chulalongkorn University, Bangkok, Thailand. The annotation adopts the sense inventory from PDTB 3.0. 180 documents - Chulalongkorn University GitHub","title":"Corpus"},{"location":"tasks/dependency_parser/#software","text":"Name Description Status Language License spaCy-Thai Tokenizer, POS-tagger, and dependency-parser for Thai language, working on Universal Dependencies. active Python 3.X MIT license esupar Tokenizer, POS-tagger, and dependency-parser with Transformers and SuPar. active Python 3.X MIT license TowerParse TowerParse is a Python tool for multilingual dependency parsing, built on top of the HuggingFace Transformers library. Unlike other multilingual dependency parsers (e.g., UDify , UDapter), TowerParse offers a language-dedicated parsing model for each language (actually, for each test UD treebank, i.e., for languages with multiple treebanks, we offer multiple parsing models). ? Python 3.X CC0-1.0 license","title":"Software"},{"location":"tasks/g2p/","text":"Grapheme to Phoneme Corpus Name Description Size License Creator Download Grapheme to Phoneme Thai Grapheme to Phoneme from Wiktionary 14,483 word CC BY-SA 3.0 Wannaphong Phatthiyaphaibun GitHub Software Name Description Status Language License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0 TLTK Thai Language Toolkit active Python 3.X BSD License (BSD-3-Clause) thpronun thpronun is a program for analyzing pronunciation of Thai words. The output can be in Thai pronunciation, Romanization, or in any other phonetic systems. It is designed to be extensible. active C/C++ GPL-3.0 License Thai G2P (grapheme to phoneme) dictionary-based conversion + BiLSTM seq2seq model (under construction) active Python 3.X CharsiuG2P CharsiuG2P is transformer based tool for grapheme-to-phoneme conversion in 100 languages. Given an orthographic word, CharsiuG2P predicts its pronunciation through a neural G2P model. active Python 3.X MIT license Software Name Detail Owner Download CharsiuG2P Multilingual G2P in 100 languages GitHub","title":"Grapheme to Phoneme"},{"location":"tasks/g2p/#grapheme-to-phoneme","text":"","title":"Grapheme to Phoneme"},{"location":"tasks/g2p/#corpus","text":"Name Description Size License Creator Download Grapheme to Phoneme Thai Grapheme to Phoneme from Wiktionary 14,483 word CC BY-SA 3.0 Wannaphong Phatthiyaphaibun GitHub","title":"Corpus"},{"location":"tasks/g2p/#software","text":"Name Description Status Language License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0 TLTK Thai Language Toolkit active Python 3.X BSD License (BSD-3-Clause) thpronun thpronun is a program for analyzing pronunciation of Thai words. The output can be in Thai pronunciation, Romanization, or in any other phonetic systems. It is designed to be extensible. active C/C++ GPL-3.0 License Thai G2P (grapheme to phoneme) dictionary-based conversion + BiLSTM seq2seq model (under construction) active Python 3.X CharsiuG2P CharsiuG2P is transformer based tool for grapheme-to-phoneme conversion in 100 languages. Given an orthographic word, CharsiuG2P predicts its pronunciation through a neural G2P model. active Python 3.X MIT license","title":"Software"},{"location":"tasks/g2p/#software_1","text":"Name Detail Owner Download CharsiuG2P Multilingual G2P in 100 languages GitHub","title":"Software"},{"location":"tasks/image-captioning/","text":"Image Captioning Software Name Description Status Language License Image Captioning in Thai: AI \u0e0a\u0e48\u0e27\u0e22\u0e1c\u0e39\u0e49\u0e1e\u0e34\u0e01\u0e32\u0e23\u0e17\u0e32\u0e07\u0e2a\u0e32\u0e22\u0e15\u0e32 Image Captioning in Thai from AI Builder https://www.facebook.com/aibuildersx/posts/175053151329799 Python 3.X ?","title":"Image Captioning"},{"location":"tasks/image-captioning/#image-captioning","text":"","title":"Image Captioning"},{"location":"tasks/image-captioning/#software","text":"Name Description Status Language License Image Captioning in Thai: AI \u0e0a\u0e48\u0e27\u0e22\u0e1c\u0e39\u0e49\u0e1e\u0e34\u0e01\u0e32\u0e23\u0e17\u0e32\u0e07\u0e2a\u0e32\u0e22\u0e15\u0e32 Image Captioning in Thai from AI Builder https://www.facebook.com/aibuildersx/posts/175053151329799 Python 3.X ?","title":"Software"},{"location":"tasks/language-model/","text":"Language model Text Corpus Name Description Size License Creator Download Thai Constitution Corpus The Constitution of Thailand Dataset Since 1932 Public Domain Wannaphong Phatthiyaphaibun GitHub Thai Law Thai Law Dataset (Act of Parliament) Public Domain Wannaphong Phatthiyaphaibun GitHub IO-LM Learn how to talk like an Information-Operation-er GitHub HC corpora HC corpora is a collection of corpora for various languages freely available to download. homepage : http://corpora.epizy.com/about.html MediaFire thai-joke-corpus Thai jokes scraped from 4 Thai jokes facebook pages collected by iApp Technology Co, Ltd. 449 Jokes GPL-3.0 License iApp Technology Co, Ltd GitHub Thai Literature Corpora (TLC) texts from Vajirayana Digital Library, stored by chapters and stanzas (non-tokenized). a total of 34 documents, 292,270 lines, 31,790,734 characters Jitkapat Sawatphol Website HSE Thai Corpus A 35 Million Word Corpus of Thai Kaggle ThaiGov corpus Data from Thai government website. public domain Wannaphong Phatthiyaphaibun GitHub ThaiGov V2 Corpus Thai News Dataset from Thai government website. public domain Wannaphong Phatthiyaphaibun GitHub OSCAR Corpus OSCAR or Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. 951,743,087 words public domain Homepage mC4 A multilingual colossal, cleaned version of Common Crawl's web crawl corpus. Hugging Face Multilingual Open Text 1.0: Public Domain News in 44 Languages This is a corpus of public domain news in 44 languages. public domain GitHub Thai depression detection dataset and baseline models Detecting Depression in Thai Blog Posts: a Dataset and a Baseline. Zenodo Enocder Preatrained Name Detail Owner Download Thai2Fit ULMFit Language Modeling, Text Feature Extraction and Text Classification in Thai Language. Created as part of pyThaiNLP with ULMFit implementation from fast.ai Charin Polpanumas GitHub BERT-th BERT pre-training in Thai language ThAIKeras GitHub BERT-Base, Multilingual Cased 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters Google GitHub bert-base-th-cased We are sharing smaller versions of bert-base-multilingual-cased that handle a custom number of languages. Geotrend Hugging Face WangchanBERTa Pretraining transformer-based Thai Language Models AI Research Institute of Thailand (AIResearch) GitHub & Hugging Face mLUKE A multilingual extension of LUKE. Hugging Face TwHIN-BERT TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for Multilingual Tweet Representations Twitter GitHub PhayaThaiBERT 278M P. Sriwirote Notebook Load BERT-th, BERT-Base, Multilingual Cased and bert-base-th-cased with Hugging Face in Python LLMs Name Parameters Detail Owner Download OpenThaiGPT 13B Kobkrit GitHub Typhoon 7B SCB10X Hugging Face SeaLLMs 13B DAMO GitHub Sea-Lion 7.5B AI Singapore GitHub WangChanGLM 7.5B VISTEC-PyThaiNLP GitHub","title":"Language model"},{"location":"tasks/language-model/#language-model","text":"","title":"Language model"},{"location":"tasks/language-model/#text-corpus","text":"Name Description Size License Creator Download Thai Constitution Corpus The Constitution of Thailand Dataset Since 1932 Public Domain Wannaphong Phatthiyaphaibun GitHub Thai Law Thai Law Dataset (Act of Parliament) Public Domain Wannaphong Phatthiyaphaibun GitHub IO-LM Learn how to talk like an Information-Operation-er GitHub HC corpora HC corpora is a collection of corpora for various languages freely available to download. homepage : http://corpora.epizy.com/about.html MediaFire thai-joke-corpus Thai jokes scraped from 4 Thai jokes facebook pages collected by iApp Technology Co, Ltd. 449 Jokes GPL-3.0 License iApp Technology Co, Ltd GitHub Thai Literature Corpora (TLC) texts from Vajirayana Digital Library, stored by chapters and stanzas (non-tokenized). a total of 34 documents, 292,270 lines, 31,790,734 characters Jitkapat Sawatphol Website HSE Thai Corpus A 35 Million Word Corpus of Thai Kaggle ThaiGov corpus Data from Thai government website. public domain Wannaphong Phatthiyaphaibun GitHub ThaiGov V2 Corpus Thai News Dataset from Thai government website. public domain Wannaphong Phatthiyaphaibun GitHub OSCAR Corpus OSCAR or Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. 951,743,087 words public domain Homepage mC4 A multilingual colossal, cleaned version of Common Crawl's web crawl corpus. Hugging Face Multilingual Open Text 1.0: Public Domain News in 44 Languages This is a corpus of public domain news in 44 languages. public domain GitHub Thai depression detection dataset and baseline models Detecting Depression in Thai Blog Posts: a Dataset and a Baseline. Zenodo","title":"Text Corpus"},{"location":"tasks/language-model/#enocder-preatrained","text":"Name Detail Owner Download Thai2Fit ULMFit Language Modeling, Text Feature Extraction and Text Classification in Thai Language. Created as part of pyThaiNLP with ULMFit implementation from fast.ai Charin Polpanumas GitHub BERT-th BERT pre-training in Thai language ThAIKeras GitHub BERT-Base, Multilingual Cased 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters Google GitHub bert-base-th-cased We are sharing smaller versions of bert-base-multilingual-cased that handle a custom number of languages. Geotrend Hugging Face WangchanBERTa Pretraining transformer-based Thai Language Models AI Research Institute of Thailand (AIResearch) GitHub & Hugging Face mLUKE A multilingual extension of LUKE. Hugging Face TwHIN-BERT TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for Multilingual Tweet Representations Twitter GitHub PhayaThaiBERT 278M P. Sriwirote","title":"Enocder Preatrained"},{"location":"tasks/language-model/#notebook","text":"Load BERT-th, BERT-Base, Multilingual Cased and bert-base-th-cased with Hugging Face in Python","title":"Notebook"},{"location":"tasks/language-model/#llms","text":"Name Parameters Detail Owner Download OpenThaiGPT 13B Kobkrit GitHub Typhoon 7B SCB10X Hugging Face SeaLLMs 13B DAMO GitHub Sea-Lion 7.5B AI Singapore GitHub WangChanGLM 7.5B VISTEC-PyThaiNLP GitHub","title":"LLMs"},{"location":"tasks/machine-translation/","text":"Machine Translation Corpus Name Description Size License Creator Download TALPCo TUFS Asian Language Parallel Corpus 1,327 sent CC BY 4.0 Nomoto, Hiroki, Kenji Okano, Sunisa Wittayapanyanon and Junta Nomura GitHub scb-mt-en-th-2020 English-Thai Machine Translation Dataset with the collaboration between Vidyasirimedhi Institute of Science and Technology (VISTEC) and Digital Economy Promotion Agency (depa), publishes an open English-Thai machine translation dataset, with the sponsorship from Siam Commercial Bank (SCB) 1,001,752 segment pairs CC BY-SA 4.0 AI Research Institute of Thailand (AIResearch) GitHub Software Documentation Data Set for Machine Translation A parallel evaluation data set of SAP software documentation with document structure annotation dev: 2048 segment pairs, test: 2050 segment pairs CC BY-NC 4.0 SAP GitHub Thai Lao Parallel corpus Thai Lao Parallel corpus CC0-1.0 License Wannaphong Phatthiyaphaibun GitHub Contradictory, My Dear Watson Translated text Non-English text converted to English language Kaggle Asian Language Treebank Parallel Corpus This is the Asian Language Treebank (ALT) Parallel Corpus. train: 1,698 articles, 18,088 sentences dev: 98 articles, 1,000 sentences test: 97 articles, 1,018 sentences CC BY 4.0 Website WikiLingua A Multilingual Abstractive Summarization Dataset 14,770 parallel (for thai) CC0-1.0 License Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown GitHub Web Inventory of Transcribed & Translated(WIT) Ted Talks The Web Inventory Talk is a collection of the original Ted talks and their translated version. The translations are available in more than 109+ languages, though the distribution is not uniform. Hugging Face generated_reviews_enth generated_reviews_enth is created as part of scb-mt-en-th-2020 for machine translation task. CC BY-SA 4.0 AI Research Institute of Thailand (AIResearch) GitHub FLORES-101 FLORES-101 is a Many-to-Many multilingual translation benchmark dataset for 101 languages. Facebook GitHub thai_usembassy This dataset collect all Thai & English news from U.S. Embassy Bangkok. CC-0 PyThaiNLP HuggingFace Software Name Description Status Language License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0 Pretrained Name Description Status Language License Lalita Chinese-Thai Machine Translation Chinese-Thai Machine Translation by AI Builder active Python 3.X Apache License 2.0 English-Thai Machine Translation Models English-Thai Machine Translation Models by VISTEC-depa Thailand Artificial Intelligence Research Institute active Python 3.X Apache License 2.0","title":"Machine Translation"},{"location":"tasks/machine-translation/#machine-translation","text":"","title":"Machine Translation"},{"location":"tasks/machine-translation/#corpus","text":"Name Description Size License Creator Download TALPCo TUFS Asian Language Parallel Corpus 1,327 sent CC BY 4.0 Nomoto, Hiroki, Kenji Okano, Sunisa Wittayapanyanon and Junta Nomura GitHub scb-mt-en-th-2020 English-Thai Machine Translation Dataset with the collaboration between Vidyasirimedhi Institute of Science and Technology (VISTEC) and Digital Economy Promotion Agency (depa), publishes an open English-Thai machine translation dataset, with the sponsorship from Siam Commercial Bank (SCB) 1,001,752 segment pairs CC BY-SA 4.0 AI Research Institute of Thailand (AIResearch) GitHub Software Documentation Data Set for Machine Translation A parallel evaluation data set of SAP software documentation with document structure annotation dev: 2048 segment pairs, test: 2050 segment pairs CC BY-NC 4.0 SAP GitHub Thai Lao Parallel corpus Thai Lao Parallel corpus CC0-1.0 License Wannaphong Phatthiyaphaibun GitHub Contradictory, My Dear Watson Translated text Non-English text converted to English language Kaggle Asian Language Treebank Parallel Corpus This is the Asian Language Treebank (ALT) Parallel Corpus. train: 1,698 articles, 18,088 sentences dev: 98 articles, 1,000 sentences test: 97 articles, 1,018 sentences CC BY 4.0 Website WikiLingua A Multilingual Abstractive Summarization Dataset 14,770 parallel (for thai) CC0-1.0 License Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown GitHub Web Inventory of Transcribed & Translated(WIT) Ted Talks The Web Inventory Talk is a collection of the original Ted talks and their translated version. The translations are available in more than 109+ languages, though the distribution is not uniform. Hugging Face generated_reviews_enth generated_reviews_enth is created as part of scb-mt-en-th-2020 for machine translation task. CC BY-SA 4.0 AI Research Institute of Thailand (AIResearch) GitHub FLORES-101 FLORES-101 is a Many-to-Many multilingual translation benchmark dataset for 101 languages. Facebook GitHub thai_usembassy This dataset collect all Thai & English news from U.S. Embassy Bangkok. CC-0 PyThaiNLP HuggingFace","title":"Corpus"},{"location":"tasks/machine-translation/#software","text":"Name Description Status Language License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0","title":"Software"},{"location":"tasks/machine-translation/#pretrained","text":"Name Description Status Language License Lalita Chinese-Thai Machine Translation Chinese-Thai Machine Translation by AI Builder active Python 3.X Apache License 2.0 English-Thai Machine Translation Models English-Thai Machine Translation Models by VISTEC-depa Thailand Artificial Intelligence Research Institute active Python 3.X Apache License 2.0","title":"Pretrained"},{"location":"tasks/ner/","text":"Named Entity Recognition Corpus Name Description Size License Creator Download Thai-NNER (Thai Nested Named Entity Recognition Corpus) This work presents the first Thai Nested Named Entity Recognition (N-NER) dataset. Thai N-NER consists of 264,798 mentions, 104 classes, and a maximum depth of 8 layers obtained from news articles and restaurant reviews, a total of 4894 documents. Our work, to the best of our knowledge, presents the largest non-English N-NER dataset and the first non-English one with fine-grained classes. CC-BY-SA 3.0 IST, VISTEC GitHub \u0e19\u0e31\u0e0a\u0e0a\u0e32 \u0e16\u0e34\u0e23\u0e30\u0e2a\u0e32\u0e42\u0e23\u0e0a corpora by Wirote Aroonmanakun's students ? \u0e19\u0e31\u0e0a\u0e0a\u0e32 \u0e16\u0e34\u0e23\u0e30\u0e2a\u0e32\u0e42\u0e23\u0e0a \u0e19\u0e31\u0e0a\u0e0a\u0e32 \u0e16\u0e34\u0e23\u0e30\u0e2a\u0e32\u0e42\u0e23\u0e0a Data \u0e28\u0e28\u0e34\u0e27\u0e34\u0e21\u0e25 \u0e01\u0e32\u0e25\u0e31\u0e19\u0e2a\u0e35\u0e21\u0e32 corpora by Wirote Aroonmanakun's students ? \u0e28\u0e28\u0e34\u0e27\u0e34\u0e21\u0e25 \u0e01\u0e32\u0e25\u0e31\u0e19\u0e2a\u0e35\u0e21\u0e32 \u0e28\u0e28\u0e34\u0e27\u0e34\u0e21\u0e25 \u0e01\u0e32\u0e25\u0e31\u0e19\u0e2a\u0e35\u0e21\u0e32 Data \u0e13\u0e31\u0e10\u0e14\u0e32\u0e1e\u0e23 \u0e40\u0e25\u0e34\u0e28\u0e0a\u0e35\u0e27\u0e30 corpora by Wirote Aroonmanakun's students ? \u0e13\u0e31\u0e10\u0e14\u0e32\u0e1e\u0e23 \u0e40\u0e25\u0e34\u0e28\u0e0a\u0e35\u0e27\u0e30 \u0e13\u0e31\u0e10\u0e14\u0e32\u0e1e\u0e23 \u0e40\u0e25\u0e34\u0e28\u0e0a\u0e35\u0e27\u0e30 Data Thai NER Thai NER project is part of PyThaiNLP. CC BY 3.0 Wannaphong Phatthiyaphaibun GitHub THAI-NEST Thai Named Entity tagging Corpus from NECTEC & Thammasat University CC BY-SA-NC 3.0 NECTEC aiforthai (registration required) WikiANN WikiANN (sometimes called PAN-X) is a multilingual named entity recognition dataset consisting of Wikipedia articles annotated with LOC (location), PER (person), and ORG (organisation) tags in the IOB2 format. Rahimi, Afshin and Li, Yuan and Cohn, Trevor GitHub Crime Named Entity Recognition NER project with Thai crime news dataset GitHub Software Name Description Status Language License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0 TLTK Thai Language Toolkit active Python 3.X BSD License (BSD-3-Clause) Thai-NNER Thai Nested Named Entity Recognition active Python 3.X MIT License","title":"Named Entity Recognition"},{"location":"tasks/ner/#named-entity-recognition","text":"","title":"Named Entity Recognition"},{"location":"tasks/ner/#corpus","text":"Name Description Size License Creator Download Thai-NNER (Thai Nested Named Entity Recognition Corpus) This work presents the first Thai Nested Named Entity Recognition (N-NER) dataset. Thai N-NER consists of 264,798 mentions, 104 classes, and a maximum depth of 8 layers obtained from news articles and restaurant reviews, a total of 4894 documents. Our work, to the best of our knowledge, presents the largest non-English N-NER dataset and the first non-English one with fine-grained classes. CC-BY-SA 3.0 IST, VISTEC GitHub \u0e19\u0e31\u0e0a\u0e0a\u0e32 \u0e16\u0e34\u0e23\u0e30\u0e2a\u0e32\u0e42\u0e23\u0e0a corpora by Wirote Aroonmanakun's students ? \u0e19\u0e31\u0e0a\u0e0a\u0e32 \u0e16\u0e34\u0e23\u0e30\u0e2a\u0e32\u0e42\u0e23\u0e0a \u0e19\u0e31\u0e0a\u0e0a\u0e32 \u0e16\u0e34\u0e23\u0e30\u0e2a\u0e32\u0e42\u0e23\u0e0a Data \u0e28\u0e28\u0e34\u0e27\u0e34\u0e21\u0e25 \u0e01\u0e32\u0e25\u0e31\u0e19\u0e2a\u0e35\u0e21\u0e32 corpora by Wirote Aroonmanakun's students ? \u0e28\u0e28\u0e34\u0e27\u0e34\u0e21\u0e25 \u0e01\u0e32\u0e25\u0e31\u0e19\u0e2a\u0e35\u0e21\u0e32 \u0e28\u0e28\u0e34\u0e27\u0e34\u0e21\u0e25 \u0e01\u0e32\u0e25\u0e31\u0e19\u0e2a\u0e35\u0e21\u0e32 Data \u0e13\u0e31\u0e10\u0e14\u0e32\u0e1e\u0e23 \u0e40\u0e25\u0e34\u0e28\u0e0a\u0e35\u0e27\u0e30 corpora by Wirote Aroonmanakun's students ? \u0e13\u0e31\u0e10\u0e14\u0e32\u0e1e\u0e23 \u0e40\u0e25\u0e34\u0e28\u0e0a\u0e35\u0e27\u0e30 \u0e13\u0e31\u0e10\u0e14\u0e32\u0e1e\u0e23 \u0e40\u0e25\u0e34\u0e28\u0e0a\u0e35\u0e27\u0e30 Data Thai NER Thai NER project is part of PyThaiNLP. CC BY 3.0 Wannaphong Phatthiyaphaibun GitHub THAI-NEST Thai Named Entity tagging Corpus from NECTEC & Thammasat University CC BY-SA-NC 3.0 NECTEC aiforthai (registration required) WikiANN WikiANN (sometimes called PAN-X) is a multilingual named entity recognition dataset consisting of Wikipedia articles annotated with LOC (location), PER (person), and ORG (organisation) tags in the IOB2 format. Rahimi, Afshin and Li, Yuan and Cohn, Trevor GitHub Crime Named Entity Recognition NER project with Thai crime news dataset GitHub","title":"Corpus"},{"location":"tasks/ner/#software","text":"Name Description Status Language License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0 TLTK Thai Language Toolkit active Python 3.X BSD License (BSD-3-Clause) Thai-NNER Thai Nested Named Entity Recognition active Python 3.X MIT License","title":"Software"},{"location":"tasks/nli/","text":"Natural Language Inference Corpus Name Description Size License Creator Download XNLI The Cross-lingual Natural Language Inference (XNLI) corpus is a crowd-sourced collection of 5,000 test and 2,500 dev pairs for the MultiNLI corpus. The pairs are annotated with textual entailment and translated into 14 languages: French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu. 5,000 test and 2,500 dev pairs CC BY-NC 4.0 Facebook Research GitHub","title":"Natural Language Inference"},{"location":"tasks/nli/#natural-language-inference","text":"","title":"Natural Language Inference"},{"location":"tasks/nli/#corpus","text":"Name Description Size License Creator Download XNLI The Cross-lingual Natural Language Inference (XNLI) corpus is a crowd-sourced collection of 5,000 test and 2,500 dev pairs for the MultiNLI corpus. The pairs are annotated with textual entailment and translated into 14 languages: French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu. 5,000 test and 2,500 dev pairs CC BY-NC 4.0 Facebook Research GitHub","title":"Corpus"},{"location":"tasks/nlu/","text":"Natural Language Understanding Corpus Name Description Size License Creator Download MASSIVE MASSIVE is a parallel dataset of > 1M utterances across 51 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation. Utterances span 60 intents and include 55 slot types. MASSIVE was created by localizing the SLURP dataset, composed of general Intelligent Voice Assistant single-shot interactions. ~1M utterances , 51 languages CC BY 4.0 Amazon GitHub Thai Winograd A collection of Winograd Schemas in the Thai language. These schemas are adapted from the original set of English Winograd Schemas proposed by Levesque et al., which was based on Ernest Davis's collection. A Winograd schema is a pair of sentences that differ by only a word or two. They include ambiguities that are resolved differently in each sentence and require world knowledge and reasoning to understand. This concept is named after Terry Winograd, who provided a well-known example. 285 questions CC BY 4.0 Phakphum Artkaew Hugging Face","title":"Natural Language Understanding"},{"location":"tasks/nlu/#natural-language-understanding","text":"","title":"Natural Language Understanding"},{"location":"tasks/nlu/#corpus","text":"Name Description Size License Creator Download MASSIVE MASSIVE is a parallel dataset of > 1M utterances across 51 languages with annotations for the Natural Language Understanding tasks of intent prediction and slot annotation. Utterances span 60 intents and include 55 slot types. MASSIVE was created by localizing the SLURP dataset, composed of general Intelligent Voice Assistant single-shot interactions. ~1M utterances , 51 languages CC BY 4.0 Amazon GitHub Thai Winograd A collection of Winograd Schemas in the Thai language. These schemas are adapted from the original set of English Winograd Schemas proposed by Levesque et al., which was based on Ernest Davis's collection. A Winograd schema is a pair of sentences that differ by only a word or two. They include ambiguities that are resolved differently in each sentence and require world knowledge and reasoning to understand. This concept is named after Terry Winograd, who provided a well-known example. 285 questions CC BY 4.0 Phakphum Artkaew Hugging Face","title":"Corpus"},{"location":"tasks/ocr/","text":"Optical Character Recognition Corpus Name Description Size License Creator Download KVIS Thai OCR Dataset Offline Thai Handwritten Character Dataset CC BY 4.0 John Joseph, Ferdin Joe Website Thai OCR Thai ocr dataset from NECTEC Training set: 81,100 image CC BY-SA-NC 3.0 NECTEC aiforthai (registration required) Thai handwriting number dataset Create Thai handwriting number dataset MIT @kittinan GitHub Software Name Description Status Language License Tesseract OCR Tesseract Open Source OCR Engine active C/C++ Apache License 2.0 Easy OCR Ready-to-use OCR with 40+ languages supported including Chinese, Japanese, Korean and Thai. active Python 3.X Apache License 2.0 Thai National Document Optical Character Recognition (THND OCR) Tesseract OCR tools for read Thai National Document used TH Sarabun National Font trained and fine-tuned. Read README.md to see about my process. active Python 3.X","title":"Optical Character Recognition"},{"location":"tasks/ocr/#optical-character-recognition","text":"","title":"Optical Character Recognition"},{"location":"tasks/ocr/#corpus","text":"Name Description Size License Creator Download KVIS Thai OCR Dataset Offline Thai Handwritten Character Dataset CC BY 4.0 John Joseph, Ferdin Joe Website Thai OCR Thai ocr dataset from NECTEC Training set: 81,100 image CC BY-SA-NC 3.0 NECTEC aiforthai (registration required) Thai handwriting number dataset Create Thai handwriting number dataset MIT @kittinan GitHub","title":"Corpus"},{"location":"tasks/ocr/#software","text":"Name Description Status Language License Tesseract OCR Tesseract Open Source OCR Engine active C/C++ Apache License 2.0 Easy OCR Ready-to-use OCR with 40+ languages supported including Chinese, Japanese, Korean and Thai. active Python 3.X Apache License 2.0 Thai National Document Optical Character Recognition (THND OCR) Tesseract OCR tools for read Thai National Document used TH Sarabun National Font trained and fine-tuned. Read README.md to see about my process. active Python 3.X","title":"Software"},{"location":"tasks/parser/","text":"Parser Corpus Name Description Size License Creator Download UD Thai PUD This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies. 1,000 sentences CC BY-SA 3.0 Universal Dependencies GitHub Thai Discourse Treebank The Thai Discourse Treebank (TDTB) at Chulalongkorn University annotates 180 documents from the LST20 corpus with 10,868 discourse relations. 6,534 sentences Prasertsom, P., Jaroonpol, A., & Rutherford, A. T. Github TUD Treebank Thai Universal Dependency Treebank, annotating TNC 3,627 sentences nlp-chula Github Software Name Description Status Language License spaCy-Thai Tokenizer, POS-tagger, and dependency-parser for Thai language, working on Universal Dependencies. active Python 3.X MIT License Link Grammar Parser A syntactic parser based on link grammar active Python 3.X LGPL","title":"Parser"},{"location":"tasks/parser/#parser","text":"","title":"Parser"},{"location":"tasks/parser/#corpus","text":"Name Description Size License Creator Download UD Thai PUD This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies. 1,000 sentences CC BY-SA 3.0 Universal Dependencies GitHub Thai Discourse Treebank The Thai Discourse Treebank (TDTB) at Chulalongkorn University annotates 180 documents from the LST20 corpus with 10,868 discourse relations. 6,534 sentences Prasertsom, P., Jaroonpol, A., & Rutherford, A. T. Github TUD Treebank Thai Universal Dependency Treebank, annotating TNC 3,627 sentences nlp-chula Github","title":"Corpus"},{"location":"tasks/parser/#software","text":"Name Description Status Language License spaCy-Thai Tokenizer, POS-tagger, and dependency-parser for Thai language, working on Universal Dependencies. active Python 3.X MIT License Link Grammar Parser A syntactic parser based on link grammar active Python 3.X LGPL","title":"Software"},{"location":"tasks/part-of-speech/","text":"Part-of-speech tagging Corpus Name Description Size License Creator Download Orchid Corpus Thai part of speech (POS) tagged corpus 5,200 sentences CC BY-SA-NC 4.0 NECTEC Mirror from @wannaphong Blackboard Treebank Blackboard Treebank is a Thai dependency corpus based on the LST20 Annotation Guideline. It features dependency structures, constituency structures, word boundaries, named entities, clause boundaries, and sentence boundaries. 122,851 clauses (38,558 sentences) CC BY 3.0 Prachya Boonkwan, NECTEC bitbucket UD Thai PUD This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies. 1,000 sentences CC BY-SA 3.0 Universal Dependencies GitHub thai-political-tweets A small Thai political twitter dataset with UD POS tags 41 tweets, 965 words Unlicense License Can Udomcharoenchaikit GitHub Software Name Description Status Language License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0 TLTK Thai Language Toolkit active Python 3.X BSD License (BSD-3-Clause)","title":"Part-of-speech tagging"},{"location":"tasks/part-of-speech/#part-of-speech-tagging","text":"","title":"Part-of-speech tagging"},{"location":"tasks/part-of-speech/#corpus","text":"Name Description Size License Creator Download Orchid Corpus Thai part of speech (POS) tagged corpus 5,200 sentences CC BY-SA-NC 4.0 NECTEC Mirror from @wannaphong Blackboard Treebank Blackboard Treebank is a Thai dependency corpus based on the LST20 Annotation Guideline. It features dependency structures, constituency structures, word boundaries, named entities, clause boundaries, and sentence boundaries. 122,851 clauses (38,558 sentences) CC BY 3.0 Prachya Boonkwan, NECTEC bitbucket UD Thai PUD This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies. 1,000 sentences CC BY-SA 3.0 Universal Dependencies GitHub thai-political-tweets A small Thai political twitter dataset with UD POS tags 41 tweets, 965 words Unlicense License Can Udomcharoenchaikit GitHub","title":"Corpus"},{"location":"tasks/part-of-speech/#software","text":"Name Description Status Language License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0 TLTK Thai Language Toolkit active Python 3.X BSD License (BSD-3-Clause)","title":"Software"},{"location":"tasks/plagiarism/","text":"Plagiarism Corpus Name Description Size License Creator Download Thai Plagiarism Thai Plagiarism Detection http://copycatch.in.th/thai-plagiarism-task.html CC BY-SA-NC 3.0 NECTEC aiforthai (registration required)","title":"Plagiarism"},{"location":"tasks/plagiarism/#plagiarism","text":"","title":"Plagiarism"},{"location":"tasks/plagiarism/#corpus","text":"Name Description Size License Creator Download Thai Plagiarism Thai Plagiarism Detection http://copycatch.in.th/thai-plagiarism-task.html CC BY-SA-NC 3.0 NECTEC aiforthai (registration required)","title":"Corpus"},{"location":"tasks/question-answering/","text":"Question Answering Corpus Name Description Size License Creator Download XQuAD XQuAD (Cross-lingual Question Answering Dataset) is a benchmark dataset for evaluating cross-lingual question answering performance. 240 paragraphs and 1,190 question-answer pairs CC BY-SA 4.0 DeepMind GitHub Thai QA Question answering program from Thai Wikipedia. 4,000 question-answer pairs CC BY-SA-NC 3.0 NECTEC Dataset: aiforthai (registration required), wiki: copycatch , Sample data set: copycatch TyDi QA A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages 200k human-annotated question-answer pairs Apache-2.0 License Google Research GitHub iapp-wiki-qa-dataset Open Thai Wikipedia QA Dataset made by iApp Technology 1,961 Documents 9,170 Questions MIT License iApp Technology GitHub MKQA MKQA: Multilingual Knowledge Questions & Answers. MKQA contains 10,000 queries sampled from the Google Natural Questions dataset. 10,000 queries Apple GitHub Thai WIKI QA Dataset from National Software Contest (NSC) 2018 - 2019 Factoid 15,000 question-answer pairs, boolean 2,000 question CC BY-SA-NC 3.0 NECTEC Dataset: aiforthai Software Name Detail Owner Download Zero-shot multilingual QA from DeepPavlov DeepPavlov is an open-source conversational AI library built on TensorFlow, Keras and PyTorch. Neural Networks and Deep Learning lab, MIPT GitHub Colab","title":"Question Answering"},{"location":"tasks/question-answering/#question-answering","text":"","title":"Question Answering"},{"location":"tasks/question-answering/#corpus","text":"Name Description Size License Creator Download XQuAD XQuAD (Cross-lingual Question Answering Dataset) is a benchmark dataset for evaluating cross-lingual question answering performance. 240 paragraphs and 1,190 question-answer pairs CC BY-SA 4.0 DeepMind GitHub Thai QA Question answering program from Thai Wikipedia. 4,000 question-answer pairs CC BY-SA-NC 3.0 NECTEC Dataset: aiforthai (registration required), wiki: copycatch , Sample data set: copycatch TyDi QA A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages 200k human-annotated question-answer pairs Apache-2.0 License Google Research GitHub iapp-wiki-qa-dataset Open Thai Wikipedia QA Dataset made by iApp Technology 1,961 Documents 9,170 Questions MIT License iApp Technology GitHub MKQA MKQA: Multilingual Knowledge Questions & Answers. MKQA contains 10,000 queries sampled from the Google Natural Questions dataset. 10,000 queries Apple GitHub Thai WIKI QA Dataset from National Software Contest (NSC) 2018 - 2019 Factoid 15,000 question-answer pairs, boolean 2,000 question CC BY-SA-NC 3.0 NECTEC Dataset: aiforthai","title":"Corpus"},{"location":"tasks/question-answering/#software","text":"Name Detail Owner Download Zero-shot multilingual QA from DeepPavlov DeepPavlov is an open-source conversational AI library built on TensorFlow, Keras and PyTorch. Neural Networks and Deep Learning lab, MIPT GitHub Colab","title":"Software"},{"location":"tasks/sentence-segmentation/","text":"Sentence Segmentation Corpus Name Description Size License Creator Download Orchid Corpus Thai part of speech (POS) tagged corpus 5,200 sentences CC BY-SA-NC 4.0 NECTEC Mirror from @wannaphong Blackboard Treebank Blackboard Treebank is a Thai dependency corpus based on the LST20 Annotation Guideline. It features dependency structures, constituency structures, word boundaries, named entities, clause boundaries, and sentence boundaries. 122,851 clauses (38,558 sentences) CC BY 3.0 Prachya Boonkwan, NECTEC bitbucket Fake review CC BY-SA 4.0 AI Research Institute of Thailand (AIResearch) GitHub Software Name Description Status Language License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0 TLTK Thai Language Toolkit active Python 3.X BSD License (BSD-3-Clause) BoydCut Bidirectional LSTM-CNN Model for Thai Sentence Segmenter active Python 3.X MIT License ThaiSum Simple Thai Sentence Segmentor active Python 3.X Apache Licence 2.0","title":"Sentence Segmentation"},{"location":"tasks/sentence-segmentation/#sentence-segmentation","text":"","title":"Sentence Segmentation"},{"location":"tasks/sentence-segmentation/#corpus","text":"Name Description Size License Creator Download Orchid Corpus Thai part of speech (POS) tagged corpus 5,200 sentences CC BY-SA-NC 4.0 NECTEC Mirror from @wannaphong Blackboard Treebank Blackboard Treebank is a Thai dependency corpus based on the LST20 Annotation Guideline. It features dependency structures, constituency structures, word boundaries, named entities, clause boundaries, and sentence boundaries. 122,851 clauses (38,558 sentences) CC BY 3.0 Prachya Boonkwan, NECTEC bitbucket Fake review CC BY-SA 4.0 AI Research Institute of Thailand (AIResearch) GitHub","title":"Corpus"},{"location":"tasks/sentence-segmentation/#software","text":"Name Description Status Language License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0 TLTK Thai Language Toolkit active Python 3.X BSD License (BSD-3-Clause) BoydCut Bidirectional LSTM-CNN Model for Thai Sentence Segmenter active Python 3.X MIT License ThaiSum Simple Thai Sentence Segmentor active Python 3.X Apache Licence 2.0","title":"Software"},{"location":"tasks/ser/","text":"Speech Emotion Recognition Corpus Name Description Size License Creator Download Thai Speech Emotion Dataset Thai Speech Emotion Recognition Dataset 36 hours CC BY-SA 4.0 AI Research Institute of Thailand (AIResearch) AIResearch Software Name Description Status Language License Vistec-AIS Speech Emotion Recognition Speech Emotion Recognition Model and Inferencing using Pytorch active Python 3.X Apache License 2.0","title":"Speech Emotion Recognition"},{"location":"tasks/ser/#speech-emotion-recognition","text":"","title":"Speech Emotion Recognition"},{"location":"tasks/ser/#corpus","text":"Name Description Size License Creator Download Thai Speech Emotion Dataset Thai Speech Emotion Recognition Dataset 36 hours CC BY-SA 4.0 AI Research Institute of Thailand (AIResearch) AIResearch","title":"Corpus"},{"location":"tasks/ser/#software","text":"Name Description Status Language License Vistec-AIS Speech Emotion Recognition Speech Emotion Recognition Model and Inferencing using Pytorch active Python 3.X Apache License 2.0","title":"Software"},{"location":"tasks/slu/","text":"Spoken Language Understanding Corpus Name Description Size License Creator Download Facebook Multilingual SLU Dataset Dataset from Cross-Lingual Transfer Learning for Multilingual Task Oriented Dialog 5k annotated utterance in Thai Facebook Facebook MTOP: Multilingual TOP MTOP comprisingof 100k annotated utterances in 6 languages across 11 domains. dataset from MTOP: A Comprehensive Multilingual Task-Oriented Semantic Parsing Benchmark Facebook Facebook","title":"Spoken Language Understanding"},{"location":"tasks/slu/#spoken-language-understanding","text":"","title":"Spoken Language Understanding"},{"location":"tasks/slu/#corpus","text":"Name Description Size License Creator Download Facebook Multilingual SLU Dataset Dataset from Cross-Lingual Transfer Learning for Multilingual Task Oriented Dialog 5k annotated utterance in Thai Facebook Facebook MTOP: Multilingual TOP MTOP comprisingof 100k annotated utterances in 6 languages across 11 domains. dataset from MTOP: A Comprehensive Multilingual Task-Oriented Semantic Parsing Benchmark Facebook Facebook","title":"Corpus"},{"location":"tasks/soundex/","text":"Soundex Software Name Description Status Language License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0 thpronun Use -s option to output soundex active C/C++ GPL-3.0 License","title":"Soundex"},{"location":"tasks/soundex/#soundex","text":"","title":"Soundex"},{"location":"tasks/soundex/#software","text":"Name Description Status Language License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0 thpronun Use -s option to output soundex active C/C++ GPL-3.0 License","title":"Software"},{"location":"tasks/speech-recognition/","text":"Speech Recognition Automatic Speech Recognition Corpus Name Description Size License Creator Download Lotus Thai Speech Recognition corpus from NECTEC (not full corpus) 12 hours CC BY-SA-NC 3.0 NECTEC aiforthai (registration required) and Mirror from @korakot: GitHub Common Voice Corpus Common Voice Corpus from mozilla 171 hours (valid) CC0-1.0 License mozilla Common Voice Gowajee corpus The corpus was collected in the Automatic Speech Recognition class offered at Chulalongkorn University as a homework assignment. 11 hours MIT License Ekapol Chuangsuwanich, Atiwong Suchato, Korrawe Karunratanakul, Burin Naowarat, Chompakorn CChaichot and Penpicha Sangsa-nga GitHub Lotus BN Thai News Speech Recognition corpus from NECTEC (not full corpus) 28 minute CC BY-SA-NC 3.0 NECTEC Mirror from @korakot: GitHub Lotus Cell Thai Speech corpus over the phone. (not full corpus) 11 hours CC BY-SA-NC 3.0 NECTEC Mirror from @korakot: GitHub Thai Elderly Speech dataset by Data Wow and VISAI Thai Elderly Speech dataset, consisting of 17 hours 11 minutes (19,200 files). The files are divided into 2 categories: Health care (health issues and services) and Smart Home (using Smart Home devices in household contexts). 17 hours 11 minutes CC BY-SA 4.0 VISAI AI Company Limited and Data Wow Company Limited VISAI AI Company Limited and Data Wow Company Limited FLEURS Fleurs is the speech version of the FLoRes machine translation benchmark. We use 2009 n-way parallel sentences from the FLoRes dev and devtest publicly available sets, in 102 languages. CC BY Google huggingface XTREME-S The Cross-lingual TRansfer Evaluation of Multilingual Encoders for Speech (XTREME-S) benchmark is a benchmark designed to evaluate speech representations across languages, tasks, domains and data regimes. It covers 102 languages from 10+ language families, 3 different domains and 4 task families: speech recognition, translation, classification and retrieval. CC BY Google huggingface Thai Dialect Corpus Corpus of Central Thai dialect and three other Thai dialects (Khummuang, Korat, and Pattani). CC BY-SA 4.0 Chulalongkorn University Github Software Name Description Status Language License PyThaiASR PyThaiASR is a Python package for Automatic Speech Recognition with focus on Thai language. It have offline thai automatic speech recognition model from Artificial Intelligence Research Institute of Thailand (AIResearch.in.th). active Python 3.X Apache License 2.0 Preatrained Name Detail Owner Download wav2vec2-large-xlsr-53-th` Finetuning wav2vec2-large-xlsr-53 on Thai Common Voice 7.0 Artificial Intelligence Research Institute of Thailand (AIResearch.in.th) Hugging Face Thai Wav2Vec2 with CommonVoice V8 (newmm tokenizer) + language model This model trained with CommonVoice V8 dataset by increase data from CommonVoice V7 dataset that It was use in airesearch/wav2vec2-large-xlsr-53-th. It was finetune wav2vec2-large-xlsr-53. Wannaphong Phatthiyaphaibun Hugging Face Thai Wav2Vec2 with CommonVoice V8 (deepcut tokenizer) + language model This model trained with CommonVoice V8 dataset by increase data from CommonVoice V7 dataset that It was use in airesearch/wav2vec2-large-xlsr-53-th. It was finetune wav2vec2-large-xlsr-53. Wannaphong Phatthiyaphaibun Hugging Face Whisper Whisper is a general-purpose speech recognition model. (include S2T X->English) OpenAI GitHub Language Identification Corpus Name Description Size License Creator Download VoxLingua107 VoxLingua107 is a speech dataset for training spoken language identification models and contains data for 107 languages. (including Thai!!!) 61 hours, 5.8G (Thai) CC-BY 4.0 License J\u00f6rgen Valk, Tanel Alum\u00e4e. Website","title":"Speech Recognition"},{"location":"tasks/speech-recognition/#speech-recognition","text":"","title":"Speech Recognition"},{"location":"tasks/speech-recognition/#automatic-speech-recognition","text":"","title":"Automatic Speech Recognition"},{"location":"tasks/speech-recognition/#corpus","text":"Name Description Size License Creator Download Lotus Thai Speech Recognition corpus from NECTEC (not full corpus) 12 hours CC BY-SA-NC 3.0 NECTEC aiforthai (registration required) and Mirror from @korakot: GitHub Common Voice Corpus Common Voice Corpus from mozilla 171 hours (valid) CC0-1.0 License mozilla Common Voice Gowajee corpus The corpus was collected in the Automatic Speech Recognition class offered at Chulalongkorn University as a homework assignment. 11 hours MIT License Ekapol Chuangsuwanich, Atiwong Suchato, Korrawe Karunratanakul, Burin Naowarat, Chompakorn CChaichot and Penpicha Sangsa-nga GitHub Lotus BN Thai News Speech Recognition corpus from NECTEC (not full corpus) 28 minute CC BY-SA-NC 3.0 NECTEC Mirror from @korakot: GitHub Lotus Cell Thai Speech corpus over the phone. (not full corpus) 11 hours CC BY-SA-NC 3.0 NECTEC Mirror from @korakot: GitHub Thai Elderly Speech dataset by Data Wow and VISAI Thai Elderly Speech dataset, consisting of 17 hours 11 minutes (19,200 files). The files are divided into 2 categories: Health care (health issues and services) and Smart Home (using Smart Home devices in household contexts). 17 hours 11 minutes CC BY-SA 4.0 VISAI AI Company Limited and Data Wow Company Limited VISAI AI Company Limited and Data Wow Company Limited FLEURS Fleurs is the speech version of the FLoRes machine translation benchmark. We use 2009 n-way parallel sentences from the FLoRes dev and devtest publicly available sets, in 102 languages. CC BY Google huggingface XTREME-S The Cross-lingual TRansfer Evaluation of Multilingual Encoders for Speech (XTREME-S) benchmark is a benchmark designed to evaluate speech representations across languages, tasks, domains and data regimes. It covers 102 languages from 10+ language families, 3 different domains and 4 task families: speech recognition, translation, classification and retrieval. CC BY Google huggingface Thai Dialect Corpus Corpus of Central Thai dialect and three other Thai dialects (Khummuang, Korat, and Pattani). CC BY-SA 4.0 Chulalongkorn University Github","title":"Corpus"},{"location":"tasks/speech-recognition/#software","text":"Name Description Status Language License PyThaiASR PyThaiASR is a Python package for Automatic Speech Recognition with focus on Thai language. It have offline thai automatic speech recognition model from Artificial Intelligence Research Institute of Thailand (AIResearch.in.th). active Python 3.X Apache License 2.0","title":"Software"},{"location":"tasks/speech-recognition/#preatrained","text":"Name Detail Owner Download wav2vec2-large-xlsr-53-th` Finetuning wav2vec2-large-xlsr-53 on Thai Common Voice 7.0 Artificial Intelligence Research Institute of Thailand (AIResearch.in.th) Hugging Face Thai Wav2Vec2 with CommonVoice V8 (newmm tokenizer) + language model This model trained with CommonVoice V8 dataset by increase data from CommonVoice V7 dataset that It was use in airesearch/wav2vec2-large-xlsr-53-th. It was finetune wav2vec2-large-xlsr-53. Wannaphong Phatthiyaphaibun Hugging Face Thai Wav2Vec2 with CommonVoice V8 (deepcut tokenizer) + language model This model trained with CommonVoice V8 dataset by increase data from CommonVoice V7 dataset that It was use in airesearch/wav2vec2-large-xlsr-53-th. It was finetune wav2vec2-large-xlsr-53. Wannaphong Phatthiyaphaibun Hugging Face Whisper Whisper is a general-purpose speech recognition model. (include S2T X->English) OpenAI GitHub","title":"Preatrained"},{"location":"tasks/speech-recognition/#language-identification","text":"","title":"Language Identification"},{"location":"tasks/speech-recognition/#corpus_1","text":"Name Description Size License Creator Download VoxLingua107 VoxLingua107 is a speech dataset for training spoken language identification models and contains data for 107 languages. (including Thai!!!) 61 hours, 5.8G (Thai) CC-BY 4.0 License J\u00f6rgen Valk, Tanel Alum\u00e4e. Website","title":"Corpus"},{"location":"tasks/speech-synthesis/","text":"Speech Synthesis Corpus Name Description Size License Creator Download TSync-1 Corpus Thai speech synthesis corpus from NECTEC (not full corpus) 6 hours CC BY-SA-NC 3.0 NECTEC Mirror from @korakot: GitHub TSync-2 Corpus Thai speech synthesis corpus from NECTEC (not full corpus) 5hr 25m CC BY-SA-NC 3.0 NECTEC aiforthai (registration required) and Mirror from @korakot edited_common_voice This dataset is a Thai TTS dataset that use the voice from Common Voice dataset and modify the voice to not to sound like the original. - MIT taetiya taechamatavorn HuggingFace Software Name Description Status Language License Thai TTS Tacotron Thai_TTS is the project about training \"Text to Speech in Thai\" using Tacotron2 by NVIDIA. active Python 3.X Apache License 2.0 PyThaiTTS Open Source Thai Text-to-speech library in Python active Python 3.X Apache License 2.0","title":"Speech Synthesis"},{"location":"tasks/speech-synthesis/#speech-synthesis","text":"","title":"Speech Synthesis"},{"location":"tasks/speech-synthesis/#corpus","text":"Name Description Size License Creator Download TSync-1 Corpus Thai speech synthesis corpus from NECTEC (not full corpus) 6 hours CC BY-SA-NC 3.0 NECTEC Mirror from @korakot: GitHub TSync-2 Corpus Thai speech synthesis corpus from NECTEC (not full corpus) 5hr 25m CC BY-SA-NC 3.0 NECTEC aiforthai (registration required) and Mirror from @korakot edited_common_voice This dataset is a Thai TTS dataset that use the voice from Common Voice dataset and modify the voice to not to sound like the original. - MIT taetiya taechamatavorn HuggingFace","title":"Corpus"},{"location":"tasks/speech-synthesis/#software","text":"Name Description Status Language License Thai TTS Tacotron Thai_TTS is the project about training \"Text to Speech in Thai\" using Tacotron2 by NVIDIA. active Python 3.X Apache License 2.0 PyThaiTTS Open Source Thai Text-to-speech library in Python active Python 3.X Apache License 2.0","title":"Software"},{"location":"tasks/speech2text-translation/","text":"Speech-to-text translation Speech-to-text translation or S2T are translate speech to text with different language. Corpus Name Description Size License Creator Download FLEURS Fleurs is the speech version of the FLoRes machine translation benchmark. We use 2009 n-way parallel sentences from the FLoRes dev and devtest publicly available sets, in 102 languages. CC BY Google huggingface Model Name Detail Owner Download Whisper Whisper is a general-purpose speech recognition model. (include S2T X->English) OpenAI GitHub","title":"Speech-to-text translation"},{"location":"tasks/speech2text-translation/#speech-to-text-translation","text":"Speech-to-text translation or S2T are translate speech to text with different language.","title":"Speech-to-text translation"},{"location":"tasks/speech2text-translation/#corpus","text":"Name Description Size License Creator Download FLEURS Fleurs is the speech version of the FLoRes machine translation benchmark. We use 2009 n-way parallel sentences from the FLoRes dev and devtest publicly available sets, in 102 languages. CC BY Google huggingface","title":"Corpus"},{"location":"tasks/speech2text-translation/#model","text":"Name Detail Owner Download Whisper Whisper is a general-purpose speech recognition model. (include S2T X->English) OpenAI GitHub","title":"Model"},{"location":"tasks/speech_representations/","text":"Speech Representations Benchmark Name Description Size License Creator Download XTREME-S The Cross-lingual TRansfer Evaluation of Multilingual Encoders for Speech (XTREME-S) benchmark is a benchmark designed to evaluate speech representations across languages, tasks, domains and data regimes. It covers 102 languages from 10+ language families, 3 different domains and 4 task families: speech recognition, translation, classification and retrieval. CC BY 4.0 Google Hugging Face","title":"Speech Representations"},{"location":"tasks/speech_representations/#speech-representations","text":"","title":"Speech Representations"},{"location":"tasks/speech_representations/#benchmark","text":"Name Description Size License Creator Download XTREME-S The Cross-lingual TRansfer Evaluation of Multilingual Encoders for Speech (XTREME-S) benchmark is a benchmark designed to evaluate speech representations across languages, tasks, domains and data regimes. It covers 102 languages from 10+ language families, 3 different domains and 4 task families: speech recognition, translation, classification and retrieval. CC BY 4.0 Google Hugging Face","title":"Benchmark"},{"location":"tasks/spell-correct/","text":"Spell Correct Corpus Name Description Size License Creator Download VISTEC-TP-TH-21 The largest social media domain datasets for Thai text processing (word segmentation, misspell correction and detection, and named-entity boundary) called \"VISTEC-TP-TH-2021\" or VISTEC-2021. 49,997 sentences with 3.39M words CC-BY-SA 3.0 VISTEC & Chiang Mai University GitHub Software Name Description Status Language License Hunspell Hunspell is the spell checker of LibreOffice, OpenOffice.org, Mozilla Firefox 3 & Thunderbird, Google Chrome. active C/C++ GNU Lesser General Public License and Mozilla Public License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0 khanaa Khanaa is a tool to make spelling Thai more convenient. active Python 3.X MIT license","title":"Spell Correct"},{"location":"tasks/spell-correct/#spell-correct","text":"","title":"Spell Correct"},{"location":"tasks/spell-correct/#corpus","text":"Name Description Size License Creator Download VISTEC-TP-TH-21 The largest social media domain datasets for Thai text processing (word segmentation, misspell correction and detection, and named-entity boundary) called \"VISTEC-TP-TH-2021\" or VISTEC-2021. 49,997 sentences with 3.39M words CC-BY-SA 3.0 VISTEC & Chiang Mai University GitHub","title":"Corpus"},{"location":"tasks/spell-correct/#software","text":"Name Description Status Language License Hunspell Hunspell is the spell checker of LibreOffice, OpenOffice.org, Mozilla Firefox 3 & Thunderbird, Google Chrome. active C/C++ GNU Lesser General Public License and Mozilla Public License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0 khanaa Khanaa is a tool to make spelling Thai more convenient. active Python 3.X MIT license","title":"Software"},{"location":"tasks/syllable-segmentation/","text":"Syllable Segmentation Software Name Description Status Language License ssg CRF syllable segmenter for Thai active Python 3.X Apache License 2.0 TLTK Thai Language Toolkit active Python 3.X BSD License (BSD-3-Clause)","title":"Syllable Segmentation"},{"location":"tasks/syllable-segmentation/#syllable-segmentation","text":"","title":"Syllable Segmentation"},{"location":"tasks/syllable-segmentation/#software","text":"Name Description Status Language License ssg CRF syllable segmenter for Thai active Python 3.X Apache License 2.0 TLTK Thai Language Toolkit active Python 3.X BSD License (BSD-3-Clause)","title":"Software"},{"location":"tasks/text-classification/","text":"Text Classification Corpus Name Description Size Labels License Creator Download prachathai-67k News Article Corpus from Prachathai.com 67,889 articles wtih 51,797 tags 12 CC BY 4.0 @lukkiddd and @cstorm125 GitHub wisesight sentiment Social media messages in Thai language with sentiment label (positive, neutral, negative, question). 26,737 messages 4 CC0-1.0 License Arthit Suriyawongkul, Ekapol Chuangsuwanich GitHub wongnai corpus This project is a collection of Wongnai's datasets which are mostly in Thai language. 500K words labeled 5 LGPL-3.0 License wongnai GitHub Toxicity in Thai Tweet Corpus Toxicity in Thai Tweet Corpus 3,300 messages 2 CC BY-NC 4.0 Tokyo Metropolitan University Natural Language Processing Group GitHub Thai-Clickbait The dataset for Thai Clickbait classification train: 37,376 messages, test: 9,344 messages 1 MIT License @9meo at GitHub GitHub sentiment_analysis_thai Thai sentiment analysis from @JagerV3 2 ? @JagerV3 at GitHub GitHub thai-emojification Emojification of Thai Text, Using Deep Learning (LSTM). train: 128 messages, test: 55 messages 5 (\u2764\ufe0f\ud83d\ude04\ud83d\ude1e\ud83c\udf74\u26be) GPL-3.0 License iApp Technology Co, Ltd GitHub The 40 Thai Children Stories The dataset was collected from 40 Thai children stories. We manually split the text into sentences which leads to 1,964 sentences 1,964 sentences 3 ? Kitsuchart Pasupa, Thititorn Seneewong Na Ayutthaya GitHub Thai sentiment analysis dataset Thai sentiment analysis dataset from PyThaiNLP 2 CC BY 3.0 PyThaiNLP GitHub LimeSoda: Dataset for Fake News Detection in Healthcare Domain Thai fake news dataset in the healthcare domain consisting of curate and manually annotated 7,191 documents annotated 7,191 documents 3 (fact, fake, or undefined) CC-BY-4.0 License Payoungkhamdee, Patomporn and Porkaew, Peerachet and Sinthunyathum, Atthasith and Songphum, Phattharaphon and Kawidam, Witsarut and Loha-Udom, Wichayut and Boonkwan, Prachya and Sutantayawalee, Vipas GitHub krathu-500 A dataset of post-comment on Pantip, a popular Thai web board. 3 (Positive, Negative, and Neutral) GitHub thai_cyberbullying_lgbt LGBT Cyberbullying Detection in Thai Language Utilizing Transformers-Based Algorithms GitHub Software Name Description Status Language License thai_sentiment The naive sentiment classification function based on NBSVM trained on wisesight_sentiment active Python 3.X Apache License 2.0","title":"Text Classification"},{"location":"tasks/text-classification/#text-classification","text":"","title":"Text Classification"},{"location":"tasks/text-classification/#corpus","text":"Name Description Size Labels License Creator Download prachathai-67k News Article Corpus from Prachathai.com 67,889 articles wtih 51,797 tags 12 CC BY 4.0 @lukkiddd and @cstorm125 GitHub wisesight sentiment Social media messages in Thai language with sentiment label (positive, neutral, negative, question). 26,737 messages 4 CC0-1.0 License Arthit Suriyawongkul, Ekapol Chuangsuwanich GitHub wongnai corpus This project is a collection of Wongnai's datasets which are mostly in Thai language. 500K words labeled 5 LGPL-3.0 License wongnai GitHub Toxicity in Thai Tweet Corpus Toxicity in Thai Tweet Corpus 3,300 messages 2 CC BY-NC 4.0 Tokyo Metropolitan University Natural Language Processing Group GitHub Thai-Clickbait The dataset for Thai Clickbait classification train: 37,376 messages, test: 9,344 messages 1 MIT License @9meo at GitHub GitHub sentiment_analysis_thai Thai sentiment analysis from @JagerV3 2 ? @JagerV3 at GitHub GitHub thai-emojification Emojification of Thai Text, Using Deep Learning (LSTM). train: 128 messages, test: 55 messages 5 (\u2764\ufe0f\ud83d\ude04\ud83d\ude1e\ud83c\udf74\u26be) GPL-3.0 License iApp Technology Co, Ltd GitHub The 40 Thai Children Stories The dataset was collected from 40 Thai children stories. We manually split the text into sentences which leads to 1,964 sentences 1,964 sentences 3 ? Kitsuchart Pasupa, Thititorn Seneewong Na Ayutthaya GitHub Thai sentiment analysis dataset Thai sentiment analysis dataset from PyThaiNLP 2 CC BY 3.0 PyThaiNLP GitHub LimeSoda: Dataset for Fake News Detection in Healthcare Domain Thai fake news dataset in the healthcare domain consisting of curate and manually annotated 7,191 documents annotated 7,191 documents 3 (fact, fake, or undefined) CC-BY-4.0 License Payoungkhamdee, Patomporn and Porkaew, Peerachet and Sinthunyathum, Atthasith and Songphum, Phattharaphon and Kawidam, Witsarut and Loha-Udom, Wichayut and Boonkwan, Prachya and Sutantayawalee, Vipas GitHub krathu-500 A dataset of post-comment on Pantip, a popular Thai web board. 3 (Positive, Negative, and Neutral) GitHub thai_cyberbullying_lgbt LGBT Cyberbullying Detection in Thai Language Utilizing Transformers-Based Algorithms GitHub","title":"Corpus"},{"location":"tasks/text-classification/#software","text":"Name Description Status Language License thai_sentiment The naive sentiment classification function based on NBSVM trained on wisesight_sentiment active Python 3.X Apache License 2.0","title":"Software"},{"location":"tasks/text-generation/","text":"Text Generation Software Name Description Status Language License TTG Thai Text Generator active Python 3.X Apache License 2.0 Pretrained Name Detail Owner Download Flax's GPT-2 base GPT-2 Base Thai is a causal language model based on the OpenAI GPT-2 model. It was trained on the OSCAR dataset, specifically the unshuffled_deduplicated_th subset. The model was trained from scratch and achieved an evaluation loss of 1.708 and an evaluation perplexity of 5.516. Flax Community Hugging Face GPT-Neo GPT-Neo 1.3B is a transformer model designed using EleutherAI's replication of the GPT-3 architecture. GPT-Neo refers to the class of models, while 1.3B represents the number of parameters of this particular pre-trained model. (It is not training for Thai but It's can working with Thai) EleutherAI Hugging Face Thai GPT Next It is fine-tune the GPT-Neo model for Thai language. Wannaphong Phatthiyaphaibun GitHub","title":"Text Generation"},{"location":"tasks/text-generation/#text-generation","text":"","title":"Text Generation"},{"location":"tasks/text-generation/#software","text":"Name Description Status Language License TTG Thai Text Generator active Python 3.X Apache License 2.0","title":"Software"},{"location":"tasks/text-generation/#pretrained","text":"Name Detail Owner Download Flax's GPT-2 base GPT-2 Base Thai is a causal language model based on the OpenAI GPT-2 model. It was trained on the OSCAR dataset, specifically the unshuffled_deduplicated_th subset. The model was trained from scratch and achieved an evaluation loss of 1.708 and an evaluation perplexity of 5.516. Flax Community Hugging Face GPT-Neo GPT-Neo 1.3B is a transformer model designed using EleutherAI's replication of the GPT-3 architecture. GPT-Neo refers to the class of models, while 1.3B represents the number of parameters of this particular pre-trained model. (It is not training for Thai but It's can working with Thai) EleutherAI Hugging Face Thai GPT Next It is fine-tune the GPT-Neo model for Thai language. Wannaphong Phatthiyaphaibun GitHub","title":"Pretrained"},{"location":"tasks/text-summarization/","text":"Text Summarization Corpus Name Description Size License Creator Download ThaiSum The largest dataset for Thai text summarization. 350,000 articles (2.9 GB) MIT Licence Nakhun Chumpolsathien GitHub TR-TPBS A dataset for Thai text summarization. 310K articles MIT License Nakhun Chumpolsathien GitHub XL-Sum This dataset annotated article-summary pairs from BBC News and covers 45 languages ranging from low to high-resource. 8,268 (for thai) CC BY-NC-SA 4.0 GitHub ThaiCrossSum Corpora Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization th-en 310,926 articles and th-zh 310,926 articles Nakhun Chumpolsathien GitHub Pretrained Model Detail Paper Download mT5: Multilingual T5 Multilingual T5 (mT5) is a massively multilingual pretrained text-to-text transformer model, trained following a similar recipe as T5. mT5: A massively multilingual pre-trained text-to-text transformer GitHub BertSum Trained Model by Nakhun Chumpolsathien & Tanachat Arayachutinan Using Knowledge Distillation from Keyword Extraction to Improve the Informativeness of Neural Cross-lingual Summarization GitHub ARedSum Trained Model by Nakhun Chumpolsathien & Tanachat Arayachutinan Using Knowledge Distillation from Keyword Extraction to Improve the Informativeness of Neural Cross-lingual Summarization GitHub TNCLS Trained Model from ThaiCrossSum Corpora by Nakhun Chumpolsathien GitHub CLS+MS Trained Model from ThaiCrossSum Corpora by Nakhun Chumpolsathien GitHub CLS+MT Trained Model from ThaiCrossSum Corpora by Nakhun Chumpolsathien GitHub XLS \u2013 RL-ROUGE Trained Model from ThaiCrossSum Corpora by Nakhun Chumpolsathien GitHub mt5-cpe-kmutt-thai-sentence-sum This repository contains the finetuned mT5-base model for Thai sentence summarization. huggingface","title":"Text Summarization"},{"location":"tasks/text-summarization/#text-summarization","text":"","title":"Text Summarization"},{"location":"tasks/text-summarization/#corpus","text":"Name Description Size License Creator Download ThaiSum The largest dataset for Thai text summarization. 350,000 articles (2.9 GB) MIT Licence Nakhun Chumpolsathien GitHub TR-TPBS A dataset for Thai text summarization. 310K articles MIT License Nakhun Chumpolsathien GitHub XL-Sum This dataset annotated article-summary pairs from BBC News and covers 45 languages ranging from low to high-resource. 8,268 (for thai) CC BY-NC-SA 4.0 GitHub ThaiCrossSum Corpora Th2En & Th2Zh: The large-scale datasets for Thai text cross-lingual summarization th-en 310,926 articles and th-zh 310,926 articles Nakhun Chumpolsathien GitHub","title":"Corpus"},{"location":"tasks/text-summarization/#pretrained","text":"Model Detail Paper Download mT5: Multilingual T5 Multilingual T5 (mT5) is a massively multilingual pretrained text-to-text transformer model, trained following a similar recipe as T5. mT5: A massively multilingual pre-trained text-to-text transformer GitHub BertSum Trained Model by Nakhun Chumpolsathien & Tanachat Arayachutinan Using Knowledge Distillation from Keyword Extraction to Improve the Informativeness of Neural Cross-lingual Summarization GitHub ARedSum Trained Model by Nakhun Chumpolsathien & Tanachat Arayachutinan Using Knowledge Distillation from Keyword Extraction to Improve the Informativeness of Neural Cross-lingual Summarization GitHub TNCLS Trained Model from ThaiCrossSum Corpora by Nakhun Chumpolsathien GitHub CLS+MS Trained Model from ThaiCrossSum Corpora by Nakhun Chumpolsathien GitHub CLS+MT Trained Model from ThaiCrossSum Corpora by Nakhun Chumpolsathien GitHub XLS \u2013 RL-ROUGE Trained Model from ThaiCrossSum Corpora by Nakhun Chumpolsathien GitHub mt5-cpe-kmutt-thai-sentence-sum This repository contains the finetuned mT5-base model for Thai sentence summarization. huggingface","title":"Pretrained"},{"location":"tasks/transliterate/","text":"Transliterate Corpus Name Description Size License Creator Download Thai2Rom Thai Romanization Dataset CC BY-SA 3.0 Wannaphong Phatthiyaphaibun kaggle Thai-English transliteration dictionary This project is Thai-English transliteration dictionary. It is store words for Thai-English transliteration pairs. Thai words are English words from English to Thai by transliteration in Thai. CC-BY 4.0 Wannaphong Phatthiyaphaibun GitHub Software Name Description Status Language License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0 TLTK Thai Language Toolkit active Python 3.X BSD License (BSD-3-Clause) Wunsen Wunsen transliterates/transcribes from other languages into Thai. active Python 3.X MIT License","title":"Transliterate"},{"location":"tasks/transliterate/#transliterate","text":"","title":"Transliterate"},{"location":"tasks/transliterate/#corpus","text":"Name Description Size License Creator Download Thai2Rom Thai Romanization Dataset CC BY-SA 3.0 Wannaphong Phatthiyaphaibun kaggle Thai-English transliteration dictionary This project is Thai-English transliteration dictionary. It is store words for Thai-English transliteration pairs. Thai words are English words from English to Thai by transliteration in Thai. CC-BY 4.0 Wannaphong Phatthiyaphaibun GitHub","title":"Corpus"},{"location":"tasks/transliterate/#software","text":"Name Description Status Language License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0 TLTK Thai Language Toolkit active Python 3.X BSD License (BSD-3-Clause) Wunsen Wunsen transliterates/transcribes from other languages into Thai. active Python 3.X MIT License","title":"Software"},{"location":"tasks/treebank/","text":"Treebank Corpus Name Description Size License Creator Download UD Thai PUD This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies. 1,000 sentences CC BY-SA 3.0 Universal Dependencies GitHub Thai Treebanks Dataset (thtb) To enable research oppotunities with very few Thai Computational Linguitic resources, we willingly introduce fundamental high-level language resouces built with passion, Thai Treebanks, build from scratch for researchers and enthusiasts. 5,200 sentences CC BY 4.0 Pechlada Seenual, Thodsaporn Chay-intr and Thanaruk Theeramunkong GitHub Blackboard Treebank Blackboard Treebank is a Thai dependency corpus based on the LST20 Annotation Guideline. It features dependency structures, constituency structures, word boundaries, named entities, clause boundaries, and sentence boundaries. 122,851 clauses (38,558 sentences) CC BY 3.0 Prachya Boonkwan, NECTEC bitbucket","title":"Treebank"},{"location":"tasks/treebank/#treebank","text":"","title":"Treebank"},{"location":"tasks/treebank/#corpus","text":"Name Description Size License Creator Download UD Thai PUD This is a part of the Parallel Universal Dependencies (PUD) treebanks created for the CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies. 1,000 sentences CC BY-SA 3.0 Universal Dependencies GitHub Thai Treebanks Dataset (thtb) To enable research oppotunities with very few Thai Computational Linguitic resources, we willingly introduce fundamental high-level language resouces built with passion, Thai Treebanks, build from scratch for researchers and enthusiasts. 5,200 sentences CC BY 4.0 Pechlada Seenual, Thodsaporn Chay-intr and Thanaruk Theeramunkong GitHub Blackboard Treebank Blackboard Treebank is a Thai dependency corpus based on the LST20 Annotation Guideline. It features dependency structures, constituency structures, word boundaries, named entities, clause boundaries, and sentence boundaries. 122,851 clauses (38,558 sentences) CC BY 3.0 Prachya Boonkwan, NECTEC bitbucket","title":"Corpus"},{"location":"tasks/word-segmentation/","text":"Word Segmentation for Thai language, Word Segmentation is the first step for process Thai text for segment thai text to words. Corpus Name Description Size License Creator Download BEST I (BEST 2009) Benchmark for Enhancing the Standard of Thai language processing 5,000,000 word CC BY-SA-NC 4.0 NECTEC aiforthai (registration required) and Mirror from @korakot Blackboard Treebank Blackboard Treebank is a Thai dependency corpus based on the LST20 Annotation Guideline. It features dependency structures, constituency structures, word boundaries, named entities, clause boundaries, and sentence boundaries. 122,851 clauses (38,558 sentences) CC BY 3.0 Prachya Boonkwan, NECTEC bitbucket Wisesight Samples with Word Tokenization Label This directory contains samples of Thai social media text, tokenized by humans. These samples are randomly drawn from the full Wisesight Sentiment Corpus. 160 sentences (wisesight-160) and 1,000 sentences (wiseight-1000) CC0-1.0 License Nitchakarn Chantarapratin, Pattarawat Chormai, Ponrawee Prasertsom, Jitkapat Sawatphol, Nozomi Yamada, and Attapol Rutherford GitHub Thai National Historical Corpus (TNHC) texts from Thai National Historical Corpus, stored by lines (manually tokenized). 47 documents, 756,478 lines, 13,361,142 characters Jitkapat Sawatphol GitHub Orchid Corpus Thai part of speech (POS) tagged corpus 5,200 sentences CC BY-SA-NC 3.0 NECTEC Mirror from @wannaphong Corpus Komped Poem (windy part) Pattarawat Chormai GitHub VISTEC-TP-TH-21 The largest social media domain datasets for Thai text processing (word segmentation, misspell correction and detection, and named-entity boundary) called \"VISTEC-TP-TH-2021\" or VISTEC-2021. 49,997 sentences with 3.39M words CC-BY-SA 3.0 VISTEC & Chiang Mai University GitHub BEST I BEST I is the Benchmark for Enhancing the Standard of Thai language processing. Number of words: 5,000,000 words Details Creator: NECTEC License: CC BY-SA-NC 4.0 Paper: Download: aiforthai (registration required) Benchmarks We are not benchmarks for this corpus because we have not an answer of testset. Blackboard Treebank Blackboard Treebank is a Thai dependency corpus based on the LST20 Annotation Guideline. It features dependency structures, constituency structures, word boundaries, named entities, clause boundaries, and sentence boundaries. 122,851 clauses (38,558 sentences) Details Creator: NECTEC License: CC-BY 3.0 Download: bitbucket Benchmarks [WIP] Orchid Corpus Orchid Corpus is Thai part of speech (POS) tagged corpus with word segmentation corpus. Number of words: words Details Creator: NECTEC License: CC BY-SA-NC 3.0 Paper: Thai Part-of-speech Tagged Corpus: ORCHID Download: Mirror from @wannaphong Benchmarks Orchid Corpus is not have the testset. Wisesight Corpus This directory contains samples of Thai social media text, tokenized by humans. These samples are randomly drawn from the full Wisesight Sentiment Corpus. wisesight-160 has 160 sentences. Number of words: 3,833 words wiseight-1000 has 1,000 sentences. Number of words: 21,745 words Benchmarks [WIP] Thai National Historical Corpus Thai National Historical Corpus or TNHC tokenized by humans. Number of words: ? words 47 documents, 756,478 lines, 13,361,142 characters Details Creator: Jitkapat Sawatphol Download: GitHub Corpus Komped Poem (windy part) Number of words: 317 words Details Creator: Pattarawat Chormai License: CC-BY-SA 3.0 Paper: - Download: GitHub Benchmarks [WIP] VISTEC-TP-TH-21 The largest social media domain datasets for Thai text processing (word segmentation, misspell correction and detection, and named-entity boundary) called \"VISTEC-TP-TH-2021\" or VISTEC-2021. Number of words: 3.39M words Details Creator: VISTEC & Chiang Mai University License: CC-BY-SA 3.0 Paper: - Download: GitHub Software Name Description Status Language License ICU ICU - International Components for Unicode active C/C++/Java Unicode License libthai is a set of Thai language support routines aimed to ease developers' tasks to incorporate Thai language support in their applications. active C/C++ LGPL-2.1 License SWATH Smart Word Analysis for THai active C/C++ GPL-2.0 License AttaCut Fast and Reasonably Accurate Word Tokenizer for Thai. active Python 3.X MIT License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0 PyWordCut wordcutpy is a simple Thai word breaker written in Python 3+ active Python 3.X LGPLv3 DeepCut A Thai word tokenization library using Deep Neural Network. active Python 3.X MIT License TLTK Thai Language Toolkit active Python 3.X BSD License (BSD-3-Clause) KUCut Thai word segmentor that is difference from existing segmentor such as CTTEX or SWATH. deactive Python 2.4-2.5 GPL-2.0 License SEFR CUT Stacked Ensemble Filter and Refine for Word Segmentation active Python 3.X MIT License CutKum Thai Word-Segmentation with LSTM in Tensorflow - Python 3.X MIT License ThaiLMCut Word Tokenizer for Thai Language based on Transfer Learning and bidirectional-LSTM active Python 3.X MIT License LexTo Thai word segmentation ( Longest Matching ) - Java LGPLv2.1 sertiscorp /thai-word-segmentation Thai word segmentation with bi-directional RNN - Python 3.X MIT License Thai Analysis Plugin for Elasticsearch The Thaichub2 (thai-chub-chub) Analysis Plugin integrates the Thai word segmentation modules into Elasticsearch. active Java Apache-2.0 License Wordcut Thai word breaker for Node.js active JavaScript, Node.JS LGPLv3 V8 BreakIterator Chrome's V8 Engine, using ICU active JavaScript Apache License 2.0 icu-wordsplit Simple icu boundary analysis module bindings for node.js inactive JavaScript BSD newmm-tokenizer Standalone Dictionary-based, Maximum Matching + Thai Character Cluster (newmm) tokenizer extracted from PyThaiNLP. active Python 3.X Apache License 2.0 Stanza Official Stanford NLP Python Library for Many Human Languages active Python 3.X Apache License 2.0 Multi Candidate Thai Word Segmentation Most existing word segmentation methods output one single segmentation solution. active Python 3.X MIT License PhlongTaIam PHP Thai word breaker active PHP LGPL-2.1 License Chamkho Rust Thai word breaker active Rust LGPL-3 License oxidized-thainlp Thai Natural Language Processing in Rust, with Python-binding. active Python & Rust Apache License 2.0 OSKut Handling Cross- and Out-of-Domain Samples in Thai Word Segmentation (ACL 2021 Findings) Stacked Ensemble Framework and DeepCut as Baseline model active Python MIT License Tools Name Description License Creator Download MudYom MudYom is a module for pre/post-processing text. It combines, aka \u0e21\u0e31\u0e14, words that should be together into one token. This process is done according to a user-defined dictionary. Pattarawat Chormai GitHub","title":"Word Segmentation"},{"location":"tasks/word-segmentation/#word-segmentation","text":"for Thai language, Word Segmentation is the first step for process Thai text for segment thai text to words.","title":"Word Segmentation"},{"location":"tasks/word-segmentation/#corpus","text":"Name Description Size License Creator Download BEST I (BEST 2009) Benchmark for Enhancing the Standard of Thai language processing 5,000,000 word CC BY-SA-NC 4.0 NECTEC aiforthai (registration required) and Mirror from @korakot Blackboard Treebank Blackboard Treebank is a Thai dependency corpus based on the LST20 Annotation Guideline. It features dependency structures, constituency structures, word boundaries, named entities, clause boundaries, and sentence boundaries. 122,851 clauses (38,558 sentences) CC BY 3.0 Prachya Boonkwan, NECTEC bitbucket Wisesight Samples with Word Tokenization Label This directory contains samples of Thai social media text, tokenized by humans. These samples are randomly drawn from the full Wisesight Sentiment Corpus. 160 sentences (wisesight-160) and 1,000 sentences (wiseight-1000) CC0-1.0 License Nitchakarn Chantarapratin, Pattarawat Chormai, Ponrawee Prasertsom, Jitkapat Sawatphol, Nozomi Yamada, and Attapol Rutherford GitHub Thai National Historical Corpus (TNHC) texts from Thai National Historical Corpus, stored by lines (manually tokenized). 47 documents, 756,478 lines, 13,361,142 characters Jitkapat Sawatphol GitHub Orchid Corpus Thai part of speech (POS) tagged corpus 5,200 sentences CC BY-SA-NC 3.0 NECTEC Mirror from @wannaphong Corpus Komped Poem (windy part) Pattarawat Chormai GitHub VISTEC-TP-TH-21 The largest social media domain datasets for Thai text processing (word segmentation, misspell correction and detection, and named-entity boundary) called \"VISTEC-TP-TH-2021\" or VISTEC-2021. 49,997 sentences with 3.39M words CC-BY-SA 3.0 VISTEC & Chiang Mai University GitHub","title":"Corpus"},{"location":"tasks/word-segmentation/#best-i","text":"BEST I is the Benchmark for Enhancing the Standard of Thai language processing. Number of words: 5,000,000 words Details Creator: NECTEC License: CC BY-SA-NC 4.0 Paper: Download: aiforthai (registration required)","title":"BEST I"},{"location":"tasks/word-segmentation/#benchmarks","text":"We are not benchmarks for this corpus because we have not an answer of testset.","title":"Benchmarks"},{"location":"tasks/word-segmentation/#blackboard-treebank","text":"Blackboard Treebank is a Thai dependency corpus based on the LST20 Annotation Guideline. It features dependency structures, constituency structures, word boundaries, named entities, clause boundaries, and sentence boundaries. 122,851 clauses (38,558 sentences) Details Creator: NECTEC License: CC-BY 3.0 Download: bitbucket","title":"Blackboard Treebank"},{"location":"tasks/word-segmentation/#benchmarks_1","text":"[WIP]","title":"Benchmarks"},{"location":"tasks/word-segmentation/#orchid-corpus","text":"Orchid Corpus is Thai part of speech (POS) tagged corpus with word segmentation corpus. Number of words: words Details Creator: NECTEC License: CC BY-SA-NC 3.0 Paper: Thai Part-of-speech Tagged Corpus: ORCHID Download: Mirror from @wannaphong","title":"Orchid Corpus"},{"location":"tasks/word-segmentation/#benchmarks_2","text":"Orchid Corpus is not have the testset.","title":"Benchmarks"},{"location":"tasks/word-segmentation/#wisesight-corpus","text":"This directory contains samples of Thai social media text, tokenized by humans. These samples are randomly drawn from the full Wisesight Sentiment Corpus. wisesight-160 has 160 sentences. Number of words: 3,833 words wiseight-1000 has 1,000 sentences. Number of words: 21,745 words","title":"Wisesight Corpus"},{"location":"tasks/word-segmentation/#benchmarks_3","text":"[WIP]","title":"Benchmarks"},{"location":"tasks/word-segmentation/#thai-national-historical-corpus","text":"Thai National Historical Corpus or TNHC tokenized by humans. Number of words: ? words 47 documents, 756,478 lines, 13,361,142 characters Details Creator: Jitkapat Sawatphol Download: GitHub","title":"Thai National Historical Corpus"},{"location":"tasks/word-segmentation/#corpus-komped-poem-windy-part","text":"Number of words: 317 words Details Creator: Pattarawat Chormai License: CC-BY-SA 3.0 Paper: - Download: GitHub","title":"Corpus Komped Poem (windy part)"},{"location":"tasks/word-segmentation/#benchmarks_4","text":"[WIP]","title":"Benchmarks"},{"location":"tasks/word-segmentation/#vistec-tp-th-21","text":"The largest social media domain datasets for Thai text processing (word segmentation, misspell correction and detection, and named-entity boundary) called \"VISTEC-TP-TH-2021\" or VISTEC-2021. Number of words: 3.39M words Details Creator: VISTEC & Chiang Mai University License: CC-BY-SA 3.0 Paper: - Download: GitHub","title":"VISTEC-TP-TH-21"},{"location":"tasks/word-segmentation/#software","text":"Name Description Status Language License ICU ICU - International Components for Unicode active C/C++/Java Unicode License libthai is a set of Thai language support routines aimed to ease developers' tasks to incorporate Thai language support in their applications. active C/C++ LGPL-2.1 License SWATH Smart Word Analysis for THai active C/C++ GPL-2.0 License AttaCut Fast and Reasonably Accurate Word Tokenizer for Thai. active Python 3.X MIT License PyThaiNLP It's part of PyThaiNLP. active Python 3.X Apache License 2.0 PyWordCut wordcutpy is a simple Thai word breaker written in Python 3+ active Python 3.X LGPLv3 DeepCut A Thai word tokenization library using Deep Neural Network. active Python 3.X MIT License TLTK Thai Language Toolkit active Python 3.X BSD License (BSD-3-Clause) KUCut Thai word segmentor that is difference from existing segmentor such as CTTEX or SWATH. deactive Python 2.4-2.5 GPL-2.0 License SEFR CUT Stacked Ensemble Filter and Refine for Word Segmentation active Python 3.X MIT License CutKum Thai Word-Segmentation with LSTM in Tensorflow - Python 3.X MIT License ThaiLMCut Word Tokenizer for Thai Language based on Transfer Learning and bidirectional-LSTM active Python 3.X MIT License LexTo Thai word segmentation ( Longest Matching ) - Java LGPLv2.1 sertiscorp /thai-word-segmentation Thai word segmentation with bi-directional RNN - Python 3.X MIT License Thai Analysis Plugin for Elasticsearch The Thaichub2 (thai-chub-chub) Analysis Plugin integrates the Thai word segmentation modules into Elasticsearch. active Java Apache-2.0 License Wordcut Thai word breaker for Node.js active JavaScript, Node.JS LGPLv3 V8 BreakIterator Chrome's V8 Engine, using ICU active JavaScript Apache License 2.0 icu-wordsplit Simple icu boundary analysis module bindings for node.js inactive JavaScript BSD newmm-tokenizer Standalone Dictionary-based, Maximum Matching + Thai Character Cluster (newmm) tokenizer extracted from PyThaiNLP. active Python 3.X Apache License 2.0 Stanza Official Stanford NLP Python Library for Many Human Languages active Python 3.X Apache License 2.0 Multi Candidate Thai Word Segmentation Most existing word segmentation methods output one single segmentation solution. active Python 3.X MIT License PhlongTaIam PHP Thai word breaker active PHP LGPL-2.1 License Chamkho Rust Thai word breaker active Rust LGPL-3 License oxidized-thainlp Thai Natural Language Processing in Rust, with Python-binding. active Python & Rust Apache License 2.0 OSKut Handling Cross- and Out-of-Domain Samples in Thai Word Segmentation (ACL 2021 Findings) Stacked Ensemble Framework and DeepCut as Baseline model active Python MIT License","title":"Software"},{"location":"tasks/word-segmentation/#tools","text":"Name Description License Creator Download MudYom MudYom is a module for pre/post-processing text. It combines, aka \u0e21\u0e31\u0e14, words that should be together into one token. This process is done according to a user-defined dictionary. Pattarawat Chormai GitHub","title":"Tools"}]} \ No newline at end of file diff --git a/sitemap.xml.gz b/sitemap.xml.gz index 4989c21c7d5ccc26943057d1f2e02c824037f8da..baf5a97547683b40ec6db1331463a73d98ccdd2a 100644 GIT binary patch delta 13 Ucmb=gXP58h;9!{NHj%vo02#Rhv;Y7A delta 13 Ucmb=gXP58h;AmiTn8;oM02!$RmH+?% diff --git a/tasks/parser/index.html b/tasks/parser/index.html index def7b79..8203f38 100644 --- a/tasks/parser/index.html +++ b/tasks/parser/index.html @@ -381,6 +381,22 @@

Corpus

Universal Dependencies GitHub + +Thai Discourse Treebank +The Thai Discourse Treebank (TDTB) at Chulalongkorn University annotates 180 documents from the LST20 corpus with 10,868 discourse relations. +6,534 sentences + +Prasertsom, P., Jaroonpol, A., & Rutherford, A. T. +Github + + +TUD Treebank +Thai Universal Dependency Treebank, annotating TNC +3,627 sentences + +nlp-chula +Github +

Software