A repo for Tibetan corpora
Currently, Esukhia hosts several corpus / corpus-related projects. This includes:
- The Children's Story Speech Corpus
- A collection of Frequency Lists (for use in Dakje, https://github.com/Esukhia/dakje)
- The Nanhai Corpus (Tibetan speech & text, ~1.2 million words)
- A Parallel Corpus (of 84,000 English/Tibetan translations, see: http://84000.co)
- A simplified-scheme, POS-tagged version of SOAS's Digital Communication corpus (http://larkpie.net/tibetancorpus/)
- Speech Tibetan transcripts pulled from a web-crawl.
The Esukhia Corpora are copyright Esukhia, and licensed under a Creative Commons Attribution 4.0 International License.