Corpora

A repo for Tibetan corpora

Currently, Esukhia hosts several corpus / corpus-related projects. This includes:

The Children's Story Speech Corpus
A collection of Frequency Lists (for use in Dakje, https://github.com/Esukhia/dakje)
The Nanhai Corpus (Tibetan speech & text, ~1.2 million words)
A Parallel Corpus (of 84,000 English/Tibetan translations, see: http://84000.co)
A simplified-scheme, POS-tagged version of SOAS's Digital Communication corpus (http://larkpie.net/tibetancorpus/)
Speech Tibetan transcripts pulled from a web-crawl.

License

The Esukhia Corpora are copyright Esukhia, and licensed under a Creative Commons Attribution 4.0 International License.