Skip to content

Esukhia/Corpora

Repository files navigation

Corpora

A repo for Tibetan corpora

Currently, Esukhia hosts several corpus / corpus-related projects. This includes:

  1. The Children's Story Speech Corpus
  2. A collection of Frequency Lists (for use in Dakje, https://github.com/Esukhia/dakje)
  3. The Nanhai Corpus (Tibetan speech & text, ~1.2 million words)
  4. A Parallel Corpus (of 84,000 English/Tibetan translations, see: http://84000.co)
  5. A simplified-scheme, POS-tagged version of SOAS's Digital Communication corpus (http://larkpie.net/tibetancorpus/)
  6. Speech Tibetan transcripts pulled from a web-crawl.

Tibetan Unicode characters

License

The Esukhia Corpora are copyright Esukhia, and licensed under a Creative Commons Attribution 4.0 International License.