Various open source corpora assembled, annotated or maintained by the Applied Computational Linguistics (ACoLi) group at the University of Augsburg, Germany (since 2023), resp. at Goethe University Frankfurt, Germany (2013-2022).
Currently covering 1000+ languages in five, partially overlapping collections:
- semantics (1 corpus, English)
- cuneiform (3 corpora, Sumerian)
- Germanic languages (13 languages, in parts with annotations for syntax and semantics)
- verse-aligned bibles (100+ languages, verse-aligned; build scripts for 700+ languages)
- parallel corpora (aside from the Bible, this includes the Teddy corpus with TED talks for 1688 languages as well as a small-size corpus of parallel text, note that these have text alignment, only)
language | collection | name+link | comments |
---|---|---|---|
English (en) | semantics | RRG Corpus | annotations for Role and Reference Grammar |
Sumerian (sux) | cuneiform | MTAAC Syntax Corpus | dependency syntax |
Sumerian (sux) | cuneiform | MTAAC Gold Corpus | morphology, named entities |
Sumerian (sux) | cuneiform | MTAAC Ur III Corpus | morphology, named entities, commodities |
Middle High German (gmh) | Germanic | ReM Treebank | phrase structure syntax, topological fields |
Afrikaans (af) | Germanic, Biblical | Bible | verse-aligned, CES/XML |
Bavarian (bar) | Germanic, Biblical | Bible | verse-aligned, CES/XML |
Danish (da) | Germanic, Biblical | Bible | verse-aligned, CES/XML |
Dutch (nl) | Germanic, Biblical | Bible | verse-aligned, CES/XML |
English (en) | Germanic, Biblical | Bible | verse-aligned, CES/XML |
German, 16th c. (deu) | Germanic, external | Early New High German corpus | phrase structure syntax (mirror of an external resource) |
German (deu) | Germanic, Biblical | Bible | verse-aligned, CES/XML |
Gothic (got) | Germanic, Biblical | Bible | verse-aligned, CES/XML |
Icelandic (is) | Germanic, Biblical | Bible | verse-aligned, CES/XML |
Middle Low German (gml) | Germanic, Biblical | Bugenhagen's Passion of Christ | text edition |
Norwegian (no) | Germanic, Biblical | Bible | verse-aligned, CES/XML |
Swedish (sv) | Germanic, Biblical | Bible | verse-aligned, CES/XML |
Achiar-Shiwiar (acu) | Biblical | Bible | verse-aligned, CES/XML |
Aguaruna (agr) | Biblical | Bible | verse-aligned, CES/XML |
Akewaio (ake) | Biblical | Bible | verse-aligned, CES/XML |
Albanian (alb) | Biblical | Bible | verse-aligned, CES/XML |
Amharic (amh) | Biblical | Bible | verse-aligned, CES/XML |
Amuzgo (azm) | Biblical | Bible | verse-aligned, CES/XML |
Arabic (ar) | Biblical | Bible | verse-aligned, CES/XML |
Armenian (hye) | Biblical | Bible | verse-aligned, CES/XML |
Aukan (djk) | Biblical | Bible | verse-aligned, CES/XML |
Barasana (bsn) | Biblical | Bible | verse-aligned, CES/XML |
Basque (eus) | Biblical | Bible | verse-aligned, CES/XML |
Bulgarian (bul) | Biblical | Bible | verse-aligned, CES/XML |
Cabecar (cjp) | Biblical | Bible | verse-aligned, CES/XML |
Cakchiquel (cak) | Biblical | Bible | verse-aligned, CES/XML |
Campa (cni) | Biblical | Bible | verse-aligned, CES/XML |
Camsa (kbh) | Biblical | Bible | verse-aligned, CES/XML |
Catalan (ca) | Biblical | Bible | verse-aligned, CES/XML |
Cebuano (ceb) | Biblical | Bible | verse-aligned, CES/XML |
Chamorro (cha) | Biblical | Bible | verse-aligned, CES/XML |
Cherokee (chr) | Biblical | Bible | verse-aligned, CES/XML |
Chinantex (csa) | Biblical | Bible | verse-aligned, CES/XML |
Chinese (zh) | Biblical | Bible | verse-aligned, CES/XML |
Coptic (cop) | Biblical | Bible | verse-aligned, CES/XML |
Creole (crp) | Biblical | Bible | verse-aligned, CES/XML |
Croatian (hrv) | Biblical | Bible | verse-aligned, CES/XML |
Czech (cs) | Biblical | Bible | verse-aligned, CES/XML |
Dawro (dwr) | Biblical | Bible | verse-aligned, CES/XML |
Dinka (dik) | Biblical | Bible | verse-aligned, CES/XML |
Esperanto (epo) | Biblical | Bible | verse-aligned, CES/XML |
Estonian (es) | Biblical | Bible | verse-aligned, CES/XML |
Ewe (ewe) | Biblical | Bible | verse-aligned, CES/XML |
Farsi (fa) | Biblical | Bible | verse-aligned, CES/XML |
Finnish (fi) | Biblical | Bible | verse-aligned, CES/XML |
French (fr) | Biblical | Bible | verse-aligned, CES/XML |
Galela (gbi) | Biblical | Bible | verse-aligned, CES/XML |
Gurajati (guj) | Biblical | Bible | verse-aligned, CES/XML |
Hebrew (he) | Biblical | Bible | verse-aligned, CES/XML |
Hindi (hin) | Biblical | Bible | verse-aligned, CES/XML |
Hungarian (hu) | Biblical | Bible | verse-aligned, CES/XML |
Indonesian (id) | Biblical | Bible | verse-aligned, CES/XML |
Italian (it) | Biblical | Bible | verse-aligned, CES/XML |
Jakalteko (jak) | Biblical | Bible | verse-aligned, CES/XML |
Japanese (jp) | Biblical | Bible | verse-aligned, CES/XML |
Kabyle (kab) | Biblical | Bible | verse-aligned, CES/XML |
Kannada (kan) | Biblical | Bible | verse-aligned, CES/XML |
Korean (ko) | Biblical | Bible | verse-aligned, CES/XML |
Latin (la) | Biblical | Bible | verse-aligned, CES/XML |
Latvian (lav) | Biblical | Bible | verse-aligned, CES/XML |
Lithuanian (lit) | Biblical | Bible | verse-aligned, CES/XML |
Lukpa (dop) | Biblical | Bible | verse-aligned, CES/XML |
Malagasy (plt) | Biblical | Bible | verse-aligned, CES/XML |
Malayalam (mal) | Biblical | Bible | verse-aligned, CES/XML |
Mam (mam) | Biblical | Bible | verse-aligned, CES/XML |
Manx Gaelic (glv) | Biblical | Bible | verse-aligned, CES/XML |
Maori (mao) | Biblical | Bible | verse-aligned, CES/XML |
Marathi (mar) | Biblical | Bible | verse-aligned, CES/XML |
Modern Greek (ell) | Biblical | Bible | verse-aligned, CES/XML |
Myanmar (mya) | Biblical | Bible | verse-aligned, CES/XML |
Nahuatl (nhg) | Biblical | Bible | verse-aligned, CES/XML |
Nepali (nep) | Biblical | Bible | verse-aligned, CES/XML |
Ojibwa (ojb) | Biblical | Bible | verse-aligned, CES/XML |
Old Greek (grc) | Biblical | Bible | verse-aligned, CES/XML |
Paite (pck) | Biblical | Bible | verse-aligned, CES/XML |
Polish (pol) | Biblical | Bible | verse-aligned, CES/XML |
Portuguese (pt) | Biblical | Bible | verse-aligned, CES/XML |
Potawatomi (pot) | Biblical | Bible | verse-aligned, CES/XML |
Qeqchi (qeq) | Biblical | Bible | verse-aligned, CES/XML |
Quiche (quc) | Biblical | Bible | verse-aligned, CES/XML |
Quichua (cuq) | Biblical | Bible | verse-aligned, CES/XML |
Romani (rom) | Biblical | Bible | verse-aligned, CES/XML |
Romanian (ro) | Biblical | Bible | verse-aligned, CES/XML |
Russian (rus) | Biblical | Bible | verse-aligned, CES/XML |
Scottish Gaelic (gla) | Biblical | Bible | verse-aligned, CES/XML |
Serbian (srp) | Biblical | Bible | verse-aligned, CES/XML |
Shona (sna) | Biblical | Bible | verse-aligned, CES/XML |
Shuar (jiv) | Biblical | Bible | verse-aligned, CES/XML |
Slovak (slk) | Biblical | Bible | verse-aligned, CES/XML |
Slovene (slv) | Biblical | Bible | verse-aligned, CES/XML |
Somali (som) | Biblical | Bible | verse-aligned, CES/XML |
Spanish (es) | Biblical | Bible | verse-aligned, CES/XML |
Swahili (swa) | Biblical | Bible | verse-aligned, CES/XML |
Syriac (syc) | Biblical | Bible | verse-aligned, CES/XML |
Tachelhit (shi) | Biblical | Bible | verse-aligned, CES/XML |
Tagalog (tgl) | Biblical | Bible | verse-aligned, CES/XML |
Telugu (te) | Biblical | Bible | verse-aligned, CES/XML |
Thai (tha) | Biblical | Bible | verse-aligned, CES/XML |
Tuareg (tmh) | Biblical | Bible | verse-aligned, CES/XML |
Turkish (tur) | Biblical | Bible | verse-aligned, CES/XML |
Ukrainian (ukr) | Biblical | Bible | verse-aligned, CES/XML |
Uma (ppk) | Biblical | Bible | verse-aligned, CES/XML |
Uspanteco (usp) | Biblical | Bible | verse-aligned, CES/XML |
Vietnamese (vie) | Biblical | Bible | verse-aligned, CES/XML |
Wolaytta (wal) | Biblical | Bible | verse-aligned, CES/XML |
Wolof (wol) | Biblical | Bible | verse-aligned, CES/XML |
Xhosa (xho) | Biblical | Bible | verse-aligned, CES/XML |
Yalunka (yal) | Biblical | Bible | verse-aligned, CES/XML |
Zama (dje) | Biblical | Bible | verse-aligned, CES/XML |
Zulu (zul) | Biblical | Bible | verse-aligned, CES/XML |
1688 languages | parallel | Teddy Corpus | TED transcripts and build routine |
Danish (da) | parallel | fairy tales | parallel text, no alignment |
German (de) | parallel | fairy tales | parallel text, no alignment |
Greek (el) | parallel | fairy tales | parallel text, no alignment |
English (en) | parallel | fairy tales | parallel text, no alignment |
Esperanto (eo) | parallel | fairy tales | parallel text, no alignment |
Spanish (es) | parallel | fairy tales | parallel text, no alignment |
Finnish (fi) | parallel | fairy tales | parallel text, no alignment |
French (fr) | parallel | fairy tales | parallel text, no alignment |
Hungarian (hu) | parallel | fairy tales | parallel text, no alignment |
Italian (it) | parallel | fairy tales | parallel text, no alignment |
Japanese (ja) | parallel | fairy tales | parallel text, no alignment |
Korean (ko) | parallel | fairy tales | parallel text, no alignment |
Low German (nds) | parallel | fairy tales | parallel text, no alignment |
Dutch (nl) | parallel | fairy tales | parallel text, no alignment |
Polish (pl) | parallel | fairy tales | parallel text, no alignment |
Portuguese (pt) | parallel | fairy tales | parallel text, no alignment |
Romanian (ro) | parallel | fairy tales | parallel text, no alignment |
Russian (ru) | parallel | fairy tales | parallel text, no alignment |
Turkish (tr) | parallel | fairy tales | parallel text, no alignment |
Ukrainian (uk) | parallel | fairy tales | parallel text, no alignment |
Vietnamese (vi) | parallel | fairy tales | parallel text, no alignment |
Chinese (zh) | parallel | fairy tales | parallel text, no alignment |
Most people will be interested in the sparse checkout of selected sub-directories, only, not all corpora at once. We describe both sparse and full checkout.
Some corpora are integrated here as Git submodules, only. If you're interested in these, only, go to their original source and checkout from there.
If you are interested in a sub-folder of the current repositorory, you can use the sparse checkout functionality of Git, illustrated for teddy/
below:
$> git clone --depth 1 --filter=blob:none --sparse https://github.com/acoli-repo/acoli-corpora
$> cd acoli-corpora
$> git sparse-checkout set teddy
-
SVN provides a similar (and even slimmer) sparse checkout functionality. However, the GitSVN bridge seems to be limited in its capacity, so directories with many files (such as
teddy/
) are likely to run into a timeout. Roughly equivalent example call for the Teddy corpus (without files in root directory):$> mkdir acoli-corpora $> cd acoli-corpora $> svn co https://github.com/acoli-repo/acoli-corpora/trunk/teddy acoli-corpora/teddy
-
Sparse checkouts were introduced with Git v.2.25. However, in some releases (at least Git v.2.25.1), sparse checkouts fail with
fatal: cannot change to 'https://github.com/acoli-repo/acoli-corpora'
. This is a bug in Git that can be fixed by upgrading Git:$> sudo add-apt-repository ppa:git-core/ppa $> sudo apt-get update $> sudo apt-get install git
Note that some of the data provided here is separately maintained, so that this repo uses the submodule
functionality of git.
However, the aggregator repository is updated occasionally, only, to point to the most recent version. To retrieve the most up-to-date versions, clone this repo using
$> git clone --recurse-submodules --remote-submodules https://github.com/acoli-repo/acoli-corpora
For updating an existing installation in the directory ./acoli-corpora/
, run
$> cd ./acoli-corpora/
$> git submodule update --recursive .
Note that these repositories do not have strong interdependencies in the aggregator, but that this has been mostly created to faciliate a quick-and-easy local setup of all corpora in one go. For development or annotation, we recommend to work within the submodule repositories directly.
Note that some corpora require a special build routine. If so, a designated Readme.md
file is provided.