Skip to content

open source corpora created, annotated or maintained by the ACoLi group at University of Augsburg, Germany.

Notifications You must be signed in to change notification settings

acoli-repo/acoli-corpora

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ACoLi Corpora

Various open source corpora assembled, annotated or maintained by the Applied Computational Linguistics (ACoLi) group at the University of Augsburg, Germany (since 2023), resp. at Goethe University Frankfurt, Germany (2013-2022).

Currently covering 1000+ languages in five, partially overlapping collections:

  • semantics (1 corpus, English)
  • cuneiform (3 corpora, Sumerian)
  • Germanic languages (13 languages, in parts with annotations for syntax and semantics)
  • verse-aligned bibles (100+ languages, verse-aligned; build scripts for 700+ languages)
  • parallel corpora (aside from the Bible, this includes the Teddy corpus with TED talks for 1688 languages as well as a small-size corpus of parallel text, note that these have text alignment, only)

Content

language collection name+link comments
English (en) semantics RRG Corpus annotations for Role and Reference Grammar
Sumerian (sux) cuneiform MTAAC Syntax Corpus dependency syntax
Sumerian (sux) cuneiform MTAAC Gold Corpus morphology, named entities
Sumerian (sux) cuneiform MTAAC Ur III Corpus morphology, named entities, commodities
Middle High German (gmh) Germanic ReM Treebank phrase structure syntax, topological fields
Afrikaans (af) Germanic, Biblical Bible verse-aligned, CES/XML
Bavarian (bar) Germanic, Biblical Bible verse-aligned, CES/XML
Danish (da) Germanic, Biblical Bible verse-aligned, CES/XML
Dutch (nl) Germanic, Biblical Bible verse-aligned, CES/XML
English (en) Germanic, Biblical Bible verse-aligned, CES/XML
German, 16th c. (deu) Germanic, external Early New High German corpus phrase structure syntax (mirror of an external resource)
German (deu) Germanic, Biblical Bible verse-aligned, CES/XML
Gothic (got) Germanic, Biblical Bible verse-aligned, CES/XML
Icelandic (is) Germanic, Biblical Bible verse-aligned, CES/XML
Middle Low German (gml) Germanic, Biblical Bugenhagen's Passion of Christ text edition
Norwegian (no) Germanic, Biblical Bible verse-aligned, CES/XML
Swedish (sv) Germanic, Biblical Bible verse-aligned, CES/XML
Achiar-Shiwiar (acu) Biblical Bible verse-aligned, CES/XML
Aguaruna (agr) Biblical Bible verse-aligned, CES/XML
Akewaio (ake) Biblical Bible verse-aligned, CES/XML
Albanian (alb) Biblical Bible verse-aligned, CES/XML
Amharic (amh) Biblical Bible verse-aligned, CES/XML
Amuzgo (azm) Biblical Bible verse-aligned, CES/XML
Arabic (ar) Biblical Bible verse-aligned, CES/XML
Armenian (hye) Biblical Bible verse-aligned, CES/XML
Aukan (djk) Biblical Bible verse-aligned, CES/XML
Barasana (bsn) Biblical Bible verse-aligned, CES/XML
Basque (eus) Biblical Bible verse-aligned, CES/XML
Bulgarian (bul) Biblical Bible verse-aligned, CES/XML
Cabecar (cjp) Biblical Bible verse-aligned, CES/XML
Cakchiquel (cak) Biblical Bible verse-aligned, CES/XML
Campa (cni) Biblical Bible verse-aligned, CES/XML
Camsa (kbh) Biblical Bible verse-aligned, CES/XML
Catalan (ca) Biblical Bible verse-aligned, CES/XML
Cebuano (ceb) Biblical Bible verse-aligned, CES/XML
Chamorro (cha) Biblical Bible verse-aligned, CES/XML
Cherokee (chr) Biblical Bible verse-aligned, CES/XML
Chinantex (csa) Biblical Bible verse-aligned, CES/XML
Chinese (zh) Biblical Bible verse-aligned, CES/XML
Coptic (cop) Biblical Bible verse-aligned, CES/XML
Creole (crp) Biblical Bible verse-aligned, CES/XML
Croatian (hrv) Biblical Bible verse-aligned, CES/XML
Czech (cs) Biblical Bible verse-aligned, CES/XML
Dawro (dwr) Biblical Bible verse-aligned, CES/XML
Dinka (dik) Biblical Bible verse-aligned, CES/XML
Esperanto (epo) Biblical Bible verse-aligned, CES/XML
Estonian (es) Biblical Bible verse-aligned, CES/XML
Ewe (ewe) Biblical Bible verse-aligned, CES/XML
Farsi (fa) Biblical Bible verse-aligned, CES/XML
Finnish (fi) Biblical Bible verse-aligned, CES/XML
French (fr) Biblical Bible verse-aligned, CES/XML
Galela (gbi) Biblical Bible verse-aligned, CES/XML
Gurajati (guj) Biblical Bible verse-aligned, CES/XML
Hebrew (he) Biblical Bible verse-aligned, CES/XML
Hindi (hin) Biblical Bible verse-aligned, CES/XML
Hungarian (hu) Biblical Bible verse-aligned, CES/XML
Indonesian (id) Biblical Bible verse-aligned, CES/XML
Italian (it) Biblical Bible verse-aligned, CES/XML
Jakalteko (jak) Biblical Bible verse-aligned, CES/XML
Japanese (jp) Biblical Bible verse-aligned, CES/XML
Kabyle (kab) Biblical Bible verse-aligned, CES/XML
Kannada (kan) Biblical Bible verse-aligned, CES/XML
Korean (ko) Biblical Bible verse-aligned, CES/XML
Latin (la) Biblical Bible verse-aligned, CES/XML
Latvian (lav) Biblical Bible verse-aligned, CES/XML
Lithuanian (lit) Biblical Bible verse-aligned, CES/XML
Lukpa (dop) Biblical Bible verse-aligned, CES/XML
Malagasy (plt) Biblical Bible verse-aligned, CES/XML
Malayalam (mal) Biblical Bible verse-aligned, CES/XML
Mam (mam) Biblical Bible verse-aligned, CES/XML
Manx Gaelic (glv) Biblical Bible verse-aligned, CES/XML
Maori (mao) Biblical Bible verse-aligned, CES/XML
Marathi (mar) Biblical Bible verse-aligned, CES/XML
Modern Greek (ell) Biblical Bible verse-aligned, CES/XML
Myanmar (mya) Biblical Bible verse-aligned, CES/XML
Nahuatl (nhg) Biblical Bible verse-aligned, CES/XML
Nepali (nep) Biblical Bible verse-aligned, CES/XML
Ojibwa (ojb) Biblical Bible verse-aligned, CES/XML
Old Greek (grc) Biblical Bible verse-aligned, CES/XML
Paite (pck) Biblical Bible verse-aligned, CES/XML
Polish (pol) Biblical Bible verse-aligned, CES/XML
Portuguese (pt) Biblical Bible verse-aligned, CES/XML
Potawatomi (pot) Biblical Bible verse-aligned, CES/XML
Qeqchi (qeq) Biblical Bible verse-aligned, CES/XML
Quiche (quc) Biblical Bible verse-aligned, CES/XML
Quichua (cuq) Biblical Bible verse-aligned, CES/XML
Romani (rom) Biblical Bible verse-aligned, CES/XML
Romanian (ro) Biblical Bible verse-aligned, CES/XML
Russian (rus) Biblical Bible verse-aligned, CES/XML
Scottish Gaelic (gla) Biblical Bible verse-aligned, CES/XML
Serbian (srp) Biblical Bible verse-aligned, CES/XML
Shona (sna) Biblical Bible verse-aligned, CES/XML
Shuar (jiv) Biblical Bible verse-aligned, CES/XML
Slovak (slk) Biblical Bible verse-aligned, CES/XML
Slovene (slv) Biblical Bible verse-aligned, CES/XML
Somali (som) Biblical Bible verse-aligned, CES/XML
Spanish (es) Biblical Bible verse-aligned, CES/XML
Swahili (swa) Biblical Bible verse-aligned, CES/XML
Syriac (syc) Biblical Bible verse-aligned, CES/XML
Tachelhit (shi) Biblical Bible verse-aligned, CES/XML
Tagalog (tgl) Biblical Bible verse-aligned, CES/XML
Telugu (te) Biblical Bible verse-aligned, CES/XML
Thai (tha) Biblical Bible verse-aligned, CES/XML
Tuareg (tmh) Biblical Bible verse-aligned, CES/XML
Turkish (tur) Biblical Bible verse-aligned, CES/XML
Ukrainian (ukr) Biblical Bible verse-aligned, CES/XML
Uma (ppk) Biblical Bible verse-aligned, CES/XML
Uspanteco (usp) Biblical Bible verse-aligned, CES/XML
Vietnamese (vie) Biblical Bible verse-aligned, CES/XML
Wolaytta (wal) Biblical Bible verse-aligned, CES/XML
Wolof (wol) Biblical Bible verse-aligned, CES/XML
Xhosa (xho) Biblical Bible verse-aligned, CES/XML
Yalunka (yal) Biblical Bible verse-aligned, CES/XML
Zama (dje) Biblical Bible verse-aligned, CES/XML
Zulu (zul) Biblical Bible verse-aligned, CES/XML
1688 languages parallel Teddy Corpus TED transcripts and build routine
Danish (da) parallel fairy tales parallel text, no alignment
German (de) parallel fairy tales parallel text, no alignment
Greek (el) parallel fairy tales parallel text, no alignment
English (en) parallel fairy tales parallel text, no alignment
Esperanto (eo) parallel fairy tales parallel text, no alignment
Spanish (es) parallel fairy tales parallel text, no alignment
Finnish (fi) parallel fairy tales parallel text, no alignment
French (fr) parallel fairy tales parallel text, no alignment
Hungarian (hu) parallel fairy tales parallel text, no alignment
Italian (it) parallel fairy tales parallel text, no alignment
Japanese (ja) parallel fairy tales parallel text, no alignment
Korean (ko) parallel fairy tales parallel text, no alignment
Low German (nds) parallel fairy tales parallel text, no alignment
Dutch (nl) parallel fairy tales parallel text, no alignment
Polish (pl) parallel fairy tales parallel text, no alignment
Portuguese (pt) parallel fairy tales parallel text, no alignment
Romanian (ro) parallel fairy tales parallel text, no alignment
Russian (ru) parallel fairy tales parallel text, no alignment
Turkish (tr) parallel fairy tales parallel text, no alignment
Ukrainian (uk) parallel fairy tales parallel text, no alignment
Vietnamese (vi) parallel fairy tales parallel text, no alignment
Chinese (zh) parallel fairy tales parallel text, no alignment

Setting it up

Most people will be interested in the sparse checkout of selected sub-directories, only, not all corpora at once. We describe both sparse and full checkout.

Sparse Checkout

Some corpora are integrated here as Git submodules, only. If you're interested in these, only, go to their original source and checkout from there.

If you are interested in a sub-folder of the current repositorory, you can use the sparse checkout functionality of Git, illustrated for teddy/ below:

	$> git clone --depth 1 --filter=blob:none --sparse https://github.com/acoli-repo/acoli-corpora
$> cd acoli-corpora
$> git sparse-checkout set teddy

Notes

  • SVN provides a similar (and even slimmer) sparse checkout functionality. However, the GitSVN bridge seems to be limited in its capacity, so directories with many files (such as teddy/) are likely to run into a timeout. Roughly equivalent example call for the Teddy corpus (without files in root directory):

      $> mkdir acoli-corpora
      $> cd acoli-corpora
      $> svn co https://github.com/acoli-repo/acoli-corpora/trunk/teddy acoli-corpora/teddy
    
  • Sparse checkouts were introduced with Git v.2.25. However, in some releases (at least Git v.2.25.1), sparse checkouts fail with fatal: cannot change to 'https://github.com/acoli-repo/acoli-corpora'. This is a bug in Git that can be fixed by upgrading Git:

      $> sudo add-apt-repository ppa:git-core/ppa
      $> sudo apt-get update
      $> sudo apt-get install git
    

Full Checkout

Note that some of the data provided here is separately maintained, so that this repo uses the submodule functionality of git. However, the aggregator repository is updated occasionally, only, to point to the most recent version. To retrieve the most up-to-date versions, clone this repo using

$> git clone --recurse-submodules --remote-submodules https://github.com/acoli-repo/acoli-corpora

For updating an existing installation in the directory ./acoli-corpora/, run

$> cd ./acoli-corpora/
$> git submodule update --recursive .

Note that these repositories do not have strong interdependencies in the aggregator, but that this has been mostly created to faciliate a quick-and-easy local setup of all corpora in one go. For development or annotation, we recommend to work within the submodule repositories directly.

Note that some corpora require a special build routine. If so, a designated Readme.md file is provided.

About

open source corpora created, annotated or maintained by the ACoLi group at University of Augsburg, Germany.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published