Skip to content

havet/My_Block_World_Corpus

Repository files navigation

My_Block_World_Corpus

A syntetic corpus to test Machine Translation

The block world corpus

origins from the department of Department of Linguistics and Philology at Uppsala University, Sweden, www.lingfil.uu.se

By the kind consent of professor Jörg Tiedemann, I've got the permission to use the corpus, translate it to new languages and licence it as I see fit.

The corpus illustrates some linguistic features that are a nuisanse to machine translation. You can imagine the context as a kind of game with 4 players, two men and two women. The playing board has a lot of (sometimes) overlapping circles in different colours. Each player has got some markers in the form of three-dimensional objects like blocks, cones and arrows. A marker can be put in any circle, and the form of the blocks admits that other objects can be put on top of them.

It consists of two parts:

  1. Files for training/developing a translation model
  • these files are named blockworld.parallel + language suffix e.g. blockworld.parallel.en
  1. Files for building a statistical language model
  • these files are named blockworld.full + language suffix e.g. blockworld.full.en

(The files are UTF-8 encoded to comply with the standard for Machine Translation. The national characters will be distorted if you use Windows and open the files with e.g. Notepad. I recommend Notepad++ for viewing and editing the files.)

Originally the corpus was intended for experiments with statistical machine translation, but it might as well be used with rule based systems e.g. shallow transfer systems as Apertium.

The original corpus was in English and Swedish, but I have added some translations to new languages: Danisch, French, German and Spanish. I would appreciate if the translations were checked by anyone having these languages as their mother tongue.

It would be useful to have the corpus translated to more languages. I would very much appreciate if you translated the corpus to your language and sent the files to me.

Contributions are welcome. Just make a fork, make a pull request and I will merge as soon as possible.

About

A syntetic corpus to test Machine Translation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published