Skip to content

Commit

Permalink
Add README files.
Browse files Browse the repository at this point in the history
  • Loading branch information
mlj committed Apr 26, 2023
1 parent cdea4ed commit 525cee4
Show file tree
Hide file tree
Showing 5 changed files with 160 additions and 164 deletions.
18 changes: 18 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Syntacticus treebank data

Raw annotated data for the treebanks in the Syntacticus collection.

Releases of the collection are hosted on
[Github](https://github.com/syntacticus/syntacticus-treebank-data).

## Data formats

The texts in the collection are available in two formats:

1. PROIEL XML: These files are the authoritative source files and the only ones
that contain all available annotation. They contain the complete morphological,
syntactic and information-structure annotation, as well as the complete text,
including punctuation, section headers etc. The schema is defined in
[`proiel.xsd`](https://github.com/syntacticus/syntacticus-treebank-data/blob/master/proiel.xsd).

2. [CoNLL-X format](http://nextens.uvt.nl/depparse-wiki/DataFormat)
75 changes: 28 additions & 47 deletions iswoc/README.md
Original file line number Diff line number Diff line change
@@ -1,59 +1,40 @@
The ISWOC Treebank
==================
## The ISWOC Treebank

The _ISWOC Treebank_ is a dependency treebank with morphosyntactic and
information-structure annotation. It includes texts in several older
Indo-European languages and is freely available under a [Creative Commons
Attribution-NonCommercial-ShareAlike 3.0 License](
http://creativecommons.org/licenses/by-nc-sa/3.0/us/).
information-structure annotation.

It includes texts in several older Indo-European languages and is freely
available under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0
License](https://creativecommons.org/licenses/by-nc-sa/4.0/).

Please cite as

> Bech, Kristin and Kristine Eide. 2014. The ISWOC corpus. Department of Literature, Area Studies and European Languages, University of Oslo. http://iswoc.github.com.
Releases of the ISWOC Treebank are hosted on
[Github](https://github.com/iswoc/iswoc-treebank).
Please see the XML files for detailed metadata and a full list of contributors.

Contents
--------
### Contents

The following texts are included in this release of the treebank:

(The _size_ column in the table below shows the number of annotated tokens in a
text. The number of tokens will be slightly larger than the number of words in
the original printed edition as some words have been split into multiple tokens
and some tokens have been inserted during annotation.)
Text | Language | Filename | Size
---- | -------- | -------- | ----
Ælfric's Lives of Saints | Old English | æls | 3137 tokens
Apollonius of Tyre | Old English | apt | 5541 tokens
Anglo-Saxon Chronicles | Old English | chrona | 5939 tokens
Orosius | Old English | or | 1728 tokens
West-Saxon Gospels | Old English | wscp | 13061 tokens
La Vie Saint Eustace | Old French | eustace | 2340 tokens
Crónica Geral de Espanha 2-12 | Portuguese | cge1 | 12074 tokens
Crónica Geral de Espanha 155-167 | Portuguese | cge2 | 10547 tokens
Décadas Livro 5, VIII, 9-14 | Portuguese | coutdec-v-8 | 13794 tokens
Crónica de Alfonso XI | Spanish | alfonso-xi | 7942 tokens
Crónica de España | Spanish | ce | 4627 tokens
El Conde Lucanor | Spanish | cdeluc | 17551 tokens
Estoria de Espanna I | Spanish | ee1 | 9488 tokens
General Estoria parte IV Daniel | Spanish | ge4 | 9233 tokens
Libro delos claros varones | Spanish | varones | 5820 tokens


(The 'size' column in the table above shows the number of annotated tokens in
a text. The number of tokens will be slightly larger than the number of words
in the original printed edition as some words have been split into multiple
tokens and some tokens have been inserted during annotation.)

Please see the XML files for detailed metadata and a full list of contributors.

Data formats
------------

The texts are available on two formats:

1. PROIEL XML: These files are the authoritative source files and the only ones
that contain all available annotation. They contain the complete morphological,
syntactic and information-structure annotation, as well as the complete text,
including punctuation, section headers etc. The schema is defined in
[`proiel.xsd`](https://github.com/proiel/proiel-treebank/blob/master/proiel.xsd).

2. [CoNLL-X format](http://nextens.uvt.nl/depparse-wiki/DataFormat)
----------------------------------------------------|---------------------|-------------|---------------
Ælfric's Lives of Saints | Old English | æls | 3,137 tokens
Crónica de Alfonso XI | Spanish | alfonso-xi | 7,941 tokens
Apollonius of Tyre | Old English | apt | 5,541 tokens
El Conde Lucanor | Spanish | cdeluc | 17,553 tokens
Crónica de España | Spanish | ce | 4,627 tokens
Crónica Geral de Espanha 2-12 (ed. Lindley 1951) | Portuguese | cge1 | 12,074 tokens
Crónica Geral de Espanha 155-167 (ed. Lindley 1951) | Portuguese | cge2 | 10,547 tokens
Anglo-Saxon Chronicles | Old English | chrona | 5,939 tokens
Décadas Livro 5, VIII, 9-14 (ed. 1. 1947) | Portuguese | coutdec-v-8 | 13,974 tokens
Estoria de Espanna I | Spanish | ee1 | 9,488 tokens
La Vie Saint Eustace | Old French | eustace | 2,340 tokens
General Estoria parte IV Daniel | Spanish | ge4 | 9,289 tokens
Orosius | Old English | or | 1,728 tokens
Libro delos claros varones | Spanish | varones | 5,820 tokens
West-Saxon Gospels | Old English | wscp | 13,061 tokens
18 changes: 18 additions & 0 deletions menotec/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
## Menotec

### Contents

The following texts are included in this release of the treebank:

(The _size_ column in the table below shows the number of annotated tokens in a
text. The number of tokens will be slightly larger than the number of words in
the original printed edition as some words have been split into multiple tokens
and some tokens have been inserted during annotation.)
Text | Language | Filename | Size
----------------------------------------------------|---------------------|-------------|---------------
Konungs skuggsjá (in AM 243 bα fol, Old Norw., ca. 1275) (ed. Holm-Olsen 1945) | Old Norse | am243 | 44 tokens
The Old Norwegian homily book (in AM 619 4to, Old Norw., ca. 1200-1225) (ed. Indrebø 1931) | Old Norse | hom | 60,822 tokens
Landslǫg Magnúss Hákonarsónar (in Holm perg 34 4to, Old Norw., ca. 1275) | Old Norse | mll | 56,889 tokens
Óláfs saga ins helga (in Upps DG 8 II, Old Norw., ca. 1225-1250) (ed. Johnsen 1922) | Old Norse | olavssaga | 42,830 tokens
Pamphilus saga (in Upps DG 4-7, Old Norw., ca. 1270) | Old Norse | pamphilus | 4,254 tokens
Strengleikar (in Upps DG 4-7, Old Norw., ca. 1270) (ed. Keyser 1850) | Old Norse | strleik | 38,549 tokens
74 changes: 29 additions & 45 deletions proiel/README.md
Original file line number Diff line number Diff line change
@@ -1,46 +1,20 @@
The PROIEL Treebank
===================
## The PROIEL Treebank

The _PROIEL Treebank_ is a dependency treebank with morphosyntactic and
information-structure annotation. It includes texts in several ancient
Indo-European languages and is freely available under a [Creative Commons
Attribution-NonCommercial-ShareAlike 3.0 License](
http://creativecommons.org/licenses/by-nc-sa/3.0/us/).
information-structure annotation.

It includes texts in several ancient Indo-European languages and is freely
available under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0
License](https://creativecommons.org/licenses/by-nc-sa/4.0/).

Please cite as

> Dag T. T. Haug and Marius L. Jøhndal. 2008. 'Creating a Parallel Treebank of the Old Indo-European Bible Translations'. In Caroline Sporleder and Kiril Ribarov (eds.). Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2008) (2008), pp. 27-34.
Releases of the PROIEL Treebank are hosted on
[Github](https://github.com/proiel/proiel-treebank).

Contents
--------

The following texts are included in this release of the treebank:

Text | Language | Filename | Size
----------------------------------------------------|---------------------|-------------|---------------
The Greek New Testament (ed. Tischendorf 1869) | Ancient Greek | greek-nt | 140,763 tokens
The Armenian New Testament (ed. Künzle 1984) | Classical Armenian | armenian-nt | 23,513 tokens
The Gothic Bible (ed. Streitberg 1919) | Gothic | gothic-nt | 57,211 tokens
Codex Marianus (ed. Jagić 1883) | Old Church Slavonic | marianus | 58,269 tokens
Jerome's Vulgate | Latin | latin-nt | 112,454 tokens
Caesar, Commentarii belli Gallici (ed. Holmes 1914) | Latin | caes-gal | 28,607 tokens
Cicero, De officiis (ed. Miller 1913) | Latin | cic-off | 10,644 tokens
Cicero, Epistulae ad Atticum (ed. Purser 1901) | Latin | cic-att | 42,855 tokens
Palladius, Opus agriculturae (ed. Schmitt 1898) | Latin | pal-agr | 12,148 tokens
Peregrinatio Aetheriae (ed. Heraeus 1908) | Latin | per-aeth | 18,356 tokens
Herodotus, Histories (ed. Godley 1920) | Ancient Greek | hdt | 85,080 tokens
Sphrantzes, Chronicles (post-1453) (ed. Grecu 1966) | Ancient Greek | chron | 24,612 tokens

(The 'size' column in the table above shows the number of annotated tokens in
a text. The number of tokens will be slightly larger than the number of words
in the original printed edition as some words have been split into multiple
tokens and some tokens have been inserted during annotation.)

Please see the XML files for detailed metadata and a full list of contributors.

### Completeness

Some sentences have not yet been annotated. This is an overview of where in the
texts unannotated sentences occur:

Expand All @@ -64,17 +38,27 @@ Sections or section ranges in which there are gaps:
* `marianus`: MATT 5, MARK 16, LUKE 2, LUKE 24, JOHN 1-2, JOHN 18, JOHN 20
* `pal-agr`: 1.4-1.12, 1.35-1.40, 2.3, 2.9-2.23, 3.9-3.10

These gaps will be completed in future releases.
These gaps may be closed in future releases.

Data formats
------------
### Contents

The texts are available on two formats:

1. PROIEL XML: These files are the authoritative source files and the only ones
that contain all available annotation. They contain the complete morphological,
syntactic and information-structure annotation, as well as the complete text,
including punctuation, section headers etc. The schema is defined in
[`proiel.xsd`](https://github.com/proiel/proiel-treebank/blob/master/proiel.xsd).
The following texts are included in this release of the treebank:

2. [CoNLL-X format](http://nextens.uvt.nl/depparse-wiki/DataFormat)
(The _size_ column in the table below shows the number of annotated tokens in a
text. The number of tokens will be slightly larger than the number of words in
the original printed edition as some words have been split into multiple tokens
and some tokens have been inserted during annotation.)
Text | Language | Filename | Size
----------------------------------------------------|---------------------|-------------|---------------
The Armenian New Testament (ed. Künzle 1984) | Classical Armenian | armenian-nt | 23,513 tokens
Commentarii belli Gallici (ed. Holmes 1914) | Latin | caes-gal | 28,657 tokens
Chronicles (post-1453) (ed. Grecu 1966) | Ancient Greek | chron | 24,612 tokens
Epistulae ad Atticum (ed. Purser 1901) | Latin | cic-att | 47,528 tokens
De officiis (ed. Miller 1913) | Latin | cic-off | 11,995 tokens
The Gothic Bible (ed. Streitberg 1919) | Gothic | gothic-nt | 57,212 tokens
The Greek New Testament (ed. Tischendorf 1869) | Ancient Greek | greek-nt | 140,773 tokens
Histories (ed. Godley 1920) | Ancient Greek | hdt | 85,166 tokens
Jerome's Vulgate | Latin | latin-nt | 112,454 tokens
Codex Marianus (ed. Jagić 1883) | Church Slavic | marianus | 64,138 tokens
Opus agriculturae (ed. Schmitt 1898) | Latin | pal-agr | 12,148 tokens
Peregrinatio Aetheriae (ed. Heraeus 1908) | Latin | per-aeth | 18,356 tokens
Loading

0 comments on commit 525cee4

Please sign in to comment.