-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
5 changed files
with
160 additions
and
164 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
# Syntacticus treebank data | ||
|
||
Raw annotated data for the treebanks in the Syntacticus collection. | ||
|
||
Releases of the collection are hosted on | ||
[Github](https://github.com/syntacticus/syntacticus-treebank-data). | ||
|
||
## Data formats | ||
|
||
The texts in the collection are available in two formats: | ||
|
||
1. PROIEL XML: These files are the authoritative source files and the only ones | ||
that contain all available annotation. They contain the complete morphological, | ||
syntactic and information-structure annotation, as well as the complete text, | ||
including punctuation, section headers etc. The schema is defined in | ||
[`proiel.xsd`](https://github.com/syntacticus/syntacticus-treebank-data/blob/master/proiel.xsd). | ||
|
||
2. [CoNLL-X format](http://nextens.uvt.nl/depparse-wiki/DataFormat) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,59 +1,40 @@ | ||
The ISWOC Treebank | ||
================== | ||
## The ISWOC Treebank | ||
|
||
The _ISWOC Treebank_ is a dependency treebank with morphosyntactic and | ||
information-structure annotation. It includes texts in several older | ||
Indo-European languages and is freely available under a [Creative Commons | ||
Attribution-NonCommercial-ShareAlike 3.0 License]( | ||
http://creativecommons.org/licenses/by-nc-sa/3.0/us/). | ||
information-structure annotation. | ||
|
||
It includes texts in several older Indo-European languages and is freely | ||
available under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 | ||
License](https://creativecommons.org/licenses/by-nc-sa/4.0/). | ||
|
||
Please cite as | ||
|
||
> Bech, Kristin and Kristine Eide. 2014. The ISWOC corpus. Department of Literature, Area Studies and European Languages, University of Oslo. http://iswoc.github.com. | ||
Releases of the ISWOC Treebank are hosted on | ||
[Github](https://github.com/iswoc/iswoc-treebank). | ||
Please see the XML files for detailed metadata and a full list of contributors. | ||
|
||
Contents | ||
-------- | ||
### Contents | ||
|
||
The following texts are included in this release of the treebank: | ||
|
||
(The _size_ column in the table below shows the number of annotated tokens in a | ||
text. The number of tokens will be slightly larger than the number of words in | ||
the original printed edition as some words have been split into multiple tokens | ||
and some tokens have been inserted during annotation.) | ||
Text | Language | Filename | Size | ||
---- | -------- | -------- | ---- | ||
Ælfric's Lives of Saints | Old English | æls | 3137 tokens | ||
Apollonius of Tyre | Old English | apt | 5541 tokens | ||
Anglo-Saxon Chronicles | Old English | chrona | 5939 tokens | ||
Orosius | Old English | or | 1728 tokens | ||
West-Saxon Gospels | Old English | wscp | 13061 tokens | ||
La Vie Saint Eustace | Old French | eustace | 2340 tokens | ||
Crónica Geral de Espanha 2-12 | Portuguese | cge1 | 12074 tokens | ||
Crónica Geral de Espanha 155-167 | Portuguese | cge2 | 10547 tokens | ||
Décadas Livro 5, VIII, 9-14 | Portuguese | coutdec-v-8 | 13794 tokens | ||
Crónica de Alfonso XI | Spanish | alfonso-xi | 7942 tokens | ||
Crónica de España | Spanish | ce | 4627 tokens | ||
El Conde Lucanor | Spanish | cdeluc | 17551 tokens | ||
Estoria de Espanna I | Spanish | ee1 | 9488 tokens | ||
General Estoria parte IV Daniel | Spanish | ge4 | 9233 tokens | ||
Libro delos claros varones | Spanish | varones | 5820 tokens | ||
|
||
|
||
(The 'size' column in the table above shows the number of annotated tokens in | ||
a text. The number of tokens will be slightly larger than the number of words | ||
in the original printed edition as some words have been split into multiple | ||
tokens and some tokens have been inserted during annotation.) | ||
|
||
Please see the XML files for detailed metadata and a full list of contributors. | ||
|
||
Data formats | ||
------------ | ||
|
||
The texts are available on two formats: | ||
|
||
1. PROIEL XML: These files are the authoritative source files and the only ones | ||
that contain all available annotation. They contain the complete morphological, | ||
syntactic and information-structure annotation, as well as the complete text, | ||
including punctuation, section headers etc. The schema is defined in | ||
[`proiel.xsd`](https://github.com/proiel/proiel-treebank/blob/master/proiel.xsd). | ||
|
||
2. [CoNLL-X format](http://nextens.uvt.nl/depparse-wiki/DataFormat) | ||
----------------------------------------------------|---------------------|-------------|--------------- | ||
Ælfric's Lives of Saints | Old English | æls | 3,137 tokens | ||
Crónica de Alfonso XI | Spanish | alfonso-xi | 7,941 tokens | ||
Apollonius of Tyre | Old English | apt | 5,541 tokens | ||
El Conde Lucanor | Spanish | cdeluc | 17,553 tokens | ||
Crónica de España | Spanish | ce | 4,627 tokens | ||
Crónica Geral de Espanha 2-12 (ed. Lindley 1951) | Portuguese | cge1 | 12,074 tokens | ||
Crónica Geral de Espanha 155-167 (ed. Lindley 1951) | Portuguese | cge2 | 10,547 tokens | ||
Anglo-Saxon Chronicles | Old English | chrona | 5,939 tokens | ||
Décadas Livro 5, VIII, 9-14 (ed. 1. 1947) | Portuguese | coutdec-v-8 | 13,974 tokens | ||
Estoria de Espanna I | Spanish | ee1 | 9,488 tokens | ||
La Vie Saint Eustace | Old French | eustace | 2,340 tokens | ||
General Estoria parte IV Daniel | Spanish | ge4 | 9,289 tokens | ||
Orosius | Old English | or | 1,728 tokens | ||
Libro delos claros varones | Spanish | varones | 5,820 tokens | ||
West-Saxon Gospels | Old English | wscp | 13,061 tokens |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
## Menotec | ||
|
||
### Contents | ||
|
||
The following texts are included in this release of the treebank: | ||
|
||
(The _size_ column in the table below shows the number of annotated tokens in a | ||
text. The number of tokens will be slightly larger than the number of words in | ||
the original printed edition as some words have been split into multiple tokens | ||
and some tokens have been inserted during annotation.) | ||
Text | Language | Filename | Size | ||
----------------------------------------------------|---------------------|-------------|--------------- | ||
Konungs skuggsjá (in AM 243 bα fol, Old Norw., ca. 1275) (ed. Holm-Olsen 1945) | Old Norse | am243 | 44 tokens | ||
The Old Norwegian homily book (in AM 619 4to, Old Norw., ca. 1200-1225) (ed. Indrebø 1931) | Old Norse | hom | 60,822 tokens | ||
Landslǫg Magnúss Hákonarsónar (in Holm perg 34 4to, Old Norw., ca. 1275) | Old Norse | mll | 56,889 tokens | ||
Óláfs saga ins helga (in Upps DG 8 II, Old Norw., ca. 1225-1250) (ed. Johnsen 1922) | Old Norse | olavssaga | 42,830 tokens | ||
Pamphilus saga (in Upps DG 4-7, Old Norw., ca. 1270) | Old Norse | pamphilus | 4,254 tokens | ||
Strengleikar (in Upps DG 4-7, Old Norw., ca. 1270) (ed. Keyser 1850) | Old Norse | strleik | 38,549 tokens |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.