Tagger and Parser

Environment setup

Create a conda environment with Python 3.8

conda create -n allennlp python=3.8

Activate the new environment

conda activate allennlp

Install allennlp (we use version 0.8.4) and other packages using pip

pip install -r requirements.txt

Internal note: both environments are already set up on coli servers, see instructions in the Wiki.

Parameter configuration

Adjust parameters including file paths in the respective .json config files, as needed. By default, the paths point to datasets in data. See respective README files there for details about the datasets.

Both our models consume data in CoNLL format where each line represents a token and columns are tab-separated. The column DEPRELS contains additional dependency relations if a token has more than one head.The tagger requires data in the CoNLL-2003 format with the relevant columns being the first (TEXT) and the fourth (LABEL). The parser requires data in the CoNLL-U format with the relevant columns being the second (FORM), the fifth (LABEL), the seventh (HEAD) and the eighth (DEPREL).

Available AllenNLP 0.8 configurations:

tagger/tagger_with_bert_config.json - BiLSTM-CNN-CRF tagger using BERT embeddings
tagger/tagger_with_english_elmo_config.json - BiLSTM-CNN-CRF tagger using English ELMo embeddings
tagger/tagger_with_german_elmo_config.json - BiLSTM-CNN-CRF tagger using German ELMo embeddings
parser/parser_config.json - Biaffine dependency parser (Dozat and Manning, 2017)

For the ELMo taggers, we use the following ELMo parameters (i.e. options and weights):

English: weights and options (use the weights and options files under fta/ after unzipping)
German: weights and options

Internal note: the ELMo options and weight files can be found on the Saarland servers at /proj/cookbook/.

The weights and options files should be named and placed according to the paths specified in the .json files; alternatively, adjust the paths in the .json files.

Training

Run allennlp train [params] -s [serialization dir] to train a model, where

[params] is the path to the .json config file.
[serialization dir] is the directory to save trained model, logs and other results.

Evaluation

Run allennlp evaluate [archive file] [input file] --output-file [output file] to evaluate the model on some evaluation data, where

[archive file] is the path to an archived trained model.
[input file] is the path to the file containing the evaluation data.
[output file] is an optional path to save the metrics as JSON; if not provided, the output will be displayed on the console.

Performance

ERRATUM (Donatelli et al., EMNLP 2021): Please refer to our Wiki page for a list of corrections, particularly concerning the reporting of results and comparability.

Our tagger's performance compared to Y'20's performance and inter-annotator agreement (IAA).

Model	Corpus	Embedder	Precision	Recall	F-Score
IAA	100-r by Y'20		89.9	92.2	90.5

Y'20	300-r by Y'20		86.5	88.8	87.6
Our tagger	300-r by Y'20	English ELMo	89.9 ± 0.5	89.2 ± 0.4	89.6 ± 0.3
Our tagger	300-r by Y'20	multilingual BERT	88.7 ± 0.4	88.4 ± 0.1	88.5 ± 0.2

Our tagger	German	German ELMo	79.2 ± 1.4	81.2 ± 1.8	80.2 ± 1.6
Our tagger	German	multilingual BERT	75.3 ± 0.8	76.0 ± 1.0	75.7 ± 0.9

Our parser's performance compared to Y'20's performance and inter-annotator agreement (IAA).

Model	Corpus	Tag source	Precision	Recall	F-Score
IAA	100-r by Y'20	gold tags	84.4	80.4	82.3

Y'20	300-r by Y'20	gold tags	73.7	68.6	71.1
Our parser	300-r by Y'20	gold tags	80.4 ± 0.0	76.1 ± 0.0	78.2 ± 0.0

Our parser	German	gold tags	69.3 ± 0.0	91.3 ± 0.0	78.8 ± 0.0

Our parser's performance on machine-tagged data:

Model	Corpus	Tag source	Precision	Recall	F-Score
Y'20	300-r by Y'20	Y'20 tagger	51.1	37.7	43.3
Our parser	300-r by Y'20	our ELMo tagger	74.4 ± 0.5	70.4 ± 1.0	72.3 ± 0.8
Our parser	German	German ELMo tagger	56.5 ± 1.1	82.8 ± 2.2	67.1 ± 0.5

Prediction

Run allennlp predict [archive file] [input file] --use-dataset-reader --output-file [output file] to parse a file with a pretrained model, where

[archive file] is the path to an archived trained model.
[input file] is the path to the file you want to parse; this file should be in the same format as the training data, i.e. CoNLL-2003 for the tagger and CoNLL-U for the parser.
use-dataset-reader tells the parser to use the same dataset reader as it used during training.
[output file] is an optional path to save parsing results as JSON; if not provided, the output will be displayed on the console.

The output of the parser will be in JSON format. To transform this into the better readable CoNLL-U format, use data-scripts/json_to_conll.py. To get labeled evaluation results for parser output, use the script data-scripts/parser_evaluation.py. Instructions for their use can be found in data-scripts/README.md.

For sample inputs and outputs see English/Samples.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tagger and Parser

Environment setup

Parameter configuration

Training

Evaluation

Performance

Prediction

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
data-scripts		data-scripts
data		data
parser		parser
tagger		tagger
README.md		README.md
requirements.txt		requirements.txt

kastein/tagger-parser

Folders and files

Latest commit

History

Repository files navigation

Tagger and Parser

Environment setup

Parameter configuration

Training

Evaluation

Performance

Prediction

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages