Skip to content

Latest commit

 

History

History

ud_benchmark

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

🪐 Weasel Project: Universal Dependencies v2.5 Benchmarks

This project template lets you train a spaCy pipeline on any Universal Dependencies corpus (v2.5) for benchmarking purposes. The pipeline includes an experimental trainable tokenizer, an experimental edit tree lemmatizer, and the standard spaCy tagger, morphologizer and dependency parser components. The CoNLL 2018 evaluation script is used to evaluate the pipeline. The template uses the UD_English-EWT treebank by default, but you can swap it out for any other available treebank. Just make sure to adjust the ud_treebank and spacy_lang settings in the config. Use xx (multi-language) for spacy_lang if a particular language is not supported by spaCy. The tokenizer in particular is only intended for use in this generic benchmarking setup. It is not optimized for speed and it does not perform particularly well for languages without space-separated tokens. In production, custom rules for spaCy's rule-based tokenizer or a language-specific word segmenter such as jieba for Chinese or sudachipy for Japanese would be recommended instead.

📋 project.yml

The project.yml defines the data assets required by the project, as well as the available commands and workflows. For details, see the Weasel documentation.

⏯ Commands

The following commands are defined by the project. They can be executed using weasel run [name]. Commands are only re-run if their inputs have changed.

Command Description
extract Extract the data
convert Convert the data to spaCy's format
train-tokenizer Train tokenizer
train-transformer Train transformer
assemble Assemble full pipeline
evaluate Evaluate on the test data and save the metrics
evaluate-with-senter Evaluate on the test data and save the metrics
package Package the trained model so it can be installed
clean Remove intermediate files

⏭ Workflows

The following workflows are defined by the project. They can be executed using weasel run [name] and will run the specified commands in order. Commands are only re-run if their inputs have changed.

Workflow Steps
all extractconverttrain-tokenizertrain-transformerassembleevaluateevaluate-with-senterpackage

🗂 Assets

The following assets are defined by the project. They can be fetched by running weasel assets in the project directory.

File Source Description
assets/ud-treebanks-v2.5.tgz URL