This project explores the effectiveness of pretraining techniques on morphological analysis (morphologizer) by conducting experiments on multiple languages. The objective of this project is to demonstrate the benefits of pretraining word vectors using domain-specific data on the performance of the morphological analysis. We leverage the OSCAR dataset to pretrain our vectors for tok2vec and utilize the UD_Treebanks dataset to train a morphologizer component. We evaluate and compare the performance of different pretraining techniques and the performance of models without any pretraining.
The project.yml
defines the data assets required by the
project, as well as the available commands and workflows. For details, see the
Weasel documentation.
The following commands are defined by the project. They
can be executed using weasel run [name]
.
Commands are only re-run if their inputs have changed.
Command | Description |
---|---|
install_requirements |
Download and install all requirements |
download_oscar |
Download a subset of the oscar dataset |
download_model |
Download the specified spaCy model for vector-objective pretraining |
extract_ud |
Extract the ud-treebanks data |
convert_ud |
Convert the ud-treebanks data to spaCy's format |
train |
Train a morphologizer component without pretrained weights and static vectors |
evaluate |
Evaluate the trained morphologizer component without pretrained weights and static vectors |
train_static |
Train a morphologizer component with static vectors from a pretrained model |
evaluate_static |
Evaluate the trained morphologizer component with static weights |
pretrain_char |
Pretrain a tok2vec component with the character objective |
train_char |
Train a morphologizer component with pretrained weights (character_objective) |
evaluate_char |
Evaluate the trained morphologizer component with pretrained weights (character-objective) |
pretrain_vector |
Pretrain a tok2vec component with the vector objective |
train_vector |
Train a morphologizer component with pretrained weights (vector_objective) |
evaluate_vector |
Evaluate the trained morphologizer component with pretrained weights (vector-objective) |
train_trf |
Train a morphologizer component without transformer embeddings |
evaluate_trf |
Evaluate the trained morphologizer component with transformer embeddings |
evaluate_metrics |
Evaluate all experiments and create a summary json file |
reset_project |
Reset the project to its original state and delete all training process |
reset_training |
Reset the training progress |
reset_metrics |
Delete the metrics folder |
The following workflows are defined by the project. They
can be executed using weasel run [name]
and will run the specified commands in order. Commands are only re-run if their
inputs have changed.
Workflow | Steps |
---|---|
data |
download_oscar → download_model → extract_ud → convert_ud |
training |
train → evaluate |
training_static |
train_static → evaluate_static |
training_char |
pretrain_char → train_char → evaluate_char |
training_vector |
pretrain_vector → train_vector → evaluate_vector |
training_trf |
train_trf → evaluate_trf |
The following assets are defined by the project. They can
be fetched by running weasel assets
in the project directory.
File | Source | Description |
---|---|---|
assets/ud-treebanks-v2.5.tgz |
URL |