Here is the guide about how to run NER models in our code. Before the guide starts, please ensure the essential packages to run the code have been installed. We share our NER datasets at data.zip.
Sentence-level training and evaluation approach has been widely applied to a lot of work while recent work finds that feeding the whole document into the transformer-based model can significantly improve the accuracy of NER models (document-level). Both kinds of training and evaluation approaches are practical and essential in different scenarios. As a result, in this part, we will discuss how to run both kinds of models.
Download the conll_en_ner_model.zip at OneDrive. Unzip the file and move the unzipped repository to resources/taggers
.
To evaluate the accuracy of the model, run:
CUDA_VISIBLE_DEVICES=0 python train.py --config config/conll_03_english.yaml --test
To train the model by yourself, firstly, training finetuned transformer-based models (i.e. BERT and M-BERT) are required. Train the following models:
CUDA_VISIBLE_DEVICES=0 python train.py --config config/mbert-en-ner-finetune.yaml
CUDA_VISIBLE_DEVICES=0 python train.py --config config/bert-en-ner-finetune.yaml
You can find the trained models and its finetuned embeddings in resources/taggers
. Modify the embedding path in config/conll_03_english.yaml
with your trained models. For example:
TransformerWordEmbeddings-1:
layers: '-1'
model: resources/taggers/en-bert_10epoch_32batch_0.00005lr_10000lrrate_en_monolingual_nocrf_fast_relearn_sentbatch_sentloss_finetune_saving_nodev_newner3/bert-base-cased
pooling_operation: first
TransformerWordEmbeddings-2:
layers: '-1'
model: resources/taggers/multi-bert_10epoch_32batch_0.00005lr_10000lrrate_en_monolingual_nocrf_fast_relearn_sentbatch_sentloss_finetune_saving_nodev_newner3/bert-base-multilingual-cased
pooling_operation: first
Finally, run:
CUDA_VISIBLE_DEVICES=0 python train.py --config config/conll_03_english.yaml
To use the model predict on your own file, run:
CUDA_VISIBLE_DEVICES=0 python train.py --config config/conll_03_english.yaml --parse --target_dir $dir --keep_order
Note that you may need to preprocess your file with the dummy tags, please check this issue for more details.
- Download the doc_ner_best.zip at OneDrive. Unzip the file and move the unzipped repository to
resources/taggers
. - The model needs the pre-extracted document-level features of
bert-base-cased
,bert-large-cased
andbert-base-multilingual-cased
. Pre-extracted features: bert-base-cased.hdf5, bert-large-cased.hdf5 and bert-base-multilingual-cased.hdf5 (Note that bert-base-multilingual-cased.hdf5 contains the pre-extracted of all languages of CoNLL datasets, so you do not need to download it again if you already downloaded it.). If you want to extract the features by yourself, see (Optional) Extract Document Features for the guide to extract the document-level features. - The model also needs the fine-tuned document-level embeddings: Download
en-xlm-roberta-large.zip
,en-roberta-large.zip
anden-xlnet-large-cased.zip
at OneDrive infine-tuned models
and unzip them. - Change the path of the embeddings (
model:
) in theconfig/doc_ner_best.yaml
To evaluate the accuracy of the model, run:
CUDA_VISIBLE_DEVICES=0 python train.py --config config/doc_ner_best.yaml --test
To train the model by yourself, firstly, training finetuned transformer-based models (i.e. RoBERTa, XLM-R, XLNET) are required. Train the following models:
CUDA_VISIBLE_DEVICES=0 python train.py --config config/xlnet-doc-en-ner-finetune.yaml
CUDA_VISIBLE_DEVICES=0 python train.py --config config/xlmr-doc-en-ner-finetune.yaml
CUDA_VISIBLE_DEVICES=0 python train.py --config config/roberta-doc-en-ner-finetune.yaml
You can find the trained models and its finetuned embeddings in resources/taggers
. Modify the embedding path in config/doc_ner_bert.yaml
with your trained models. For example:
TransformerWordEmbeddings-0:
layers: '-1'
model: resources/taggers/xlnet-first-docv2_10epoch_1batch_4accumulate_0.000005lr_10000lrrate_eng_monolingual_nocrf_fast_norelearn_sentbatch_sentloss_finetune_nodev_saving_ner4/xlnet-large-cased
pooling_operation: first
v2_doc: true
TransformerWordEmbeddings-1:
layers: '-1'
model: resources/taggers/xlmr-first-docv2_10epoch_1batch_4accumulate_0.000005lr_10000lrrate_eng_monolingual_nocrf_fast_norelearn_sentbatch_sentloss_finetune_nodev_saving_ner3/xlm-roberta-large
pooling_operation: first
v2_doc: true
TransformerWordEmbeddings-2:
layers: '-1'
model: resources/taggers/en-xlmr-first-docv2_10epoch_1batch_4accumulate_0.000005lr_10000lrrate_eng_monolingual_nocrf_fast_norelearn_sentbatch_sentloss_finetune_nodev_saving_ner5/roberta-large
pooling_operation: first
v2_doc: true
Please follow the instruction above.
Finally, run:
CUDA_VISIBLE_DEVICES=0 python train.py --config config/doc_ner_best.yaml
To use the model predict on your own file, run:
CUDA_VISIBLE_DEVICES=0 python train.py --config config/doc_ner_best.yaml --parse --target_dir $dir --keep_order
Note that:
- You may need to preprocess your file with the dummy tags for prediction, please check this issue for more details.
- You need to pre-extract document-level features of
bert-base-cased
,bert-large-cased
andbert-base-multilingual-cased
embeddings. Please follow (Optional) Extract Document Features.
CoNLL German dataset contains version 2003 and a revised version of 2006. Currently, we release 2003 models for sentence-level and document-level, 2006 model for document-level.
Download the conll_03_de_model.zip at OneDrive. Unzip the file and move the unzipped repository to resources/taggers
.
To evaluate the accuracy of the model, run:
CUDA_VISIBLE_DEVICES=0 python train.py --config config/conll_03_de_model.yaml --test
To train the model by yourself, firstly, training finetuned transformer-based models (i.e. BERT and M-BERT) are required. Train the following models:
CUDA_VISIBLE_DEVICES=0 python train.py --config config/mbert-de-03-ner-finetune.yaml
CUDA_VISIBLE_DEVICES=0 python train.py --config config/bert-de-03-ner-finetune.yaml
You can find the trained models and its finetuned embeddings in resources/taggers
. Modify the embedding path in config/conll_03_de_model.yaml
with your trained models. For example:
TransformerWordEmbeddings-1:
layers: '-1'
model: resources/taggers/de-bert_10epoch_32batch_0.00005lr_10000lrrate_de_monolingual_nocrf_fast_relearn_sentbatch_sentloss_finetune_saving_nodev_newner3/bert-base-german-cased
pooling_operation: first
TransformerWordEmbeddings-2:
layers: '-1'
model: resources/taggers/multi-bert_10epoch_32batch_0.00005lr_10000lrrate_de_monolingual_nocrf_fast_relearn_sentbatch_sentloss_finetune_saving_nodev_newner3/bert-base-multilingual-cased
pooling_operation: first
Finally, run:
CUDA_VISIBLE_DEVICES=0 python train.py --config config/conll_03_de_model.yaml
To use the model predict on your own file, run:
CUDA_VISIBLE_DEVICES=0 python train.py --config config/conll_03_de_model.yaml --parse --target_dir $dir --keep_order
Note that you may need to preprocess your file with the dummy tags, please check this issue for more details.
- Download the doc_ner_de_03_best.zip and doc_ner_de_06_best.zip at OneDrive. Unzip the file and move the unzipped repository to
resources/taggers
. - Both of the models need the pre-extracted document-level features of
bert-base-german-dbmdz-cased
andbert-base-multilingual-cased
. Pre-extracted features: bert-base-german-dbmdz-cased.hdf5 and bert-base-multilingual-cased.hdf5 (Note that bert-base-multilingual-cased.hdf5 contains the pre-extracted of all languages of CoNLL datasets, so you do not need to download it again if you already downloaded it.). If you want to extract the features by yourself, see (Optional) Extract Document Features for the guide to extract the document-level features. - The model also needs the fine-tuned document-level embeddings: Download
de-xlm-roberta-large.zip
at OneDrive infine-tuned models
and unzip them. - Change the path of the embeddings (
model:
) in theconfig/doc_ner_de_03_best.yaml
orconfig/doc_ner_de_06_best.yaml
To evaluate the accuracy of the model, run:
CUDA_VISIBLE_DEVICES=0 python train.py --config config/doc_ner_de_03_best.yaml --test # for version 2003
CUDA_VISIBLE_DEVICES=0 python train.py --config config/doc_ner_de_06_best.yaml --test # for version 2006
To train the model by yourself, firstly, training finetuned transformer-based models (i.e. XLM-R) are required. Train the following models:
CUDA_VISIBLE_DEVICES=0 python train.py --config config/xlmr-doc-de-03-ner-finetune.yaml # for version 2003
CUDA_VISIBLE_DEVICES=0 python train.py --config config/xlmr-doc-de-06-ner-finetune.yaml # for version 2006
You can find the trained models and its finetuned embeddings in resources/taggers
. Modify the embedding path in config/doc_ner_de_03_best.yaml
or config/doc_ner_de_06_best.yaml
with your trained models.
Please follow the instruction above.
Finally, run:
CUDA_VISIBLE_DEVICES=0 python train.py --config config/doc_ner_de_03_best.yaml # for version 2003
CUDA_VISIBLE_DEVICES=0 python train.py --config config/doc_ner_de_06_best.yaml # for version 2006
To use the model predict on your own file, run:
CUDA_VISIBLE_DEVICES=0 python train.py --config config/doc_ner_de_03_best.yaml --parse --target_dir $dir --keep_order # for version 2003
CUDA_VISIBLE_DEVICES=0 python train.py --config config/doc_ner_de_06_best.yaml --parse --target_dir $dir --keep_order # for version 2006
Note that:
- You may need to preprocess your file with the dummy tags for prediction, please check this issue for more details.
- You need to pre-extract document-level features of
bert-base-german-dbmdz-cased
andbert-base-multilingual-cased
embeddings. Please follow (Optional) Extract Document Features.
Currently, we release the model for document-level.
- Download the doc_ner_nl_best.zip at OneDrive. Unzip the file and move the unzipped repository to
resources/taggers
. - Both of the models need the pre-extracted document-level features of
bert-base-dutch-cased-finetuned-conll2002-ner
andbert-base-multilingual-cased
. Pre-extracted features: bert-base-dutch-cased-finetuned-conll2002-ner.hdf5 and bert-base-multilingual-cased.hdf5 (Note that bert-base-multilingual-cased.hdf5 contains the pre-extracted of all languages of CoNLL datasets, so you do not need to download it again if you already downloaded it.). If you want to extract the features by yourself, see (Optional) Extract Document Features for the guide to extract the document-level features. - The model also needs the fine-tuned document-level embeddings: Download
nl-xlm-roberta-large.zip
at OneDrive infine-tuned models
and unzip them. - Change the path of the embeddings (
model:
) in theconfig/doc_ner_nl_best.yaml
.
To evaluate the accuracy of the model, run:
CUDA_VISIBLE_DEVICES=0 python train.py --config config/doc_ner_nl_best.yaml --test
To train the model by yourself, firstly, training finetuned transformer-based models (i.e. XLM-R) are required. Train the following model:
CUDA_VISIBLE_DEVICES=0 python train.py --config config/xlmr-doc-nl-ner-finetune.yaml
You can find the trained models and its finetuned embeddings in resources/taggers
. Modify the embedding path in config/doc_ner_nl_best.yaml
with your trained models.
Please follow the instruction above.
Finally, run:
CUDA_VISIBLE_DEVICES=0 python train.py --config config/doc_ner_nl_best.yaml
To use the model predict on your own file, run:
CUDA_VISIBLE_DEVICES=0 python train.py --config config/doc_ner_nl_best.yaml --parse --target_dir $dir --keep_order
Note that:
- You may need to preprocess your file with the dummy tags for prediction, please check this issue for more details.
- You need to pre-extract document-level features of
bert-base-dutch-cased-finetuned-conll2002-ner
andbert-base-multilingual-cased
embeddings. Please follow (Optional) Extract Document Features.
Currently, we release the model for document-level.
- Download the es_doc_ner_new.zip at OneDrive. Unzip the file and move the unzipped repository to
resources/taggers
. (We use es_doc_ner_new.zip instead of doc_ner_es_best.zip because one of the embeddings from transformers is possibly trained on the test data, which results in extremely high score on the dataset.) - Both of the models need the pre-extracted document-level features of
bert-spanish-cased-finetuned-ner
andbert-base-multilingual-cased
. Pre-extracted features: bert-spanish-cased-finetuned-ner.hdf5 and bert-base-multilingual-cased.hdf5 (Note that bert-base-multilingual-cased.hdf5 contains the pre-extracted of all languages of CoNLL datasets, so you do not need to download it again if you already downloaded it.). If you want to extract the features by yourself, see (Optional) Extract Document Features for the guide to extract the document-level features. - The model also needs the fine-tuned document-level embeddings: Download
es-xlm-roberta-large.zip
at OneDrive infine-tuned models
and unzip them. - Change the path of the embeddings (
model:
) in theconfig/doc_ner_es_best.yaml
.
To evaluate the accuracy of the model, run:
CUDA_VISIBLE_DEVICES=0 python train.py --config config/doc_ner_es_best.yaml --test
To train the model by yourself, firstly, training finetuned transformer-based models (i.e. XLM-R) are required. Train the following model:
CUDA_VISIBLE_DEVICES=0 python train.py --config config/xlmr-doc-es-ner-finetune.yaml
You can find the trained models and its finetuned embeddings in resources/taggers
. Modify the embedding path in config/doc_ner_es_best.yaml
with your trained models.
Please follow the instruction above.
Finally, run:
CUDA_VISIBLE_DEVICES=0 python train.py --config config/doc_ner_es_best.yaml
To use the model predict on your own file, run:
CUDA_VISIBLE_DEVICES=0 python train.py --config config/doc_ner_es_best.yaml --parse --target_dir $dir --keep_order
Note that:
- You may need to preprocess your file with the dummy tags for prediction, please check this issue for more details.
- You need to pre-extract document-level features of
bert-spanish-cased-finetuned-ner
andbert-base-multilingual-cased
embeddings. Please follow (Optional) Extract Document Features.