This repository contains the code for BioNER, an LSTM-based model designed for biomedical named entity recognition (NER).
We provide the model trained for the following datasets:
Dataset | Mirror (Siasky) | Mirror (Mega) |
---|---|---|
MedMentions full | Download Model | Download Model |
MedMentions ST21pv | Download Model | Download Model |
JNLPBA | Download Model | Download Model |
In addition, the word embeddings trained with fastText on PubMed Baseline 2021 are provided for the following n-gram ranges:
n-gram range | Mirror (Siasky) | Mirror (Mega) | Mirror (Storj) |
---|---|---|---|
3-4 | Download | Download | Download |
3-6 | Download | Download | Download |
Install the dependencies.
pip install -r requirements.txt
As deterministic behaviour is enabled by default, you may need to set a debug environment variable CUBLAS_WORKSPACE_CONFIG
to prevent RuntimeErrors when using CUDA.
export CUBLAS_WORKSPACE_CONFIG=:4096:8
BioNER expects a dataset in the CoNLL-2003 format. We used the tool bconv for preprocessing the MedMentions dataset.
You can either use the provided Makefile
to train the BioNER model or execute train_bioner.py
directly.
Makefile:
Don't forget to fill in the empty fields in the Makefile
before the first start.
make train-bioner ngrams=3-4
You can annotate a CoNLL-2003 dataset in the following way:
python annotate_dataset.py \
--embeddings \ # path to the word embeddings file
--dataset \ # path to the CoNLL-2003 dataset
--outputFile \ # path to the output file for storing the annotated dataset
--model # path to the trained BioNER model
Furthermore, you can add the flag --enableExportCoNLL
to export an additional file at the same location at the same parent folder as the outputFile
, which can be used for the evaluation with the original conlleval.pl
perl script (source).