The repo contains four training pipelines: general NER, bio-medical NER, wiki entity linking, and medical entity linking.
General NER
-
Task: NER in general domain.
-
Model: General BERT.
-
Dataset for model training :
English data from CoNLL 2003 shared task. It contains four different types of named entities: PERSON, LOCATION, ORGANIZATION, and MISC. It can be downloaded from DeepAI website. -
Use Case: Analyzing research papers on COVID-19
-
Materials to use for testing: Research Papaers for COVID-19.
-
Model: BioBERT v1.1 (domain-specific language representation model pre-trained on large-scale biomedical corpora)
-
Dataset for model training: NER: CORD-NER dataset, Entity-Linking: CORD-NERD Dataset
-
Bio-medical NER
-
Task: NER in bio-medical domain.
-
Model: BioBERT v1.1 (domain-specific language representation model pre-trained on large-scale biomedical corpora).
-
Dataset for model training :
MTL-Bioinformatics-2016. Download and know more about this dataset on this github repo.
Wiki entity linking
-
Task: Entity linking in general domain.
-
Model: General Bert.
-
Dataset for model training : AIDA CoNLL03 entity linking dataset. The entities are identified by YAGO2 entity name, by Wikipedia URL, or by Freebase mid. It has to be used together with CoNLL03 NER dataset mentioned above.
First download CoNLL03 dataset which contains train/dev/test.
Second, download aida-yago2-dataset.zip from this website.
Third, in the downloaded folder, manually segment AIDA-YAGO2-annotations.tsv into three files corresponding to CoNLL03 train/dev/test, then put them into train/dev/test folders.
Medical entity linking
-
Task: Entity linking in medical domain.
-
Model: BioBERT v1.1 (domain-specific language representation model pre-trained on large-scale biomedical corpora).
-
Dataset for model training: MedMentions st21pv sub-dataset. It can be download from this github repo.
-
Below shows the steps to train a general NER model. You can modify config direcotry to train others.
-
Create a conda virtual environment and git clone this repo, then in command line:
cd composing_information_system/
export PYTHONPATH="$(pwd):$PYTHONPATH"
-
Create an output directory.
mkdir sample_output
-
Run training script:
python examples/tagging/main_train_tagging.py --config-dir examples/tagging/configs_ner/
-
After training, you will find your trained models in the following directory. It contains the trained model, vocabulary, train state and training log.
ls sample_output/ner/
-
Download the pretrained models and vocabs from below list. Put model.pt and vocab.pkl into
predict_path
specified inconfig_predict.yml
. Then run the following command.python examples/tagging/main_predict_tagging.py --config-dir examples/tagging/configs_ner/
-
General Ner : model, train_state, vocab
-
Bio-medical NER : model, train_state, vocab
-
Wiki entity linking : model, train_state, vocab
-
Medical entity linking : model, train_state, vocab
-
-
You can use your trained bio-medical NER model and medical entity linking model to do inference on a new dataset
-
Inference Dataset : CORD-NERD dataset. Information about this dataset and downloadble links can be found here.
python examples/tagging/main_predict_cord.py --config-dir examples/tagging/configs_cord_test/
-
Below is the performance metrics of the General NER task.
General NER Accuracy Precision Recall F1 #instance Overall 98.98 93.56 94.81 94.18 LOC 95.94 96.41 96.17 18430 MISC 86.15 89.97 88.02 9628 ORG 91.05 91.99 91.52 13549 PER 96.89 97.69 97.29 18563 -
Below is the performance metrics of the bio-medical NER task.
Bio-medical NER Accuracy Precision Recall F1 #instance Overall 98.41 84.93 89.01 86.92 Chemical 79.20 86.34 82.62 1428 Organism 85.23 73.87 79.14 3337 Protein 85.53 97.15 90.97 11972 -
Below is the performance metrics of the wiki entity linking task. Due to the large number of classes in entity linking tasks, we are only showing the overall performance.
Wiki entity linking Accuracy Precision Recall F1 Overall 91.27 51.86 38.60 44.25 -
Below is the performance metrics of the medical entity linking task. Since MedMentions dataset does not provide word boundaries (only has entity linking boundaries), the evaluation method here is to count extact match of entities.
Medical entity linking Precision Recall F1 Exact match 26.25 22.24 24.07