Composable Training Pipeline

The repo contains four training pipelines: general NER, bio-medical NER, wiki entity linking, and medical entity linking.

Tasks and Datasets

General NER

Task: NER in general domain.
Model: General BERT.
Dataset for model training :
English data from CoNLL 2003 shared task. It contains four different types of named entities: PERSON, LOCATION, ORGANIZATION, and MISC. It can be downloaded from DeepAI website.
Use Case: Analyzing research papers on COVID-19
- Materials to use for testing: Research Papaers for COVID-19.
- Model: BioBERT v1.1 (domain-specific language representation model pre-trained on large-scale biomedical corpora)
- Dataset for model training: NER: CORD-NER dataset, Entity-Linking: CORD-NERD Dataset

Bio-medical NER

Task: NER in bio-medical domain.
Model: BioBERT v1.1 (domain-specific language representation model pre-trained on large-scale biomedical corpora).
Dataset for model training :
MTL-Bioinformatics-2016. Download and know more about this dataset on this github repo.

Wiki entity linking

Medical entity linking

Task: Entity linking in medical domain.
Model: BioBERT v1.1 (domain-specific language representation model pre-trained on large-scale biomedical corpora).
Dataset for model training: MedMentions st21pv sub-dataset. It can be download from this github repo.

Below shows the steps to train a general NER model. You can modify config direcotry to train others.
Create a conda virtual environment and git clone this repo, then in command line:

cd composing_information_system/

export PYTHONPATH="$(pwd):$PYTHONPATH"
Create an output directory.

mkdir sample_output
Run training script:

python examples/tagging/main_train_tagging.py --config-dir examples/tagging/configs_ner/
After training, you will find your trained models in the following directory. It contains the trained model, vocabulary, train state and training log.

ls sample_output/ner/

You can use your trained bio-medical NER model and medical entity linking model to do inference on a new dataset
Inference Dataset : CORD-NERD dataset. Information about this dataset and downloadble links can be found here.

python examples/tagging/main_predict_cord.py --config-dir examples/tagging/configs_cord_test/

Below is the performance metrics of the General NER task.

General NER	Accuracy	Precision	Recall	F1	#instance
Overall	98.98	93.56	94.81	94.18
LOC		95.94	96.41	96.17	18430
MISC		86.15	89.97	88.02	9628
ORG		91.05	91.99	91.52	13549
PER		96.89	97.69	97.29	18563

Below is the performance metrics of the bio-medical NER task.

Bio-medical NER	Accuracy	Precision	Recall	F1	#instance
Overall	98.41	84.93	89.01	86.92
Chemical		79.20	86.34	82.62	1428
Organism		85.23	73.87	79.14	3337
Protein		85.53	97.15	90.97	11972

Below is the performance metrics of the wiki entity linking task. Due to the large number of classes in entity linking tasks, we are only showing the overall performance.

Wiki entity linking Accuracy Precision Recall F1

Overall 91.27 51.86 38.60 44.25
Below is the performance metrics of the medical entity linking task. Since MedMentions dataset does not provide word boundaries (only has entity linking boundaries), the evaluation method here is to count extact match of entities.

Medical entity linking Precision Recall F1

Exact match 26.25 22.24 24.07