Skip to content

Latest commit

 

History

History
167 lines (96 loc) · 7.09 KB

training.md

File metadata and controls

167 lines (96 loc) · 7.09 KB

Composable Training Pipeline

The repo contains four training pipelines: general NER, bio-medical NER, wiki entity linking, and medical entity linking.

Tasks and Datasets

General NER

  • Task: NER in general domain.

  • Model: General BERT.

  • Dataset for model training :
    English data from CoNLL 2003 shared task. It contains four different types of named entities: PERSON, LOCATION, ORGANIZATION, and MISC. It can be downloaded from DeepAI website.

  • Use Case: ​Analyzing research papers on COVID-19​

    • Materials to use for testing: ​Research Papaers for COVID-19.​

    • Model: BioBERT v1.1 (domain-specific language representation model pre-trained on large-scale biomedical corpora)​

    • Dataset for model training: NER: CORD-NER dataset, Entity-Linking: CORD-NERD Dataset​ ​

Bio-medical NER

  • Task: NER in bio-medical domain.

  • Model: BioBERT v1.1 (domain-specific language representation model pre-trained on large-scale biomedical corpora).

  • Dataset for model training :
    MTL-Bioinformatics-2016. Download and know more about this dataset on this github repo.

Wiki entity linking

  • Task: Entity linking in general domain.

  • Model: General Bert.

  • Dataset for model training : AIDA CoNLL03 entity linking dataset. The entities are identified by YAGO2 entity name, by Wikipedia URL, or by Freebase mid. It has to be used together with CoNLL03 NER dataset mentioned above.

    First download CoNLL03 dataset which contains train/dev/test.

    Second, download aida-yago2-dataset.zip from this website.

    Third, in the downloaded folder, manually segment AIDA-YAGO2-annotations.tsv into three files corresponding to CoNLL03 train/dev/test, then put them into train/dev/test folders.

Medical entity linking

  • Task: Entity linking in medical domain.

  • Model: BioBERT v1.1 (domain-specific language representation model pre-trained on large-scale biomedical corpora).

  • Dataset for model training: MedMentions st21pv sub-dataset. It can be download from this github repo.

How to train your models

  • Below shows the steps to train a general NER model. You can modify config direcotry to train others.

  • Create a conda virtual environment and git clone this repo, then in command line:

    cd composing_information_system/

    export PYTHONPATH="$(pwd):$PYTHONPATH"

  • Create an output directory.

    mkdir sample_output

  • Run training script:

    python examples/tagging/main_train_tagging.py --config-dir examples/tagging/configs_ner/

  • After training, you will find your trained models in the following directory. It contains the trained model, vocabulary, train state and training log.

    ls sample_output/ner/

How to do inference

  • Download the pretrained models and vocabs from below list. Put model.pt and vocab.pkl into predict_path specified in config_predict.yml. Then run the following command.

    python examples/tagging/main_predict_tagging.py --config-dir examples/tagging/configs_ner/

Inferencing using two concatenated models

  • You can use your trained bio-medical NER model and medical entity linking model to do inference on a new dataset

  • Inference Dataset : CORD-NERD dataset. Information about this dataset and downloadble links can be found here.

    python examples/tagging/main_predict_cord.py --config-dir examples/tagging/configs_cord_test/

Eveluation performance examples:

  • Below is the performance metrics of the General NER task.

    General NER Accuracy Precision Recall F1 #instance
    Overall 98.98 93.56 94.81 94.18
    LOC 95.94 96.41 96.17 18430
    MISC 86.15 89.97 88.02 9628
    ORG 91.05 91.99 91.52 13549
    PER 96.89 97.69 97.29 18563
  • Below is the performance metrics of the bio-medical NER task.

    Bio-medical NER Accuracy Precision Recall F1 #instance
    Overall 98.41 84.93 89.01 86.92
    Chemical 79.20 86.34 82.62 1428
    Organism 85.23 73.87 79.14 3337
    Protein 85.53 97.15 90.97 11972
  • Below is the performance metrics of the wiki entity linking task. Due to the large number of classes in entity linking tasks, we are only showing the overall performance.

    Wiki entity linking Accuracy Precision Recall F1
    Overall 91.27 51.86 38.60 44.25
  • Below is the performance metrics of the medical entity linking task. Since MedMentions dataset does not provide word boundaries (only has entity linking boundaries), the evaluation method here is to count extact match of entities.

    Medical entity linking Precision Recall F1
    Exact match 26.25 22.24 24.07