This repo hosts the code for the paper Biomedical Event Extraction with Hierarchical Knowledge Graphs. We represent knowledge from UMLS in hierarchical knowledge graphs, and integrate knowledge into pre-trained contextual representations with a proposed graph neural networks, Graph Edge-conditioned Attention Neural Networks (GEANet). Currently, only SciBERT Baseline is available, and our best performing model GEANet-SciBERT will soon be released.
Model | Dev Set F1 | Test Set F1 |
---|---|---|
SciBERT Baseline | 59.33 | 58.50 |
GEANet-SciBERT | 60.38 | 60.06 |
Previous SOTA | N/A | 58.65 |
With the increasing concern about the COVID-19 pandemic, researchers have been putting much effort in providing useful insights into the COVID-19 Open Research Dataset Challenge (CORD-19). This repo also demonstrates how we extract biomedical events with SciBERT, a BERT trained on scientific corpus, which was fine-tuned on the GENIA BioNLP shared task 2011. We took reference from the pipeline described in Bjorne et al., where the pipeline can be broken into 3 stages: trigger detection, edge/argument detection and unmerging. The extracted events can be found here.
In addition, we adopted the framework from Han et al., where trigger and edge detections are trained in a multitask setting.
Our experiments were ran with Python 3.6.10, PyTorch 1.4.0 and CUDA 10.1 on a CentOS machine. Please follow the instructions of GPU dependencies and usage as illustrated in the official PyTorch website. We utilized the NER model en_ner_jnlpba_md
provided by ScispaCy for tagging biomedical entities.
Run pip install -r requirements
to install all the required packages.
(Optional) If you would like to use the knowledge incorporation component, additional packages are required for Pytorch Geometric can be installed as follows:
$ pip install torch-scatter==latest+${CUDA} -f https://pytorch-geometric.com/whl/torch-1.4.0.html
$ pip install torch-sparse==latest+${CUDA} -f https://pytorch-geometric.com/whl/torch-1.4.0.html
$ pip install torch-cluster==latest+${CUDA} -f https://pytorch-geometric.com/whl/torch-1.4.0.html
$ pip install torch-spline-conv==latest+${CUDA} -f https://pytorch-geometric.com/whl/torch-1.4.0.html
, where ${CUDA}
should be replaced by your installed CUDA version. (e.g. cpu
, cu92
, cu101
, cu102
)
The trained model can be downloaded from here. You need to unzip decompress it before running the program.
- Run through
Preprocess CORD.ipynb
to generate the processed.txt
files togenia_cord_19
from the json files in thecustom_license/custom_license/pmc_json/
directory. - Type
. run_multitask_bert.sh
in the terminal to run the whole event extraction pipeline and generate event annotations intogenia_cord_19_output
.
Call the biomedical_evet_extraction
function in the predict.py
script. This function takes in a single string as parameter. The event extraction pipeline will be run on this string.
>>> biomedical_evet_extraction("BMP-6 inhibits growth of mature human B cells; induction of Smad phosphorylation and upregulation of Id1.")
>>> [{'tokens': ['BMP-6', 'inhibits', 'growth', 'of', 'mature', 'human', 'B', 'cells', ';', 'induction', 'of', 'Smad', 'phosphorylation', 'and', 'upregulation', 'of', 'Id1', '.'], 'events': [{'event_type': 'Positive_regulation', 'triggers': [{'event_type': 'Positive_regulation', 'text': 'upregulation', 'start_token': 14, 'end_token': 14}], 'arguments': [{'role': 'Theme', 'text': 'Id1', 'start_token': 16, 'end_token': 16}]}], 'ner': [[0, 0, 'PROTEIN'], [4, 7, 'CELL_TYPE'], [11, 11, 'PROTEIN'], [16, 16, 'PROTEIN']]}]
├── genia_cord_19
├── genia_cord_19_output
├── preprocessed_data
├── eval
└── weights
@inproceedings{huang-etal-2020-biomedical,
title = "Biomedical Event Extraction with Hierarchical Knowledge Graphs",
author = "Huang, Kung-Hsiang and
Yang, Mu and
Peng, Nanyun",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.findings-emnlp.114",
doi = "10.18653/v1/2020.findings-emnlp.114",
pages = "1277--1285",
abstract = "Biomedical event extraction is critical in understanding biomolecular interactions described in scientific corpus. One of the main challenges is to identify nested structured events that are associated with non-indicative trigger words. We propose to incorporate domain knowledge from Unified Medical Language System (UMLS) to a pre-trained language model via Graph Edge-conditioned Attention Networks (GEANet) and hierarchical graph representation. To better recognize the trigger words, each sentence is first grounded to a sentence graph based on a jointly modeled hierarchical knowledge graph from UMLS. The grounded graphs are then propagated by GEANet, a novel graph neural networks for enhanced capabilities in inferring complex events. On BioNLP 2011 GENIA Event Extraction task, our approach achieved 1.41{\%} F1 and 3.19{\%} F1 improvements on all events and complex events, respectively. Ablation studies confirm the importance of GEANet and hierarchical KG.",
}