This is the code for our paper Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models, to appear on ACL 2024 Findings.
The original train/validation/test data, and the generated synthetic training data has been uploaded in Huggingface Dataset Hub (note that KG
and LLM
stands for two ways of incorporating external knowledge):
Corpus | # Train | # Test | # Class | Task | Link-KG | Link-LLM |
---|---|---|---|---|---|---|
LitCovid | 24960 | 6238 | 7 | Text Classification | litcovid | litcovid |
HOC | 3091 | 898 | 10 | Text Classification | hoc | hoc |
GAD | 4750 | 350 | 1 | Relation Extraction | gad | gad |
CDR | 8431 | 2522 | 1 | Relation Extraction | cdr | cdr |
ChemProt | 8793 | 10807 | 5 | Relation Extraction | chemprot | chemprot |
MedNLI | 11232 | 1422 | 3 | Natural Language Inference | mednli | mednli |
MEDIQA-NLI | - | 405 | 3 | Natural Language Inference | mediqa-nli | mediqa-nli |
MEDIQA-RQE | 8588 | 302 | 2 | Natural Language Inference | mediqa-rqe | mediqa-rqe |
PUBHEALTH | 9804 | 1231 | 4 | Fact Verification | pubhealth | pubhealth |
HealthVer | 10591 | 1824 | 3 | Fact Verification | healthver | healthver |
MQP | 10 | 3033 | 2 | Sentence Similarity | mqp | mqp |
BC5CDR-Disease | 4882 | 5085 | 1 | Named Entity Recognition | bc5cdr-disease | bc5cdr-disease |
BC5CDR-Chemical | 4882 | 5085 | 1 | Named Entity Recognition | bc5cdr-chemical | bc5cdr-chemical |
NCBI-Disease | 5336 | 921 | 1 | Named Entity Recognition | ncbi-disease | ncbi-disease |
CHEMDNER | 14522 | 12430 | 1 | Named Entity Recognition | chemdner | chemdner |
CASI | 5 | 100 | 6 | Attribute Extraction | casi | casi |
Note:
- Due to privacy constraint, we are not able to release the training set for MedNLI/MediQA-NLI.
train.jsonl
stands for the synthetic training set (may contain noise)train_few.jsonl
stands for the initial few-shot demonstrationstest.jsonl
stands for data from the test set
First of all, please apply an OpenAI API key here, if you don't have one yet.
Then, replace the YOUR_API_KEY
in clingen.py
with your own API key.
Finally, run bash run_clingen.sh
with your specified dataset name and keyword type.
Feel free to contact ran.xu at emory.edu
for any questions regarding this repo. Please try to specify the problem with details so we can help you better and quicker!
If you find this repository helpful, please kindly consider citing the corresponding paper. Thanks in advance!
@inproceedings{xu2024knowledge,
title={Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models},
author={Xu, Ran and Cui, Hejie and Yu, Yue and Kan, Xuan and Shi, Wenqi and Zhuang, Yuchen and Jin, Wei and Ho, Joyce and Yang, Carl},
booktitle={Findings of the Association for Computational Linguistics: ACL 2024},
year={2024}
}