In this project, we experiment with a range of prompting strategies for genetic information extraction to evaluate the performance, and find limitations of using generative technologies.
Organisation of information about genes, genetic variants, and associated diseases from vast quantities of scientific literature texts through automated information extraction (IE) strategies can facilitate progress in personalised medicine.
We systematically evaluate the performance of generative large language models (LLMs) on the extraction of specialised genetic information, focusing on end-to-end IE encompassing both named entity recognition and relation extraction. We experiment across multilingual datasets with a range of instruction strategies, including zero-shot and few-shot prompting along with providing an annotation guideline. Optimal results are obtained with few-shot prompting. However, we also identify that generative LLMs failed to adhere to the instructions provided, leading to over-generation of entities and relations. We therefore carefully examine the effect of learning paradigms on the extent to which genetic entities are fabricated, and the limitations of exact matching to determine performance of the model.
-
Download the datasets for IE tasks
-
Create train, and test datasets following the below format for each of the datasets.
-
For each dataset create a
<dataset_type>_text.tsv
file and a<dataset_type>_gold_annotations.tsv
-
<dataset_type>_text.tsv
is a TSV file containing the columnspmid
(ID of the paper),text
(Text from literature) -
<dataset_type>_gold_annotations.tsv
is a TSV file containing the ground truth/ gold annotations in order to do pairwise comparisons to evaluate the performance of this system. Contains the below columns.-
For Named Entity Recognition (NER):
pmid
: PubMed ID of the paperfilename
: File name of the paper the text is frommark
: Annotation ID following the BRAT formatlabel
: Entity label eg:Disease
offset1
: Starting index of the spanoffset2
: Ending index of the spanspan
: Identified entity eg:Síndrome de Gorlin
-
For Relation Extraction (RE) or join NER and RE (NERRE):
pmid
: PubMed ID of the paperfilename
: File name of the paper the text is frommark1
: Annotation ID for first entity following the BRAT formatlabel1
: First entity label eg:Gene
offset1_start
: Starting index of the first spanoffset1_end
: Ending index of the first spanspan1
: First entity identified eg:DUSP6
mark2
: Annotation ID for second entity following the BRAT formatlabel2
: Second entity label eg:Disease
offset2_start
: Starting index of the second spanoffset2_end
: Ending index of the second spanspan2
: Second entity identified eg:Mood Disorders
relation_mark
: ID for the relation identifiedrelation_type
: Relation type to annotate eg:biomarker
-
-
Alternatively: If the datasets are either one of GenoVarDis, TBGA or Variome the data can be cleaned and pre-processed once
CLEAN-DATA=true
in the.env
file.
-
-
Set up models
-
Currently supported models are:
- GPT-3.5 Turbo, model id:
gpt-35-turbo-16k
- Llama 3 70b Instruct, model id:
meta.llama3-70b-instruct-v1:0
- GPT-3.5 Turbo, model id:
-
Using Azure OpenAI
Note: The environment variables should be set inside the
.env
file. -
Using Amazon Bedrock
- Getting started with Amazon Bedrock
- Install AWS CLI
- Configure SSO for authentication
aws configure sso
aws sso login --profile <PROFILE-NAME>
-
-
Duplicate the
.env-template
file as.env
and populate according to the task and model. -
[Optional] Add custom prompts to the matching prompt library file:
<task>_prompts.json
. -
[Optional] To add other models, update
models.py
file by creating the corresponding model class similar to classGPTModel
.
Run the Python program via IDE python main.py
.
Brat-Eval is the tool we have used for evaluation.
A summary of the datasets,
extracted instances, hallucinated instances, visualisation of results, and performance details
will be generated in <RESULT-FOLDER-PATH>/results
once the program has finished running.
Milindi Kodikara
Karin Verspoor
© 2024 Copyright for this project by its contributors.
🧩 READ stands for Reading, Extraction, and Annotation of Documents!