Releases: Milindi-Kodikara/RMIT-READ-BioMed
ALTA 2024
RMIT READ-BioMed for ALTA
The RMIT University system for the 22nd Annual Workshop of the Australasian Language Technology Association (ALTA 2024).
In this project, we experiment with a range of prompting strategies for genetic information extraction to evaluate the performance, and find limitations of using generative technologies.
List of Publications
Lesser the shots, higher the hallucinations: Exploration of Genetic Information Extraction using Generative Large Language Models - TBA
Project overview
Organisation of information about genes, genetic variants, and associated diseases from vast quantities of scientific literature texts through automated information extraction (IE) strategies can facilitate progress in personalised medicine.
We systematically evaluate the performance of generative large language models (LLMs) on the extraction of specialised genetic information, focusing on end-to-end IE encompassing both named entity recognition and relation extraction. We experiment across multilingual datasets with a range of instruction strategies, including zero-shot and few-shot prompting along with providing an annotation guideline. Optimal results are obtained with few-shot prompting. However, we also identify that generative LLMs failed to adhere to the instructions provided, leading to over-generation of entities and relations. We therefore carefully examine the effect of learning paradigms on the extent to which genetic entities are fabricated, and the limitations of exact matching to determine performance of the model.
Full Changelog: v1.0...v2.0
GenoVarDis 2024
RMIT READ-BioMed for GenoVarDis
The RMIT University system for NER of genetic entities in biomedical literature for the GenoVarDis shared task at Iberian Languages Evaluation Forum (IberLEF 2024), co-located with the 40th Conference of the Spanish Society for Natural Language Processing (SEPLN 2024).
List of Publications
Overview
This is a system developed for the GenoVarDis shared task at IberLEF 2024,
focusing on the task of Named Entity Recognition (NER) of genes, genetic variants, and associated diseases from
Spanish-language scientific literature texts.
The approach involves exploration of a general generative Large Language Model (LLM), GPT-3.5, for NER.
We explore the impact of providing English-language instructions with the Spanish-language target text (cross-
linguistic setting) as compared to a within-language setting where the instruction language matches the language
of the text.
We further experiment with a range of instruction strategies, including zero-shot and few-shot
prompting under these two settings. Results indicate that the optimal results could be obtained with English-
language instructions under the few-shot learning paradigm, resulting in an F1-score of 0.5. While this approach
does not match the top results achieved for the shared task, our experiments provide insight into limitations
associated with simple prompting of LLMs in languages other than English.
Full Changelog: https://github.com/Milindi-Kodikara/RMIT-READ-BioMed-Version-2.0/commits/v1.0