WeLT: Improving Biomedical Fine-tuned Pre-trained Language Models with Cost-sensitive Learning

Authors: Ghadeer Mobasher*, Olga Krebs, Wolfgang Müller, and Michael Gertz

For transparency, all the 🤗 models are publicly available as a result of our experimental work on HuggingFaceHub

Biomedical pre-trained language models (BioPLMs) have been achieving state-of-the-art results for various biomedical text-mining tasks. However, prevailing fine-tuning approaches naively train BioPLMs on targeted datasets without considering the class distributions. This is problematic, especially with dealing with imbalanced biomedical gold-standard datasets for named entity recognition (NER). Regardless of the high-performing SOTA fine-tuned NER models, they are biased towards other (O) tags and misclassify biomedical entities. To fill the gap, we propose WELT, a cost-sensitive BERT that handles the class imbalance for the task of biomedical NER. We investigate the impact of WELT against the traditional fine-tuning approaches on mixed-domain and domain-specific BioPLMs. We evaluated WELT against other weighting schemes such as Inverse of Number of Samples (INS), Inverse of Square Root of Number of Samples (ISNS) and Effective Number of Samples (ENS). Our results show the outperformance of WELT on 4 different types of Biomedical BERT models and BioELECTRA using 8 gold-standard datasets.

Installation

Dependencies

Python (>=3.6)
Pytorch (>=1.2.0)

Clone this GitHub repository: git clone https://github.com/mobashgr/WELT.git
Navigate to the WELT folder and install all necessary dependencies: python3 -m pip install -r requirements.txt
Note: To install the appropriate torch, follow the download instructions based on your development environment.

Data Preparation

NER Datasets

Dataset	Source
NCBI-disease BC5CDR-disease BC5CDR-chem BC4CHEMD BC2GM Linnaeus	NER datasets are directly retrieved from BioBERT via this link
BioRED-Dis BioRED-Chem	We have extended the aforementioned NER datasets to include BioRED. To convert from `BioC XML / JSON` to `conll`, we used bconv and filtered the chemical and disease entities.

Data Download
To directly download NER datasets, use download.sh or manually download them via this link in WELT directory, unzip datasets.zip and rm -r datasets.zip

Data Pre-processing
We adapted the preprocessing.sh from BioBERT to include BioRED

Fine-tuning with handling the class imbalance

We have conducted experiments on different BERT models using WeLT weighting scheme. We have compared WELT against other existing weighting schemes and the corresponding traditional fine-tuning approaches(i.e normal BioBERT fine-tuning)

Fine-tuning BERT Models

Model	Used version in HF 🤗
BioBERT	model_name_or_path
BlueBERT	model_name_or_path
PubMedBERT	model_name_or_path
SciBERT	model_name_or_path
BioELECTRA	model_name_or_path

Weighting Schemes

Name
Inverse of Number of Samples (INS)
Inverse of Square Root of Number of Samples (ISNS)
Effective Number of Samples (ENS)
Weighted Loss Trainer (WeLT) (Ours)

Cost-Sensitive Fine-Tuning

We have adapted BioBERT-run_ner.py to develop in run_weight_scheme.py that extends Trainer class to WeightedLossTrainer and override compute_loss function to include INS, ISNS, ENS and WELT in weighted Cross-Entropy loss function.

Evaluation
For fair comparison we have used the same NER evaluation approach of BioBERT

Usage Example
This is an example of fine-tuning BioRED-Chem over SciBERT using an ENS weight scheme with $\beta$ of 0.3

cd named-entity-recognition
./preprocess.sh

export SAVE_DIR=./output
export DATA_DIR=../datasets/NER

export MAX_LENGTH=384
export BATCH_SIZE=5
export NUM_EPOCHS=20
export SAVE_STEPS=1000
export ENTITY=BioRED-Chem
export SEED=1

python run_weight_scheme.py \
    --data_dir ${DATA_DIR}/${ENTITY}/ \
    --labels ${DATA_DIR}/${ENTITY}/labels.txt \
    --model_name_or_path allenai/scibert_scivocab_uncased \
   --output_dir ${ENTITY}-${MAX_LENGTH}-SciBERT-ENS-0.3\
    --max_seq_length ${MAX_LENGTH} \
    --num_train_epochs ${NUM_EPOCHS} \
    --weight_scheme ENS \
    --beta_factor 0.3  \
    --per_device_train_batch_size ${BATCH_SIZE} \
    --save_steps ${SAVE_STEPS} \
    --seed ${SEED} \
    --do_train \
    --do_eval \
    --do_predict \
    --overwrite_output_dir

Quick Links

-Usage of WeLT
-Hyperparameters

Citation

@inproceedings{mobasher-etal-2023-welt,
   title = "{W}e{LT}: Improving Biomedical Fine-tuned Pre-trained Language Models with Cost-sensitive Learning",
   author = {Mobasher, Ghadeer  and
     M{\"u}ller, Wolfgang  and
     Krebs, Olga  and
     Gertz, Michael},
   booktitle = "The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks",
   month = jul,
   year = "2023",
   address = "Toronto, Canada",
   publisher = "Association for Computational Linguistics",
   url = "https://aclanthology.org/2023.bionlp-1.40",
   pages = "427--438"
}

Acknowledgment

Ghadeer Mobasher* is part of the PoLiMeR-ITN and is supported by the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement PoLiMeR, No 812616.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WeLT: Improving Biomedical Fine-tuned Pre-trained Language Models with Cost-sensitive Learning

Installation

Data Preparation

Fine-tuning with handling the class imbalance

Quick Links

Citation

Acknowledgment

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
named-entity-recognition		named-entity-recognition
README.md		README.md
download.sh		download.sh
requirements.txt		requirements.txt

mobashgr/WeLT

Folders and files

Latest commit

History

Repository files navigation

WeLT: Improving Biomedical Fine-tuned Pre-trained Language Models with Cost-sensitive Learning

Installation

Data Preparation

Fine-tuning with handling the class imbalance

Quick Links

Citation

Acknowledgment

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages