Authors: Ghadeer Mobasher*, Olga Krebs, Wolfgang Müller, and Michael Gertz
Biomedical pre-trained language models (BioPLMs) have been achieving state-of-the-art results for various biomedical text-mining tasks. However, prevailing fine-tuning approaches naively train BioPLMs on targeted datasets without considering the class distributions. This is problematic, especially with dealing with imbalanced biomedical gold-standard datasets for named entity recognition (NER). Regardless of the high-performing SOTA fine-tuned NER models, they are biased towards other (O) tags and misclassify biomedical entities. To fill the gap, we propose WELT, a cost-sensitive BERT that handles the class imbalance for the task of biomedical NER. We investigate the impact of WELT against the traditional fine-tuning approaches on mixed-domain and domain-specific BioPLMs. We evaluated WELT against other weighting schemes such as Inverse of Number of Samples (INS), Inverse of Square Root of Number of Samples (ISNS) and Effective Number of Samples (ENS). Our results show the outperformance of WELT on 4 different types of Biomedical BERT models and BioELECTRA using 8 gold-standard datasets.
Dependencies
- Python (>=3.6)
- Pytorch (>=1.2.0)
- Clone this GitHub repository:
git clone https://github.com/mobashgr/WELT.git
- Navigate to the WELT folder and install all necessary dependencies:
python3 -m pip install -r requirements.txt
Note: To install the appropriate torch, follow the download instructions based on your development environment.
NER Datasets
Dataset | Source |
---|---|
|
NER datasets are directly retrieved from BioBERT via this link |
|
We have extended the aforementioned NER datasets to include BioRED. To convert from BioC XML / JSON to conll , we used bconv and filtered the chemical and disease entities. |
Data Download
To directly download NER datasets, use download.sh
or manually download them via this link in WELT
directory, unzip datasets.zip
and rm -r datasets.zip
Data Pre-processing
We adapted the preprocessing.sh
from BioBERT to include BioRED
We have conducted experiments on different BERT models using WeLT weighting scheme. We have compared WELT against other existing weighting schemes and the corresponding traditional fine-tuning approaches(i.e normal BioBERT fine-tuning)
Fine-tuning BERT Models
Model | Used version in HF 🤗 |
---|---|
BioBERT | model_name_or_path |
BlueBERT | model_name_or_path |
PubMedBERT | model_name_or_path |
SciBERT | model_name_or_path |
BioELECTRA | model_name_or_path |
Weighting Schemes
Name |
---|
Inverse of Number of Samples (INS) |
Inverse of Square Root of Number of Samples (ISNS) |
Effective Number of Samples (ENS) |
Weighted Loss Trainer (WeLT) (Ours) |
Cost-Sensitive Fine-Tuning
We have adapted BioBERT-run_ner.py to develop in run_weight_scheme.py that extends Trainer
class to WeightedLossTrainer
and override compute_loss
function to include INS, ISNS, ENS and WELT
in weighted Cross-Entropy loss
function.
Evaluation
For fair comparison we have used the same NER evaluation approach of BioBERT
Usage Example
This is an example of fine-tuning BioRED-Chem
over SciBERT
using an ENS
weight scheme with
cd named-entity-recognition
./preprocess.sh
export SAVE_DIR=./output
export DATA_DIR=../datasets/NER
export MAX_LENGTH=384
export BATCH_SIZE=5
export NUM_EPOCHS=20
export SAVE_STEPS=1000
export ENTITY=BioRED-Chem
export SEED=1
python run_weight_scheme.py \
--data_dir ${DATA_DIR}/${ENTITY}/ \
--labels ${DATA_DIR}/${ENTITY}/labels.txt \
--model_name_or_path allenai/scibert_scivocab_uncased \
--output_dir ${ENTITY}-${MAX_LENGTH}-SciBERT-ENS-0.3\
--max_seq_length ${MAX_LENGTH} \
--num_train_epochs ${NUM_EPOCHS} \
--weight_scheme ENS \
--beta_factor 0.3 \
--per_device_train_batch_size ${BATCH_SIZE} \
--save_steps ${SAVE_STEPS} \
--seed ${SEED} \
--do_train \
--do_eval \
--do_predict \
--overwrite_output_dir
-Usage of WeLT
-Hyperparameters
@inproceedings{mobasher-etal-2023-welt,
title = "{W}e{LT}: Improving Biomedical Fine-tuned Pre-trained Language Models with Cost-sensitive Learning",
author = {Mobasher, Ghadeer and
M{\"u}ller, Wolfgang and
Krebs, Olga and
Gertz, Michael},
booktitle = "The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.bionlp-1.40",
pages = "427--438"
}
Ghadeer Mobasher* is part of the PoLiMeR-ITN and is supported by the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement PoLiMeR, No 812616.