A tool to generate error according to the methodology in our manuscript. To properly generate phonetically similar errors, you need to prepare your own vocabulary file. You can edit the method _set_vocab()
to fit your vocab file into the algorithm.
Run this script to generate corrupted corpus using our error_generator.py
tool. It will create a pickle file where each row contains the original and corrupted sentence, the changed index and words, and the oneshot vector. In the oneshot vector, True
means the word at this position is changed while False
means it remains the same. The default output file name is error_corpus.pickle
.
The input VOCAB_FILE should be either a csv or a txt file, being consistent with the vocab file in error_generator.py
. The input CORPUS_FILE should be a plain text file, with one sentence per line. You can change the proportion of the error types by modifying the script. To run the script:
python generate_error_corpus.py \
--vocab_file=$VOCAB_FILE \
--corpus_file=$CORPUS_FILE \
--save_dir=$SAVE_DIR
Run this python script to fine tune your BERT model. The model we adapt in finetuning is BertForTokenClassification from huggingface transformers library. On top of the hidden states output from the BERT model, it applies a linear layer for token classification.
Arguments:
- data_dir (required): The errored
.pickle
file. - model_dir (required): Directory of transformers-compatible BERT model or a model identifier from transformers, e.g. 'bert-base-uncased'. The parameters will be initialised from this model.
- save_dir (required): Directory to save dataloaders, fine-tuned model, and outputs.
- epochs: Numbers of epochs to run fine-tuning. Default: 3.
- learn_rate: Learning rate for fine-tuning model. Default: 3e-5.
- batch_size: Batch size for fine-tuning model. Default: 32.
To start fine-tuning, just run:
EPOCHS=3
LEARN_RATE=3e-5
BATCH_SIZE=32
python finetune.py \
--data_dir=$IN_DATA \
--model_dir=$MODEL_DIR \
--save_dir=$SAVE_DIR \
--epochs=$EPOCHS \
--learn_rate=$LEARN_RATE \
--batch_size=$BATCH_SIZE
Run this script to test your finetuned model. It will compute micro-average ROC curve and ROC area and create a plot to illustrate.
Arguments:
- data_dir (required): The errored
.pickle
file. - model_dir (required): Directory of transformers-compatible BERT model (or
SAVE_DIR
of finetune.py). - save_dir (required): Directory to save dataloaders, fine-tuned model, and outputs.
- batch_size: Batch size for running the test. Default: 32.
To start testing, run:
BATCH_SIZE=32
python test.py \
--data_dir=$IN_DATA \
--model_dir=$MODEL_DIR \
--save_dir=$SAVE_DIR \
--batch_size=$BATCH_SIZE
In this folder, there're two sample files which serve as a good start for you to quickly run the code. In sample_corpus.txt, we scraped the findings section of ten radiology reports from NATIONALRad. We performed sentence segmentation with spaCy.
- Clone this repository to your local PC.
- Build and run the Dockerfile:
docker run --rm -it $(docker build -q .) bash
- Run generate_error_corpus.py script to generate data for fine-tuning. We've provided you with a sample corpus file and a vocab file. Feel free to generate your own errors with them. For example:
python generate_error_corpus.py --vocab_file=sample_data/sample_vocab.csv --corpus_file=sample_data/sample_corpus.txt --save_dir=sample_data/
- Run finetuning.py to fine tune the BERT model for dictation error detection. You'll use the
error_corpus.pickle
that you created in step 3. For example:mkdir finetuned_model && python finetune.py --data_dir=sample_data/error_corpus.pickle --model_dir=bert-base-uncased --save_dir=finetuned_model/ --epochs=3 --learn_rate=3e-5 --batch_size=32
- Run test.py to evaluate your finetuned model. You can create another corrupted corpus (step 3) to test it. For example:
python test.py --data_dir=sample_data/error_corpus.pickle --model_dir=finetuned_model/ --save_dir=finetuned_model/ --batch_size=32
- Clone this repository to your local PC.
- Install Python 3.6 or above.
- Install everything in the requirements.txt file.
- Run generate_error_corpus.py script to generate data for fine-tuning. We've provided you with a sample corpus file and a vocab file. Feel free to generate your own errors with them.
- Run finetuning.py to fine tune the BERT model for dictation error detection. You'll use the
error_corpus.pickle
that you created in step 4. - Run test.py to evaluate your finetuned model. You can create another corrupted corpus (step 4) to test it.