Word Appropriateness Detector

Relevant Files

error_generator.py

A tool to generate error according to the methodology in our manuscript. To properly generate phonetically similar errors, you need to prepare your own vocabulary file. You can edit the method _set_vocab()to fit your vocab file into the algorithm.

generate_error_corpus.py

Run this script to generate corrupted corpus using our error_generator.py tool. It will create a pickle file where each row contains the original and corrupted sentence, the changed index and words, and the oneshot vector. In the oneshot vector, True means the word at this position is changed while False means it remains the same. The default output file name is error_corpus.pickle.

The input VOCAB_FILE should be either a csv or a txt file, being consistent with the vocab file in error_generator.py . The input CORPUS_FILE should be a plain text file, with one sentence per line. You can change the proportion of the error types by modifying the script. To run the script:

python generate_error_corpus.py \
  --vocab_file=$VOCAB_FILE \
  --corpus_file=$CORPUS_FILE \
  --save_dir=$SAVE_DIR

finetune.py

Run this python script to fine tune your BERT model. The model we adapt in finetuning is BertForTokenClassification from huggingface transformers library. On top of the hidden states output from the BERT model, it applies a linear layer for token classification.

Arguments：

data_dir (required): The errored .pickle file.
model_dir (required): Directory of transformers-compatible BERT model or a model identifier from transformers, e.g. 'bert-base-uncased'. The parameters will be initialised from this model.
save_dir (required): Directory to save dataloaders, fine-tuned model, and outputs.
epochs: Numbers of epochs to run fine-tuning. Default: 3.
learn_rate: Learning rate for fine-tuning model. Default: 3e-5.
batch_size: Batch size for fine-tuning model. Default: 32.

To start fine-tuning, just run:

EPOCHS=3 
LEARN_RATE=3e-5
BATCH_SIZE=32

python finetune.py \
    --data_dir=$IN_DATA \
    --model_dir=$MODEL_DIR \
    --save_dir=$SAVE_DIR \
    --epochs=$EPOCHS \
    --learn_rate=$LEARN_RATE \
    --batch_size=$BATCH_SIZE

test.py

Run this script to test your finetuned model. It will compute micro-average ROC curve and ROC area and create a plot to illustrate.

Arguments:

data_dir (required): The errored .pickle file.
model_dir (required): Directory of transformers-compatible BERT model (or SAVE_DIR of finetune.py).
save_dir (required): Directory to save dataloaders, fine-tuned model, and outputs.
batch_size: Batch size for running the test. Default: 32.

To start testing, run:

BATCH_SIZE=32

python test.py \
    --data_dir=$IN_DATA \
    --model_dir=$MODEL_DIR \
    --save_dir=$SAVE_DIR \
    --batch_size=$BATCH_SIZE

sample_data

In this folder, there're two sample files which serve as a good start for you to quickly run the code. In sample_corpus.txt, we scraped the findings section of ten radiology reports from NATIONALRad. We performed sentence segmentation with spaCy.

Instructions for Running in a Docker Container

Clone this repository to your local PC.
Build and run the Dockerfile: docker run --rm -it $(docker build -q .) bash
Run generate_error_corpus.py script to generate data for fine-tuning. We've provided you with a sample corpus file and a vocab file. Feel free to generate your own errors with them. For example: python generate_error_corpus.py --vocab_file=sample_data/sample_vocab.csv --corpus_file=sample_data/sample_corpus.txt --save_dir=sample_data/
Run finetuning.py to fine tune the BERT model for dictation error detection. You'll use the error_corpus.pickle that you created in step 3. For example: mkdir finetuned_model && python finetune.py --data_dir=sample_data/error_corpus.pickle --model_dir=bert-base-uncased --save_dir=finetuned_model/ --epochs=3 --learn_rate=3e-5 --batch_size=32
Run test.py to evaluate your finetuned model. You can create another corrupted corpus (step 3) to test it. For example: python test.py --data_dir=sample_data/error_corpus.pickle --model_dir=finetuned_model/ --save_dir=finetuned_model/ --batch_size=32

Instructions for Running Locally

Clone this repository to your local PC.
Install Python 3.6 or above.
Install everything in the requirements.txt file.
Run generate_error_corpus.py script to generate data for fine-tuning. We've provided you with a sample corpus file and a vocab file. Feel free to generate your own errors with them.
Run finetuning.py to fine tune the BERT model for dictation error detection. You'll use the error_corpus.pickle that you created in step 4.
Run test.py to evaluate your finetuned model. You can create another corrupted corpus (step 4) to test it.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.ipynb_checkpoints		.ipynb_checkpoints
error_generator_tool		error_generator_tool
notebooks		notebooks
sample_data		sample_data
.DS_Store		.DS_Store
Dockerfile		Dockerfile
README.md		README.md
error_generator.py		error_generator.py
finetune.py		finetune.py
generate_error_corpus.py		generate_error_corpus.py
requirements.txt		requirements.txt
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Word Appropriateness Detector

Relevant Files

error_generator.py

generate_error_corpus.py

finetune.py

test.py

sample_data

Instructions for Running in a Docker Container

Instructions for Running Locally

About

Releases

Packages

Contributors 3

Languages

bdrad/rad_word_appropriateness_checker

Folders and files

Latest commit

History

Repository files navigation

Word Appropriateness Detector

Relevant Files

error_generator.py

generate_error_corpus.py

finetune.py

test.py

sample_data

Instructions for Running in a Docker Container

Instructions for Running Locally

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages