This is the source code for the Master's thesis project "Hospital readmission prediction with long clinical notes", presented in partial fulfilment of the requirements for the degree of MSc in Computer Science in the University of Cape Town.
The project aims to evaluate the effect of using a Transformer-based model with a sparse attention pattern to predict 30-day hospital readmission on a cohort from the MIMIC-III dataset.
Available here is the source-code for data processing, model training and evaluation.
The code was tested using Python 3.8 in a Linux environment, and it is recomended to use a virtual environment.
- Install the dependencies with:
pip3 install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cpu
If you plan on using GPUS use the reqirements-hpc.txt
files instead.
To execute data processing, run the following module:
python -m src.clinical_longformer.data.processing
To get the help messagae use the --help
argument:
$ python -m src.clinical_longformer.data.processing --help
usage: processing.py [-h] [--n-days {1-30}] [-v] [-vv] mimic_path {ds,all} {-1,512,1024,2048,4096} [out_path]
Data processing
positional arguments:
mimic_path MIMIC-III dataset path
{ds,all} set notes category (ds - Discharge Summary)
{-1,512,1024,2048,4096}
set note length, -1 means do not chunk text
out_path set output path
optional arguments:
-h, --help show this help message and exit
--n-days {1-30} set number of days (only used if category is set to all)
-v, --verbose set loglevel to INFO
-vv, --very-verbose set loglevel to DEBUG
Model training is done with PyTorch Lightning framework.
There are four excutable modules available in src/clinical_longformer/model
: dan.py
, lstm.py
, bert.py
and longformer.py
.
These modules run the Pytorch Lightning Trainer, you can find available arguments by using the --help
argument.
More information is available in the docs.
In the case of longformer.py
we are able to specify the maximum token length usgin the --max_length
argument.
In the hpc-uct
, and chpc
folders there are examples of how to run the models.
Hyperparameter tuning is done using Weights & Biases.
Look inside hpc-uct
, and chpc
for examples of how to run the sweeps.
Pre-training is done using the HuggingFace's Transformers library language-modeling example script.
The script has been cloned to this repository, where job files for executing in hpc-uct
are available.