Lung cancer screening with low-dose computed tomography significantly improves patient lung cancer outcomes, improving survival and reducing morbidity; two large randomized control lung cancer screening trials have demonstrated 20% (NLST trial) and 24% (NELSON trial) reductions in lung cancer mortality respectively. These results have motivated the development of national lung screening programs. The success of these programs hinges on their ability to the right patients for screening, balancing the benefits early detection against the harms of overscreening. This capacity relies on our ability to estimate a patients risk of developing lung cancer. In this class project, we will develop machine learning tools to predict lung cancer risk from PLCO questionnaires, develop screening guideline simulations, and compare the cost-effectiveness of these proposed guidelines against current NLST criteria. The goal of this project is to give you hands-on experience developing machine learning tools from scratch and analyzing their clinical implications in a real world setting. At the end of this project, you will write a short project report, describing your model and your analyses. Submit your project code and a project report by the due date.
You can login to the CPH Departmental nodes via ssh. The steps are:
login to UCSF VPN
ssh cph-dept-01.ucsf.edu
Note, we have four CPH nodes, namely cph-dept-[01-04]
, that are reserved for Cornerstone coursework. You can find the class GPU allocation spreadsheet here.
You might the following unix tools useful: tmux, htop and oh-my-zsh.
Starter project code is available in this github. You can clone this repository with the following command:
git clone git@github.com:yala/CPH200_24.git
To manage dependencies, we'll be using miniconda. You can install miniconda following these instructions.
After loading conda
, you can then create your python3.10
environment and install the necessary python packages with the following commands:
conda create -n env_name python=3.10
conda activate env_name
pip install -r requirements.txt
The plco datasets, which include helpful data dictionaries and readmes, are availale at:
/scratch/project1/plco
In this part of the project, you will extend the starter code available in vectorizer.py
, logistic_regression.py
and main.py
to develop lung cancer risk models from the PLCO data.
To get started, we will implement logistic regression with Stochastic Gradient Descent to predict lung cancer risk using just patient age.
Recall, we can define a logistic regression model as:
where
We will train out model to perform classification using the binary cross entropy loss with L2 regulazarization.
where
To complete this part of the project, you will want to extract age data (column name is "age"
) from the PLCO csv, featurize it, and implement SGD.
You will need to solve for the gradient of the loss with respect your model parameters. Note, pay special attention to the numerical stability of your update rule. You may also need to play with
the number of training steps, batch size, learning rate and regularization parameter.
Your validation set ROC AUC should around 0.60
.
In your project report, please include a plot of your training and validation loss curves and describe the details of your model implementation.
A key challenge in developing effective machine learning tools is experiment management. Even for a simple model, such your logistic regression model, and a simple structured dataset (i.e PLCO) there are wide range of hyperparameters to tune. It quickly becomes intractable to identify the best model configuration by hand. In this part of the project, you will develop a small job dispatcher that will run a grid search over a set of hyperparameters. Your dispatcher should take as input a list of hyperparameters and a list of values to search over. Your dispatcher should then run a job for each combination of hyperparameters. Your dispatcher should also keep track of the results of each job and summarize the results in a convenient format. You can find some helpful starter code in dispatcher.py
.
Complete the grid search dispatcher and use it to tune the hyperparameters of your age-based logistic regression model. In your project report, include a plot showing the relationship of L2 regularization and model training loss.
Now that you have build a simple single feature classifier, you will extend your model to include additional features from the PLCO dataset. A data dictionary from the NCI is available at /scratch/project1/plco/dictionary_lung_prsn-aug21-091521.pdf
.
Note, this includes a wide range of questionare features including smoking history, family history, and other demographic information. Some of this data is numeric and some is categorical, and you will need to develop an efficient way to featurize this data. Moreover, you will also need to decide how to handle missing data, and how to deal with scale of various features (e.g. age vs. pack years). For this step, you will find some hints on a suggested vectorizer
design in vectorizer.py
. Note, you do not need to use all the features in the questionnare.
Beyond a richer set of features, you will can want to consider more sophisticated models like Random Forest or Gradient Boosted Trees. You're invited to leverage the sklearn
library for this part of the project to quickly explore other model implementations.
At the end of this phrase, your validation ROC AUC should be greater or equal to 0.83
.
In your project report, please including an test ROC plot of your final model, compared to your age-based model, and describe the details of your final model implementation. Please also include any interesting ablations from your model development process.
Now that you have developed your lung cancer model, and finalized your hyper-parameters, you will now focus on evaluating the performance of your model on the test set and on various subgroups of the test set.
In your project report, include ROC curves and Precision recall curves of your best model and highlight the operation point of the current NLST criteria (available in the "nlst_flag"
column). In addition to performance on the overall test, evaluate the performance of your model (using AUC ROC) on the following subgroups:
- sex (
sex
column) - race (
race7
column) - educational status (
educat
column) - cigarette smoking status (
cig_stat
column) - NLST eligiblity (
nlst_flag
column)
Are there any meaningful performance differences across these groups? What are the limitations of these analyses? What do these analyses tell us about our lung cancer risk model?
In addition to overall and subgroup analyses, list the top 3 most important features in your model. Note, depending on the type of model (e.g. tree method vs logistic regression) you use, you may need to leverage different model interpretability techniques. In your report, list the most important features and describe how you identified them.
Recall that lung cancer screening guidelines must balance the early detection of lung cancer against the harms of overscreening. In this part of the project, you will simulate the clinical utility of your model by comparing the cost-effectiveness of your model against the current national screening criteria (also known as the NLST criteria as available in the nlst_flag
column).
To start off, compute the sensitivity, specificity and positive predictive value (PPV) of the NLST criteria on the PLCO test set. Note, you can use the sklearn.metrics
library to compute these metrics. If you were to match the specificity (i.e. amount of overscreening), sensitivity (i.e. fraction of cancer patients benefiting from early detection) or PPV (i.e. fraction of screened patients that will develop cancer) of the NLST criteria, what performance would your risk model enable?
How would you choose a risk threshold for lung screening and why? Note, this a subjective choice.
For your chosen risk threshold, please compute its performance metrics across the patient subgroups listed above.
In the closing section of your project report, please discuss the implications of your findings and the limitations of these analyses in shaping lung cancer screening guidelines. What is missing in these analyses? What additional studies are needed to broaden clinical screening criteria?