Goal: An online containerized machine learning microservice to predict patients likely to become septic based on vital signs and laboratory values powered by the Red Hat Portfolio.
From O'REILLY Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow 2nd Edition
- Define the objective:
- early prediction of Sepsis based on patient Vital Signs and Laboratory Values
- How will the solution be used:
- Collected patient vitals and labs fetched on a recurring time interval and passed through the data transformation pipeline and model for Sepsis prediction
- What are the current solutions/workarounds:
- provider manually enters values in a Sepsis application
- How should you frame this problem:
- supervised, online, classification
- How should performance be measured:
- f1_score - measure provides a way to combine both precision and recall into a single measure that captures both properties
- recall - calculated as the number of true positives divided by the total number of true positives and false negatives; good for unbalanced data
- precision - quantifies the number of correct positive predictions made; good for unbalanced data
- What is the minimum performance needed to attain the objective for a prototype:
- 60%
- What are the comparable problems? can you resuse experience or tools?
- Severe Septic Shock, Refractory Septic Shock, Hypertension, etc.
- Is human expertise available?
- yes
- How would you solve the problem manually?
- heuristics
- List the assumptions made so far:
- the data is synthetic used to train the model
- Verify the assumptions:
- from the data "ICULOS" is the strongest indicator predicting Sepsis, but when removed from the data negatively impacts the models accuracy.
python functions
- List the data needed and how much:
- vital signs
- laboratory values
- minimum >50k records
- Where to get the data:
- How much space will it take:
- 11.4MB
- Check legal obligations:
- Attribution 4.0 International (CC BY 4.0)
- Free to:
- Share — copy and redistribute the material in any medium or format
- Adapt — remix, transform, and build upon the material for any purpose, even commercially.
- Get access authorizations.
- Create workspace with enough storage:
- Open Data Hub on Red Hat OpenShift
- Get the data:
- submodules/fetch_data.py
- Convert the data to a format to manipulate without changing the data:
- submodules/load_data.py
- Ensure sensitive data is deleted or protected (e.g. anonymized).
- Check the size and type of data (time series, sample, geographical):
- Total size = 11.4MB
- Total patient entries = 36,302
- Total attributes = 41
- Combined data in .CSV format containing Vitals, Labs and Demographics
- Labeled data column 41 "isSepsis" with 0=False 1=True
- Numeric (float64 and int64)
- Includes null/NaN
data-engineering.ipynb
- Create a copy of the data for exploration
- Create a Jupyter Notebook to keep record of exploration
- Study each attribute:
- name
- data type (numeric, categorical, bounded/un, text, structured, etc.)
- % missing values
- Noisiness (stochastic, outliers, rounding errors, etc.)
- Usefulness for the task
- Type of distribution (Gaussian, uniform, logarithmic, etc.)
- For supervised learning tasks, identify the target attributes
- Visualize the data
- Study the correlations between attributes
- Study how to manually solve the problem
- Identify promising transformations
- Identify extra data that would be useful (go back to Get the Data.)
- Document
data-scientists.ipynb
- Work on copies of the data
- see data/ subdirectory
- Write functions for all data transformations.
- Data cleaning:
- fix or remove outliers
- fill in missing values
- drop rows or columns
- Feature selection:
- drop attributes that provide no useful information for the task
- Feature engineering:
- Discretize continuous features
- Decompose features
- Add transformations
- Aggregate features
data-scientists.ipynb
- Sample smaller training sets (if afforable)
- Train many models from different categories
- Linear Support Vector Machine "SVM" Support Vector Classifier "SVC"
- Naive Bayes
- K-Neighbors Classifier
- Random Forest Classifier
- Logistic Regression
- Stochastic Gradient Descent "SGD" Classifier
- Neural Network Multi-Layer Perceptron Classifier
- XGBoost Classifier
- Measure and compare performance
- Cross Validations with f1_scoring
- Analyze the most significant variables for each algorithm
- Analyse the types of errors the models make
- would a human have avoided them?
- Perform a round of feature engineering and selection
- Perform 1-2 quick iterations
- Shortlist the top 3-5 most promising models that make different errors
*_model.ipynb
- Want as much data as possible
- Fine-tune hyperparameters using cross-validation
- Try Ensemble methods combining models
- Measure final model on Test set to estimate the error
- Document what you have done
- Create a presentation
- Explain why the solution achieves the business objective
- Highlight interesting points
- what worked?
- what didn't?
- assumptions
- limitations
- Get model ready for production
- Write monitoring code to check at regular intervals and trigger alerts when it drops
- Beware:
- slow degradation as data evolves
- monitor input quality as well as output
- Retrain models on fresh data regularly (automate it!)
.
├── README.md
├── data
│ ├── in # simulate new patient(s) vitals, labs, etc. data
│ │ └── new_data.csv # sample new data
│ ├── out # simulate predictions from input data
│ │ └── new_data_results.csv # sample new data predictions
│ ├── raw # raw data for exploration to training
│ │ ├── archive.gz
│ │ ├── archive.zip
│ │ └── dataSepsis.csv
│ └── transform # data transformation pipelines serialized to disk
│ ├── pipeline.pkl
│ └── pipeline_minmax.pkl
├── images # images used in the notebooks for illustration
│ ├── SIRSvsqSOFA.jpg
│ ├── SepsisDetection.png
│ ├── confusion_matrix.png
│ └── scikitlearn-choose-right-estimator.png
├── main.py # main prediction python script
├── models # serializes model files from experiment to final downselection
│ ├── experiment # serialized from data-science notebook
│ │ ├── gnb_model.pkl
│ │ ├── knn_model.pkl
│ │ ├── log_model.pkl
│ │ ├── mlp_model.pkl
│ │ ├── rfc_model.pkl
│ │ ├── sgd_model.pkl
│ │ ├── svc_model.pkl
│ │ └── xgbc_model.pkl
│ ├── final # production-ready models
│ │ ├── mlp_model.pkl
│ │ └── xgbc_model.pkl
│ └── tune # hyperparameter tuned models
│ ├── mlp_model.pkl
│ └── xgbc_model.pkl
├── notebooks
│ ├── README.md
│ ├── data-engineering.ipynb # exploring data
│ ├── data-science.ipynb # feature engineering and model exploring
│ ├── mlp-model.ipynb # downselected model for production
│ ├── rfc-model.ipynb # downselected model for production
│ └── xgbc-model.ipynb # downselected model for production
├── reports # saved figures for reference
│ └── figures
│ ├── experiment
│ │ ├── gnb_cm.png
│ │ ├── gnb_prc.png
│ │ ├── knn_cm.png
│ │ ├── knn_prc.png
│ │ ├── log_cm.png
│ │ ├── log_prc.png
│ │ ├── mlp_cm.png
│ │ ├── mlp_prc.png
│ │ ├── rfc_cm.png
│ │ ├── rfc_prc.png
│ │ ├── sgd_cm.png
│ │ ├── sgd_prc.png
│ │ ├── svc_cm.png
│ │ ├── svc_prc.png
│ │ ├── xgbc_cm.png
│ │ └── xgbc_prc.png
│ ├── final
│ │ ├── mlp_cm.png
│ │ ├── mlp_prc.png
│ │ ├── rfc_cm.png
│ │ ├── rfc_prc.png
│ │ ├── xgbc_cm.png
│ │ └── xgbc_prc.png
│ └── tune
│ ├── mlp_cm.png
│ ├── rfc_cm.png
│ └── xgbc_cm.png
├── requirements.txt # required packages
├── scratch # scratch/to-delete folder
│ ├── model-performance-assessment.ipynb
│ ├── scratch-pad
│ ├── train_svc.py
│ ├── train_svc_grid.py
│ └── train_svc_rand.py
├── serving # model serving work
│ ├── README.md
│ └── fn
│ ├── func.py
│ ├── func.yaml
│ ├── requirements.txt
│ └── test_func.py
└── submodules # reusuable python functions/methods
├── __init__.py
├── __pycache__
│ ├── __init__.cpython-39.pyc
│ ├── config.cpython-39.pyc
│ ├── fetch.cpython-39.pyc
│ ├── fetch_data.cpython-39.pyc
│ └── load_data.cpython-39.pyc
├── config.py
├── fetch_data.py
└── load_data.py