Skip to content

Trains machine learning algorithms to predict the age and the risk of dying for participants of NHANES dataset

Notifications You must be signed in to change notification settings

HMS-AgeVSSurvival/TrainingCenter

Repository files navigation

TrainingCenter

Super linter

This repository is part of an entire project to study age prediction and survival prediction from NHANES dataset. The code of this project is split into 3 repositories:

  • 📦NHANES_preprocessing to scrape the NHANES website and preprocess the data.
  • 📦TrainingCenter to train the algorithms from the dataset created in the previous repository.
  • 📦CorrelationCenter to study the outputs of the models trained in the previous repository.

Feel free to start a discussion to ask anything here.

Installation

To setup your virtual environment:

pip install pip==20.0.2
pip install -e .

Structure to have before launching the jobs

Before launching the jobs, you need to get the datasets and to set the folds properly by executing the following:

make_folds --main_category MAIN_CATEGORY --category CATEGORY --number_folds NUMBER_FOLDS

For this command line to work, you need to have this folder structure:

┣ 📦NHANES_preprocessing 
┃  ┗ 📂merge
┃    ┗ 📂data
┃       ┣ 📂examination
┃       ┃ ┗ 📜[category].feather
┃       ┣ 📂laboratory
┃       ┃ ┗ 📜[category].feather
┃       ┗ 📂questionnaire
┃         ┗ 📜[category].feather
┣ 📦TrainingCenter
   ┗ 📂[...]

Pipelines

There are three pipelines available.

Predictions

To predict the biological age or the risk of dying, you can use the command line made for that purpose:

prediction --main_category MAIN_CATEGORY --category CATEGORY --target TARGET --algorithm ALGORITHM --random_state RANDOM_STATE --n_inner_search N_INNER_SEARCH

Basic predictions

To have the control on the survival predictions, you can train the models with only age, sex and ethnicities by using this command line:

basic_prediction --main_category MAIN_CATEGORY --category CATEGORY --target TARGET --algorithm ALGORITHM --random_state RANDOM_STATE --n_inner_search N_INNER_SEARCH

Feature importances

To get the feature importances of the models, you can use:

feature_importances --main_category MAIN_CATEGORY --category CATEGORY --target TARGET --algorithm ALGORITHM --random_state RANDOM_STATE --n_inner_search N_INNER_SEARCH

Results

All the results are available in this spread sheet. The results are automatically updated to the spread sheet when the computations are done.

Executing the file ./shape_age_range/export_information.py will add the shapes and the age ranges to the spread sheet for each category and each target.

Launching jobs

The folder fit_running gathers all the scripts for you to launch jobs on a cluster of computers using Slurm without you having to tell how much memory or time limit you need.

To run the tests

python -m unittest

About

Trains machine learning algorithms to predict the age and the risk of dying for participants of NHANES dataset

Resources

Stars

Watchers

Forks

Packages

No packages published