This repository is part of an entire project to study age prediction and survival prediction from NHANES dataset. The code of this project is split into 3 repositories:
- 📦NHANES_preprocessing to scrape the NHANES website and preprocess the data.
- 📦TrainingCenter to train the algorithms from the dataset created in the previous repository.
- 📦CorrelationCenter to study the outputs of the models trained in the previous repository.
Feel free to start a discussion to ask anything here.
To setup your virtual environment:
pip install pip==20.0.2
pip install -e .
Before launching the jobs, you need to get the datasets and to set the folds properly by executing the following:
make_folds --main_category MAIN_CATEGORY --category CATEGORY --number_folds NUMBER_FOLDS
For this command line to work, you need to have this folder structure:
┣ 📦NHANES_preprocessing
┃ ┗ 📂merge
┃ ┗ 📂data
┃ ┣ 📂examination
┃ ┃ ┗ 📜[category].feather
┃ ┣ 📂laboratory
┃ ┃ ┗ 📜[category].feather
┃ ┗ 📂questionnaire
┃ ┗ 📜[category].feather
┣ 📦TrainingCenter
┗ 📂[...]
There are three pipelines available.
To predict the biological age or the risk of dying, you can use the command line made for that purpose:
prediction --main_category MAIN_CATEGORY --category CATEGORY --target TARGET --algorithm ALGORITHM --random_state RANDOM_STATE --n_inner_search N_INNER_SEARCH
To have the control on the survival predictions, you can train the models with only age, sex and ethnicities by using this command line:
basic_prediction --main_category MAIN_CATEGORY --category CATEGORY --target TARGET --algorithm ALGORITHM --random_state RANDOM_STATE --n_inner_search N_INNER_SEARCH
To get the feature importances of the models, you can use:
feature_importances --main_category MAIN_CATEGORY --category CATEGORY --target TARGET --algorithm ALGORITHM --random_state RANDOM_STATE --n_inner_search N_INNER_SEARCH
All the results are available in this spread sheet. The results are automatically updated to the spread sheet when the computations are done.
Executing the file ./shape_age_range/export_information.py will add the shapes and the age ranges to the spread sheet for each category and each target.
The folder fit_running gathers all the scripts for you to launch jobs on a cluster of computers using Slurm without you having to tell how much memory or time limit you need.
python -m unittest