Cardiovascular Disease Risk Prediction Using NHANES 2013-2014 Data

Project Overview

This project leverages data from the 2013-2014 CDC NHANES (National Health and Nutrition Examination Survey) dataset to create a predictive model that assesses the cardiovascular risk of individuals aged 20 and above. The primary goal is to develop a practical cardiovascular risk assessment tool that uses only demographic data, such as age, cardiovascular history, symptoms, and education level, as input. This tool is deployed through a Streamlit web application.

Dataset Source

The dataset used in this project was sourced from Kaggle, which includes the following sub-datasets:

Demographics
Diet
Examination
Labs
Medications
Questionnaire

The focus was on demographic factors for the final model, but the broader dataset provides comprehensive health information on the survey participants.

Project Motivation

Problem Area

Heart disease is the leading cause of death in the United States and globally, affecting men, women, and individuals of various ethnic backgrounds. Early detection of cardiovascular risk can help prevent the onset of heart disease, reduce mortality rates, and improve population health. This project aims to create a machine learning model capable of predicting cardiovascular disease (CVD) risk based on simple, non-invasive demographic factors.

The Impact

Reducing cardiovascular disease-related deaths has profound societal and economic benefits:

Healthier Workforce: A healthier population leads to increased productivity and reduced sick days.

Lower Healthcare Costs: Early intervention can reduce the need for expensive treatments.

Global Relevance: Though the dataset pertains to the U.S., insights and models can be generalized or adapted to other countries with similar health challenges.

Data Used

The NHANES dataset provides rich information about the health of participants, segmented into various categories. For this project, we utilized:

Demographic Data: Age, sex, education level, marital status

Questionnaire: Cardiovascular history, smoking habits, and symptoms

Examination Data: Height, weight, and blood pressure

Dataset Descriptions:

Demographic: Holds basic demographic information of survey participants, such as age, sex, education level, and income.

Questionnaire: Contains responses about medical history, smoking habits, dietary behavior, and family-level information.

Labs: Blood and urine test results, including cholesterol, blood glucose, and other metabolic markers.

Diet: Detailed dietary intake information of participants.

Examination: Physical examination results, including BMI, height, waist circumference, and blood pressure.

Methodology

Approach

The following process was done during this project:

Data Cleaning
Data Preprocessing
Modeling
Model Evaluation

Data Cleaning

Imputation: Missing values were imputed using the mean, median or random imputation from observed column values.
Re-encode Features: All features had to be re-encoded and one hot encoded prior to the modeling phase for better feature interpretation.
Rename Features: All features were renamed in each dataset, and a dictionary was used to map the SEQN or respondent number to each dataset to ensure the same respondents were present all throughout the datasets.

Data Preprocessing

Feature Selection: Respondent gender, age, height, weight, education level, cardiovascular illness symptoms, and family history were prioritized in this project.
Feature Engineering: Age bins were created to handle class imbalance, as well as additional features such as "has_angina" and "has_family_history" as Cardiovascular Disease Indicators, and "have_cvd", derived from the one hot encoded features of cardiovascular disease conditions (congestive_heart_failure_Yes, coronary_heart_disease_Yes, heart_attack_Yes, stroke_Yes) as those who already have CVD.

Modeling

The project employs machine learning techniques to predict whether a participant is at risk of cardiovascular disease (CVD) based on the available demographic and cardiovascular symptom data. The following models were evaluated:

Logistic Regression
Random Forest
XGBoost
Ensemble Methods

Model Evaluation

Final Model

After testing various models, the Random Forest algorithm was chosen for its superior performance in terms of accuracy and balance of precision/recall. The model is optimized using techniques like hyperparameter tuning, SMOTE-ENN sampling and Random Sampling to handle class imbalances.

Performance Metrics

The performance of the model was evaluated using:

Accuracy
Precision
Recall
F1-Score

The Streamlit app allows users to input demographic data and assess their cardiovascular risk based on the model’s predictions.

Streamlit App

The cardiovascular risk prediction model has been deployed as a web application using Streamlit. Users can input simple demographic information such as age, sex, cardiovascular history, education, and symptoms to receive a cardiovascular risk assessment.

Installation Instructions

To run the app locally:

Clone this repository:

git clone https://github.com/your-repo/cvd-risk-assessment.git

Install the required dependencies:
```
pip install -r requirements.txt
```
Run the Streamlit app:
```
streamlit run app.py
```

The application features:

User Input Fields: Age, sex, cardiovascular history, education level, smoking history, etc.

Prediction Output: Displays the predicted risk of cardiovascular disease.

File Directory Structure

Data : Original and cleaned datasets.

Docs : Presentation materials and documentation.

Models : Trained models for the Streamlit app.

Notebooks : Jupyter notebooks for data cleaning and modeling.

Streamlit : Streamlit app and requirements.

README.md : This README file.

LICENSE : Project license information.

Future Work

This project can be further improved by:

Expanding Input Features: Incorporating additional clinical and lab data to improve prediction accuracy.

Testing Other Models: Exploring deep learning methods like neural networks to enhance predictive performance.

Generalization: Applying the model to different populations to test its robustness across diverse demographic groups.

Conclusion

By utilizing machine learning models on publicly available health data, this project demonstrates the potential of predictive analytics in the early detection of cardiovascular disease risk. The deployed Streamlit app makes it easy for users to input their data and receive actionable insights regarding their cardiovascular health.

Feel free to contribute to this project or raise any issues through GitHub.

License

This project is licensed under the MIT License.

This README provides an overview of the project, instructions for running the app, and additional context for users and contributors. Let me know if you'd like to add or modify any sections!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cardiovascular Disease Risk Prediction Using NHANES 2013-2014 Data

Project Overview

Dataset Source

Project Motivation

Problem Area

The Impact

Data Used

Dataset Descriptions:

Methodology

Approach

Data Cleaning

Data Preprocessing

Modeling

Model Evaluation

Final Model

Performance Metrics

Streamlit App

Installation Instructions

File Directory Structure

Future Work

Conclusion

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
.devcontainer		.devcontainer
Data		Data
Docs		Docs
Models		Models
Notebooks		Notebooks
Streamlit		Streamlit
LICENSE		LICENSE
README.md		README.md

License

ysouffront1/PredictingCVDRisk

Folders and files

Latest commit

History

Repository files navigation

Cardiovascular Disease Risk Prediction Using NHANES 2013-2014 Data

Project Overview

Dataset Source

Project Motivation

Problem Area

The Impact

Data Used

Dataset Descriptions:

Methodology

Approach

Data Cleaning

Data Preprocessing

Modeling

Model Evaluation

Final Model

Performance Metrics

Streamlit App

Installation Instructions

File Directory Structure

Future Work

Conclusion

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages