Chronic kidney disease (CKD) has been on the rise in recent years and is a major cause of mortality and health expenditure in the United States. This project uses 235 features extracted from the U.S. Census Bureau to test whether hyper-local rates of CKD can be determined using readily available demographic data. These features include data on age, sex, marital status, disability, employment, profession, household type, housing costs, and type of insurance. Regression and ensemble methods were used to predict rates of chronic kidney disease. Ultimately, gradient boosted decision trees proved to be the best prediction model with a predictive accuracy of 83.94% (adjusted R2).
The purpose of this project was to assist federal, state, and local public health agencies and organization to improve targeting of public health campaigns related to chronic kidney disease prevention. The predictive model helps to accomplish this goal by allowing limited resources to be targeted to neighborhoods with the greatest need for intervention.
Data Sources
- 500 Cities: Local Data for Better Health (Centers for Disease Control)
- American Community Survey 5-year Data API (U.S. Census Bureau)
- Final Report: A summary of the project process, results, and actionable insights.
- Slide Deck: Used for presenting findings
- Notebooks: These were used in the following order to create the code base for this project.
- Data Wrangling: collecting, organizing, and cleaning datasets
- Data Storytelling: using exploratory data analysis to tell a story about the data
- Exploratory Data Analysis: exploring the data for initial insights, correlations, and possibly important features
- Regression Analysis: using various regression and ensemble methods to predict CKD prevalence
- Reports: These reports were written to track progress and explain the process throughout the project.
- Images: All saved plot and map outputs