Skip to content

lohzhishen/SC1015-Diabetes_Prediction

Repository files navigation

SC1015-Diabetes_Prediction

About

This repository contains our mini-project for SC1015 (Introduction to Data Science & Artifical Intelligence). Our project is about identifying important health metrics in predicting the risk of diabetes.

Contributors

  • @lohzhishen
  • @YoNG-Zaii
  • @TANERNHONG

Presentation

Diabetes Prediction Youtube Video

Usage

Download "heart_2020_cleaned.csv" and "SC1015_Intro_to_DSAI.ipynb". Place both files in the same folder.

Jupyter Notebook:

  • Run normally.

Google Colaboratory:

  • Upload "SC1015_Intro_to_DSAI.ipynb" to Google Colab.
  • Upload "heart_2020_cleaned.csv" to runtime environment (Ensure that the whole file is uploaded before running the notebook or there will be errors.)

Problem Statement

What are some of the important health metrics in determining the risk of diabetes?

Approach

Our group approach to answering this question through a data-driven method is to reframe the problem as a classification problem. From the models that we have trained, we extracted the relative feature importance and used this information to come to a conclusion about the importance of the different health metrics.

Dataset used

Dataset from Kaggle: "Personal Key Indicators of Heart Disease" by Kamil Pytlak
Source : https://www.kaggle.com/kamilpytlak/personal-key-indicators-of-heart-disease (requires login to download)

Although we are not using the dataset for its intended purpose, the dataset does provide the essential information we want, mainly whether an individual has diabetes and their health metrics.

New Tools Used

EDA:

  • Chi-square test of independence (chi2_contingency from scipy), Cramer's V and Tsuchuprow's T.

Data preprocessing:

  • Under sampling majority class (RandomUnderSampler from imblearn)
  • Class weights
  • Transforms (StandardScaler and PolynomialFeatures from sklearn)

Models:

  • Logistic regression (LogisticRegression from sklearn)
  • Random forest classifier (RandomForestClassifier from sklearn)

Evaluation tools:

  • Evaluation reports (classification_report from sklearn)
  • ROC curve and AUC score (roc_curve and roc_auc_score from sklearn)
  • Recursive feature elimination (RFE from sklearn)

Conclusions

Results

Accuracy of models

The LogisticRegression is the more accurate model as it outperforms the RandomForestClassifier in terms of accuracy and has a higher AUC score. It also has a lower false positive and false negative rate.

Feature importance

The LogisticRegression model placed approximately equal importance to BMI, AgeCategory and GenHealth as predictors for diabetes. This is in contrast to the RandomForestClassifier. It placed the high importance on BMI. However, GenHealth and AgeCategory are about half as important as BMI, and SleepTime is about a quarter as important as BMI.

Insights

Both models suggest that BMI is an important factor and SleepTime is a relatively unimportant factor in predicting the risk of diabetes.

However, LogisticRegression placed importance on the variables more evenly, with SleepTime being an aforementioned exception, whereas the RandomForestClassifier placed greater relative feature importance on BMI as compared to the rest of the variables.

  • BMI, Age and General Health are important health metrics in predicting the risk of diabetes.
  • Sleep Time is a relative unimportant health metric in predicting the risk of diabetes.

Learning Points

  • Handling imbalanced datasets using resampling methods and class weights.
  • Parametric and nonparametric machine-learning algorithms - Logistic regression and RandomForestClassifier respectively - from sklearn.
  • Recursive feature elimination and sklearn package.
  • Collaborating using Google Colab.
  • Concepts of PolynomialTransform and StandardScaler.
  • Concepts of Chi-square Test of Independence, Cramer's V, and Tsuchuprow's T.

References

Bali A. (2022, April 5). Chi-Square Formula: Definition, P-value, Applications, Examples. collegedunia.
     https://collegedunia.com/exams/chi-square-formula-definition-pvalue-applications-examples-articleid-4167#fi

Chao, D. Y. (2021, May 22). Chi-Square Test, with Python. Towards Data Science.
     https://towardsdatascience.com/chi-square-test-with-python-d8ba98117626

Common pitfalls and recommended practices. (n.d.). Scikit-learn.
     https://scikit-learn.org/stable/common_pitfalls.html

Diabetes. (2022). World Health Organization.
     https://www.who.int/health-topics/diabetes#tab=tab_1

Imbalanced Data. (2011, Nov 1). Google Developers.
     https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/imbalanced-data

Pytlak, K. (2022). Personal Key Indicators of Heart Disease. Kaggle.
     https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease

RandomUnderSampler. (n.d.). ImbalancedLearn.
     https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.RandomUnderSampler.html

Richmond, S. (2016, March 21). Algorithms Exposed: Random Forest. bccvl.
     https://bccvl.org.au/algorithms-exposed-random-forest/#:~:text=ASSUMPTIONS,are%20ordinal%20or%20non%2Dordinal

Sarang, N. (2018, June 27). Understanding AUC - ROC Curve. Towards Data Science.
     https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5

Seb. (2021, April 8). Chi-Square Distribution Table. Programmathically.
     https://programmathically.com/chi-square-distribution-table/

Singapore's War on Diabetes. (2021, May 26). HealthHub.
     https://www.healthhub.sg/live-healthy/1273/d-day-for-diabetes

sklearn.ensemble.RandomForestClassifier. (n.d.). Scikit-learn.
     https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

sklearn.feature_selection.RFE. (n.d.). Scikit-learn.
     https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html?highlight=rfe#sklearn.feature_selection.RFE

sklearn.linear_model.LogisticRegression. (n.d.). Scikit-learn.
     https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

sklearn.metrics.classification_report. (n.d.). Scikit-learn.
     https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report

sklearn.metrics.roc_auc_score. (n.d.). Scikit-learn.
     https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score

sklearn.metrics.roc_curve. (n.d.). Scikit-learn.
     https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve

sklearn.preprocessing.PolynomialFeatures (n.d.). Scikit-learn.
     https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures

sklearn.preprocessing.StandardScaler. (n.d.). Scikit-learn.
     https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •