A multi-layer perceptron which predicts whether an individual is susceptible to diabetes. The model has been trained on the Pima Indians Diabetes Database, provided by the National Institute of Diabetes and Digestive and Kidney Diseases.
matplotlib
pandas
Keras
NumPy
seaborn
scikit-learn
Note: 'outcome' refers to whether an individual does, or does not, have diabetes
- Variables are on different scales, and therefore must be standardized
- The majority of data has been collected from individuals between 20 and 30 years of age
BMI
,Blood Pressure
, andGlucose
are normally distributed- This is to be expected when such statistics are collected from a population
- It is impossible for for
BMI
,Blood Pressure
, andGlucose
to have a value of zero- Missing or incomplete data?
- Certain individuals have had up to 15 pregnancies
- While not implausible, this information should still be considered
- This data-set suggests that 35% of the population has diabetes (65% do not)
- The World Health Organisation estimates that only 8.5% of the global population suffers from diabetes
- ...this data-set is therefore not representative of the global population, which is to be expected due to its nature
Glucose
,BMI
, andAge
appear to be the strongest predicting values for those with diabetesBlood Pressure
andSkin Thickness
do not appear to have a significant correlation with the distribution of diabetic and non-diabetic individuals
- There are a total of 768 entries
Pregnancies
,Glucose Concentration
,Blood Pressure
,Skin Thickness
,Insulin
, andBMI
appear to have a minimum value of zero. This indicates missing values as such values are impossible
- There is a significant number of missing values. Most notably, a large number of entries for
Insulin
andSkin Thickness
are missing - Due to the fact that missing values have been determined by searching for entries with a value of zero,
Pregnancies
can be ignored as an individual with zero pregnancies is perfectly valid - Missing values have been replaced with the mean of non-missing values
- The values for
Outcome
have been copied from the original dataset as they do not require standardization
The dataset has been split into training (80%
) and testing (20%
) splits. The training set has then been further divided into training (80%
) and validation (20%
) splits.
Once trained, the model was able to achieve 96.74%
accuracy on the training set and 70.13%
accuracy on the testing set.
- In the case of diabetes prediction, false-negatives are the least desirable outcome as it would result in patients being informed that they will not develop diabetes when in fact they may.