title | author | date | output | ||||
---|---|---|---|---|---|---|---|
Diabetes Relation Analysis |
Aashay Sharma |
24/07/2020 |
|
In this Particular analysis we will try to find out relations between different variables given in the data set and we will try to fit a model to predict the diabetes outcome.
Data is already clean and does not recquire some specific cleaning but we will just convert the Outcome Variable to a factor variable to perform some plots.
data <- "/Users/aashaysharma/Desktop/RStudio/diabetes/diabetes.csv"
diabetes <- read.csv(data)
head(diabetes)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## DiabetesPedigreeFunction Age Outcome
## 1 0.627 50 1
## 2 0.351 31 0
## 3 0.672 32 1
## 4 0.167 21 0
## 5 2.288 33 1
## 6 0.201 30 0
Now we have many variables but our outcome is discrete that is it is binary data 1 or 0.
Now I will perform aov analysis to get the variance distribution of the data so we can see that what variables account for what amount of variance.
variance_analysis <- aov(Outcome ~ . , data = diabetes)
summary(variance_analysis)
## Df Sum Sq Mean Sq F value Pr(>F)
## Pregnancies 1 8.59 8.59 53.638 6.16e-13 ***
## Glucose 1 34.02 34.02 212.406 < 2e-16 ***
## BloodPressure 1 0.12 0.12 0.771 0.380213
## SkinThickness 1 0.86 0.86 5.393 0.020481 *
## Insulin 1 0.26 0.26 1.594 0.207108
## BMI 1 6.78 6.78 42.331 1.40e-10 ***
## DiabetesPedigreeFunction 1 1.82 1.82 11.349 0.000793 ***
## Age 1 0.46 0.46 2.865 0.090922 .
## Residuals 759 121.57 0.16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now we can see that the variables which account for high variance and lowest p-Values are :
- Pregnancies
- Glucose
- BMI
- DiabetesPedigreeFunction
- and SkinThickness but it accounts for a bit higher p-Value compared to others in this list.
Converting the outcome to factor variable :
diabetes$Outcome2 <- as.factor(diabetes$Outcome)
a <-ggplot(data = diabetes, aes(x = Outcome2, y = Pregnancies)) + geom_boxplot()
b <-ggplot(data = diabetes, aes(x = Outcome2, y = Glucose)) + geom_boxplot()
c <-ggplot(data = diabetes, aes(x = Outcome2, y = BMI)) + geom_boxplot()
d <-ggplot(data = diabetes, aes(x = Outcome2, y = DiabetesPedigreeFunction)) + geom_boxplot()
e <-ggplot(data = diabetes, aes(x = Outcome2, y = SkinThickness)) + geom_boxplot()
grid.arrange(a, b, c, d, e, nrow = 3, ncol = 2)
Okay so after looking at the graphs we can infer that Pregnancies and Glucose have a significant mean difference with least outliers, BMI and PedigreeFunction have a lesser mean difference but have many outliers which can account for lesser accurate fit and the last graph SkinThickness has a lesser mean difference but one outlier.
- Pregnancies and Glucose as features
- BMI and PedigreeFunction as feature along with the first 2
- and all the feautres including SkinThickness.
We will use caret package for fitting and plotting.
First we will separate the data into a training and testing set.
set.seed(1234)
inTrain <- createDataPartition(y = diabetes$Outcome2, list = FALSE, p = 0.65)
train <- diabetes[inTrain,]
test <- diabetes[-inTrain,]
RF_model <- train(Outcome2 ~ Pregnancies + Glucose, method = "rf", data = train, ntree = 100)
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
RF_predict <- predict(RF_model, test)
confusionMatrix(test$Outcome2, RF_predict)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 139 36
## 1 36 57
##
## Accuracy : 0.7313
## 95% CI : (0.674, 0.7835)
## No Information Rate : 0.653
## P-Value [Acc > NIR] : 0.003716
##
## Kappa : 0.4072
##
## Mcnemar's Test P-Value : 1.000000
##
## Sensitivity : 0.7943
## Specificity : 0.6129
## Pos Pred Value : 0.7943
## Neg Pred Value : 0.6129
## Prevalence : 0.6530
## Detection Rate : 0.5187
## Detection Prevalence : 0.6530
## Balanced Accuracy : 0.7036
##
## 'Positive' Class : 0
##
LR_model <- train(Outcome2 ~ Pregnancies + Glucose, method = "glm", family = "binomial", data = train)
LR_predict <- predict(LR_model, test)
confusionMatrix(test$Outcome2, LR_predict)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 155 20
## 1 40 53
##
## Accuracy : 0.7761
## 95% CI : (0.7214, 0.8246)
## No Information Rate : 0.7276
## P-Value [Acc > NIR] : 0.04112
##
## Kappa : 0.4798
##
## Mcnemar's Test P-Value : 0.01417
##
## Sensitivity : 0.7949
## Specificity : 0.7260
## Pos Pred Value : 0.8857
## Neg Pred Value : 0.5699
## Prevalence : 0.7276
## Detection Rate : 0.5784
## Detection Prevalence : 0.6530
## Balanced Accuracy : 0.7604
##
## 'Positive' Class : 0
##
RF_model2 <- train(Outcome2 ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction, method = "rf", data = train, ntree = 100)
RF_predict2 <- predict(RF_model2, test)
confusionMatrix(test$Outcome2, RF_predict2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 150 25
## 1 32 61
##
## Accuracy : 0.7873
## 95% CI : (0.7334, 0.8347)
## No Information Rate : 0.6791
## P-Value [Acc > NIR] : 5.73e-05
##
## Kappa : 0.5223
##
## Mcnemar's Test P-Value : 0.4268
##
## Sensitivity : 0.8242
## Specificity : 0.7093
## Pos Pred Value : 0.8571
## Neg Pred Value : 0.6559
## Prevalence : 0.6791
## Detection Rate : 0.5597
## Detection Prevalence : 0.6530
## Balanced Accuracy : 0.7667
##
## 'Positive' Class : 0
##
LR_model2 <- train(Outcome2 ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction, method = "glm", family = "binomial", data = train)
LR_predict2 <- predict(LR_model2, test)
confusionMatrix(test$Outcome2, LR_predict2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 154 21
## 1 37 56
##
## Accuracy : 0.7836
## 95% CI : (0.7294, 0.8314)
## No Information Rate : 0.7127
## P-Value [Acc > NIR] : 0.005295
##
## Kappa : 0.5024
##
## Mcnemar's Test P-Value : 0.048885
##
## Sensitivity : 0.8063
## Specificity : 0.7273
## Pos Pred Value : 0.8800
## Neg Pred Value : 0.6022
## Prevalence : 0.7127
## Detection Rate : 0.5746
## Detection Prevalence : 0.6530
## Balanced Accuracy : 0.7668
##
## 'Positive' Class : 0
##
RF_model3 <- train(Outcome2 ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction + SkinThickness, method = "rf", data = train, ntree = 100)
RF_predict3 <- predict(RF_model3, test)
confusionMatrix(test$Outcome2, RF_predict3)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 149 26
## 1 36 57
##
## Accuracy : 0.7687
## 95% CI : (0.7135, 0.8178)
## No Information Rate : 0.6903
## P-Value [Acc > NIR] : 0.002792
##
## Kappa : 0.4763
##
## Mcnemar's Test P-Value : 0.253038
##
## Sensitivity : 0.8054
## Specificity : 0.6867
## Pos Pred Value : 0.8514
## Neg Pred Value : 0.6129
## Prevalence : 0.6903
## Detection Rate : 0.5560
## Detection Prevalence : 0.6530
## Balanced Accuracy : 0.7461
##
## 'Positive' Class : 0
##
LR_model3 <- train(Outcome2 ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction + SkinThickness, method = "glm", family = "binomial", data = train)
LR_predict3 <- predict(LR_model3, test)
confusionMatrix(test$Outcome2, LR_predict3)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 155 20
## 1 37 56
##
## Accuracy : 0.7873
## 95% CI : (0.7334, 0.8347)
## No Information Rate : 0.7164
## P-Value [Acc > NIR] : 0.00512
##
## Kappa : 0.5097
##
## Mcnemar's Test P-Value : 0.03407
##
## Sensitivity : 0.8073
## Specificity : 0.7368
## Pos Pred Value : 0.8857
## Neg Pred Value : 0.6022
## Prevalence : 0.7164
## Detection Rate : 0.5784
## Detection Prevalence : 0.6530
## Balanced Accuracy : 0.7721
##
## 'Positive' Class : 0
##
1 -> Pregnancies 2 -> Glucose 3 -> BMI 4 -> DiabetesPedigreeFunction 5 -> SkinThickness
print("RF Model 1 + 2")
## [1] "RF Model 1 + 2"
confusionMatrix(test$Outcome2, RF_predict)$overall[1]
## Accuracy
## 0.7313433
print("RF Model 1 + 2 + 3 + 4")
## [1] "RF Model 1 + 2 + 3 + 4"
confusionMatrix(test$Outcome2, RF_predict2)$overall[1]
## Accuracy
## 0.7873134
print("RF Model 1 + 2 + 3 + 4 + 5")
## [1] "RF Model 1 + 2 + 3 + 4 + 5"
confusionMatrix(test$Outcome2, RF_predict3)$overall[1]
## Accuracy
## 0.7686567
print("LR Model 1 + 2")
## [1] "LR Model 1 + 2"
confusionMatrix(test$Outcome2, LR_predict)$overall[1]
## Accuracy
## 0.7761194
print("LR Model 1 + 2 + 3 + 4")
## [1] "LR Model 1 + 2 + 3 + 4"
confusionMatrix(test$Outcome2, LR_predict2)$overall[1]
## Accuracy
## 0.7835821
print("LR Model 1 + 2 + 3 + 4 + 5")
## [1] "LR Model 1 + 2 + 3 + 4 + 5"
confusionMatrix(test$Outcome2, LR_predict3)$overall[1]
## Accuracy
## 0.7873134
We can see the accuracies and they are a bit close, but there are some outliers in BMI and PedigreeFunction Variables thus the fit can be faulty in some manner but over all we can see all this 5 variables have some significant effect over diabetes outcome.
Further we can try more combinations and other models to see what fits the best, we cannot just rely on accuracies
After some exploratory analysis with the selected attributes from the dataset, I applied some common ML algoritms (widely used for classification problems), though the models perform well and can be used for inference and base level predictions but a deep learning approach is neccesary as it can help us infere more about the data and give much better predicitions (more real world predictions)
The dataset which is used in this problem contains only 768 instances (ie, rows or records), so it is sufficient for a machine learning problem but is very small for a deep learning problems as the data will be splitted further into test and train sets and for correct evalutation of the model we would be left with a very small test set which wont give us good or highly significant metrics for our model. So eradicate this problem to a acceptable level we can use K Folds Cross validation method to train and evalusate our model so that even with such a small data we can get pretty good and most important "significant" metrics for our model.
The choice of model is also very simple due to data limitation and so we will use a model with only 2 layers, the model schema is summarized below:
INPUT LAYER : 8 UNITS SHAPE(5,), ACTIVATION = "RELU"
#1 HIDDEN LAYER 1 : 8 UNITS, ACTIVATION = "RELU"
#2 OUTPUT LAYER 2 : 1 UNIT , ACTIVATION = "SIGMOID" (FOR BINARY CLASSIFICATION PROBLEM)
LOSS FUNCTION : BINARY_CROSSENTROPY
OPTIMIZER : STOCHASTIC GRADIENT DESCENT (SGD)
METRICS : ACCURACY (HOW MANY LABELS DID WE CORRECTLY PREDICT FROM TEST SET)
VALIDATION METHOD : K FOLDS CROSS VALIDATION WITH 3 FOLDS
NOTE : Also tweaked and tried 10 folds as well as 5 folds but the final metrics did not have any significant or drastics changes so chose 3 Folds