-
Notifications
You must be signed in to change notification settings - Fork 0
/
assignment06.qmd
275 lines (196 loc) · 13 KB
/
assignment06.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
---
title: "assignment06"
author: "Benjamin Reese"
format: html
self-contained: true
---
## Exercise 01: Calculate MSE, RMSE, and MAE.
The following work calculates the MSE, RMSE, and MAE for this table:
true_value predicted_value
------- ------
1 2
2 2
3 1
4 8
5 4
First, mean squared error, or MSE, is calculated by the equation:
$$\hat{MSE} = \frac{1}{n}\sum^{n}_{i=1}(Y_{i}-\hat{Y_{i}})^2$$
Starting with the first row, $(1-2)^2=1$, $(2-2)^2=0$, $(3-1)^2=4$, $(4-8)^2=16$, and $(5-4)^2=1$.
Adding up all of the subtractions gives us, $1+0+4+16+1=22$. The final step is to divide by the sample size, for us that is $n=5$. So, $22/5$, or, $4.4$ is the mean squared error.
$$\hat{MSE}=\frac{22}{5}=4.4$$
The root mean square error, or RMSE, is calculated by taking the square root of the MSE. Here is the formula:
$$\hat{RMSE} = \sqrt{\frac{1}{n}\sum^{n}_{i=1}(Y_{i}-\hat{Y_{i}})^2}$$
Taking the square root of our MSE gives us, $\sqrt{4.4}\approxeq2.097618$. The Root Mean Square Error is $1.949359$.
$$\hat{RMSE}=\sqrt{4.4}\approxeq2.097618$$
Finally, the mean absolute error, or MAE, is determined by the formula:
$$\hat{MAE} = \frac{1}{n}\sum^{n}_{i=1}|Y_{i}-\hat{Y_{i}}|$$
This formula is very similar to the MSE and RMSE formulas, except we are finding the absolute values of our errors, and not the squared values. The MAE for the data in the table above are calculated as $|1-2|=1$, $|2-2|=0$, $|3-1|=2$, $|4-8|=4$, and $|5-4|=1$. Summing over all of these absolute values gives us: $1+0+2+4+1=8$. Now we just have to divide by our sample size, $n=5$, and we have our MAE. $8/5=1.6$. So our MAE is $8/5$, or, $1.6$.
$$\hat{MAE}=\frac{8}{5}=1.6$$
While all three of these equations tell us the error in predictions, they handle outliers differently. The mean squared error method, and by extension, the root mean square error method, is greatly influenced by outliers because the differences are squared. MAE can be interpreted as the absolute distances between true and predicted values while mean square error calculates by squaring, weighting higher errors compared to lower errors. In sum, the RMSE error method is more influenced by outliers than mean absolute error.
## Exercise 02: Creating A Binary Classification Confusion Matrix
Below find my work for exercise 02. I start by replicating the table before creating the confusion matrix and calculating precision, sensitivity, and accuracy.
```{r, out.width="90%", out.height="90%"}
knitr::include_graphics("images/ex02.pdf")
```
The key metrics are reported below:
- Accuracy: $\frac{7}{10}=.7$
- Precision: $\frac{3}{4}=.75$
- Recall/Sensitivity: $\frac{3}{5}=.6$
## Exercise 03: Creating A Multiclass Classification Confusion Matrix
Below, find my work for exercise 03. Like exercise 02, I start by replicating the table, creating the confusion matrix, classifying each prediction, and then calculating accuracy and the misclassification rate. Accuracy is the proportion of correct predictions divided by the total, found by summing the main diagonal and dividing by the total. The misclassification rate is the proportion of incorrect predictions. We can find that by subtracting accuracy from 1, or $1-Accuracy$. The key results are reported below:
```{r, out.width="90%", out.height="90%"}
knitr::include_graphics("images/ex03.pdf")
```
- Accuracy: $\frac{10}{15}=\frac{2}{3}\approxeq.6667$
- Misclassification Rate: $\frac{5}{15}=\frac{1}{3}\approxeq.3333$
## Exercise 04: Guessing & Predicting
In this case, the highest accuracy we can achieve would be 51%, if we were to choose 1 every time. We could only achieve 49% accuracy if we were to choose 0 every time, and thus we should predict 1 for the highest possible accuracy.
In the second case, the highest accuracy we can achieve would be 99%, if we were to choose 0 every time. We could only achieve .01% accuracy if we were to choose 1 every time, and thus we should predict 0 for the highest possible accuracy.
If we could only pick one number to guess every time, then our accuracy is limited by the proportion of actual observations that are that number.
Context is important when comparing calculated accuracy in machine learning contexts because of relative costs. Depending on situational factors, true positives, true negatives, false positives, and false negatives can have varying levels of severity. These concerns must be considered by researchers, perhaps before even estimating a model. If a medical test, such as a test for cancer, results in a false negative, then the patient could have undiagnosed cancer that severely decreases their survivability. A false positive, on the other hand, may just lead to a few more tests. In that scenario, false negatives are far more consequential than false positives. In that case, we should probably be concerned about sensitivity over accuracy.
There could be other instances, such as detecting radiation, where a false positive could result in mass evacuations, and the accompanying social hardship, where there was actually no dangerous levels of radiation. In that instance, we would be more interested in precision.
We have several metrics related to the quality of our tests - accuracy, precision, sensitivity, and specificity - and they are each qualitatively different and should be used as a metric of quality at the determination of the researcher based on expertise and substantive knowledge of relative costs.
## Exercise 05: The Marble Bag Problem
```{r, warning=FALSE, message=FALSE}
## Packages for Exercise05
library(readr)
library(tidymodels)
library(tidyverse)
## Loading in Data
marbles <- read_csv("data/marbles.csv")
```
### 1. Dividing the Marbles Dataset into Training and Testing Data
```{r}
## Setting the seed
set.seed(20200229)
## Splitting
split <- initial_split(data = marbles, prop = .8)
## Training and Testing
marbles_train <- training(x = split)
marbles_test <- testing(x = split)
```
### 2. Developing Mental Model for Predicting Marbles
The chart, tibble, and probabilities below suggest that we should guess that a marble is black if it is big and white if it is small.
```{r}
## Basic Barplot
marbles_train %>%
ggplot(aes(x=size, fill=color)) +
geom_bar(color="black") +
scale_fill_manual(values=c("black", "white")) +
theme_minimal() +
labs(x="Marble Size", y="Number of Marbles",
title = "The Size and Color of Marbles in A Bag",
subtitle = "Most Big Marbles are Black")
## Counting Marbles By Size and Color
marbles_train %>%
count(color, size, sort = TRUE)
```
As shown in the barplot and tibble above, more of the big marbles are black than white, and more of the small marbles are white than black. Therefore, if we reach into a bag and feel that the marble is big, we should guess that it is black, and, if the marble is small, we should guess that it is white. Since, though, there are relatively few small black marbles and relatively more big black marbles, we should be quite accurate predicting a marble is black given it is big. Further, the number of small black marbles is so few that we should be even more accurate when guessing that a small marble is white.
Analytically, we can think of this problem in terms of conditional probabilities. Conditional probability is defined by the formula:
$$P(A|B) = P(A∩B) / P(B)$$
We can think of this problem, and model our intuitive results, as wanting to know the conditional probability of a marble being black given it is big, and a marble being white given that it is small. If the conditional probability of a marble being white given it is small is larger than the conditional probability of a marble being black given it is small, then we should predict small marbles will be white. Similarly, if the conditional probability of a marble being black given it is big is larger than the conditional probability of a marble being black given it is small, then we should predict large marbles will be black. The code below uses the tibble above to construct probabilities and shows that the expected conditional probabilities are consistent with our intuitive predictions. The following code evaluates permutations of the following probability formula:
$$P(W|S) = P(W∩S) / P(S)$$
Where $P(W|S)$ is the probability that a marble is white given that it is small - what we are trying to find -, $P(W∩S)$ is the probability that a marble is white AND small, and $P(S)$ is the probability that a marble is small. I evaluate this for white given big to see which probability is larger and repeat for all size and color combinations.
```{r}
## Probability of white AND small
p_white_small <- 27/100
## Probability of small
p_small <- 37/100
## Conditional Probability of white given small
p_white_small/p_small
## Probability of white AND big
p_white_big <- 21/100
## Probability of big
p_big <- 63/100
## Conditional Probability of white given big
p_white_big/p_big
## Probability of black AND small
p_black_small <- 10/100
## Conditional probability of black give small
p_black_small/p_small
## Probability of black AND big
p_black_big <- 42/100
## Conditional probability of black given big
p_black_big/p_big
```
### 3. Creating Simple Function to Predict Marble Color
Below find the simple function that predicts the color of a marble based on the size. It is a very simple function that predicts marbles will be small if they are white and big if they are black.
```{r}
#' Simple Marble Color Predictor
#'
#' @param x a numeric vector of sizes of marbles
#'
#' @return a character vector of predictions of marble color based on their size
#'
#'
#' @examples color_predict(marbles_train$size)
color_predict <- function(x) {
x <- as_tibble(x) %>%
mutate(yhat = case_when(
x == "small" ~ "White",
x == "big" ~ "Black"
)
)%>%
select(yhat)
return(x)
}
## Applying to Testing Data
color_predict(marbles_test$size)
```
### 4. Confusion Matrix and Accuracy Function
The code below creates a function to generate confusion matrices and calculates accuracy. It takes two arguments, actual observed values, and predicted values, such as those determined above by the color_predict() function. The accuracy is .76 and the confusion matrix, formatted as a table, can be seen below.
```{r}
## Creating the predicted values
yhat <- color_predict(marbles_test$size)
#' Generating Confusion Matrix and Accuracy
#'
#' @param y A character vector of actual observed values to be compared to predicted values
#' @param yhat A character vector of predicted values to be compared to the actually observed values
#'
#' @return This function returns two objects, first a basic confusion matrix formatted as a table object, and, second, a numeric value representing the accuracy as calculated by the sum of the table diagonal divided by the total
#'
#'
#' @examples confusion_matrix(y= marbles_test$color, yhat=yhat)
confusion_matrix <- function(y, yhat) {
conf_tib <- tibble(y, yhat) ## putting variables together
conf_matrix <- table(conf_tib) ## creating table
Accuracy <- (sum(diag(conf_matrix)))/(nrow(conf_tib)) ## summing and dividing by total
return(list(conf_matrix, Accuracy)) ## output
}
confusion_matrix(y= marbles_test$color, yhat=yhat)
```
## 5. Estimating Regression Tree/CART Model
The code below runs a decision tree/cart model, and includes decision tree output. The results are analyzed in *6*.
```{r, warning=FALSE, message=FALSE}
## Creating the Recipe
cart_rec <- recipe(formula = color ~ ., data = marbles_train)
## Creating the CART Model
cart_mod <-
decision_tree() %>%
set_engine(engine = "rpart") %>%
set_mode(mode = "classification")
## Workflow
cart_wf <- workflow() %>%
add_recipe(cart_rec) %>%
add_model(cart_mod)
## Fitting the Model
cart_fit <- cart_wf %>%
fit(data = marbles_train)
## Creating a Decision Tree Visualization
rpart.plot::rpart.plot(x = cart_fit$fit$fit$fit)
## Selecting the Predictions
predictions <- bind_cols(
marbles_test,
predict(object = cart_fit, new_data = marbles_test),
predict(object = cart_fit, new_data = marbles_test, type = "prob")
) %>%
mutate(color=as.factor(color))
## The Predictions
select(predictions, color, starts_with(".pred"))
## Confusion matrix
conf_mat(data = predictions, truth = color, estimate = .pred_class)
## Accuracy
accuracy(data = predictions, truth = color, estimate = .pred_class)
```
## 6.
The results of the CART model are consistent with my intuitive/mental model from part 2. The intuition that larger marbles will most likely be black and smaller marbles will most likely be white is supported by the CART model results. The predictions in both part 2 and part 5 are the same. The reason that the predictions are the same is that size is the only, predictor that we have of marble color. We identified through a plot and count that size can help to predict color, and we confirmed this through our machine learning model. In this simple case, we can use plots and simple counts to create predictions. For more complex applications, the CART model should become a far better method of generating predictions than simply looking at plots and counts.