-
Notifications
You must be signed in to change notification settings - Fork 2
/
41-bagging-classifiers.qmd
executable file
·326 lines (301 loc) · 11.8 KB
/
41-bagging-classifiers.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
# Bagging for classification
```{r setup}
#| include: false
source("_common.R")
```
## Instability of trees (motivation)
Just like in the regression case, classification
trees can be highly unstable (specifically: relatively small
changes in the training set may result in
comparably large changes in the corresponding tree).
We illustrate the problem on the very simple graduate
school admissions example
(3-class 2-dimensional covariates) we used in class. First
read the data:
```{r inst1, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
mm <- read.table("data/T11-6.DAT", header = FALSE)
```
We transform the response variable `V3` into a factor (which is how class labels
are represented in `R`, and what `rpart()` expects as the values in the response
variable to build a classifier):
```{r inst11}
mm$V3 <- as.factor(mm$V3)
```
To obtain better looking plots later, we now re-scale one of the features
(so both explanatory variables have similar ranges):
```{r inst12}
mm[, 2] <- mm[, 2] / 150
```
We use the function `rpart` to train a classification tree on these data, using
deviance-based (`information`) splits:
```{r inst13}
library(rpart)
a.t <- rpart(V3 ~ V1 + V2, data = mm, method = "class", parms = list(split = "information"))
```
To illustrate the instability of this tree (i.e. how the tree changes when the data are perturbed
slightly), we create a new training set (`mm2`) that is identical to
the original one (in `mm`), except for two observations where
we change their responses from class `1` to class `2`:
```{r inst20}
mm2 <- mm
mm2[1, 3] <- 2
mm2[7, 3] <- 2
```
The following plot contains the new training set, with the two changed
observations (you can find them around the point `(GPA, GMAT) = (3, 4)`)
highlighted with a blue dot (their new class) and a red ring around them
(their old class was "red"):
```{r inst2, fig.width=6, fig.height=6, message=FALSE, warning=FALSE, fig.show='hold'}
plot(mm2[, 1:2],
pch = 19, cex = 1.5, col = c("red", "blue", "green")[mm2[, 3]],
xlab = "GPA", "GMAT", xlim = c(2, 5), ylim = c(2, 5)
)
points(mm[c(1, 7), -3], pch = "O", cex = 1.1, col = c("red", "blue", "green")[mm[c(1, 7), 3]])
```
As we did above, we now train a classification tree on the perturbed data in `mm2`:
```{r inst2.5, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
a2.t <- rpart(V3 ~ V1 + V2, data = mm2, method = "class", parms = list(split = "information"))
```
To visualize the differences between the two trees we build a fine grid of points
and compare the predicted probabilities of each class on each point on the grid.
First, construct the grid:
```{r inst2.0}
aa <- seq(2, 5, length = 200)
bb <- seq(2, 5, length = 200)
dd <- expand.grid(aa, bb)
names(dd) <- names(mm)[1:2]
```
Now, compute the estimated conditional probabilities of each of the 3 classes
on each of the 40,000 points on the grid `dd`:
```{r inst2.1}
p.t <- predict(a.t, newdata = dd, type = "prob")
p2.t <- predict(a2.t, newdata = dd, type = "prob")
```
The next figures show the estimated probabilities of class "red" with each of the two
trees:
```{r inst2.2, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
filled.contour(aa, bb, matrix(p.t[, 1], 200, 200),
col = terrain.colors(20), xlab = "GPA", ylab = "GMAT",
plot.axes = {
axis(1)
axis(2)
},
panel.last = {
points(mm[, -3], pch = 19, cex = 1.5, col = c("red", "blue", "green")[mm[, 3]])
}
)
filled.contour(aa, bb, matrix(p2.t[, 1], 200, 200),
col = terrain.colors(20), xlab = "GPA", ylab = "GMAT",
plot.axes = {
axis(1)
axis(2)
},
panel.last = {
points(mm2[, -3], pch = 19, cex = 1.5, col = c("red", "blue", "green")[mm2[, 3]])
points(mm[c(1, 7), -3], pch = "O", cex = 1.1, col = c("red", "blue", "green")[mm[c(1, 7), 3]])
}
)
```
Similarly, the estimated conditional probabilities for class "blue" at each point of the grid are:
```{r kk}
# blues
filled.contour(aa, bb, matrix(p.t[, 2], 200, 200),
col = terrain.colors(20), xlab = "GPA", ylab = "GMAT",
plot.axes = {
axis(1)
axis(2)
},
panel.last = {
points(mm[, -3], pch = 19, cex = 1.5, col = c("red", "blue", "green")[mm[, 3]])
}
)
filled.contour(aa, bb, matrix(p2.t[, 2], 200, 200),
col = terrain.colors(20), xlab = "GPA", ylab = "GMAT",
plot.axes = {
axis(1)
axis(2)
},
pane.last = {
points(mm2[, -3], pch = 19, cex = 1.5, col = c("red", "blue", "green")[mm2[, 3]])
points(mm[c(1, 7), -3], pch = "O", cex = 1.1, col = c("red", "blue", "green")[mm[c(1, 7), 3]])
}
)
```
And finally, for class "green":
```{r kk2}
# greens
filled.contour(aa, bb, matrix(p.t[, 3], 200, 200),
col = terrain.colors(20), xlab = "GPA", ylab = "GMAT",
plot.axes = {
axis(1)
axis(2)
}, panel.last = {
points(mm[, -3], pch = 19, cex = 1.5, col = c("red", "blue", "green")[mm[, 3]])
}
)
filled.contour(aa, bb, matrix(p2.t[, 3], 200, 200),
col = terrain.colors(20), xlab = "GPA", ylab = "GMAT",
plot.axes = {
axis(1)
axis(2)
},
pane.last = {
points(mm2[, -3], pch = 19, cex = 1.5, col = c("red", "blue", "green")[mm2[, 3]])
points(mm[c(1, 7), -3], pch = "O", cex = 1.1, col = c("red", "blue", "green")[mm[c(1, 7), 3]])
}
)
```
Note that, for example, the regions of the feature space (the
explanatory variables) that would be classified as "red" or "green" for the
trees trained with the original and the slightly changed training sets
change quite noticeably, even though the difference in the training sets is
relatively small. Below we show how an ensemble of classifiers
constructed via bagging can provide a more stable classifier.
## Bagging trees
Just as we did for regression, bagging consists of building an ensemble of
predictors (in this case, classifiers) using bootstrap samples.
If we using *B* bootstrap samples, we will construct *B* classifiers, and
given a point **x**, we now have *B* estimated conditional probabilities
for each of the possible *K* classes.
Unlike what happens with regression problems, we now have a choice to make
when deciding how to combine the *B* outputs for each point. We can take
either: a majority vote over the *B* separate decisions, or we can
average the *B* estimated probabilities for the *K* classes, to obtain
bagged estimated conditional probabilities. As discussed and illustrated
in class, the latter approach is usually preferred.
To illustrate the increased stability of bagged classification trees,
we repeat the experiment above: we build an ensemble of 1000 classification
trees trained on the original data, and a second ensemble (also of 1000
trees) using the slightly modified data.
Each ensemble is constructed in exactly the same way we did
in the regression case.
For the first ensemble we train `NB = 1000` trees
and store them in a list (called `ts`) for future use:
```{r bag0}
my.c <- rpart.control(minsplit = 3, cp = 1e-6, xval = 10)
NB <- 1000
ts <- vector("list", NB)
set.seed(123)
n <- nrow(mm)
for (j in 1:NB) {
ii <- sample(1:n, replace = TRUE)
ts[[j]] <- rpart(V3 ~ V1 + V2, data = mm[ii, ], method = "class", parms = list(split = "information"), control = my.c)
}
```
### Using the ensemble
As discussed in class, there are two possible ways to use this ensemble given
a new observation: we can classify it to the class with most votes among the
*B* bagged classifiers, or we can compute the average conditional probabilities
over the *B* classifiers, and use this average as our esimated conditional
probability. We illustrate both of these with the point `(GPA, GMAT) = (3.3, 3.0)`.
#### Majority vote
The simplest, but less elegant way to compute the votes for each class
across the *B* trees in the ensemble is to loop over them and
count:
```{r predbag0}
x0 <- t(c(V1 = 3.3, V2 = 3.0))
votes <- vector("numeric", 3)
names(votes) <- 1:3
for (j in 1:NB) {
k <- predict(ts[[j]], newdata = data.frame(x0), type = "class")
votes[k] <- votes[k] + 1
}
(votes)
```
And we see that the class most voted is `r as.numeric(which.max(votes))`.
The above calculation can be made more elegantly with the function `sapply`
(or `lapply`):
```{r predbag1}
votes2 <- sapply(ts, FUN = function(a, newx) predict(a, newdata = newx, type = "class"), newx = data.frame(x0))
table(votes2)
```
#### Average probabilities (over the ensemble)
If we wanted to compute the average of the conditional probabilities across
the *B* different estimates, we could do it in a very similar way. Here I show how
to do it using `sapply`. You are strongly encouraged to verify these calculations
by computing the average of the conditional probabilities using a for-loop.
```{r predbag2}
votes2 <- sapply(ts, FUN = function(a, newx) predict(a, newdata = newx, type = "prob"), newx = data.frame(x0))
(rowMeans(votes2))
```
And again, we see that class `1` has a much higher probability of occuring
for this point.
### Increased stability of ensembles
To illustrate that ensembles of tree-based classifiers tend to be more stable
than a single tree, we construct another example, but this time using
the slightly modified data. The ensemble is stored in the list `ts2`:
```{r bag02}
mm2 <- mm
mm2[1, 3] <- 2
mm2[7, 3] <- 2
NB <- 1000
ts2 <- vector("list", NB)
set.seed(123)
n <- nrow(mm)
for (j in 1:NB) {
ii <- sample(1:n, replace = TRUE)
ts2[[j]] <- rpart(V3 ~ V1 + V2, data = mm2[ii, ], method = "class", parms = list(split = "information"), control = my.c)
}
```
We use the same fine grid as before to show the
estimated conditional probabilities, this time
obtained with the two ensembles.
```{r grid0}
aa <- seq(2, 5, length = 200)
bb <- seq(2, 5, length = 200)
dd <- expand.grid(aa, bb)
names(dd) <- names(mm)[1:2]
```
To combine (average) the `NB = 1000` estimated probabilities
of each of the 3 classes for each of the 40,000 points
in the grid `dd` I use the function `vapply`
and store the result in a 3-dimensional array. The
averaged probabilities over the 1000 bagged trees
can then obtained by averaging across the 3rd dimension.
This approach may not be intuitively very clear at
first sight. *You are strongly encouraged to ignore my code
below and compute the bagged conditional probabilites for the
3 classes for each point in the grid in a way that is
clear to you*. The main goal is to understand the method
and be able to do it on your own. Efficient and / or elegant code
can be written later, but it is not the focus of this course.
The ensemble of trees trained with the original data:
```{r bag1, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
pp0 <- vapply(ts, FUN = predict, FUN.VALUE = matrix(0, 200 * 200, 3), newdata = dd, type = "prob")
pp <- apply(pp0, c(1, 2), mean)
```
And the ensemble of trees trained with the slightly modified data:
```{r bag010, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
pp02 <- vapply(ts2, FUN = predict, FUN.VALUE = matrix(0, 200 * 200, 3), newdata = dd, type = "prob")
pp2 <- apply(pp02, c(1, 2), mean)
```
The plots below show the estimated conditional probabilities for class "red"
in each point of the grid, with each of the two ensembles. Note
how similar they are (and contrast this with the results obtained before
without bagging):
```{r bag1.plot}
filled.contour(aa, bb, matrix(pp[, 3], 200, 200),
col = terrain.colors(20), xlab = "GPA", ylab = "GMAT",
plot.axes = {
axis(1)
axis(2)
},
panel.last = {
points(mm[, -3], pch = 19, cex = 1.5, col = c("red", "blue", "green")[mm[, 3]])
}
)
filled.contour(aa, bb, matrix(pp2[, 3], 200, 200),
col = terrain.colors(20), xlab = "GPA", ylab = "GMAT",
plot.axes = {
axis(1)
axis(2)
},
panel.last = {
points(mm2[, -3], pch = 19, cex = 1.5, col = c("red", "blue", "green")[mm2[, 3]])
points(mm[c(1, 7), -3], pch = "O", cex = 1.2, col = c("red", "blue", "green")[mm[c(1, 7), 3]])
}
)
```
You are strongly encouraged to obtain the corresponding plots comparing
the estimated conditional probabilities with both ensembles
for each of the other 2 classes ("blue" and "green").