-
Notifications
You must be signed in to change notification settings - Fork 2
/
42-random-forests.qmd
285 lines (255 loc) · 11 KB
/
42-random-forests.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
# Random Forests
```{r setup}
#| include: false
source("_common.R")
```
Even though using a *bagged* ensemble of trees usually results in a more
stable predictor / classifier, a better ensemble can be improved by training
each of its members in a careful way. The main idea is to try to reduce the
(conditional) potential correlation among the predictions of the bagged trees,
as discussed in class. Each of the bootstrap trees in the ensemble is
grown using only a randomly selected set of features when partitioning each node.
More specifically, at each node only a random subset of explanatory variables
is considered to determine the optimal split. These randomly chosen features
are selected independently at each node as the tree is being constructed.
To train a Random Forest in `R` we use the
funtion `randomForest` from the package with the same name.
The syntax is the same as that of `rpart`, but the tuning parameters
for each of the *trees* in the *forest* are different from `rpart`.
Refer to the help page if you need to modify them.
We load and prepare the admissions data as before:
```{r init}
mm <- read.table("data/T11-6.DAT", header = FALSE)
mm$V3 <- as.factor(mm$V3)
mm[, 2] <- mm[, 2] / 150
```
and train a Random Forest with 500 trees and using all the default tuning parameters:
```{r rf1, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
library(randomForest)
a.rf <- randomForest(V3 ~ V1 + V2, data = mm, ntree = 500)
```
Predictions can be obtained using the `predict` method, as usual, when
you specify the `newdata` argument. Refer to the help page
of `predict.randomForest` for details on the different
behaviour of `predict` for Random Forest objects when the argument `newdata` is
either present or missing.
To visualize the predicted classes obtained with a Random Forest
on our example data, we compute the corresponding
predicted conditional class probabilities on the
same grid used before:
```{r inst2.0.2}
aa <- seq(2, 5, length = 200)
bb <- seq(2, 5, length = 200)
dd <- expand.grid(aa, bb)
names(dd) <- names(mm)[1:2]
```
The estimated conditional probabilities for class *red*
are shown in the plot below
(how are these estimated conditional probabilities computed exactly?)
```{r rf1.1, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
pp.rf <- predict(a.rf, newdata = dd, type = "prob")
filled.contour(aa, bb, matrix(pp.rf[, 1], 200, 200),
col = terrain.colors(20),
xlab = "GPA", ylab = "GMAT",
plot.axes = {
axis(1)
axis(2)
},
panel.last = {
points(mm[, -3],
pch = 19, cex = 1.5,
col = c("red", "blue", "green")[mm[, 3]]
)
}
)
```
And the predicted conditional probabilities for the rest of the classes are:
```{r rf2, fig.width=6, fig.height=6, message=FALSE, warning=FALSE, echo=TRUE}
filled.contour(aa, bb, matrix(pp.rf[, 2], 200, 200),
col = terrain.colors(20),
xlab = "GPA", ylab = "GMAT", plot.axes = {
axis(1)
axis(2)
},
panel.last = {
points(mm[, -3],
pch = 19, cex = 1.5,
col = c("red", "blue", "green")[mm[, 3]]
)
}
)
filled.contour(aa, bb, matrix(pp.rf[, 3], 200, 200),
col = terrain.colors(20), xlab = "GPA",
ylab = "GMAT", plot.axes = {
axis(1)
axis(2)
},
panel.last = {
points(mm[, -3], pch = 19, cex = 1.5, col = c("red", "blue", "green")[mm[, 3]])
}
)
```
A very interesting exercise would be to train a Random Forest on the perturbed data
(in `mm2`) and verify that the predicted conditional probabilities do not change much,
as was the case for the bagged classifier.
## Another example
We will now use a more interesting example. The ISOLET data, available
here:
[http://archive.ics.uci.edu/ml/datasets/ISOLET](http://archive.ics.uci.edu/ml/datasets/ISOLET),
contains data
on sound recordings of 150 speakers saying each letter of the
alphabet (twice). See the original source for more details. Since
the full data set is rather large, here we only use the subset
corresponding to the observations for the letters **C** and **Z**.
We first load the training and test data sets, and force the response
variable to be categorical, so that the `R` implementations of the
different predictors we will use below will build
classifiers and not their regression counterparts:
```{r rf.isolet, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
xtr <- read.table("data/isolet-train-c-z.data", sep = ",")
xte <- read.table("data/isolet-test-c-z.data", sep = ",")
xtr$V618 <- as.factor(xtr$V618)
xte$V618 <- as.factor(xte$V618)
```
To train a Random Forest we use the function `randomForest` in the
package of the same name. The code underlying this package was originally
written by Leo Breiman. We first train a Random Forest, using all the default parameters
```{r rf.isolet02, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
library(randomForest)
set.seed(123)
(a.rf <- randomForest(V618 ~ ., data = xtr, ntree = 500))
```
We now check its performance on the test set:
```{r rf.isolet.te}
p.rf <- predict(a.rf, newdata = xte, type = "response")
table(p.rf, xte$V618)
```
Note that the Random Forest only makes one mistake out of 120 (approx 0.8%) observations
in the test set. However, the OOB error rate estimate is slightly over 2%.
The next plot shows the evolution of the OOB error rate estimate as a function of the
number of classifiers in the ensemble (trees in the forest). Note that 500 trees
appears to be a reasonable forest size, in the sense thate the OOB error rate estimate is stable.
```{r rf0.oob, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
plot(a.rf, lwd = 3, lty = 1)
```
Consider again the ISOLET data, available
here:
[http://archive.ics.uci.edu/ml/datasets/ISOLET](http://archive.ics.uci.edu/ml/datasets/ISOLET).
Here we only use a subset
corresponding to the observations for the letters **C** and **Z**.
We first load the training and test data sets, and force the response
variable to be categorical, so that the `R` implementations of the
different predictors we will use below will build
classifiers and not their regression counterparts:
```{r rf.isolet00, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
xtr <- read.table("data/isolet-train-c-z.data", sep = ",")
xte <- read.table("data/isolet-test-c-z.data", sep = ",")
xtr$V618 <- as.factor(xtr$V618)
xte$V618 <- as.factor(xte$V618)
```
To train a Random Forest we use the function `randomForest` in the
package of the same name. The code underlying this package was originally
written by Leo Breiman. We train a RF leaving all
paramaters at their default values, and check
its performance on the test set:
```{r rf.isolet2, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
library(randomForest)
set.seed(123)
a.rf <- randomForest(V618 ~ ., data = xtr, ntree = 500)
p.rf <- predict(a.rf, newdata = xte, type = "response")
table(p.rf, xte$V618)
```
Note that the Random Forest only makes one mistake out of 120 observations
in the test set. The OOB error rate estimate is slightly over 2%,
and we see that 500 trees is a reasonable forest size:
```{r rf.oob, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
plot(a.rf, lwd = 3, lty = 1)
a.rf
```
## Using a test set instead of OBB
Given that in this case we do have a test set, we can use it
to monitor the error rate (instead of using the OOB error estimates):
```{r rf.isolet.test, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
x.train <- model.matrix(V618 ~ ., data = xtr)
y.train <- xtr$V618
x.test <- model.matrix(V618 ~ ., data = xte)
y.test <- xte$V618
set.seed(123)
a.rf <- randomForest(x = x.train, y = y.train, xtest = x.test, ytest = y.test, ntree = 500)
test.err <- a.rf$test$err.rate
ma <- max(c(test.err))
plot(test.err[, 2], lwd = 2, lty = 1, col = "red", type = "l", ylim = c(0, max(c(0, ma))))
lines(test.err[, 3], lwd = 2, lty = 1, col = "green")
lines(test.err[, 1], lwd = 2, lty = 1, col = "black")
```
According to the help page for the `plot` method for objects of class
`randomForest`, the following plot should show both error rates (OOB plus
those on the test set):
```{r rf.isolet.test.plot, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
plot(a.rf, lwd = 2)
```
## Feature sequencing / Variable ranking
To explore which variables were used in the forest,
and also, their importance rank as discussed in
class, we can use the function `varImpPlot`:
```{r rf.isolet3, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
varImpPlot(a.rf, n.var = 20)
```
## Comparing RF with other classifiers
We now compare the Random Forest with some of the other classifiers we saw in class,
using their classification error rate on the test set as our comparison measure.
We first start with K-NN:
```{r rf.isolet4, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
library(class)
u1 <- knn(train = xtr[, -618], test = xte[, -618], cl = xtr[, 618], k = 1)
table(u1, xte$V618)
u5 <- knn(train = xtr[, -618], test = xte[, -618], cl = xtr[, 618], k = 5)
table(u5, xte$V618)
u10 <- knn(train = xtr[, -618], test = xte[, -618], cl = xtr[, 618], k = 10)
table(u10, xte$V618)
u20 <- knn(train = xtr[, -618], test = xte[, -618], cl = xtr[, 618], k = 20)
table(u20, xte$V618)
u50 <- knn(train = xtr[, -618], test = xte[, -618], cl = xtr[, 618], k = 50)
table(u50, xte$V618)
```
To use logistic regression we first create a new variable that is 1
for the letter **C** and 0 for the letter **Z**, and use it as
our response variable.
```{r rf.isoletglm, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
xtr$V619 <- as.numeric(xtr$V618 == 3)
d.glm <- glm(V619 ~ . - V618, data = xtr, family = binomial)
pr.glm <- as.numeric(predict(d.glm, newdata = xte, type = "response") > 0.5)
table(pr.glm, xte$V618)
```
Question for the reader: why do you think this classifier's performance
is so disappointing?
It is interesting to see how a simple LDA classifier does:
```{r rf.isolet5, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
library(MASS)
xtr$V619 <- NULL
d.lda <- lda(V618 ~ ., data = xtr)
pr.lda <- predict(d.lda, newdata = xte)$class
table(pr.lda, xte$V618)
```
Finally, note that a carefully built classification tree
performs remarkably well, only using 3 features:
```{r rf.isolet6, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
library(rpart)
my.c <- rpart.control(minsplit = 5, cp = 1e-8, xval = 10)
set.seed(987)
a.tree <- rpart(V618 ~ ., data = xtr, method = "class", parms = list(split = "information"), control = my.c)
cp <- a.tree$cptable[which.min(a.tree$cptable[, "xerror"]), "CP"]
a.tp <- prune(a.tree, cp = cp)
p.t <- predict(a.tp, newdata = xte, type = "vector")
table(p.t, xte$V618)
```
Finally, note that if you train a single classification tree with the
default values for the stopping criterion tuning parameters, the
tree also uses only 3 features, but its classification error rate
on the test set is larger than that of the pruned one:
```{r rf.isolet7, fig.width=6, fig.height=6, message=FALSE, warning=FALSE}
set.seed(987)
a2.tree <- rpart(V618 ~ ., data = xtr, method = "class", parms = list(split = "information"))
p2.t <- predict(a2.tree, newdata = xte, type = "vector")
table(p2.t, xte$V618)
```