-
Notifications
You must be signed in to change notification settings - Fork 0
/
hw3.qmd
302 lines (218 loc) · 9.73 KB
/
hw3.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
---
title: "Homework #3: Penalized Regression"
author: "**Rohan Shroff**"
format: ds6030hw-html
---
# Required R packages and Directories {.unnumbered .unlisted}
```{r packages, message=FALSE, warning=FALSE}
data_dir = 'https://mdporter.github.io/teaching/data/' # data directory
library(mlbench)
library(glmnet)
library(tidymodels)# for optional tidymodels solutions
library(tidyverse) # functions for data manipulation
```
# Problem 1: Optimal Tuning Parameters
In cross-validation, we discussed choosing the tuning parameter values that minimized the cross-validation error. Another approach, called the "one-standard error" rule \[ISL pg 214, ESL pg 61\], uses the values corresponding to the least complex model whose cv error is within one standard error of the best model. The goal of this assignment is to compare these two rules.
Use simulated data from `mlbench.friedman1(n, sd=2)` in the `mlbench` R package to fit *lasso models*. The tuning parameter $\lambda$ (corresponding to the penalty on the coefficient magnitude) is the one we will focus one. Generate training data, use k-fold cross-validation to get $\lambda_{\rm min}$ and $\lambda_{\rm 1SE}$, generate test data, make predictions for the test data, and compare performance of the two rules under a squared error loss using a hypothesis test.
Choose reasonable values for:
- Number of cv folds ($K$)
- Note: you are free to use repeated CV, repeated hold-outs, or bootstrapping instead of plain cross-validation; just be sure to describe what do did so it will be easier to follow.
- Number of training and test observations
- Number of simulations
- If everyone uses different values, we will be able to see how the results change over the different settings.
- Don't forget to make your results reproducible (e.g., set seed)
This pseudo code (using k-fold cv) will get you started:
``` yaml
library(mlbench)
library(glmnet)
#-- Settings
n_train = # number of training obs
n_test = # number of test obs
K = # number of CV folds
alpha = # glmnet tuning alpha (1 = lasso, 0 = ridge)
M = # number of simulations
#-- Data Generating Function
getData <- function(n) mlbench.friedman1(n, sd=2) # data generating function
#-- Simulations
# Set Seed Here
for(m in 1:M) {
# 1. Generate Training Data
# 2. Build Training Models using cross-validation, e.g., cv.glmnet()
# 3. get lambda that minimizes cv error and 1 SE rule
# 4. Generate Test Data
# 5. Predict y values for test data (for each model: min, 1SE)
# 6. Evaluate predictions
}
#-- Compare
# compare performance of the approaches / Statistical Test
```
## a. Code for the simulation and performance results
::: {.callout-note title="Solution"}
```{r}
library(mlbench)
library(glmnet)
n_train <- 1000
n_test <- 5000
K <- 5
alpha <- 1
M <- 100
folds = rep(1:K, length=n_train)
getData <- function(n) mlbench.friedman1(n, sd=2)
set.seed(600)
lambda_min_mse <- vector()
lambda_1se_mse <- vector()
for(m in 1:M) {
# 1. Generate Training Data
data_train <- getData(n_train)
x_data <- data_train$x
y_data <- data_train$y
x_data <- as.matrix(data_train$x)
y_data <- data_train$y
# 2. Build Training Models using cross-validation, e.g., cv.glmnet()
train_model <- cv.glmnet(x_data, y_data, alpha=alpha, foldid= folds)
# 3. get lambda that minimizes cv error and 1 SE rule
lambda_min <- train_model$lambda.min
lambda_1se <- train_model$lambda.1se
# 4. Generate Test Data
data_test <- getData(n_test)
x_test_data <- as.matrix(data_test$x)
y_test_data <- data_test$y
# 5. Predict y values for test data (for each model: min, 1SE)
yhat_lambda_min = predict(train_model, x_test_data, s = "lambda.min")
yhat_lambda_1se = predict(train_model, x_test_data, s = "lambda.1se")
mse_lambda_min <- mean((y_test_data - yhat_lambda_min)^2)
mse_lambda_1se <- mean((y_test_data - yhat_lambda_1se)^2)
# 6. Evaluate predictions
lambda_min_mse[m] <- mse_lambda_min
lambda_1se_mse[m] <- mse_lambda_1se
}
lambda_min_mse <- unlist(lambda_min_mse)
lambda_1se_mse <- unlist(lambda_1se_mse)
mean_lambda_min_MSE <- mean(lambda_min_mse)
mean_lambda_1se_MSE <- mean(lambda_1se_mse)
```
```{r}
mean_lambda_min_MSE
mean_lambda_1se_MSE
```
:::
## b. Hypothesis test
Provide results and discussion of a hypothesis test comparing $\lambda_{\rm min}$ and $\lambda_{\rm 1SE}$.
::: {.callout-note title="Solution"}
The MSE for the minimum lambda is slightly lower than the 1se lambda which makes sense. We can now perform a hypothesis test comparing lambda min and lambda 1se. We can therefore reject the null hypothesis.
```{r}
t_test_result <- t.test(lambda_min_mse, lambda_1se_mse)
t_test_result
```
The confidence range values are both negative, indicating there is a statistically significant difference between the two MSE values.
:::
# Problem 2 Prediction Contest: Real Estate Pricing
This problem uses the [realestate-train](%60r%20file.path(data_dir,%20'realestate-train.csv')%60) and [realestate-test](%60r%20file.path(data_dir,%20'realestate-test.csv')%60) (click on links for data).
The goal of this contest is to predict sale price (in thousands) (`price` column) using an *elastic net* model. Evaluation of the test data will be based on the root mean squared error ${\rm RMSE}= \sqrt{\frac{1}{m}\sum_i (y_i - \hat{y}_i)^2}$ for the $m$ test set observations.
## a. Load and pre-process data
Load the data and create necessary data structures for running *elastic net*.
- You are free to use any data transformation or feature engineering
- Note: there are some categorical predictors so at the least you will have to convert those to something numeric (e.g., one-hot or dummy coding).
::: {.callout-note title="Solution"}
```{r}
rs_train <- read_csv('realestate-train.csv')
rs_train['SqFeet'] <- rs_train$SqFeet/sd(rs_train$SqFeet)
rs_train['LotSize'] <- rs_train$LotSize/sd(rs_train$LotSize)
rs_train['CentralAir'] <- ifelse(rs_train$CentralAir=="Y",1,0)
rs_train['HouseStyle'] <- as.numeric(factor(rs_train$HouseStyle))
rs_train <- rs_train %>%
mutate(HouseStyle = recode(HouseStyle,
'2Story' = 1,
'1Story' = 2,
'1.5Unf' = 3,
'SLvl' = 4,
'2.5Unf' = 5,
'1.5Fin' = 6,
'2.5Fin' = 7,
'SFoyer' = 8
))
rs_train <- rs_train[,c('price', 'GarageCars', 'Fireplaces', 'TotRmsAbvGrd', 'Baths','SqFeet', 'CentralAir', 'Age', 'LotSize', 'HouseStyle', 'condition')]
rs_train
```
```{r}
rs_test <- read_csv('realestate-test.csv')
rs_test['SqFeet'] <- rs_test$SqFeet/sd(rs_test$SqFeet)
rs_test['LotSize'] <- rs_test$LotSize/sd(rs_test$LotSize)
rs_test['CentralAir'] <- ifelse(rs_test$CentralAir=="Y",1,0)
rs_test['HouseStyle'] <- as.numeric(factor(rs_test$HouseStyle))
rs_test <- rs_test %>%
mutate(HouseStyle = recode(HouseStyle,
'2Story' = 1,
'1Story' = 2,
'1.5Unf' = 3,
'SLvl' = 4,
'2.5Unf' = 5,
'1.5Fin' = 6,
'2.5Fin' = 7,
'SFoyer' = 8
))
rs_test <- rs_test[,c('GarageCars', 'Fireplaces', 'TotRmsAbvGrd', 'Baths','SqFeet', 'CentralAir', 'Age', 'LotSize', 'HouseStyle', 'condition')]
rs_test
```
:::
## b. Fit elastic net model
Use an *elastic net* model to predict the `price` of the test data.
- You are free to use any data transformation or feature engineering
- You are free to use any tuning parameters
- Report the $\alpha$ and $\lambda$ parameters you used to make your final predictions.
- Describe how you choose those tuning parameters
::: {.callout-note title="Solution"}
```{r}
# finding minimum alpha and lambda based on training data
alpha <- seq(0,1,0.05)
k <- 10
folds = rep(1:K, length=nrow(rs_train))
min_MSE_list <- vector()
min_lambda_list <- vector()
rs_yhat_test_list <- list()
for (a in 1:length(alpha)){
x_train <- as.matrix(rs_train[,-1])
y_train <- rs_train$price
rs_train_model <- cv.glmnet(x_train, y_train, alpha=alpha[a], foldid = folds)
lambda_min <- rs_train_model$lambda.min
x_test <- as.matrix(rs_test)
rs_yhat_test <- predict(rs_train_model, x_test, s=rs_train_model$lambda.min)
min_lambda_index <- which(rs_train_model$lambda == lambda_min)
min_mse <- rs_train_model$cvm[min_lambda_index]
min_MSE_list[a] <- min_mse
min_lambda_list[a] <- lambda_min
rs_yhat_test_list[[a]] <- rs_yhat_test
}
min_MSE_list
min_lambda_list
```
```{r}
min_MSE_list <- unlist(min_MSE_list)
a <- which.min(min_MSE_list)
a
min_lambda_list <- unlist(min_lambda_list)
b <- min_lambda_list[a]
b
yhat <- unlist(rs_yhat_test_list[2])
yhat <- as.data.frame(yhat)
yhat
```
The minimum MSE corresponds to an alpha value of 0.1 and a lambda value of 0.477.
:::
## c. Submit predictions
Submit a .csv file (ensure comma separated format) named `lastname_firstname.csv` that includes your predictions in a column named *yhat*. We will use automated evaluation, so the format must be exact.
- You will receive credit for a proper submission; the top five scores will receive 2 bonus points.
::: {.callout-note title="Solution"}
```{r}
write.csv(yhat, "C:/Users/shrof/R/UVA/DS6030-rs7aq/Shroff_Rohan.csv", row.names=FALSE)
```
:::
## d. Report anticpated performance
Report the anticipated performance of your method in terms of RMSE. We will see how close your performance assessment matches the actual value.
::: {.callout-note title="Solution"}
The anticipated performance in terms of RMSE is below. To calculate this, I took the minimum mean MSE, which occurred at alpha = 0.1, and took the square root of this.
```{r}
antic_perf <- (min_MSE_list[2] ** 0.5)
antic_perf
```
:::