forked from awunderground/data-science-for-public-policy2
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path10_predictive-modeling-review.qmd
1051 lines (734 loc) · 35 KB
/
10_predictive-modeling-review.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Predictive Modeling Review"
abstract: "This chapter reviews fundamental concepts for predictive modeling. The chapter focuses on concepts instead of code, but introduces complete examples with code at the end."
format:
html:
toc: true
embed-resources: true
code-line-numbers: true
bibliography: references.bib
editor_options:
chunk_output_type: console
---
```{r}
#| echo: false
exercise_number <- 1
```
```{r}
#| echo: false
#| warning: false
#| message: false
library(tidyverse)
library(here)
library(gridExtra)
theme_set(theme_minimal())
```
## Motivation
There are many policy applications where it is useful to make accurate predictions on "unseen" data. Unseen data means any data not used to develop a model for predictions.
The broad process of developing models to make predictions of a pre-defined variable is predictive modeling or supervised machine learning. The narrower process of developing a specific model with data is called fitting a model, learning a model, or estimating a model. The pre-defined variable the model is intended to predict is the **outcome variable** and the remaining variables used to predict the outcome variable are **predictors**.[^names]
[^names]: The are many names for outcome variables including targets and dependent variable. There are many names for predictors including features and independent variables.
In most applications we have **modeling data** that include the outcome of interest and **implementation data** that do not include the outcome of interest. Our objective is to fit, or estimate, some $\hat{f}(\vec{x})$ using the modeling data to predict the missing outcome in the implementation data.
![Modeling Data and Implementation Data for predicting the presence of lead in homes](images/data.png){#fig-data}
::: callout-tip
## Modeling Data
Modeling data contain the outcome variable of interest and predictors for fitting models to predict the outcome variable of interest.
Modeling data are often historical data or high-cost data.
:::
::: callout-tip
## Implementation Data
Implementation data contain predictors for predicting the outcome of interest but don't necessarily contain the outcome of interest.
The implementation data needs to contain all of the predictors used in the modeling data, and the implementation data can't be too different from the modeling data.
:::
### Examples
Predictive modeling is used in public policy for targeting interventions, amplified asking, scaled labeling, and forecasting and nowcasting. We explore each of these topics in more depth below.
::: callout-tip
## Targeting Interventions
Using predictive modeling to target the person, household, or place most in need of a policy intervention. Effective targeting can maximize the impact of a limited intervention.
:::
Consider "[Predictive Modeling for Public Health: Preventing Childhood Lead Poisoning](https://www.dssgfellowship.org/wp-content/uploads/2016/01/p2039-potash.pdf)" by @potash2015.
The authors use historical data in Chicago from home lead inspectors and observable characteristics about homes and neighborhoods to predict if homes contain lead.
- **Motivation:** Lead poisoning is terrible for children, but Chicago has limited lead inspectors.
- **Implementation data:** Observed characteristics about homes and neighborhoods in Chicago.
- **Modeling data:** Historical characteristics about homes and neighborhoods in Chicago and results from home lead inspections.
- **Objective:** Predict the probability of a home containing lead. 1. Prioritize sending inspectors to the homes of the most at-risk children before the child suffers from elevated BLL. 2. Create a risk score for children that can be used by parents and health providers.
- **Tools:** Cross validation and regularized logistic regression, linear SVC, and Random Forests.
- **Results:** The precision of the model was two to five times better than the baseline scenario of randomly sending inspectors. The advantage faded over time and in situations where more inspectors were available.
::: callout-tip
## Amplified Asking
Amplified asking [@salganik2018] uses predictive modeling trained on a small amount of survey data from one data source and applies it to a larger data source to produce estimates at a scale or geographic granularity not possible with either data set on its own.
:::
Consider "Calling for Better Measurement: Estimating an Individual's Wealth and Well-Being from Mobile Phone Transaction Records" by @blumenstock and "Predicting poverty and wealth from mobile phone metadata" by @blumenstock2015.
- **Motivation:** Measuring wealth and poverty in developing nations is challenging and expensive.
- **Implementation data:** 2005-2009 phone metadata from about 1.5 million phone customers in Rwanda.
- **Modeling data:** A follow-up phone stratified survey of 856 phone customers with questions about asset ownership, home ownership, and basic welfare indicators linked to the phone metadata.
- **Objective:** Use the phone metadata to accurately predict variables about asset ownership, home ownership, and basic welfare indicators. One of their outcomes variables is the first principal component of asset variables.
- **Tools:** 5-fold cross-validation, elastic net regression and regularized logistic regression, deterministic finite automation.
- **Results:** The results in the first paper were poor, but follow-up work resulted in models with AUC as high as 0.88 and $R^2$ as high as 0.46.
::: callout-tip
## Scaled Labeling
The process of labeling a subset of data and then using predictive models to label the full data.
This is most effective when labeling the data is time consuming or expensive.
:::
Consider "[How Urban Piloted Data Science Techniques to Collect Land-Use Reform Data](https://urban-institute.medium.com/how-urban-piloted-data-science-techniques-to-collect-land-use-reform-data-475409903b88)" by @zheng2020.
- **Motivation:** Land-use reform and zoning are poorly understood at the national level. The Urban Institute set out to create a data set from newspaper articles that reflects land-use reforms.
- **Implementation data:** About 76,000 newspaper articles
- **Modeling data:** 568 hand-labelled newspaper articles
- **Objective:** Label the nature of zoning reforms on roughly 76,000 newspaper articles.
- **Tools:** Hand coding, natural language processing, cross-validation, and random forests
- **Results:** The results were not accurate enough to treat as a data set but could be used to target subsequent hand coding and investigation.
::: callout-tip
## Forecasting and Nowcasting
A forecast attempts to predict the most-likely future. A nowcast attempts to predict the most-likely present in situations where information is collected and reported with a lag.
:::
Consider the high-profile Google Flu Trends model by @ginsberg2009.
- **Motivation:** The CDC tracks flu-like illnesses and releases data on about a two-week lag. Understanding flu-like illnesses on a one-day lag can improve public health responses.
- **Implementation data:** Real-time counts of Google search queries.
- **Modeling data:** 2003-2008 CDC data about flu-like illnesses and counts for 50 million of the most common Google search queries in each state.
- **Objective:** Predict regional-level flu outbreaks on a one-day lag using Google searches to make predictions.
- **Tools:** Logistic regression and the top 45 influenza related illness query terms.
- **Results:** The in-sample results were very promising. Unfortunately, the model performed poorly when implemented. Google implemented Google Flu Trends but later discontinued the model. The two major issues were:
1. **Drift:** The model decayed over time because of changes to search
2. **Algorithmic confounding:** The model overestimated the 2009 flu season because of fear-induced search traffic during the Swine flu outbreak
::: callout
#### [`r paste("Exercise", exercise_number)`]{style="color:#1696d2;"}
```{r}
#| echo: false
exercise_number <- exercise_number + 1
```
1. What other policy applications do you know for making accurate predictions on unseen data?
:::
## Regression
Predictive modeling is split into two approaches: regression and classification.
::: callout-tip
## Regression
Predictive modeling with a numeric outcome.
**Note:** Regression and linear regression are different ideas. We will use many algorithms for regression applications that are not linear regression.
:::
::: callout-tip
## Classification
Predictive modeling with a categorical outcome. The output of these models can be predicted classes of a categorical variable or predicted probabilities (e.g. 0.75 for "A" and 0.25 for "B").
:::
This chapter will focus on regression. A later chapter focuses on classification.
With regression, our general goal is to fit $\hat{f}(\vec{x})$ using modeling data that will minimize predictive errors on unseen data. There are many algorithms for fitting $\hat{f}(\vec{x})$. For regression, most of these algorithms fit conditional means; however, some use conditional quantiles like the conditional median.
Families of predictive models:
- linear
- trees
- naive
- kernel
- NN
::: callout
#### [`r paste("Exercise", exercise_number)`]{style="color:#1696d2;"}
```{r}
#| echo: false
exercise_number <- exercise_number + 1
```
1. Name all of the algorithms you know for each family of predictive model.
:::
## Regression Algorithms
Let's explore a few algorithms for generating predictions in regression applications. Consider some simulated data with 1,000 observations, one predictor, and one outcome.
```{r}
#| warning: false
library(tidymodels)
set.seed(20201004)
x <- runif(n = 1000, min = 0, max = 10)
data1 <- bind_cols(
x = x,
y = 10 * sin(x) + x + 20 + rnorm(n = length(x), mean = 0, sd = 2)
)
```
For reasons explained later, we split the data into a training data set with 750 observations and a testing data set with 250 observations.
```{r}
set.seed(20201007)
# create a split object
data1_split <- initial_split(data = data1, prop = 0.75)
# create the training and testing data
data1_train <- training(x = data1_split)
data1_test <- testing(x = data1_split)
```
@fig-simulated-training-data visualizes the training data.
```{r}
#| label: fig-simulated-training-data
#| fig-cap: Simulated training data
# visualize the data
data1_train |>
ggplot(aes(x = x, y = y)) +
geom_point(alpha = 0.25) +
labs(title = "Example 1 Data") +
theme_minimal()
```
### Linear Regression
1. Find the line that minimizes the sum of squared errors between the line and the observed data.
**Note:** Linear regression is linear in its coefficients but can fit non-linear patterns in the data with the inclusion of higher order terms (for example, $x^2$).
@fig-simulated-regression shows four different linear regression models fit to the training data. Degree is the magnitude of the highest order term included in the model.
For example, "Degree = 1" means $\hat{y}_i = b_0 + b_1x_i$ and "Degree = 3" means $\hat{y}_i = b_0 + b_1x_i + b_2x_i^2 + b_3x_i^3$.
```{r}
#| code-fold: true
#| label: fig-simulated-regression
#| fig-cap: Fitting the non-linear pattern in the data requires higher order terms
lin_reg_model1 <- linear_reg() |>
set_engine(engine = "lm") |>
set_mode(mode = "regression") |>
fit(formula = y ~ x, data = data1_train)
lin_reg_model2 <- linear_reg() |>
set_engine(engine = "lm") |>
set_mode(mode = "regression") |>
fit(formula = y ~ poly(x, degree = 2, raw = TRUE), data = data1_train)
lin_reg_model3 <- linear_reg() |>
set_engine(engine = "lm") |>
set_mode(mode = "regression") |>
fit(formula = y ~ poly(x, degree = 3, raw = TRUE), data = data1_train)
lin_reg_model4 <- linear_reg() |>
set_engine(engine = "lm") |>
set_mode(mode = "regression") |>
fit(formula = y ~ poly(x, degree = 4, raw = TRUE), data = data1_train)
# create a grid of predictions
new_data <- tibble(x = seq(0, 10, 0.1))
predictions_grid <- tibble(
x = seq(0, 10, 0.1),
`Degree = 1` = predict(object = lin_reg_model1, new_data = new_data)$.pred,
`Degree = 2` = predict(object = lin_reg_model2, new_data = new_data)$.pred,
`Degree = 3` = predict(object = lin_reg_model3, new_data = new_data)$.pred,
`Degree = 4` = predict(object = lin_reg_model4, new_data = new_data)$.pred
) |>
pivot_longer(-x, names_to = "model", values_to = ".pred")
ggplot() +
geom_point(data = data1_train, aes(x = x, y = y), alpha = 0.25) +
geom_path(data = predictions_grid, aes(x = x, y = .pred), color = "red") +
facet_wrap(~model) +
labs(
title = "Example 1: Data with Predictions (Linear Regression)",
subtitle = "Prediction in red"
) +
theme_minimal()
```
### KNN
1. Find the $k$ closest observations using only the predictors. Closeness is typically measured with Euclidean distance.
2. Take the mean of the outcome variables for the $k$ closest observations.
```{r}
#| code-fold: true
#| label: simulated-knn
#| fig-cap: Changing k leads to models that under and over fit to the data
# create a knn model specification
knn_mod1 <-
nearest_neighbor(neighbors = 1) |>
set_engine(engine = "kknn") |>
set_mode(mode = "regression") |>
fit(formula = y ~ x, data = data1_train)
knn_mod5 <-
nearest_neighbor(neighbors = 5) |>
set_engine(engine = "kknn") |>
set_mode(mode = "regression") |>
fit(formula = y ~ x, data = data1_train)
knn_mod50 <-
nearest_neighbor(neighbors = 50) |>
set_engine(engine = "kknn") |>
set_mode(mode = "regression") |>
fit(formula = y ~ x, data = data1_train)
knn_mod500 <-
nearest_neighbor(neighbors = 500) |>
set_engine(engine = "kknn") |>
set_mode(mode = "regression") |>
fit(formula = y ~ x, data = data1_train)
# create a grid of predictions
new_data <- tibble(x = seq(0, 10, 0.1))
predictions_grid <- tibble(
x = seq(0, 10, 0.1),
`KNN with k=1` = predict(object = knn_mod1, new_data = new_data)$.pred,
`KNN with k=5` = predict(object = knn_mod5, new_data = new_data)$.pred,
`KNN with k=50` = predict(object = knn_mod50, new_data = new_data)$.pred,
`KNN with k=500` = predict(object = knn_mod500, new_data = new_data)$.pred
) |>
pivot_longer(-x, names_to = "model", values_to = ".pred")
# visualize the data
ggplot() +
geom_point(data = data1_train, aes(x = x, y = y), alpha = 0.25) +
geom_path(data = predictions_grid, aes(x = x, y = .pred), color = "red") +
facet_wrap(~model) +
labs(
title = "Example 1: Data with Predictions (KNN)",
subtitle = "Prediction in red"
) +
theme_minimal()
```
### Regression Trees
1. Consider all binary splits of all predictors to split the data into two groups. Choose the split that results in groups with the lowest MSE for the outcome variable within each group.
2. Repeat step 1 recursively until a stopping parameter is reached (cost complexity, minimum group size, tree depth).
[Decision Tree: The Obama-Clinton Divide](https://archive.nytimes.com/www.nytimes.com/imagepages/2008/04/16/us/20080416_OBAMA_GRAPHIC.html?scp=5&sq=Decision%20Obama%20clinton&st=cse) from the New York Times is a clear example of a regression tree.
```{r}
#| code-fold: true
#| label: fig-simulated-regression-trees
#| fig-cap: Changing the cost complexity parameter leads to models that are under of over fit to the data
reg_tree_mod1 <-
decision_tree(cost_complexity = 0.1) |>
set_engine(engine = "rpart") |>
set_mode(mode = "regression") |>
fit(formula = y ~ x, data = data1_train)
reg_tree_mod2 <-
decision_tree(cost_complexity = 0.01) |>
set_engine(engine = "rpart") |>
set_mode(mode = "regression") |>
fit(formula = y ~ x, data = data1_train)
reg_tree_mod3 <-
decision_tree(cost_complexity = 0.001) |>
set_engine(engine = "rpart") |>
set_mode(mode = "regression") |>
fit(formula = y ~ x, data = data1_train)
reg_tree_mod4 <-
decision_tree(cost_complexity = 0.0001) |>
set_engine(engine = "rpart") |>
set_mode(mode = "regression") |>
fit(formula = y ~ x, data = data1_train)
# create a grid of predictions
new_data <- tibble(x = seq(0, 10, 0.1))
predictions_grid <- tibble(
x = seq(0, 10, 0.1),
`Regression Tree with cp=0.1` = predict(object = reg_tree_mod1, new_data = new_data)$.pred,
`Regression Tree with cp=0.01` = predict(object = reg_tree_mod2, new_data = new_data)$.pred,
`Regression Tree with cp=0.001` = predict(object = reg_tree_mod3, new_data = new_data)$.pred,
`Regression Tree with cp=0.0001` = predict(object = reg_tree_mod4, new_data = new_data)$.pred
) |>
pivot_longer(-x, names_to = "model", values_to = ".pred")
# visualize the data
ggplot() +
geom_point(data = data1_train, aes(x = x, y = y), alpha = 0.25) +
geom_path(data = predictions_grid, aes(x = x, y = .pred), color = "red") +
facet_wrap(~model) +
labs(
title = "Example 1: Data with Predictions (Regression Trees)",
subtitle = "Prediction in red"
) +
theme_minimal()
```
Regression trees generate a series of binary splits that can be visualized. Consider the simple tree from @fig-simulated-regression-trees where `cp=0.1`.
```{r}
#| code-fold: true
#| message: false
library(rpart.plot)
rpart.plot(reg_tree_mod1$fit, roundint = FALSE)
```
Linear regression is a parameteric approach. It requires making assumptions about the functional form of the data and reducing the model to a finite number of parameters.
KNN and regression trees are nonparametric approaches to predictive modeling because they do not require an assumption about the functional form of the model. The data-driven approach of nonparametric models is appealing because it requires fewer assumptions, but it often requires more data and care to not overfit the modeling data.
Regression trees are the simplest tree-based algorithms. Other tree-based algorithms, like gradient-boosted trees and random forests, often have far better performance than regression trees.
::: callout
#### [`r paste("Exercise", exercise_number)`]{style="color:#1696d2;"}
```{r}
#| echo: false
exercise_number <- exercise_number + 1
```
1. Use `library(tidymodels)` to fit a KNN model to the `data1_train`. Pick any plausible value of $k$.
2. Make predictions on the testing data.
3. Evaluate the model using `rmse()`.
:::
## Error
Different applications call for different error metrics. Root mean squared error (RMSE) and mean squared error (MSE) are popular metrics for regression.
$$
RMSE = \sqrt{\frac{1}{n}\sum_{i = 1}^n (y_i - \hat{y}_i)^2}
$$ {#eq-rmse}
$$
MSE = \frac{1}{n}\sum_{i = 1}^n (y_i - \hat{y}_i)^2
$$ {#eq-mse}
Let's consider the MSE for a single observation $(x_0, y_0)$ and think about the model that generates $\hat{y_0}$, which we will denote $\hat{f}(x_0)$.
$$
MSE = (y_0 - \hat{f}(x_0))^2
$$ {#eq-mse}
Assuming $y_i$ independent and $y_i - \hat{y}_i \sim N(0, \sigma^2)$, we can decompose the expected value of MSE into three types of error: irreducible error, bias error, and variance error. Understanding the types of error will inform modeling decisions.
$$
E[MSE] = E[y_0 - \hat{f}(x_0)]^2 = Var(\epsilon) + [Bias(\hat{f}(x_0))]^2 + Var(\hat{f}(x_0))
$$ {#eq-mse-ev1}
$$
E[MSE] = E[y_0 - \hat{f}(x_0)]^2 = \sigma^2 + (\text{model bias})^2 + \text{model variance}
$$ {#eq-mse-ev2}
::: callout-tip
## Irreducible error ($\sigma^2$)
Error that can't be reduced regardless of model quality. This is often caused by factors that affect the outcome of interest that aren't measured or included in the data set.
:::
::: callout-tip
## Bias error
Difference between the expected prediction of a predictive model and the correct value. Bias error is generally the error that comes from approximating complicated real-world problems with relatively simple models.
:::
Here, the model on the left does a poor job fitting the data (high bias) and the model on the right does a decent job fitting the data (low bias).
```{r}
#| label: model-bias
#| code-fold: true
#| fig-height: 2.5
#| fig-width: 4
#| fig-align: "center"
#| fig-cap: Bias
#| message: false
#| warning: false
set.seed(20200225)
sample1 <- tibble(
x = runif(100, min = -10, max = 10),
noise = rnorm(100, mean = 10, sd = 10),
y = -(x ^ 2) + noise
)
grid.arrange(
sample1 |>
ggplot(aes(x, y)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm",
se = FALSE) +
theme_void() +
labs(title = "High Bias") +
theme(plot.margin = unit(c(1, 1, 1, 1), "cm")),
sample1 |>
ggplot(aes(x, y)) +
geom_point(alpha = 0.5) +
geom_smooth(se = FALSE,
span = 0.08) +
theme_void() +
labs(title = "Low Bias") +
theme(plot.margin = unit(c(1, 1, 1, 1), "cm")),
ncol = 2
)
```
::: callout-tip
## Variance error
How much the predicted value ($\hat{y}$) and fitted model ($\hat{f}$) change for a given data point when using a different sample of data to fit the model.
:::
Here, the model on the left does not change much as the training data change (low variance) and the model on the right changes a lot when the training data change (high variance).
```{r}
#| label: model-variance
#| code-fold: true
#| fig-cap: Variance
#| message: false
#| warning: false
#| fig-height: 4
set.seed(20200226)
sample2 <- tibble(
sample_number = rep(c("Sample 1", "Sample 2", "Sample 3", "Sample 4"), 100),
x = runif(400, min = -10, max = 10),
noise = rnorm(400, mean = 10, sd = 10),
y = -(x ^ 2) + noise
)
grid.arrange(
sample2 |>
ggplot(aes(x, y)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm",
se = FALSE,
alpha = 0.5) +
facet_wrap(~sample_number) +
theme_void() +
labs(title = "Low Variance") +
theme(plot.margin = unit(c(1, 1, 1, 1), "cm")),
sample2 |>
ggplot(aes(x, y)) +
geom_point(alpha = 0.5) +
geom_smooth(se = FALSE,
span = 0.08,
alpha = 0.5) +
facet_wrap(~sample_number) +
theme_void() +
labs(title = "High Variance") +
theme(plot.margin = unit(c(1, 1, 1, 1), "cm")),
ncol = 2
)
```
As can be seen in the error decomposition and plots above, for any given amount of total prediction error, there is a trade-off between bias error and variance error. When one type of error is reduced, the other type of error increases.
The bias-variance trade-off can generally be reconceived as
- underfitting-overfitting tradeoff
- simplicity-complexity tradeoff
```{r echo = FALSE, fig.height = 2.5, fig.width = 4, fig.align = "center"}
tribble(
~x, ~bias_squared, ~variance,
0, 10, 3,
1, 7, 3.02,
2, 5, 3.05,
3, 4, 3.1,
4, 3.5, 3.15,
5, 3.25, 3.25,
6, 3.15, 3.5,
7, 3.1, 4,
8, 3.05, 5,
9, 3.02, 7,
10, 3, 10
) |>
mutate(total_error = bias_squared + variance) |>
gather(-x, key = "key", value = "value") |>
ggplot(aes(x, value, color = key)) +
geom_point() +
geom_line() +
scale_x_continuous(labels = NULL) +
scale_y_continuous(limits = c(0, NA),
labels = NULL) +
labs(x = "Complexity or Overfitting",
y = "Error") +
theme_minimal()
```
Our objective is to make accurate predictions on unseen data. It is important to make sure that models make accurate predictions on the modeling data and the implementation data.
::: callout-tip
## In-sample error
The predictive error of a model measured on the data used to estimate the predictive model.
:::
::: callout-tip
## Out-of-sample error
The predictive error of a model measured on the data not used to estimate the predictive model. Out-of-sample error is generally greater than the in-sample error.
:::
::: callout-tip
## Generalizability
How well a model makes predictions on unseen data relative to how well it makes predictions on the data used to estimate the model.
:::
## Spending Data
Predictive modelers strategically "spend data" to create accurate models that generalize to the implementation data. We will focus on two strategies: the training-testing split and v-fold cross validation.
::: callout-tip
## Training data
A subset of data used to develop a predictive model. The share of data committed to a training set depends on the number of observations, the number of predictors in a model, and heterogeneity in the data. 0.8 is a common share.
:::
::: callout-tip
## Testing data
A subset of data used to estimate model performance. The testing set usually includes all observations not included in the testing set. Do not look at these data until the very end and only estimate the out-of-sample error rate on the testing set once. If the error rate is estimated more than once on the testing data, it will underestimate the error rate.
:::
![](images/cross-validation.jpeg){#fig-cv}
::: callout-tip
## $v$-fold Cross-Validation
Break the data into $v$ equal-sized, exclusive groups. Train the model on data from $v - 1$ folds (analysis data) and measure the error metric (i.e. RMSE) on the left-out fold (assessment data). Repeat this process $v$ times, each time leaving out a different fold. Average the $v$ error metrics.
$v$-fold cross-validation is sometimes called $k$-fold cross-validation.
:::
Strategies for spending data differ based on the application and the size and nature of the data. Ideally, the training-testing split is made at the beginning of the modeling process. The $v$-fold cross-validation is used on the testing data for comparing
1. approaches for feature/target engineering
2. algorithms
3. hyperparameter tuning
::: callout-tip
## Data Leakage
When information that won't be available when the model makes out-of-sample predictions is used when estimating a model. Looking at data from the testing set creates data leakage. Data leakage leads to an underestimate of out-of-sample error.
:::
::: {.callout-warning}
Feature and target engineering is a common source of data leakage.
For example, regularized regression models expect variables to be normalized (divide each predictor by the sample mean and divide by the sample standard deviation). A naive approach would be to calculate the means and standard deviations once one the full data set and use them on the testing data or use them in cross validation. This causes data leakage and biased underestimates of the out-of-sample error rate.
New means and standard deviations need to be estimated every time a model is fit including within each iteration of a resampling method! This is a lot of work, but `library(tidymodels)` makes this simple.
:::
## Predictive Modeling Pipelines
The rest of this chapter focuses on a predictive modeling pipeline applied to the simulated data. This pipeline is a general approach to creating predictive models and can change based on the specifics of an application. We will demonstrate the pipeline using KNN and regression trees.
0. Problem Formulation
1. Split data into training and testing data
2. Exploratory Data Analysis
3. Set up Resampling for Model Selection
4. Create Candidate Models
5. Test and Choose the "Best" Model
6. Optional: Expand the Model Fit
7. Evaluate the Final Model
8. Optional: Implement the Final Model
### Example: KNN
#### 1. Split data into training and testing data
```{r}
set.seed(20201007)
# create a split object
data1_split <- initial_split(data = data1, prop = 0.75)
# create the training and testing data
data1_train <- training(x = data1_split)
data1_test <- testing(x = data1_split)
```
#### 2. Exploratory Data Analysis
:::{.callout-warning}
Only perform EDA on the training data.
Exploring the testing data will lead to data leakage.
:::
```{r}
data1_train |>
ggplot(aes(x = x, y = y)) +
geom_point(alpha = 0.25) +
labs(title = "Example 1 Data") +
theme_minimal()
```
#### 3. Set up Resampling for Model Selection
We use 10-fold cross validation for hyperparameter tuning and model selection. The training data are small (750 observations and one predictor), so we will repeat the entire cross validation process five times.
Estimating out-of-sample error rates with cross-validation is an uncertain process. Repeated cross-validation improves the accuracy of our estimated error rates.
```{r}
data1_folds <- vfold_cv(data = data1_train, v = 10, repeats = 5)
```
#### 4. Create Candidate Models
We have three main levers for creating candidate models:
1. **Feature and target engineering.** Feature and target engineering is generally specified when creating a recipe with `recipe()` and `step_*()` functions.
2. **Switching algorithms.** Algorithms are specified using functions from `library(parsnip)`.
3. **Hyperparameter tuning.** Hyperparameter tuning is typically handled by using `tune()` when picking an algorithm and then creating a hyperparameter tuning grid.
It is common to normalize (subtract the sample mean and divide by the sample standard deviation) predictors when using KNN.
```{r}
knn_recipe <-
recipe(formula = y ~ ., data = data1_train) |>
step_normalize(all_predictors())
```
We use `nearest_neighbor()` to create a KNN model and we use `tune()` as a placeholder for `k`. Note that we can tune many hyperparameters for a given model. The Regression Tree example later in this chapter illustrates this concept.
```{r}
knn_mod <-
nearest_neighbor(neighbors = tune()) |>
set_engine(engine = "kknn") |>
set_mode(mode = "regression")
```
We use `grid_regular()` to specify a hyperparameter tuning grid for potential values of $k$.
```{r}
knn_grid <- grid_regular(neighbors(range = c(1, 99)), levels = 10)
knn_grid
```
Finally, we combine the recipe and model objects into a `workflow()`.
```{r}
knn_workflow <-
workflow() |>
add_recipe(recipe = knn_recipe) |>
add_model(spec = knn_mod)
```
#### 5. Test and Choose the "Best" Model
`tune_grid()` fits the models to the repeated cross validation with differing hyperparameters and captures RMSE against each evaluation data set.
```{r}
knn_res <-
knn_workflow |>
tune_grid(
resamples = data1_folds,
grid = knn_grid,
metrics = metric_set(rmse)
)
```
We can quickly extract information from the tuned results object with functions like `collect_metrics()`, `show_best()`, and `autoplot()`.
```{r}
knn_res |>
collect_metrics()
knn_res |>
show_best()
knn_res |>
autoplot()
```
#### 6. Optional: Expand the Model Fit
Sometimes we will pick the best predictive model specification (feature and target engineering + algorithm + hyperparameters) and fit the model on all of the training data.
First, finalize the workflow with the "best" hyperparameters from the tuned results.
```{r}
final_wf <-
knn_workflow |>
finalize_workflow(select_best(knn_res))
```
Second, use `last_fit()` to fit the model on all of the data.
```{r}
final_fit <-
final_wf |>
last_fit(data1_split)
```
::: {.callout-note}
`select_best()` takes a narrow view of model selection and picks the model with the lowest error rate.
It is important to consider other elements when picking a "best." Other considerations include parsimony, cost, and equity.
:::
#### 7. Evaluate the Final Model
The final fit object contains the out-of-sample error metric from the testing data. This metric is our best estimate of model performance on unseen data.
In this case, the testing data error is slightly higher than the training data error, which makes sense.
```{r}
final_fit |>
collect_metrics()
```
#### 8. Optional: Implement the Final Model
`final_fit` and `pedict()` can be used to apply the model to new, unseen data.
Only implement the model if it achieves the objectives of the final model.
### Example: Regression Trees
We'll next work with a subset of data from the `Chicago` data set from `library(tidymodels)`.
```{r}
chicago_small <- Chicago |>
select(
ridership,
Clark_Lake,
Quincy_Wells,
Irving_Park,
Monroe,
Polk,
temp,
percip,
Bulls_Home,
Bears_Home,
WhiteSox_Home,
Cubs_Home,
date
) |>
mutate(weekend = wday(date, label = TRUE) %in% c("Sat", "Sun")) |>
glimpse()
```
#### 0. Problem Formulation
The variable `ridership` is the outcome variable. It measures ridership at the Clark/Lake L station in Chicago. The variables `Clark_Lake`, `Quincy_Wells`, `Irving_Park`, `Monroe`, and `Polk` measure ridership at several stations *14 days before* the `ridership` variable.
The objective is predict `ridership` to better allocate staff and resources to the Clark/Lake L station.
#### 1. Split data into training and testing data
```{r}
set.seed(20231106)
# create a split object
chicago_split <- initial_split(data = chicago_small, prop = 0.8)
# create the training and testing data
chicago_train <- training(x = chicago_split)
chicago_test <- testing(x = chicago_split)
```
#### 2. Exploratory Data Analysis
[This chapter](https://bookdown.org/max/FES/visualizations-for-numeric-data-exploring-train-ridership-data.html) in "**Feature and Target Engineering**" contains EDA for the data set of interest.
#### 3. Set up Resampling for Model Selection
We use 10-fold cross validation for hyperparameter tuning and model selection. The training data are larger in the previous example, so we skip repeats.
```{r}
chicago_folds <- vfold_cv(data = chicago_train, v = 10)
```
#### 4. Create Candidate Models
We will use regression trees for this example. For feature and target engineering, we will simply remove the date column.
```{r}
rpart_recipe <-
recipe(formula = ridership ~ ., data = chicago_train) |>
step_rm(date)
```
We specify a regression tree with the rpart engine.
```{r}
rpart_mod <-
decision_tree(
cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()
) |>
set_engine(engine = "rpart") |>
set_mode(mode = "regression")
```
We use `grid_regular()` to specify a hyperparameter tuning grid for potential values of the cost-complexity parameter, tree depth parameter, and minimum n.
```{r}
rpart_grid <- grid_regular(
cost_complexity(range = c(-5, -1)),
tree_depth(range = c(3, 15)),
min_n(),
levels = 10
)
rpart_grid
```
Finally, we combine the recipe and model objects into a `workflow()`.
```{r}
rpart_workflow <-
workflow() |>
add_recipe(recipe = rpart_recipe) |>
add_model(spec = rpart_mod)
```
#### 5. Test and Choose the "Best" Model
```{r}
rpart_res <-
rpart_workflow |>
tune_grid(resamples = chicago_folds,
grid = rpart_grid,
metrics = metric_set(rmse))
rpart_res |>
collect_metrics()
rpart_res |>