-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.xml
7342 lines (7098 loc) · 938 KB
/
index.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
<title>nutriverse</title>
<link>/</link>
<atom:link href="/index.xml" rel="self" type="application/rss+xml" />
<description>nutriverse</description>
<generator>Hugo -- gohugo.io</generator><language>en-gb</language><lastBuildDate>Thu, 25 Jun 2020 00:00:00 +0000</lastBuildDate>
<item>
<title>Build a model</title>
<link>/start/models/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>/start/models/</guid>
<description><h2 id="intro">Introduction</h2>
<p>How do you create a statistical model using tidymodels? In this article, we will walk you through the steps. We start with data for modeling, learn how to specify and train models with different engines using the
<a href="https://tidymodels.github.io/parsnip/" target="_blank" rel="noopener">parsnip package</a>, and understand why these functions are designed this way.</p>
<p>To use code in this article, you will need to install the following packages: readr, rstanarm, and tidymodels.</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r"><span style="color:#00f">library</span>(tidymodels) <span style="color:#408080;font-style:italic"># for the parsnip package, along with the rest of tidymodels</span>
<span style="color:#408080;font-style:italic"># Helper packages</span>
<span style="color:#00f">library</span>(readr) <span style="color:#408080;font-style:italic"># for importing data</span>
</code></pre></div><h2 id="data">The Sea Urchins Data</h2>
<p>Let&rsquo;s use the data from
<a href="https://link.springer.com/article/10.1007/BF00349318" target="_blank" rel="noopener">Constable (1993)</a> to explore how three different feeding regimes affect the size of sea urchins over time. The initial size of the sea urchins at the beginning of the experiment probably affects how big they grow as they are fed.</p>
<p>To start, let&rsquo;s read our urchins data into R, which we&rsquo;ll do by providing
<a href="https://readr.tidyverse.org/reference/read_delim.html" target="_blank" rel="noopener"><code>readr::read_csv()</code></a> with a url where our CSV data is located (&ldquo;<a href="https://tidymodels.org/start/models/urchins.csv">https://tidymodels.org/start/models/urchins.csv</a>&rdquo;):</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r">urchins <span style="color:#666">&lt;-</span>
<span style="color:#408080;font-style:italic"># Data were assembled for a tutorial </span>
<span style="color:#408080;font-style:italic"># at https://www.flutterbys.com.au/stats/tut/tut7.5a.html</span>
<span style="color:#00f">read_csv</span>(<span style="color:#ba2121">&#34;https://tidymodels.org/start/models/urchins.csv&#34;</span>) <span style="color:#666">%&gt;%</span>
<span style="color:#408080;font-style:italic"># Change the names to be a little more verbose</span>
<span style="color:#00f">setNames</span>(<span style="color:#00f">c</span>(<span style="color:#ba2121">&#34;food_regime&#34;</span>, <span style="color:#ba2121">&#34;initial_volume&#34;</span>, <span style="color:#ba2121">&#34;width&#34;</span>)) <span style="color:#666">%&gt;%</span>
<span style="color:#408080;font-style:italic"># Factors are very helpful for modeling, so we convert one column</span>
<span style="color:#00f">mutate</span>(food_regime <span style="color:#666">=</span> <span style="color:#00f">factor</span>(food_regime, levels <span style="color:#666">=</span> <span style="color:#00f">c</span>(<span style="color:#ba2121">&#34;Initial&#34;</span>, <span style="color:#ba2121">&#34;Low&#34;</span>, <span style="color:#ba2121">&#34;High&#34;</span>)))
<span style="color:#408080;font-style:italic">#&gt; Parsed with column specification:</span>
<span style="color:#408080;font-style:italic">#&gt; cols(</span>
<span style="color:#408080;font-style:italic">#&gt; TREAT = col_character(),</span>
<span style="color:#408080;font-style:italic">#&gt; IV = col_double(),</span>
<span style="color:#408080;font-style:italic">#&gt; SUTW = col_double()</span>
<span style="color:#408080;font-style:italic">#&gt; )</span>
</code></pre></div><p>Let&rsquo;s take a quick look at the data:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r">urchins
<span style="color:#408080;font-style:italic">#&gt; # A tibble: 72 x 3</span>
<span style="color:#408080;font-style:italic">#&gt; food_regime initial_volume width</span>
<span style="color:#408080;font-style:italic">#&gt; &lt;fct&gt; &lt;dbl&gt; &lt;dbl&gt;</span>
<span style="color:#408080;font-style:italic">#&gt; 1 Initial 3.5 0.01 </span>
<span style="color:#408080;font-style:italic">#&gt; 2 Initial 5 0.02 </span>
<span style="color:#408080;font-style:italic">#&gt; 3 Initial 8 0.061</span>
<span style="color:#408080;font-style:italic">#&gt; 4 Initial 10 0.051</span>
<span style="color:#408080;font-style:italic">#&gt; 5 Initial 13 0.041</span>
<span style="color:#408080;font-style:italic">#&gt; 6 Initial 13 0.061</span>
<span style="color:#408080;font-style:italic">#&gt; 7 Initial 15 0.041</span>
<span style="color:#408080;font-style:italic">#&gt; 8 Initial 15 0.071</span>
<span style="color:#408080;font-style:italic">#&gt; 9 Initial 16 0.092</span>
<span style="color:#408080;font-style:italic">#&gt; 10 Initial 17 0.051</span>
<span style="color:#408080;font-style:italic">#&gt; # … with 62 more rows</span>
</code></pre></div><p>The urchins data is a
<a href="https://tibble.tidyverse.org/index.html" target="_blank" rel="noopener">tibble</a>. If you are new to tibbles, the best place to start is the
<a href="https://r4ds.had.co.nz/tibbles.html" target="_blank" rel="noopener">tibbles chapter</a> in <em>R for Data Science</em>. For each of the 72 urchins, we know their:</p>
<ul>
<li>experimental feeding regime group (<code>food_regime</code>: either <code>Initial</code>, <code>Low</code>, or <code>High</code>),</li>
<li>size in milliliters at the start of the experiment (<code>initial_volume</code>), and</li>
<li>suture width at the end of the experiment (<code>width</code>).</li>
</ul>
<p>As a first step in modeling, it&rsquo;s always a good idea to plot the data:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r"><span style="color:#00f">ggplot</span>(urchins,
<span style="color:#00f">aes</span>(x <span style="color:#666">=</span> initial_volume,
y <span style="color:#666">=</span> width,
group <span style="color:#666">=</span> food_regime,
col <span style="color:#666">=</span> food_regime)) <span style="color:#666">+</span>
<span style="color:#00f">geom_point</span>() <span style="color:#666">+</span>
<span style="color:#00f">geom_smooth</span>(method <span style="color:#666">=</span> lm, se <span style="color:#666">=</span> <span style="color:#008000;font-weight:bold">FALSE</span>) <span style="color:#666">+</span>
<span style="color:#00f">scale_color_viridis_d</span>(option <span style="color:#666">=</span> <span style="color:#ba2121">&#34;plasma&#34;</span>, end <span style="color:#666">=</span> <span style="color:#666">.7</span>)
<span style="color:#408080;font-style:italic">#&gt; `geom_smooth()` using formula &#39;y ~ x&#39;</span>
</code></pre></div><p><img src="figs/urchin-plot-1.svg" width="672" /></p>
<p>We can see that urchins that were larger in volume at the start of the experiment tended to have wider sutures at the end, but the slopes of the lines look different so this effect may depend on the feeding regime condition.</p>
<h2 id="build-model">Build and fit a model</h2>
<p>A standard two-way analysis of variance (
<a href="https://www.itl.nist.gov/div898/handbook/prc/section4/prc43.htm" target="_blank" rel="noopener">ANOVA</a>) model makes sense for this dataset because we have both a continuous predictor and a categorical predictor. Since the slopes appear to be different for at least two of the feeding regimes, let&rsquo;s build a model that allows for two-way interactions. Specifying an R formula with our variables in this way:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r">width <span style="color:#666">~</span> initial_volume <span style="color:#666">*</span> food_regime
</code></pre></div><p>allows our regression model depending on initial volume to have separate slopes and intercepts for each food regime.</p>
<p>For this kind of model, ordinary least squares is a good initial approach. With tidymodels, we start by specifying the <em>functional form</em> of the model that we want using the
<a href="https://tidymodels.github.io/parsnip/" target="_blank" rel="noopener">parsnip package</a>. Since there is a numeric outcome and the model should be linear with slopes and intercepts, the model type is
<a href="https://tidymodels.github.io/parsnip/reference/linear_reg.html" target="_blank" rel="noopener">&ldquo;linear regression&rdquo;</a>. We can declare this with:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r"><span style="color:#00f">linear_reg</span>()
<span style="color:#408080;font-style:italic">#&gt; Linear Regression Model Specification (regression)</span>
</code></pre></div><p>That is pretty underwhelming since, on its own, it doesn&rsquo;t really do much. However, now that the type of model has been specified, a method for <em>fitting</em> or training the model can be stated using the <strong>engine</strong>. The engine value is often a mash-up of the software that can be used to fit or train the model as well as the estimation method. For example, to use ordinary least squares, we can set the engine to be <code>lm</code>:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r"><span style="color:#00f">linear_reg</span>() <span style="color:#666">%&gt;%</span>
<span style="color:#00f">set_engine</span>(<span style="color:#ba2121">&#34;lm&#34;</span>)
<span style="color:#408080;font-style:italic">#&gt; Linear Regression Model Specification (regression)</span>
<span style="color:#408080;font-style:italic">#&gt; </span>
<span style="color:#408080;font-style:italic">#&gt; Computational engine: lm</span>
</code></pre></div><p>The
<a href="https://tidymodels.github.io/parsnip/reference/linear_reg.html" target="_blank" rel="noopener">documentation page for <code>linear_reg()</code></a> lists the possible engines. We&rsquo;ll save this model object as <code>lm_mod</code>.</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r">lm_mod <span style="color:#666">&lt;-</span>
<span style="color:#00f">linear_reg</span>() <span style="color:#666">%&gt;%</span>
<span style="color:#00f">set_engine</span>(<span style="color:#ba2121">&#34;lm&#34;</span>)
</code></pre></div><p>From here, the model can be estimated or trained using the
<a href="https://tidymodels.github.io/parsnip/reference/fit.html" target="_blank" rel="noopener"><code>fit()</code></a> function:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r">lm_fit <span style="color:#666">&lt;-</span>
lm_mod <span style="color:#666">%&gt;%</span>
<span style="color:#00f">fit</span>(width <span style="color:#666">~</span> initial_volume <span style="color:#666">*</span> food_regime, data <span style="color:#666">=</span> urchins)
lm_fit
<span style="color:#408080;font-style:italic">#&gt; parsnip model object</span>
<span style="color:#408080;font-style:italic">#&gt; </span>
<span style="color:#408080;font-style:italic">#&gt; Fit time: 3ms </span>
<span style="color:#408080;font-style:italic">#&gt; </span>
<span style="color:#408080;font-style:italic">#&gt; Call:</span>
<span style="color:#408080;font-style:italic">#&gt; stats::lm(formula = formula, data = data)</span>
<span style="color:#408080;font-style:italic">#&gt; </span>
<span style="color:#408080;font-style:italic">#&gt; Coefficients:</span>
<span style="color:#408080;font-style:italic">#&gt; (Intercept) initial_volume </span>
<span style="color:#408080;font-style:italic">#&gt; 0.0331216 0.0015546 </span>
<span style="color:#408080;font-style:italic">#&gt; food_regimeLow food_regimeHigh </span>
<span style="color:#408080;font-style:italic">#&gt; 0.0197824 0.0214111 </span>
<span style="color:#408080;font-style:italic">#&gt; initial_volume:food_regimeLow initial_volume:food_regimeHigh </span>
<span style="color:#408080;font-style:italic">#&gt; -0.0012594 0.0005254</span>
</code></pre></div><p>Perhaps our analysis requires a description of the model parameter estimates and their statistical properties. Although the <code>summary()</code> function for <code>lm</code> objects can provide that, it gives the results back in an unwieldy format. Many models have a <code>tidy()</code> method that provides the summary results in a more predictable and useful format (e.g. a data frame with standard column names):</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r"><span style="color:#00f">tidy</span>(lm_fit)
<span style="color:#408080;font-style:italic">#&gt; # A tibble: 6 x 5</span>
<span style="color:#408080;font-style:italic">#&gt; term estimate std.error statistic p.value</span>
<span style="color:#408080;font-style:italic">#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;</span>
<span style="color:#408080;font-style:italic">#&gt; 1 (Intercept) 0.0331 0.00962 3.44 0.00100 </span>
<span style="color:#408080;font-style:italic">#&gt; 2 initial_volume 0.00155 0.000398 3.91 0.000222</span>
<span style="color:#408080;font-style:italic">#&gt; 3 food_regimeLow 0.0198 0.0130 1.52 0.133 </span>
<span style="color:#408080;font-style:italic">#&gt; 4 food_regimeHigh 0.0214 0.0145 1.47 0.145 </span>
<span style="color:#408080;font-style:italic">#&gt; 5 initial_volume:food_regimeLow -0.00126 0.000510 -2.47 0.0162 </span>
<span style="color:#408080;font-style:italic">#&gt; 6 initial_volume:food_regimeHigh 0.000525 0.000702 0.748 0.457</span>
</code></pre></div><h2 id="predict-model">Use a model to predict</h2>
<p>This fitted object <code>lm_fit</code> has the <code>lm</code> model output built-in, which you can access with <code>lm_fit$fit</code>, but there are some benefits to using the fitted parsnip model object when it comes to predicting.</p>
<p>Suppose that, for a publication, it would be particularly interesting to make a plot of the mean body size for urchins that started the experiment with an initial volume of 20ml. To create such a graph, we start with some new example data that we will make predictions for, to show in our graph:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r">new_points <span style="color:#666">&lt;-</span> <span style="color:#00f">expand.grid</span>(initial_volume <span style="color:#666">=</span> <span style="color:#666">20</span>,
food_regime <span style="color:#666">=</span> <span style="color:#00f">c</span>(<span style="color:#ba2121">&#34;Initial&#34;</span>, <span style="color:#ba2121">&#34;Low&#34;</span>, <span style="color:#ba2121">&#34;High&#34;</span>))
new_points
<span style="color:#408080;font-style:italic">#&gt; initial_volume food_regime</span>
<span style="color:#408080;font-style:italic">#&gt; 1 20 Initial</span>
<span style="color:#408080;font-style:italic">#&gt; 2 20 Low</span>
<span style="color:#408080;font-style:italic">#&gt; 3 20 High</span>
</code></pre></div><p>To get our predicted results, we can use the <code>predict()</code> function to find the mean values at 20ml.</p>
<p>It is also important to communicate the variability, so we also need to find the predicted confidence intervals. If we had used <code>lm()</code> to fit the model directly, a few minutes of reading the
<a href="https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.lm.html" target="_blank" rel="noopener">documentation page</a> for <code>predict.lm()</code> would explain how to do this. However, if we decide to use a different model to estimate urchin size (<em>spoiler:</em> we will!), it is likely that a completely different syntax would be required.</p>
<p>Instead, with tidymodels, the types of predicted values are standardized so that we can use the same syntax to get these values.</p>
<p>First, let&rsquo;s generate the mean body width values:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r">mean_pred <span style="color:#666">&lt;-</span> <span style="color:#00f">predict</span>(lm_fit, new_data <span style="color:#666">=</span> new_points)
mean_pred
<span style="color:#408080;font-style:italic">#&gt; # A tibble: 3 x 1</span>
<span style="color:#408080;font-style:italic">#&gt; .pred</span>
<span style="color:#408080;font-style:italic">#&gt; &lt;dbl&gt;</span>
<span style="color:#408080;font-style:italic">#&gt; 1 0.0642</span>
<span style="color:#408080;font-style:italic">#&gt; 2 0.0588</span>
<span style="color:#408080;font-style:italic">#&gt; 3 0.0961</span>
</code></pre></div><p>When making predictions, the tidymodels convention is to always produce a tibble of results with standardized column names. This makes it easy to combine the original data and the predictions in a usable format:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r">conf_int_pred <span style="color:#666">&lt;-</span> <span style="color:#00f">predict</span>(lm_fit,
new_data <span style="color:#666">=</span> new_points,
type <span style="color:#666">=</span> <span style="color:#ba2121">&#34;conf_int&#34;</span>)
conf_int_pred
<span style="color:#408080;font-style:italic">#&gt; # A tibble: 3 x 2</span>
<span style="color:#408080;font-style:italic">#&gt; .pred_lower .pred_upper</span>
<span style="color:#408080;font-style:italic">#&gt; &lt;dbl&gt; &lt;dbl&gt;</span>
<span style="color:#408080;font-style:italic">#&gt; 1 0.0555 0.0729</span>
<span style="color:#408080;font-style:italic">#&gt; 2 0.0499 0.0678</span>
<span style="color:#408080;font-style:italic">#&gt; 3 0.0870 0.105</span>
<span style="color:#408080;font-style:italic"># Now combine: </span>
plot_data <span style="color:#666">&lt;-</span>
new_points <span style="color:#666">%&gt;%</span>
<span style="color:#00f">bind_cols</span>(mean_pred) <span style="color:#666">%&gt;%</span>
<span style="color:#00f">bind_cols</span>(conf_int_pred)
<span style="color:#408080;font-style:italic"># and plot:</span>
<span style="color:#00f">ggplot</span>(plot_data, <span style="color:#00f">aes</span>(x <span style="color:#666">=</span> food_regime)) <span style="color:#666">+</span>
<span style="color:#00f">geom_point</span>(<span style="color:#00f">aes</span>(y <span style="color:#666">=</span> .pred)) <span style="color:#666">+</span>
<span style="color:#00f">geom_errorbar</span>(<span style="color:#00f">aes</span>(ymin <span style="color:#666">=</span> .pred_lower,
ymax <span style="color:#666">=</span> .pred_upper),
width <span style="color:#666">=</span> <span style="color:#666">.2</span>) <span style="color:#666">+</span>
<span style="color:#00f">labs</span>(y <span style="color:#666">=</span> <span style="color:#ba2121">&#34;urchin size&#34;</span>)
</code></pre></div><p><img src="figs/lm-all-pred-1.svg" width="672" /></p>
<h2 id="new-engine">Model with a different engine</h2>
<p>Every one on your team is happy with that plot <em>except</em> that one person who just read their first book on
<a href="https://bayesian.org/what-is-bayesian-analysis/" target="_blank" rel="noopener">Bayesian analysis</a>. They are interested in knowing if the results would be different if the model were estimated using a Bayesian approach. In such an analysis, a
<a href="https://towardsdatascience.com/introduction-to-bayesian-linear-regression-e66e60791ea7" target="_blank" rel="noopener"><em>prior distribution</em></a> needs to be declared for each model parameter that represents the possible values of the parameters (before being exposed to the observed data). After some discussion, the group agrees that the priors should be bell-shaped but, since no one has any idea what the range of values should be, to take a conservative approach and make the priors <em>wide</em> using a Cauchy distribution (which is the same as a t-distribution with a single degree of freedom).</p>
<p>The
<a href="https://mc-stan.org/rstanarm/articles/priors.html" target="_blank" rel="noopener">documentation</a> on the rstanarm package shows us that the <code>stan_glm()</code> function can be used to estimate this model, and that the function arguments that need to be specified are called <code>prior</code> and <code>prior_intercept</code>. It turns out that <code>linear_reg()</code> has a
<a href="https://tidymodels.github.io/parsnip/reference/linear_reg.html#details" target="_blank" rel="noopener"><code>stan</code> engine</a>. Since these prior distribution arguments are specific to the Stan software, they are passed as arguments to
<a href="https://tidymodels.github.io/parsnip/reference/set_engine.html" target="_blank" rel="noopener"><code>parsnip::set_engine()</code></a>. After that, the same exact <code>fit()</code> call is used:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r"><span style="color:#408080;font-style:italic"># set the prior distribution</span>
prior_dist <span style="color:#666">&lt;-</span> rstanarm<span style="color:#666">::</span><span style="color:#00f">student_t</span>(df <span style="color:#666">=</span> <span style="color:#666">1</span>)
<span style="color:#00f">set.seed</span>(<span style="color:#666">123</span>)
<span style="color:#408080;font-style:italic"># make the parsnip model</span>
bayes_mod <span style="color:#666">&lt;-</span>
<span style="color:#00f">linear_reg</span>() <span style="color:#666">%&gt;%</span>
<span style="color:#00f">set_engine</span>(<span style="color:#ba2121">&#34;stan&#34;</span>,
prior_intercept <span style="color:#666">=</span> prior_dist,
prior <span style="color:#666">=</span> prior_dist)
<span style="color:#408080;font-style:italic"># train the model</span>
bayes_fit <span style="color:#666">&lt;-</span>
bayes_mod <span style="color:#666">%&gt;%</span>
<span style="color:#00f">fit</span>(width <span style="color:#666">~</span> initial_volume <span style="color:#666">*</span> food_regime, data <span style="color:#666">=</span> urchins)
<span style="color:#00f">print</span>(bayes_fit, digits <span style="color:#666">=</span> <span style="color:#666">5</span>)
<span style="color:#408080;font-style:italic">#&gt; parsnip model object</span>
<span style="color:#408080;font-style:italic">#&gt; </span>
<span style="color:#408080;font-style:italic">#&gt; Fit time: 1.5s </span>
<span style="color:#408080;font-style:italic">#&gt; stan_glm</span>
<span style="color:#408080;font-style:italic">#&gt; family: gaussian [identity]</span>
<span style="color:#408080;font-style:italic">#&gt; formula: width ~ initial_volume * food_regime</span>
<span style="color:#408080;font-style:italic">#&gt; observations: 72</span>
<span style="color:#408080;font-style:italic">#&gt; predictors: 6</span>
<span style="color:#408080;font-style:italic">#&gt; ------</span>
<span style="color:#408080;font-style:italic">#&gt; Median MAD_SD </span>
<span style="color:#408080;font-style:italic">#&gt; (Intercept) 0.03452 0.00883</span>
<span style="color:#408080;font-style:italic">#&gt; initial_volume 0.00150 0.00037</span>
<span style="color:#408080;font-style:italic">#&gt; food_regimeLow 0.01805 0.01221</span>
<span style="color:#408080;font-style:italic">#&gt; food_regimeHigh 0.01934 0.01367</span>
<span style="color:#408080;font-style:italic">#&gt; initial_volume:food_regimeLow -0.00119 0.00047</span>
<span style="color:#408080;font-style:italic">#&gt; initial_volume:food_regimeHigh 0.00061 0.00065</span>
<span style="color:#408080;font-style:italic">#&gt; </span>
<span style="color:#408080;font-style:italic">#&gt; Auxiliary parameter(s):</span>
<span style="color:#408080;font-style:italic">#&gt; Median MAD_SD </span>
<span style="color:#408080;font-style:italic">#&gt; sigma 0.02121 0.00186</span>
<span style="color:#408080;font-style:italic">#&gt; </span>
<span style="color:#408080;font-style:italic">#&gt; ------</span>
<span style="color:#408080;font-style:italic">#&gt; * For help interpreting the printed output see ?print.stanreg</span>
<span style="color:#408080;font-style:italic">#&gt; * For info on the priors used see ?prior_summary.stanreg</span>
</code></pre></div><p>This kind of Bayesian analysis (like many models) involves randomly generated numbers in its fitting procedure. We can use <code>set.seed()</code> to ensure that the same (pseudo-)random numbers are generated each time we run this code. The number <code>123</code> isn&rsquo;t special or related to our data; it is just a &ldquo;seed&rdquo; used to choose random numbers.</p>
<p>To update the parameter table, the <code>tidy()</code> method is once again used:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r"><span style="color:#00f">tidy</span>(bayes_fit, intervals <span style="color:#666">=</span> <span style="color:#008000;font-weight:bold">TRUE</span>)
<span style="color:#408080;font-style:italic">#&gt; # A tibble: 6 x 5</span>
<span style="color:#408080;font-style:italic">#&gt; term estimate std.error lower upper</span>
<span style="color:#408080;font-style:italic">#&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;</span>
<span style="color:#408080;font-style:italic">#&gt; 1 (Intercept) 0.0345 0.00883 0.0200 0.0490 </span>
<span style="color:#408080;font-style:italic">#&gt; 2 initial_volume 0.00150 0.000369 0.000895 0.00212 </span>
<span style="color:#408080;font-style:italic">#&gt; 3 food_regimeLow 0.0181 0.0122 -0.00181 0.0380 </span>
<span style="color:#408080;font-style:italic">#&gt; 4 food_regimeHigh 0.0193 0.0137 -0.00317 0.0420 </span>
<span style="color:#408080;font-style:italic">#&gt; 5 initial_volume:food_regimeLow -0.00119 0.000472 -0.00199 -0.000413</span>
<span style="color:#408080;font-style:italic">#&gt; 6 initial_volume:food_regimeHigh 0.000610 0.000651 -0.000490 0.00170</span>
</code></pre></div><p>A goal of the tidymodels packages is that the <strong>interfaces to common tasks are standardized</strong> (as seen in the <code>tidy()</code> results above). The same is true for getting predictions; we can use the same code even though the underlying packages use very different syntax:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r">bayes_plot_data <span style="color:#666">&lt;-</span>
new_points <span style="color:#666">%&gt;%</span>
<span style="color:#00f">bind_cols</span>(<span style="color:#00f">predict</span>(bayes_fit, new_data <span style="color:#666">=</span> new_points)) <span style="color:#666">%&gt;%</span>
<span style="color:#00f">bind_cols</span>(<span style="color:#00f">predict</span>(bayes_fit, new_data <span style="color:#666">=</span> new_points, type <span style="color:#666">=</span> <span style="color:#ba2121">&#34;conf_int&#34;</span>))
<span style="color:#00f">ggplot</span>(bayes_plot_data, <span style="color:#00f">aes</span>(x <span style="color:#666">=</span> food_regime)) <span style="color:#666">+</span>
<span style="color:#00f">geom_point</span>(<span style="color:#00f">aes</span>(y <span style="color:#666">=</span> .pred)) <span style="color:#666">+</span>
<span style="color:#00f">geom_errorbar</span>(<span style="color:#00f">aes</span>(ymin <span style="color:#666">=</span> .pred_lower, ymax <span style="color:#666">=</span> .pred_upper), width <span style="color:#666">=</span> <span style="color:#666">.2</span>) <span style="color:#666">+</span>
<span style="color:#00f">labs</span>(y <span style="color:#666">=</span> <span style="color:#ba2121">&#34;urchin size&#34;</span>) <span style="color:#666">+</span>
<span style="color:#00f">ggtitle</span>(<span style="color:#ba2121">&#34;Bayesian model with t(1) prior distribution&#34;</span>)
</code></pre></div><p><img src="figs/stan-pred-1.svg" width="672" /></p>
<p>This isn&rsquo;t very different from the non-Bayesian results (except in interpretation).</p>
<div class="note">The <a href="https://parsnip.tidymodels.org/">parsnip</a> package can work with many model types, engines, and arguments. Check out <a href="/find/parsnip/">tidymodels.org/find/parsnip</a> to see what is available.</div>
<h2 id="why">Why does it work that way?</h2>
<p>The extra step of defining the model using a function like <code>linear_reg()</code> might seem superfluous since a call to <code>lm()</code> is much more succinct. However, the problem with standard modeling functions is that they don&rsquo;t separate what you want to do from the execution. For example, the process of executing a formula has to happen repeatedly across model calls even when the formula does not change; we can&rsquo;t recycle those computations.</p>
<p>Also, using the tidymodels framework, we can do some interesting things by incrementally creating a model (instead of using single function call).
<a href="/start/tuning/">Model tuning</a> with tidymodels uses the specification of the model to declare what parts of the model should be tuned. That would be very difficult to do if <code>linear_reg()</code> immediately fit the model.</p>
<p>If you are familiar with the tidyverse, you may have noticed that our modeling code uses the magrittr pipe (<code>%&gt;%</code>). With dplyr and other tidyverse packages, the pipe works well because all of the functions take the <em>data</em> as the first argument. For example:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r">urchins <span style="color:#666">%&gt;%</span>
<span style="color:#00f">group_by</span>(food_regime) <span style="color:#666">%&gt;%</span>
<span style="color:#00f">summarize</span>(med_vol <span style="color:#666">=</span> <span style="color:#00f">median</span>(initial_volume))
<span style="color:#408080;font-style:italic">#&gt; # A tibble: 3 x 2</span>
<span style="color:#408080;font-style:italic">#&gt; food_regime med_vol</span>
<span style="color:#408080;font-style:italic">#&gt; &lt;fct&gt; &lt;dbl&gt;</span>
<span style="color:#408080;font-style:italic">#&gt; 1 Initial 20.5</span>
<span style="color:#408080;font-style:italic">#&gt; 2 Low 19.2</span>
<span style="color:#408080;font-style:italic">#&gt; 3 High 15</span>
</code></pre></div><p>whereas the modeling code uses the pipe to pass around the <em>model object</em>:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r">bayes_mod <span style="color:#666">%&gt;%</span>
<span style="color:#00f">fit</span>(width <span style="color:#666">~</span> initial_volume <span style="color:#666">*</span> food_regime, data <span style="color:#666">=</span> urchins)
</code></pre></div><p>This may seem jarring if you have used dplyr a lot, but it is extremely similar to how ggplot2 operates:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r"><span style="color:#00f">ggplot</span>(urchins,
<span style="color:#00f">aes</span>(initial_volume, width)) <span style="color:#666">+</span> <span style="color:#408080;font-style:italic"># returns a ggplot object </span>
<span style="color:#00f">geom_jitter</span>() <span style="color:#666">+</span> <span style="color:#408080;font-style:italic"># same</span>
<span style="color:#00f">geom_smooth</span>(method <span style="color:#666">=</span> lm, se <span style="color:#666">=</span> <span style="color:#008000;font-weight:bold">FALSE</span>) <span style="color:#666">+</span> <span style="color:#408080;font-style:italic"># same </span>
<span style="color:#00f">labs</span>(x <span style="color:#666">=</span> <span style="color:#ba2121">&#34;Volume&#34;</span>, y <span style="color:#666">=</span> <span style="color:#ba2121">&#34;Width&#34;</span>) <span style="color:#408080;font-style:italic"># etc</span>
</code></pre></div><h2 id="session-info">Session information</h2>
<pre><code>#&gt; ─ Session info ───────────────────────────────────────────────────────────────
#&gt; setting value
#&gt; version R version 4.0.0 (2020-04-24)
#&gt; os macOS Mojave 10.14.6
#&gt; system x86_64, darwin17.0
#&gt; ui X11
#&gt; language (EN)
#&gt; collate en_US.UTF-8
#&gt; ctype en_US.UTF-8
#&gt; tz America/New_York
#&gt; date 2020-05-19
#&gt;
#&gt; ─ Packages ───────────────────────────────────────────────────────────────────
#&gt; package * version date lib source
#&gt; broom * 0.5.6 2020-04-20 [1] CRAN (R 4.0.0)
#&gt; dials * 0.0.6 2020-04-03 [1] CRAN (R 4.0.0)
#&gt; dplyr * 0.8.5 2020-03-07 [1] CRAN (R 4.0.0)
#&gt; ggplot2 * 3.3.0 2020-03-05 [1] CRAN (R 4.0.0)
#&gt; infer * 0.5.1 2019-11-19 [1] CRAN (R 4.0.0)
#&gt; parsnip * 0.1.1 2020-05-06 [1] CRAN (R 4.0.0)
#&gt; purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.0.0)
#&gt; readr * 1.3.1 2018-12-21 [1] CRAN (R 4.0.0)
#&gt; recipes * 0.1.12 2020-05-01 [1] CRAN (R 4.0.0)
#&gt; rlang 0.4.6 2020-05-02 [1] CRAN (R 4.0.0)
#&gt; rsample * 0.0.6 2020-03-31 [1] CRAN (R 4.0.0)
#&gt; rstanarm * 2.19.3 2020-02-11 [1] CRAN (R 4.0.0)
#&gt; tibble * 3.0.1 2020-04-20 [1] CRAN (R 4.0.0)
#&gt; tidymodels * 0.1.0 2020-02-16 [1] CRAN (R 4.0.0)
#&gt; tune * 0.1.0 2020-04-02 [1] CRAN (R 4.0.0)
#&gt; workflows * 0.1.1 2020-03-17 [1] CRAN (R 4.0.0)
#&gt; yardstick * 0.0.6 2020-03-17 [1] CRAN (R 4.0.0)
#&gt;
#&gt; [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library
</code></pre></description>
</item>
<item>
<title>Regression models two ways</title>
<link>/learn/models/parsnip-ranger-glmnet/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>/learn/models/parsnip-ranger-glmnet/</guid>
<description><h2 id="introduction">Introduction</h2>
<p>To use the code in this article, you will need to install the following packages: AmesHousing, glmnet, randomForest, ranger, and tidymodels.</p>
<p>We can create regression models with the tidymodels package
<a href="https://tidymodels.github.io/parsnip/" target="_blank" rel="noopener">parsnip</a> to predict continuous or numeric quantities. Here, let&rsquo;s first fit a random forest model, which does <em>not</em> require all numeric input (see discussion
<a href="https://bookdown.org/max/FES/categorical-trees.html" target="_blank" rel="noopener">here</a>) and discuss how to use <code>fit()</code> and <code>fit_xy()</code>, as well as <em>data descriptors</em>.</p>
<p>Second, let&rsquo;s fit a regularized linear regression model to demonstrate how to move between different types of models using parsnip.</p>
<h2 id="the-ames-housing-data">The Ames housing data</h2>
<p>We&rsquo;ll use the Ames housing data set to demonstrate how to create regression models using parsnip. First, set up the data set and create a simple training/test set split:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r"><span style="color:#00f">library</span>(AmesHousing)
ames <span style="color:#666">&lt;-</span> <span style="color:#00f">make_ames</span>()
<span style="color:#00f">library</span>(tidymodels)
<span style="color:#00f">set.seed</span>(<span style="color:#666">4595</span>)
data_split <span style="color:#666">&lt;-</span> <span style="color:#00f">initial_split</span>(ames, strata <span style="color:#666">=</span> <span style="color:#ba2121">&#34;Sale_Price&#34;</span>, p <span style="color:#666">=</span> <span style="color:#666">0.75</span>)
ames_train <span style="color:#666">&lt;-</span> <span style="color:#00f">training</span>(data_split)
ames_test <span style="color:#666">&lt;-</span> <span style="color:#00f">testing</span>(data_split)
</code></pre></div><p>The use of the test set here is <em>only for illustration</em>; normally in a data analysis these data would be saved to the very end after many models have been evaluated.</p>
<h2 id="random-forest">Random forest</h2>
<p>We&rsquo;ll start by fitting a random forest model to a small set of parameters. Let&rsquo;s create a model with the predictors <code>Longitude</code>, <code>Latitude</code>, <code>Lot_Area</code>, <code>Neighborhood</code>, and <code>Year_Sold</code>. A simple random forest model can be specified via:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r">rf_defaults <span style="color:#666">&lt;-</span> <span style="color:#00f">rand_forest</span>(mode <span style="color:#666">=</span> <span style="color:#ba2121">&#34;regression&#34;</span>)
rf_defaults
<span style="color:#408080;font-style:italic">#&gt; Random Forest Model Specification (regression)</span>
</code></pre></div><p>The model will be fit with the ranger package by default. Since we didn&rsquo;t add any extra arguments to <code>fit</code>, <em>many</em> of the arguments will be set to their defaults from the function <code>ranger::ranger()</code>. The help pages for the model function describe the default parameters and you can also use the <code>translate()</code> function to check out such details.</p>
<p>The parsnip package provides two different interfaces to fit a model:</p>
<ul>
<li>the formula interface (<code>fit()</code>), and</li>
<li>the non-formula interface (<code>fit_xy()</code>).</li>
</ul>
<p>Let&rsquo;s start with the non-formula interface:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r">preds <span style="color:#666">&lt;-</span> <span style="color:#00f">c</span>(<span style="color:#ba2121">&#34;Longitude&#34;</span>, <span style="color:#ba2121">&#34;Latitude&#34;</span>, <span style="color:#ba2121">&#34;Lot_Area&#34;</span>, <span style="color:#ba2121">&#34;Neighborhood&#34;</span>, <span style="color:#ba2121">&#34;Year_Sold&#34;</span>)
rf_xy_fit <span style="color:#666">&lt;-</span>
rf_defaults <span style="color:#666">%&gt;%</span>
<span style="color:#00f">set_engine</span>(<span style="color:#ba2121">&#34;ranger&#34;</span>) <span style="color:#666">%&gt;%</span>
<span style="color:#00f">fit_xy</span>(
x <span style="color:#666">=</span> ames_train[, preds],
y <span style="color:#666">=</span> <span style="color:#00f">log10</span>(ames_train<span style="color:#666">$</span>Sale_Price)
)
rf_xy_fit
<span style="color:#408080;font-style:italic">#&gt; parsnip model object</span>
<span style="color:#408080;font-style:italic">#&gt; </span>
<span style="color:#408080;font-style:italic">#&gt; Fit time: 952ms </span>
<span style="color:#408080;font-style:italic">#&gt; Ranger result</span>
<span style="color:#408080;font-style:italic">#&gt; </span>
<span style="color:#408080;font-style:italic">#&gt; Call:</span>
<span style="color:#408080;font-style:italic">#&gt; ranger::ranger(formula = formula, data = data, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1)) </span>
<span style="color:#408080;font-style:italic">#&gt; </span>
<span style="color:#408080;font-style:italic">#&gt; Type: Regression </span>
<span style="color:#408080;font-style:italic">#&gt; Number of trees: 500 </span>
<span style="color:#408080;font-style:italic">#&gt; Sample size: 2199 </span>
<span style="color:#408080;font-style:italic">#&gt; Number of independent variables: 5 </span>
<span style="color:#408080;font-style:italic">#&gt; Mtry: 2 </span>
<span style="color:#408080;font-style:italic">#&gt; Target node size: 5 </span>
<span style="color:#408080;font-style:italic">#&gt; Variable importance mode: none </span>
<span style="color:#408080;font-style:italic">#&gt; Splitrule: variance </span>
<span style="color:#408080;font-style:italic">#&gt; OOB prediction error (MSE): 0.00844 </span>
<span style="color:#408080;font-style:italic">#&gt; R squared (OOB): 0.736</span>
</code></pre></div><p>The non-formula interface doesn&rsquo;t do anything to the predictors before passing them to the underlying model function. This particular model does <em>not</em> require indicator variables (sometimes called &ldquo;dummy variables&rdquo;) to be created prior to fitting the model. Note that the output shows &ldquo;Number of independent variables: 5&rdquo;.</p>
<p>For regression models, we can use the basic <code>predict()</code> method, which returns a tibble with a column named <code>.pred</code>:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r">test_results <span style="color:#666">&lt;-</span>
ames_test <span style="color:#666">%&gt;%</span>
<span style="color:#00f">select</span>(Sale_Price) <span style="color:#666">%&gt;%</span>
<span style="color:#00f">mutate</span>(Sale_Price <span style="color:#666">=</span> <span style="color:#00f">log10</span>(Sale_Price)) <span style="color:#666">%&gt;%</span>
<span style="color:#00f">bind_cols</span>(
<span style="color:#00f">predict</span>(rf_xy_fit, new_data <span style="color:#666">=</span> ames_test[, preds])
)
test_results <span style="color:#666">%&gt;%</span> <span style="color:#00f">slice</span>(<span style="color:#666">1</span><span style="color:#666">:</span><span style="color:#666">5</span>)
<span style="color:#408080;font-style:italic">#&gt; # A tibble: 5 x 2</span>
<span style="color:#408080;font-style:italic">#&gt; Sale_Price .pred</span>
<span style="color:#408080;font-style:italic">#&gt; &lt;dbl&gt; &lt;dbl&gt;</span>
<span style="color:#408080;font-style:italic">#&gt; 1 5.33 5.22</span>
<span style="color:#408080;font-style:italic">#&gt; 2 5.02 5.21</span>
<span style="color:#408080;font-style:italic">#&gt; 3 5.27 5.25</span>
<span style="color:#408080;font-style:italic">#&gt; 4 5.60 5.51</span>
<span style="color:#408080;font-style:italic">#&gt; 5 5.28 5.24</span>
<span style="color:#408080;font-style:italic"># summarize performance</span>
test_results <span style="color:#666">%&gt;%</span> <span style="color:#00f">metrics</span>(truth <span style="color:#666">=</span> Sale_Price, estimate <span style="color:#666">=</span> .pred)
<span style="color:#408080;font-style:italic">#&gt; # A tibble: 3 x 3</span>
<span style="color:#408080;font-style:italic">#&gt; .metric .estimator .estimate</span>
<span style="color:#408080;font-style:italic">#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt;</span>
<span style="color:#408080;font-style:italic">#&gt; 1 rmse standard 0.0914</span>
<span style="color:#408080;font-style:italic">#&gt; 2 rsq standard 0.717 </span>
<span style="color:#408080;font-style:italic">#&gt; 3 mae standard 0.0662</span>
</code></pre></div><p>Note that:</p>
<ul>
<li>If the model required indicator variables, we would have to create them manually prior to using <code>fit()</code> (perhaps using the recipes package).</li>
<li>We had to manually log the outcome prior to modeling.</li>
</ul>
<p>Now, for illustration, let&rsquo;s use the formula method using some new parameter values:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r"><span style="color:#00f">rand_forest</span>(mode <span style="color:#666">=</span> <span style="color:#ba2121">&#34;regression&#34;</span>, mtry <span style="color:#666">=</span> <span style="color:#666">3</span>, trees <span style="color:#666">=</span> <span style="color:#666">1000</span>) <span style="color:#666">%&gt;%</span>
<span style="color:#00f">set_engine</span>(<span style="color:#ba2121">&#34;ranger&#34;</span>) <span style="color:#666">%&gt;%</span>
<span style="color:#00f">fit</span>(
<span style="color:#00f">log10</span>(Sale_Price) <span style="color:#666">~</span> Longitude <span style="color:#666">+</span> Latitude <span style="color:#666">+</span> Lot_Area <span style="color:#666">+</span> Neighborhood <span style="color:#666">+</span> Year_Sold,
data <span style="color:#666">=</span> ames_train
)
<span style="color:#408080;font-style:italic">#&gt; parsnip model object</span>
<span style="color:#408080;font-style:italic">#&gt; </span>
<span style="color:#408080;font-style:italic">#&gt; Fit time: 2.6s </span>
<span style="color:#408080;font-style:italic">#&gt; Ranger result</span>
<span style="color:#408080;font-style:italic">#&gt; </span>
<span style="color:#408080;font-style:italic">#&gt; Call:</span>
<span style="color:#408080;font-style:italic">#&gt; ranger::ranger(formula = formula, data = data, mtry = ~3, num.trees = ~1000, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1)) </span>
<span style="color:#408080;font-style:italic">#&gt; </span>
<span style="color:#408080;font-style:italic">#&gt; Type: Regression </span>
<span style="color:#408080;font-style:italic">#&gt; Number of trees: 1000 </span>
<span style="color:#408080;font-style:italic">#&gt; Sample size: 2199 </span>
<span style="color:#408080;font-style:italic">#&gt; Number of independent variables: 5 </span>
<span style="color:#408080;font-style:italic">#&gt; Mtry: 3 </span>
<span style="color:#408080;font-style:italic">#&gt; Target node size: 5 </span>
<span style="color:#408080;font-style:italic">#&gt; Variable importance mode: none </span>
<span style="color:#408080;font-style:italic">#&gt; Splitrule: variance </span>
<span style="color:#408080;font-style:italic">#&gt; OOB prediction error (MSE): 0.00848 </span>
<span style="color:#408080;font-style:italic">#&gt; R squared (OOB): 0.735</span>
</code></pre></div><p>Suppose that we would like to use the randomForest package instead of ranger. To do so, the only part of the syntax that needs to change is the <code>set_engine()</code> argument:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r"><span style="color:#00f">rand_forest</span>(mode <span style="color:#666">=</span> <span style="color:#ba2121">&#34;regression&#34;</span>, mtry <span style="color:#666">=</span> <span style="color:#666">3</span>, trees <span style="color:#666">=</span> <span style="color:#666">1000</span>) <span style="color:#666">%&gt;%</span>
<span style="color:#00f">set_engine</span>(<span style="color:#ba2121">&#34;randomForest&#34;</span>) <span style="color:#666">%&gt;%</span>
<span style="color:#00f">fit</span>(
<span style="color:#00f">log10</span>(Sale_Price) <span style="color:#666">~</span> Longitude <span style="color:#666">+</span> Latitude <span style="color:#666">+</span> Lot_Area <span style="color:#666">+</span> Neighborhood <span style="color:#666">+</span> Year_Sold,
data <span style="color:#666">=</span> ames_train
)
<span style="color:#408080;font-style:italic">#&gt; parsnip model object</span>
<span style="color:#408080;font-style:italic">#&gt; </span>
<span style="color:#408080;font-style:italic">#&gt; Fit time: 2.1s </span>
<span style="color:#408080;font-style:italic">#&gt; </span>
<span style="color:#408080;font-style:italic">#&gt; Call:</span>
<span style="color:#408080;font-style:italic">#&gt; randomForest(x = as.data.frame(x), y = y, ntree = ~1000, mtry = ~3) </span>
<span style="color:#408080;font-style:italic">#&gt; Type of random forest: regression</span>
<span style="color:#408080;font-style:italic">#&gt; Number of trees: 1000</span>
<span style="color:#408080;font-style:italic">#&gt; No. of variables tried at each split: 3</span>
<span style="color:#408080;font-style:italic">#&gt; </span>
<span style="color:#408080;font-style:italic">#&gt; Mean of squared residuals: 0.013</span>
<span style="color:#408080;font-style:italic">#&gt; % Var explained: 59.4</span>
</code></pre></div><p>Look at the formula code that was printed out; one function uses the argument name <code>ntree</code> and the other uses <code>num.trees</code>. The parsnip models don&rsquo;t require you to know the specific names of the main arguments.</p>
<p>Now suppose that we want to modify the value of <code>mtry</code> based on the number of predictors in the data. Usually, a good default value is <code>floor(sqrt(num_predictors))</code> but a pure bagging model requires an <code>mtry</code> value equal to the total number of parameters. There may be cases where you may not know how many predictors are going to be present when the model will be fit (perhaps due to the generation of indicator variables or a variable filter) so this might be difficult to know exactly ahead of time when you write your code.</p>
<p>When the model it being fit by parsnip,
<a href="https://tidymodels.github.io/parsnip/reference/descriptors.html" target="_blank" rel="noopener"><em>data descriptors</em></a> are made available. These attempt to let you know what you will have available when the model is fit. When a model object is created (say using <code>rand_forest()</code>), the values of the arguments that you give it are <em>immediately evaluated</em> unless you delay them. To delay the evaluation of any argument, you can used <code>rlang::expr()</code> to make an expression.</p>
<p>Two relevant data descriptors for our example model are:</p>
<ul>
<li><code>.preds()</code>: the number of predictor <em>variables</em> in the data set that are associated with the predictors <strong>prior to dummy variable creation</strong>.</li>
<li><code>.cols()</code>: the number of predictor <em>columns</em> after dummy variables (or other encodings) are created.</li>
</ul>
<p>Since ranger won&rsquo;t create indicator values, <code>.preds()</code> would be appropriate for <code>mtry</code> for a bagging model.</p>
<p>For example, let&rsquo;s use an expression with the <code>.preds()</code> descriptor to fit a bagging model:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r"><span style="color:#00f">rand_forest</span>(mode <span style="color:#666">=</span> <span style="color:#ba2121">&#34;regression&#34;</span>, mtry <span style="color:#666">=</span> <span style="color:#00f">.preds</span>(), trees <span style="color:#666">=</span> <span style="color:#666">1000</span>) <span style="color:#666">%&gt;%</span>
<span style="color:#00f">set_engine</span>(<span style="color:#ba2121">&#34;ranger&#34;</span>) <span style="color:#666">%&gt;%</span>
<span style="color:#00f">fit</span>(
<span style="color:#00f">log10</span>(Sale_Price) <span style="color:#666">~</span> Longitude <span style="color:#666">+</span> Latitude <span style="color:#666">+</span> Lot_Area <span style="color:#666">+</span> Neighborhood <span style="color:#666">+</span> Year_Sold,
data <span style="color:#666">=</span> ames_train
)
<span style="color:#408080;font-style:italic">#&gt; parsnip model object</span>
<span style="color:#408080;font-style:italic">#&gt; </span>
<span style="color:#408080;font-style:italic">#&gt; Fit time: 3.6s </span>
<span style="color:#408080;font-style:italic">#&gt; Ranger result</span>
<span style="color:#408080;font-style:italic">#&gt; </span>
<span style="color:#408080;font-style:italic">#&gt; Call:</span>
<span style="color:#408080;font-style:italic">#&gt; ranger::ranger(formula = formula, data = data, mtry = ~.preds(), num.trees = ~1000, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1)) </span>
<span style="color:#408080;font-style:italic">#&gt; </span>
<span style="color:#408080;font-style:italic">#&gt; Type: Regression </span>
<span style="color:#408080;font-style:italic">#&gt; Number of trees: 1000 </span>
<span style="color:#408080;font-style:italic">#&gt; Sample size: 2199 </span>
<span style="color:#408080;font-style:italic">#&gt; Number of independent variables: 5 </span>
<span style="color:#408080;font-style:italic">#&gt; Mtry: 5 </span>
<span style="color:#408080;font-style:italic">#&gt; Target node size: 5 </span>
<span style="color:#408080;font-style:italic">#&gt; Variable importance mode: none </span>
<span style="color:#408080;font-style:italic">#&gt; Splitrule: variance </span>
<span style="color:#408080;font-style:italic">#&gt; OOB prediction error (MSE): 0.00869 </span>
<span style="color:#408080;font-style:italic">#&gt; R squared (OOB): 0.728</span>
</code></pre></div><h2 id="regularized-regression">Regularized regression</h2>
<p>A linear model might work for this data set as well. We can use the <code>linear_reg()</code> parsnip model. There are two engines that can perform regularization/penalization, the glmnet and sparklyr packages. Let&rsquo;s use the former here. The glmnet package only implements a non-formula method, but parsnip will allow either one to be used.</p>
<p>When regularization is used, the predictors should first be centered and scaled before being passed to the model. The formula method won&rsquo;t do that automatically so we will need to do this ourselves. We&rsquo;ll use the
<a href="https://tidymodels.github.io/recipes/" target="_blank" rel="noopener">recipes</a> package for these steps.</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r">norm_recipe <span style="color:#666">&lt;-</span>
<span style="color:#00f">recipe</span>(
Sale_Price <span style="color:#666">~</span> Longitude <span style="color:#666">+</span> Latitude <span style="color:#666">+</span> Lot_Area <span style="color:#666">+</span> Neighborhood <span style="color:#666">+</span> Year_Sold,
data <span style="color:#666">=</span> ames_train
) <span style="color:#666">%&gt;%</span>
<span style="color:#00f">step_other</span>(Neighborhood) <span style="color:#666">%&gt;%</span>
<span style="color:#00f">step_dummy</span>(<span style="color:#00f">all_nominal</span>()) <span style="color:#666">%&gt;%</span>
<span style="color:#00f">step_center</span>(<span style="color:#00f">all_predictors</span>()) <span style="color:#666">%&gt;%</span>
<span style="color:#00f">step_scale</span>(<span style="color:#00f">all_predictors</span>()) <span style="color:#666">%&gt;%</span>
<span style="color:#00f">step_log</span>(Sale_Price, base <span style="color:#666">=</span> <span style="color:#666">10</span>) <span style="color:#666">%&gt;%</span>
<span style="color:#408080;font-style:italic"># estimate the means and standard deviations</span>
<span style="color:#00f">prep</span>(training <span style="color:#666">=</span> ames_train, retain <span style="color:#666">=</span> <span style="color:#008000;font-weight:bold">TRUE</span>)
<span style="color:#408080;font-style:italic"># Now let&#39;s fit the model using the processed version of the data</span>
glmn_fit <span style="color:#666">&lt;-</span>
<span style="color:#00f">linear_reg</span>(penalty <span style="color:#666">=</span> <span style="color:#666">0.001</span>, mixture <span style="color:#666">=</span> <span style="color:#666">0.5</span>) <span style="color:#666">%&gt;%</span>
<span style="color:#00f">set_engine</span>(<span style="color:#ba2121">&#34;glmnet&#34;</span>) <span style="color:#666">%&gt;%</span>
<span style="color:#00f">fit</span>(Sale_Price <span style="color:#666">~</span> ., data <span style="color:#666">=</span> <span style="color:#00f">juice</span>(norm_recipe))
glmn_fit
<span style="color:#408080;font-style:italic">#&gt; parsnip model object</span>
<span style="color:#408080;font-style:italic">#&gt; </span>
<span style="color:#408080;font-style:italic">#&gt; Fit time: 13ms </span>
<span style="color:#408080;font-style:italic">#&gt; </span>
<span style="color:#408080;font-style:italic">#&gt; Call: glmnet::glmnet(x = as.matrix(x), y = y, family = &#34;gaussian&#34;, alpha = ~0.5) </span>
<span style="color:#408080;font-style:italic">#&gt; </span>
<span style="color:#408080;font-style:italic">#&gt; Df %Dev Lambda</span>
<span style="color:#408080;font-style:italic">#&gt; 1 0 0.000 0.1370</span>
<span style="color:#408080;font-style:italic">#&gt; 2 1 0.019 0.1250</span>
<span style="color:#408080;font-style:italic">#&gt; 3 1 0.036 0.1140</span>
<span style="color:#408080;font-style:italic">#&gt; 4 1 0.050 0.1040</span>
<span style="color:#408080;font-style:italic">#&gt; 5 2 0.068 0.0946</span>
<span style="color:#408080;font-style:italic">#&gt; 6 4 0.093 0.0862</span>
<span style="color:#408080;font-style:italic">#&gt; 7 5 0.125 0.0785</span>
<span style="color:#408080;font-style:italic">#&gt; 8 5 0.153 0.0716</span>
<span style="color:#408080;font-style:italic">#&gt; 9 7 0.184 0.0652</span>
<span style="color:#408080;font-style:italic">#&gt; 10 7 0.214 0.0594</span>
<span style="color:#408080;font-style:italic">#&gt; 11 7 0.240 0.0541</span>
<span style="color:#408080;font-style:italic">#&gt; 12 8 0.262 0.0493</span>
<span style="color:#408080;font-style:italic">#&gt; 13 8 0.286 0.0449</span>
<span style="color:#408080;font-style:italic">#&gt; 14 8 0.306 0.0409</span>
<span style="color:#408080;font-style:italic">#&gt; 15 8 0.323 0.0373</span>
<span style="color:#408080;font-style:italic">#&gt; 16 8 0.338 0.0340</span>
<span style="color:#408080;font-style:italic">#&gt; 17 8 0.350 0.0310</span>
<span style="color:#408080;font-style:italic">#&gt; 18 8 0.361 0.0282</span>
<span style="color:#408080;font-style:italic">#&gt; 19 9 0.370 0.0257</span>
<span style="color:#408080;font-style:italic">#&gt; 20 9 0.379 0.0234</span>
<span style="color:#408080;font-style:italic">#&gt; 21 9 0.386 0.0213</span>
<span style="color:#408080;font-style:italic">#&gt; 22 9 0.392 0.0195</span>
<span style="color:#408080;font-style:italic">#&gt; 23 9 0.397 0.0177</span>
<span style="color:#408080;font-style:italic">#&gt; 24 9 0.401 0.0161</span>
<span style="color:#408080;font-style:italic">#&gt; 25 9 0.405 0.0147</span>
<span style="color:#408080;font-style:italic">#&gt; 26 9 0.408 0.0134</span>
<span style="color:#408080;font-style:italic">#&gt; 27 10 0.410 0.0122</span>
<span style="color:#408080;font-style:italic">#&gt; 28 11 0.413 0.0111</span>
<span style="color:#408080;font-style:italic">#&gt; 29 11 0.415 0.0101</span>
<span style="color:#408080;font-style:italic">#&gt; 30 11 0.417 0.0092</span>
<span style="color:#408080;font-style:italic">#&gt; 31 12 0.418 0.0084</span>
<span style="color:#408080;font-style:italic">#&gt; 32 12 0.420 0.0077</span>
<span style="color:#408080;font-style:italic">#&gt; 33 12 0.421 0.0070</span>
<span style="color:#408080;font-style:italic">#&gt; 34 12 0.422 0.0064</span>
<span style="color:#408080;font-style:italic">#&gt; 35 12 0.423 0.0058</span>
<span style="color:#408080;font-style:italic">#&gt; 36 12 0.423 0.0053</span>
<span style="color:#408080;font-style:italic">#&gt; 37 12 0.424 0.0048</span>
<span style="color:#408080;font-style:italic">#&gt; 38 12 0.425 0.0044</span>
<span style="color:#408080;font-style:italic">#&gt; 39 12 0.425 0.0040</span>
<span style="color:#408080;font-style:italic">#&gt; 40 12 0.425 0.0036</span>
<span style="color:#408080;font-style:italic">#&gt; 41 12 0.426 0.0033</span>
<span style="color:#408080;font-style:italic">#&gt; 42 12 0.426 0.0030</span>
<span style="color:#408080;font-style:italic">#&gt; 43 12 0.426 0.0028</span>
<span style="color:#408080;font-style:italic">#&gt; 44 12 0.426 0.0025</span>
<span style="color:#408080;font-style:italic">#&gt; 45 12 0.426 0.0023</span>
<span style="color:#408080;font-style:italic">#&gt; 46 12 0.426 0.0021</span>
<span style="color:#408080;font-style:italic">#&gt; 47 12 0.427 0.0019</span>
<span style="color:#408080;font-style:italic">#&gt; 48 12 0.427 0.0017</span>
<span style="color:#408080;font-style:italic">#&gt; 49 12 0.427 0.0016</span>
<span style="color:#408080;font-style:italic">#&gt; 50 12 0.427 0.0014</span>
<span style="color:#408080;font-style:italic">#&gt; 51 12 0.427 0.0013</span>
<span style="color:#408080;font-style:italic">#&gt; 52 12 0.427 0.0012</span>
<span style="color:#408080;font-style:italic">#&gt; 53 12 0.427 0.0011</span>
<span style="color:#408080;font-style:italic">#&gt; 54 12 0.427 0.0010</span>
<span style="color:#408080;font-style:italic">#&gt; 55 12 0.427 0.0009</span>
<span style="color:#408080;font-style:italic">#&gt; 56 12 0.427 0.0008</span>
<span style="color:#408080;font-style:italic">#&gt; 57 12 0.427 0.0008</span>
<span style="color:#408080;font-style:italic">#&gt; 58 12 0.427 0.0007</span>
<span style="color:#408080;font-style:italic">#&gt; 59 12 0.427 0.0006</span>
<span style="color:#408080;font-style:italic">#&gt; 60 12 0.427 0.0006</span>
<span style="color:#408080;font-style:italic">#&gt; 61 12 0.427 0.0005</span>
<span style="color:#408080;font-style:italic">#&gt; 62 12 0.427 0.0005</span>
<span style="color:#408080;font-style:italic">#&gt; 63 12 0.427 0.0004</span>
<span style="color:#408080;font-style:italic">#&gt; 64 12 0.427 0.0004</span>
<span style="color:#408080;font-style:italic">#&gt; 65 12 0.427 0.0004</span>
</code></pre></div><p>If <code>penalty</code> were not specified, all of the <code>lambda</code> values would be computed.</p>
<p>To get the predictions for this specific value of <code>lambda</code> (aka <code>penalty</code>):</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r"><span style="color:#408080;font-style:italic"># First, get the processed version of the test set predictors:</span>
test_normalized <span style="color:#666">&lt;-</span> <span style="color:#00f">bake</span>(norm_recipe, new_data <span style="color:#666">=</span> ames_test, <span style="color:#00f">all_predictors</span>())
test_results <span style="color:#666">&lt;-</span>
test_results <span style="color:#666">%&gt;%</span>
<span style="color:#00f">rename</span>(`random forest` <span style="color:#666">=</span> .pred) <span style="color:#666">%&gt;%</span>
<span style="color:#00f">bind_cols</span>(
<span style="color:#00f">predict</span>(glmn_fit, new_data <span style="color:#666">=</span> test_normalized) <span style="color:#666">%&gt;%</span>
<span style="color:#00f">rename</span>(glmnet <span style="color:#666">=</span> .pred)
)
test_results
<span style="color:#408080;font-style:italic">#&gt; # A tibble: 731 x 3</span>
<span style="color:#408080;font-style:italic">#&gt; Sale_Price `random forest` glmnet</span>
<span style="color:#408080;font-style:italic">#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;</span>
<span style="color:#408080;font-style:italic">#&gt; 1 5.33 5.22 5.27</span>
<span style="color:#408080;font-style:italic">#&gt; 2 5.02 5.21 5.17</span>
<span style="color:#408080;font-style:italic">#&gt; 3 5.27 5.25 5.23</span>
<span style="color:#408080;font-style:italic">#&gt; 4 5.60 5.51 5.25</span>
<span style="color:#408080;font-style:italic">#&gt; 5 5.28 5.24 5.25</span>
<span style="color:#408080;font-style:italic">#&gt; 6 5.17 5.19 5.19</span>
<span style="color:#408080;font-style:italic">#&gt; 7 5.02 4.97 5.19</span>
<span style="color:#408080;font-style:italic">#&gt; 8 5.46 5.50 5.49</span>
<span style="color:#408080;font-style:italic">#&gt; 9 5.44 5.46 5.48</span>
<span style="color:#408080;font-style:italic">#&gt; 10 5.33 5.50 5.47</span>
<span style="color:#408080;font-style:italic">#&gt; # … with 721 more rows</span>
test_results <span style="color:#666">%&gt;%</span> <span style="color:#00f">metrics</span>(truth <span style="color:#666">=</span> Sale_Price, estimate <span style="color:#666">=</span> glmnet)
<span style="color:#408080;font-style:italic">#&gt; # A tibble: 3 x 3</span>
<span style="color:#408080;font-style:italic">#&gt; .metric .estimator .estimate</span>
<span style="color:#408080;font-style:italic">#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt;</span>
<span style="color:#408080;font-style:italic">#&gt; 1 rmse standard 0.132 </span>
<span style="color:#408080;font-style:italic">#&gt; 2 rsq standard 0.410 </span>
<span style="color:#408080;font-style:italic">#&gt; 3 mae standard 0.0956</span>
test_results <span style="color:#666">%&gt;%</span>
<span style="color:#00f">gather</span>(model, prediction, <span style="color:#666">-</span>Sale_Price) <span style="color:#666">%&gt;%</span>
<span style="color:#00f">ggplot</span>(<span style="color:#00f">aes</span>(x <span style="color:#666">=</span> prediction, y <span style="color:#666">=</span> Sale_Price)) <span style="color:#666">+</span>
<span style="color:#00f">geom_abline</span>(col <span style="color:#666">=</span> <span style="color:#ba2121">&#34;green&#34;</span>, lty <span style="color:#666">=</span> <span style="color:#666">2</span>) <span style="color:#666">+</span>
<span style="color:#00f">geom_point</span>(alpha <span style="color:#666">=</span> <span style="color:#666">.4</span>) <span style="color:#666">+</span>
<span style="color:#00f">facet_wrap</span>(<span style="color:#666">~</span>model) <span style="color:#666">+</span>
<span style="color:#00f">coord_fixed</span>()
</code></pre></div><p><img src="figs/glmn-pred-1.svg" width="672" /></p>
<p>This final plot compares the performance of the random forest and regularized regression models.</p>
<h2 id="session-information">Session information</h2>
<pre><code>#&gt; ─ Session info ───────────────────────────────────────────────────────────────
#&gt; setting value
#&gt; version R version 3.6.2 (2019-12-12)
#&gt; os macOS Mojave 10.14.6
#&gt; system x86_64, darwin15.6.0
#&gt; ui X11
#&gt; language (EN)
#&gt; collate en_US.UTF-8
#&gt; ctype en_US.UTF-8
#&gt; tz America/Denver
#&gt; date 2020-04-17
#&gt;
#&gt; ─ Packages ───────────────────────────────────────────────────────────────────
#&gt; package * version date lib source
#&gt; AmesHousing * 0.0.3 2017-12-17 [1] CRAN (R 3.6.0)
#&gt; broom * 0.5.5 2020-02-29 [1] CRAN (R 3.6.0)
#&gt; dials * 0.0.6 2020-04-03 [1] CRAN (R 3.6.2)
#&gt; dplyr * 0.8.5 2020-03-07 [1] CRAN (R 3.6.0)
#&gt; ggplot2 * 3.3.0 2020-03-05 [1] CRAN (R 3.6.0)
#&gt; glmnet * 3.0-2 2019-12-11 [1] CRAN (R 3.6.0)
#&gt; infer * 0.5.1 2019-11-19 [1] CRAN (R 3.6.0)
#&gt; parsnip * 0.1.0 2020-04-09 [1] CRAN (R 3.6.2)
#&gt; purrr * 0.3.3 2019-10-18 [1] CRAN (R 3.6.0)
#&gt; randomForest * 4.6-14 2018-03-25 [1] CRAN (R 3.6.0)
#&gt; ranger * 0.12.1 2020-01-10 [1] CRAN (R 3.6.0)
#&gt; recipes * 0.1.10 2020-03-18 [1] CRAN (R 3.6.0)
#&gt; rlang 0.4.5 2020-03-01 [1] CRAN (R 3.6.0)
#&gt; rsample * 0.0.6 2020-03-31 [1] CRAN (R 3.6.2)
#&gt; tibble * 2.1.3 2019-06-06 [1] CRAN (R 3.6.2)
#&gt; tidymodels * 0.1.0 2020-02-16 [1] CRAN (R 3.6.0)
#&gt; tune * 0.1.0 2020-04-02 [1] CRAN (R 3.6.2)
#&gt; workflows * 0.1.1 2020-03-17 [1] CRAN (R 3.6.0)
#&gt; yardstick * 0.0.6 2020-03-17 [1] CRAN (R 3.6.0)
#&gt;
#&gt; [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library
</code></pre></description>
</item>
<item>
<title>Classification models using a neural network</title>
<link>/learn/models/parsnip-nnet/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>/learn/models/parsnip-nnet/</guid>
<description><h2 id="introduction">Introduction</h2>
<p>To use the code in this article, you will need to install the following packages: keras and tidymodels. You will also need the python keras library installed (see <code>?keras::install_keras()</code>).</p>
<p>We can create classification models with the tidymodels package
<a href="https://tidymodels.github.io/parsnip/" target="_blank" rel="noopener">parsnip</a> to predict categorical quantities or class labels. Here, let&rsquo;s fit a single classification model using a neural network and evaluate using a validation set. While the
<a href="https://tidymodels.github.io/tune/" target="_blank" rel="noopener">tune</a> package has functionality to also do this, the parsnip package is the center of attention in this article so that we can better understand its usage.</p>
<h2 id="fitting-a-neural-network">Fitting a neural network</h2>
<p>Let&rsquo;s fit a model to a small, two predictor classification data set. The data are in the modeldata package (part of tidymodels) and have been split into training, validation, and test data sets. In this analysis, the test set is left untouched; this article tries to emulate a good data usage methodology where the test set would only be evaluated once at the end after a variety of models have been considered.</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r"><span style="color:#00f">data</span>(bivariate)
<span style="color:#00f">nrow</span>(bivariate_train)
<span style="color:#408080;font-style:italic">#&gt; [1] 1009</span>
<span style="color:#00f">nrow</span>(bivariate_val)
<span style="color:#408080;font-style:italic">#&gt; [1] 300</span>
</code></pre></div><p>A plot of the data shows two right-skewed predictors:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r"><span style="color:#00f">ggplot</span>(bivariate_train, <span style="color:#00f">aes</span>(x <span style="color:#666">=</span> A, y <span style="color:#666">=</span> B, col <span style="color:#666">=</span> Class)) <span style="color:#666">+</span>
<span style="color:#00f">geom_point</span>(alpha <span style="color:#666">=</span> <span style="color:#666">.2</span>)
</code></pre></div><p><img src="figs/biv-plot-1.svg" width="576" /></p>
<p>Let&rsquo;s use a single hidden layer neural network to predict the outcome. To do this, we transform the predictor columns to be more symmetric (via the <code>step_BoxCox()</code> function) and on a common scale (using <code>step_normalize()</code>). We can use
<a href="https://tidymodels.github.io/recipes/" target="_blank" rel="noopener">recipes</a> to do so:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r">biv_rec <span style="color:#666">&lt;-</span>
<span style="color:#00f">recipe</span>(Class <span style="color:#666">~</span> ., data <span style="color:#666">=</span> bivariate_train) <span style="color:#666">%&gt;%</span>
<span style="color:#00f">step_BoxCox</span>(<span style="color:#00f">all_predictors</span>())<span style="color:#666">%&gt;%</span>
<span style="color:#00f">step_normalize</span>(<span style="color:#00f">all_predictors</span>()) <span style="color:#666">%&gt;%</span>
<span style="color:#00f">prep</span>(training <span style="color:#666">=</span> bivariate_train, retain <span style="color:#666">=</span> <span style="color:#008000;font-weight:bold">TRUE</span>)
<span style="color:#408080;font-style:italic"># We will juice() to get the processed training set back</span>
<span style="color:#408080;font-style:italic"># For validation:</span>
val_normalized <span style="color:#666">&lt;-</span> <span style="color:#00f">bake</span>(biv_rec, new_data <span style="color:#666">=</span> bivariate_val, <span style="color:#00f">all_predictors</span>())
<span style="color:#408080;font-style:italic"># For testing when we arrive at a final model: </span>
test_normalized <span style="color:#666">&lt;-</span> <span style="color:#00f">bake</span>(biv_rec, new_data <span style="color:#666">=</span> bivariate_test, <span style="color:#00f">all_predictors</span>())
</code></pre></div><p>We can use the keras package to fit a model with 5 hidden units and a 10% dropout rate, to regularize the model:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r"><span style="color:#00f">set.seed</span>(<span style="color:#666">57974</span>)
nnet_fit <span style="color:#666">&lt;-</span>
<span style="color:#00f">mlp</span>(epochs <span style="color:#666">=</span> <span style="color:#666">100</span>, hidden_units <span style="color:#666">=</span> <span style="color:#666">5</span>, dropout <span style="color:#666">=</span> <span style="color:#666">0.1</span>) <span style="color:#666">%&gt;%</span>
<span style="color:#00f">set_mode</span>(<span style="color:#ba2121">&#34;classification&#34;</span>) <span style="color:#666">%&gt;%</span>
<span style="color:#408080;font-style:italic"># Also set engine-specific `verbose` argument to prevent logging the results: </span>
<span style="color:#00f">set_engine</span>(<span style="color:#ba2121">&#34;keras&#34;</span>, verbose <span style="color:#666">=</span> <span style="color:#666">0</span>) <span style="color:#666">%&gt;%</span>
<span style="color:#00f">fit</span>(Class <span style="color:#666">~</span> ., data <span style="color:#666">=</span> <span style="color:#00f">juice</span>(biv_rec))
nnet_fit
<span style="color:#408080;font-style:italic">#&gt; parsnip model object</span>
<span style="color:#408080;font-style:italic">#&gt; </span>
<span style="color:#408080;font-style:italic">#&gt; Fit time: 8.7s </span>
<span style="color:#408080;font-style:italic">#&gt; Model</span>
<span style="color:#408080;font-style:italic">#&gt; Model: &#34;sequential&#34;</span>
<span style="color:#408080;font-style:italic">#&gt; ________________________________________________________________________________</span>
<span style="color:#408080;font-style:italic">#&gt; Layer (type) Output Shape Param # </span>
<span style="color:#408080;font-style:italic">#&gt; ================================================================================</span>
<span style="color:#408080;font-style:italic">#&gt; dense (Dense) (None, 5) 15 </span>
<span style="color:#408080;font-style:italic">#&gt; ________________________________________________________________________________</span>
<span style="color:#408080;font-style:italic">#&gt; dense_1 (Dense) (None, 5) 30 </span>
<span style="color:#408080;font-style:italic">#&gt; ________________________________________________________________________________</span>
<span style="color:#408080;font-style:italic">#&gt; dropout (Dropout) (None, 5) 0 </span>
<span style="color:#408080;font-style:italic">#&gt; ________________________________________________________________________________</span>
<span style="color:#408080;font-style:italic">#&gt; dense_2 (Dense) (None, 2) 12 </span>
<span style="color:#408080;font-style:italic">#&gt; ================================================================================</span>
<span style="color:#408080;font-style:italic">#&gt; Total params: 57</span>
<span style="color:#408080;font-style:italic">#&gt; Trainable params: 57</span>
<span style="color:#408080;font-style:italic">#&gt; Non-trainable params: 0</span>
<span style="color:#408080;font-style:italic">#&gt; ________________________________________________________________________________</span>
</code></pre></div><h2 id="model-performance">Model performance</h2>
<p>In parsnip, the <code>predict()</code> function can be used to characterize performance on the validation set. Since parsnip always produces tibble outputs, these can just be column bound to the original data:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r">val_results <span style="color:#666">&lt;-</span>
bivariate_val <span style="color:#666">%&gt;%</span>
<span style="color:#00f">bind_cols</span>(
<span style="color:#00f">predict</span>(nnet_fit, new_data <span style="color:#666">=</span> val_normalized),
<span style="color:#00f">predict</span>(nnet_fit, new_data <span style="color:#666">=</span> val_normalized, type <span style="color:#666">=</span> <span style="color:#ba2121">&#34;prob&#34;</span>)
)
val_results <span style="color:#666">%&gt;%</span> <span style="color:#00f">slice</span>(<span style="color:#666">1</span><span style="color:#666">:</span><span style="color:#666">5</span>)
<span style="color:#408080;font-style:italic">#&gt; # A tibble: 5 x 6</span>
<span style="color:#408080;font-style:italic">#&gt; A B Class .pred_class .pred_One .pred_Two</span>
<span style="color:#408080;font-style:italic">#&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;fct&gt; &lt;fct&gt; &lt;dbl&gt; &lt;dbl&gt;</span>
<span style="color:#408080;font-style:italic">#&gt; 1 1061. 74.5 One Two 0.473 0.527 </span>
<span style="color:#408080;font-style:italic">#&gt; 2 1241. 83.4 One Two 0.484 0.516 </span>
<span style="color:#408080;font-style:italic">#&gt; 3 939. 71.9 One One 0.636 0.364 </span>
<span style="color:#408080;font-style:italic">#&gt; 4 813. 77.1 One One 0.925 0.0746</span>
<span style="color:#408080;font-style:italic">#&gt; 5 1706. 92.8 Two Two 0.355 0.645</span>
val_results <span style="color:#666">%&gt;%</span> <span style="color:#00f">roc_auc</span>(truth <span style="color:#666">=</span> Class, .pred_One)
<span style="color:#408080;font-style:italic">#&gt; # A tibble: 1 x 3</span>
<span style="color:#408080;font-style:italic">#&gt; .metric .estimator .estimate</span>
<span style="color:#408080;font-style:italic">#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt;</span>
<span style="color:#408080;font-style:italic">#&gt; 1 roc_auc binary 0.815</span>
val_results <span style="color:#666">%&gt;%</span> <span style="color:#00f">accuracy</span>(truth <span style="color:#666">=</span> Class, .pred_class)
<span style="color:#408080;font-style:italic">#&gt; # A tibble: 1 x 3</span>
<span style="color:#408080;font-style:italic">#&gt; .metric .estimator .estimate</span>
<span style="color:#408080;font-style:italic">#&gt; &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt;</span>
<span style="color:#408080;font-style:italic">#&gt; 1 accuracy binary 0.737</span>
val_results <span style="color:#666">%&gt;%</span> <span style="color:#00f">conf_mat</span>(truth <span style="color:#666">=</span> Class, .pred_class)
<span style="color:#408080;font-style:italic">#&gt; Truth</span>
<span style="color:#408080;font-style:italic">#&gt; Prediction One Two</span>
<span style="color:#408080;font-style:italic">#&gt; One 150 27</span>
<span style="color:#408080;font-style:italic">#&gt; Two 52 71</span>
</code></pre></div><p>Let&rsquo;s also create a grid to get a visual sense of the class boundary for the validation set.</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r">a_rng <span style="color:#666">&lt;-</span> <span style="color:#00f">range</span>(bivariate_train<span style="color:#666">$</span>A)
b_rng <span style="color:#666">&lt;-</span> <span style="color:#00f">range</span>(bivariate_train<span style="color:#666">$</span>B)
x_grid <span style="color:#666">&lt;-</span>
<span style="color:#00f">expand.grid</span>(A <span style="color:#666">=</span> <span style="color:#00f">seq</span>(a_rng[1], a_rng[2], length.out <span style="color:#666">=</span> <span style="color:#666">100</span>),
B <span style="color:#666">=</span> <span style="color:#00f">seq</span>(b_rng[1], b_rng[2], length.out <span style="color:#666">=</span> <span style="color:#666">100</span>))
x_grid_trans <span style="color:#666">&lt;-</span> <span style="color:#00f">bake</span>(biv_rec, x_grid)
<span style="color:#408080;font-style:italic"># Make predictions using the transformed predictors but </span>
<span style="color:#408080;font-style:italic"># attach them to the predictors in the original units: </span>
x_grid <span style="color:#666">&lt;-</span>
x_grid <span style="color:#666">%&gt;%</span>
<span style="color:#00f">bind_cols</span>(<span style="color:#00f">predict</span>(nnet_fit, x_grid_trans, type <span style="color:#666">=</span> <span style="color:#ba2121">&#34;prob&#34;</span>))
<span style="color:#00f">ggplot</span>(x_grid, <span style="color:#00f">aes</span>(x <span style="color:#666">=</span> A, y <span style="color:#666">=</span> B)) <span style="color:#666">+</span>
<span style="color:#00f">geom_contour</span>(<span style="color:#00f">aes</span>(z <span style="color:#666">=</span> .pred_One), breaks <span style="color:#666">=</span> <span style="color:#666">.5</span>, col <span style="color:#666">=</span> <span style="color:#ba2121">&#34;black&#34;</span>) <span style="color:#666">+</span>
<span style="color:#00f">geom_point</span>(data <span style="color:#666">=</span> bivariate_val, <span style="color:#00f">aes</span>(col <span style="color:#666">=</span> Class), alpha <span style="color:#666">=</span> <span style="color:#666">0.3</span>)
</code></pre></div><p><img src="figs/biv-boundary-1.svg" width="576" /></p>
<h2 id="session-information">Session information</h2>
<pre><code>#&gt; ─ Session info ───────────────────────────────────────────────────────────────
#&gt; setting value
#&gt; version R version 3.6.2 (2019-12-12)
#&gt; os macOS Mojave 10.14.6
#&gt; system x86_64, darwin15.6.0
#&gt; ui X11
#&gt; language (EN)
#&gt; collate en_US.UTF-8
#&gt; ctype en_US.UTF-8
#&gt; tz America/Denver
#&gt; date 2020-04-17
#&gt;
#&gt; ─ Packages ───────────────────────────────────────────────────────────────────
#&gt; package * version date lib source
#&gt; broom * 0.5.5 2020-02-29 [1] CRAN (R 3.6.0)
#&gt; dials * 0.0.6 2020-04-03 [1] CRAN (R 3.6.2)
#&gt; dplyr * 0.8.5 2020-03-07 [1] CRAN (R 3.6.0)
#&gt; ggplot2 * 3.3.0 2020-03-05 [1] CRAN (R 3.6.0)
#&gt; infer * 0.5.1 2019-11-19 [1] CRAN (R 3.6.0)
#&gt; keras 2.2.5.0 2019-10-08 [1] CRAN (R 3.6.0)
#&gt; parsnip * 0.1.0 2020-04-09 [1] CRAN (R 3.6.2)
#&gt; purrr * 0.3.3 2019-10-18 [1] CRAN (R 3.6.0)
#&gt; recipes * 0.1.10 2020-03-18 [1] CRAN (R 3.6.0)
#&gt; rlang 0.4.5 2020-03-01 [1] CRAN (R 3.6.0)
#&gt; rsample * 0.0.6 2020-03-31 [1] CRAN (R 3.6.2)
#&gt; tibble * 2.1.3 2019-06-06 [1] CRAN (R 3.6.2)
#&gt; tidymodels * 0.1.0 2020-02-16 [1] CRAN (R 3.6.0)
#&gt; tune * 0.1.0 2020-04-02 [1] CRAN (R 3.6.2)
#&gt; workflows * 0.1.1 2020-03-17 [1] CRAN (R 3.6.0)
#&gt; yardstick * 0.0.6 2020-03-17 [1] CRAN (R 3.6.0)
#&gt;
#&gt; [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library
</code></pre></description>
</item>
<item>
<title>Nested resampling</title>
<link>/learn/work/nested-resampling/</link>
<pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
<guid>/learn/work/nested-resampling/</guid>
<description><h2 id="introduction">Introduction</h2>
<p>To use the code in this article, you will need to install the following packages: furrr, kernlab, mlbench, scales, and tidymodels.</p>
<p>In this article, we discuss an alternative method for evaluating and tuning models, called
<a href="https://scholar.google.com/scholar?hl=en&amp;as_sdt=0%2C7&amp;q=%22nested&#43;resampling%22&#43;inner&#43;outer&amp;btnG=" target="_blank" rel="noopener">nested resampling</a>. While it is more computationally taxing and challenging to implement than other resampling methods, it has the potential to produce better estimates of model performance.</p>
<h2 id="resampling-models">Resampling models</h2>
<p>A typical scheme for splitting the data when developing a predictive model is to create an initial split of the data into a training and test set. If resampling is used, it is executed on the training set. A series of binary splits is created. In rsample, we use the term <em>analysis set</em> for the data that are used to fit the model and the term <em>assessment set</em> for the set used to compute performance:</p>
<p><img src="figs/resampling.svg" width="70%" style="display: block; margin: auto;" /></p>
<p>A common method for tuning models is
<a href="/learn/work/tune-svm/">grid search</a> where a candidate set of tuning parameters is created. The full set of models for every combination of the tuning parameter grid and the resamples is fitted. Each time, the assessment data are used to measure performance and the average value is determined for each tuning parameter.</p>
<p>The potential problem is that once we pick the tuning parameter associated with the best performance, this performance value is usually quoted as the performance of the model. There is serious potential for <em>optimization bias</em> since we use the same data to tune the model and to assess performance. This would result in an optimistic estimate of performance.</p>
<p>Nested resampling uses an additional layer of resampling that separates the tuning activities from the process used to estimate the efficacy of the model. An <em>outer</em> resampling scheme is used and, for every split in the outer resample, another full set of resampling splits are created on the original analysis set. For example, if 10-fold cross-validation is used on the outside and 5-fold cross-validation on the inside, a total of 500 models will be fit. The parameter tuning will be conducted 10 times and the best parameters are determined from the average of the 5 assessment sets. This process occurs 10 times.</p>
<p>Once the tuning results are complete, a model is fit to each of the outer resampling splits using the best parameter associated with that resample. The average of the outer method&rsquo;s assessment sets are a unbiased estimate of the model.</p>
<p>We will simulate some regression data to illustrate the methods. The mlbench package has a function <code>mlbench::mlbench.friedman1()</code> that can simulate a complex regression data structure from the
<a href="https://scholar.google.com/scholar?hl=en&amp;q=%22Multivariate&#43;adaptive&#43;regression&#43;splines%22&amp;btnG=&amp;as_sdt=1%2C7&amp;as_sdtp=" target="_blank" rel="noopener">original MARS publication</a>. A training set size of 100 data points are generated as well as a large set that will be used to characterize how well the resampling procedure performed.</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r"><span style="color:#00f">library</span>(mlbench)
sim_data <span style="color:#666">&lt;-</span> <span style="color:#00f">function</span>(n) {
tmp <span style="color:#666">&lt;-</span> <span style="color:#00f">mlbench.friedman1</span>(n, sd <span style="color:#666">=</span> <span style="color:#666">1</span>)
tmp <span style="color:#666">&lt;-</span> <span style="color:#00f">cbind</span>(tmp<span style="color:#666">$</span>x, tmp<span style="color:#666">$</span>y)
tmp <span style="color:#666">&lt;-</span> <span style="color:#00f">as.data.frame</span>(tmp)
<span style="color:#00f">names</span>(tmp)<span style="color:#00f">[ncol</span>(tmp)] <span style="color:#666">&lt;-</span> <span style="color:#ba2121">&#34;y&#34;</span>
tmp
}
<span style="color:#00f">set.seed</span>(<span style="color:#666">9815</span>)
train_dat <span style="color:#666">&lt;-</span> <span style="color:#00f">sim_data</span>(<span style="color:#666">100</span>)
large_dat <span style="color:#666">&lt;-</span> <span style="color:#00f">sim_data</span>(<span style="color:#666">10</span>^5)
</code></pre></div><h2 id="nested-resampling">Nested resampling</h2>
<p>To get started, the types of resampling methods need to be specified. This isn&rsquo;t a large data set, so 5 repeats of 10-fold cross validation will be used as the <em>outer</em> resampling method for generating the estimate of overall performance. To tune the model, it would be good to have precise estimates for each of the values of the tuning parameter so let&rsquo;s use 25 iterations of the bootstrap. This means that there will eventually be <code>5 * 10 * 25 = 1250</code> models that are fit to the data <em>per tuning parameter</em>. These models will be discarded once the performance of the model has been quantified.</p>
<p>To create the tibble with the resampling specifications:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r"><span style="color:#00f">library</span>(tidymodels)
results <span style="color:#666">&lt;-</span> <span style="color:#00f">nested_cv</span>(train_dat,
outside <span style="color:#666">=</span> <span style="color:#00f">vfold_cv</span>(repeats <span style="color:#666">=</span> <span style="color:#666">5</span>),
inside <span style="color:#666">=</span> <span style="color:#00f">bootstraps</span>(times <span style="color:#666">=</span> <span style="color:#666">25</span>))
results
<span style="color:#408080;font-style:italic">#&gt; [1] &#34;nested_cv&#34; &#34;vfold_cv&#34; &#34;rset&#34; &#34;tbl_df&#34; &#34;tbl&#34; </span>
<span style="color:#408080;font-style:italic">#&gt; [6] &#34;data.frame&#34;</span>
<span style="color:#408080;font-style:italic">#&gt; # Nested resampling:</span>
<span style="color:#408080;font-style:italic">#&gt; # outer: 10-fold cross-validation repeated 5 times</span>
<span style="color:#408080;font-style:italic">#&gt; # inner: Bootstrap sampling</span>
<span style="color:#408080;font-style:italic">#&gt; # A tibble: 50 x 4</span>
<span style="color:#408080;font-style:italic">#&gt; splits id id2 inner_resamples </span>
<span style="color:#408080;font-style:italic">#&gt; &lt;named list&gt; &lt;chr&gt; &lt;chr&gt; &lt;named list&gt; </span>
<span style="color:#408080;font-style:italic">#&gt; 1 &lt;split [90/10]&gt; Repeat1 Fold01 &lt;tibble [25 × 2]&gt;</span>
<span style="color:#408080;font-style:italic">#&gt; 2 &lt;split [90/10]&gt; Repeat1 Fold02 &lt;tibble [25 × 2]&gt;</span>
<span style="color:#408080;font-style:italic">#&gt; 3 &lt;split [90/10]&gt; Repeat1 Fold03 &lt;tibble [25 × 2]&gt;</span>
<span style="color:#408080;font-style:italic">#&gt; 4 &lt;split [90/10]&gt; Repeat1 Fold04 &lt;tibble [25 × 2]&gt;</span>
<span style="color:#408080;font-style:italic">#&gt; 5 &lt;split [90/10]&gt; Repeat1 Fold05 &lt;tibble [25 × 2]&gt;</span>
<span style="color:#408080;font-style:italic">#&gt; 6 &lt;split [90/10]&gt; Repeat1 Fold06 &lt;tibble [25 × 2]&gt;</span>
<span style="color:#408080;font-style:italic">#&gt; 7 &lt;split [90/10]&gt; Repeat1 Fold07 &lt;tibble [25 × 2]&gt;</span>
<span style="color:#408080;font-style:italic">#&gt; 8 &lt;split [90/10]&gt; Repeat1 Fold08 &lt;tibble [25 × 2]&gt;</span>
<span style="color:#408080;font-style:italic">#&gt; 9 &lt;split [90/10]&gt; Repeat1 Fold09 &lt;tibble [25 × 2]&gt;</span>
<span style="color:#408080;font-style:italic">#&gt; 10 &lt;split [90/10]&gt; Repeat1 Fold10 &lt;tibble [25 × 2]&gt;</span>
<span style="color:#408080;font-style:italic">#&gt; # … with 40 more rows</span>
</code></pre></div><p>The splitting information for each resample is contained in the <code>split</code> objects. Focusing on the second fold of the first repeat:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r">results<span style="color:#666">$</span>splits[[2]]
<span style="color:#408080;font-style:italic">#&gt; &lt;Training/Validation/Total&gt;</span>
<span style="color:#408080;font-style:italic">#&gt; &lt;90/10/100&gt;</span>
</code></pre></div><p><code>&lt;90/10/100&gt;</code> indicates the number of observations in the analysis set, assessment set, and the original data.</p>
<p>Each element of <code>inner_resamples</code> has its own tibble with the bootstrapping splits.</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r">results<span style="color:#666">$</span>inner_resamples[[5]]
<span style="color:#408080;font-style:italic">#&gt; # Bootstrap sampling </span>
<span style="color:#408080;font-style:italic">#&gt; # A tibble: 25 x 2</span>
<span style="color:#408080;font-style:italic">#&gt; splits id </span>
<span style="color:#408080;font-style:italic">#&gt; &lt;list&gt; &lt;chr&gt; </span>
<span style="color:#408080;font-style:italic">#&gt; 1 &lt;split [90/31]&gt; Bootstrap01</span>
<span style="color:#408080;font-style:italic">#&gt; 2 &lt;split [90/33]&gt; Bootstrap02</span>
<span style="color:#408080;font-style:italic">#&gt; 3 &lt;split [90/37]&gt; Bootstrap03</span>
<span style="color:#408080;font-style:italic">#&gt; 4 &lt;split [90/31]&gt; Bootstrap04</span>
<span style="color:#408080;font-style:italic">#&gt; 5 &lt;split [90/32]&gt; Bootstrap05</span>
<span style="color:#408080;font-style:italic">#&gt; 6 &lt;split [90/32]&gt; Bootstrap06</span>
<span style="color:#408080;font-style:italic">#&gt; 7 &lt;split [90/36]&gt; Bootstrap07</span>
<span style="color:#408080;font-style:italic">#&gt; 8 &lt;split [90/34]&gt; Bootstrap08</span>
<span style="color:#408080;font-style:italic">#&gt; 9 &lt;split [90/29]&gt; Bootstrap09</span>
<span style="color:#408080;font-style:italic">#&gt; 10 &lt;split [90/31]&gt; Bootstrap10</span>
<span style="color:#408080;font-style:italic">#&gt; # … with 15 more rows</span>
</code></pre></div><p>These are self-contained, meaning that the bootstrap sample is aware that it is a sample of a specific 90% of the data:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r">results<span style="color:#666">$</span>inner_resamples[[5]]<span style="color:#666">$</span>splits[[1]]
<span style="color:#408080;font-style:italic">#&gt; &lt;Training/Validation/Total&gt;</span>
<span style="color:#408080;font-style:italic">#&gt; &lt;90/31/90&gt;</span>
</code></pre></div><p>To start, we need to define how the model will be created and measured. Let&rsquo;s use a radial basis support vector machine model via the function <code>kernlab::ksvm</code>. This model is generally considered to have <em>two</em> tuning parameters: the SVM cost value and the kernel parameter <code>sigma</code>. For illustration purposes here, only the cost value will be tuned and the function <code>kernlab::sigest</code> will be used to estimate <code>sigma</code> during each model fit. This is automatically done by <code>ksvm</code>.</p>
<p>After the model is fit to the analysis set, the root-mean squared error (RMSE) is computed on the assessment set. <strong>One important note:</strong> for this model, it is critical to center and scale the predictors before computing dot products. We don&rsquo;t do this operation here because <code>mlbench.friedman1</code> simulates all of the predictors to be standardized uniform random variables.</p>
<p>Our function to fit the model and compute the RMSE is:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r"><span style="color:#00f">library</span>(kernlab)
<span style="color:#408080;font-style:italic"># `object` will be an `rsplit` object from our `results` tibble</span>
<span style="color:#408080;font-style:italic"># `cost` is the tuning parameter</span>
svm_rmse <span style="color:#666">&lt;-</span> <span style="color:#00f">function</span>(object, cost <span style="color:#666">=</span> <span style="color:#666">1</span>) {
y_col <span style="color:#666">&lt;-</span> <span style="color:#00f">ncol</span>(object<span style="color:#666">$</span>data)
mod <span style="color:#666">&lt;-</span>
<span style="color:#00f">svm_rbf</span>(mode <span style="color:#666">=</span> <span style="color:#ba2121">&#34;regression&#34;</span>, cost <span style="color:#666">=</span> cost) <span style="color:#666">%&gt;%</span>
<span style="color:#00f">set_engine</span>(<span style="color:#ba2121">&#34;kernlab&#34;</span>) <span style="color:#666">%&gt;%</span>
<span style="color:#00f">fit</span>(y <span style="color:#666">~</span> ., data <span style="color:#666">=</span> <span style="color:#00f">analysis</span>(object))
holdout_pred <span style="color:#666">&lt;-</span>
<span style="color:#00f">predict</span>(mod, <span style="color:#00f">assessment</span>(object) <span style="color:#666">%&gt;%</span> dplyr<span style="color:#666">::</span><span style="color:#00f">select</span>(<span style="color:#666">-</span>y)) <span style="color:#666">%&gt;%</span>
<span style="color:#00f">bind_cols</span>(<span style="color:#00f">assessment</span>(object) <span style="color:#666">%&gt;%</span> dplyr<span style="color:#666">::</span><span style="color:#00f">select</span>(y))
<span style="color:#00f">rmse</span>(holdout_pred, truth <span style="color:#666">=</span> y, estimate <span style="color:#666">=</span> .pred)<span style="color:#666">$</span>.estimate
}
<span style="color:#408080;font-style:italic"># In some case, we want to parameterize the function over the tuning parameter:</span>
rmse_wrapper <span style="color:#666">&lt;-</span> <span style="color:#00f">function</span>(cost, object) <span style="color:#00f">svm_rmse</span>(object, cost)
</code></pre></div><p>For the nested resampling, a model needs to be fit for each tuning parameter and each bootstrap split. To do this, create a wrapper:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r"><span style="color:#408080;font-style:italic"># `object` will be an `rsplit` object for the bootstrap samples</span>
tune_over_cost <span style="color:#666">&lt;-</span> <span style="color:#00f">function</span>(object) {
<span style="color:#00f">tibble</span>(cost <span style="color:#666">=</span> <span style="color:#666">2</span> ^ <span style="color:#00f">seq</span>(<span style="color:#666">-2</span>, <span style="color:#666">8</span>, by <span style="color:#666">=</span> <span style="color:#666">1</span>)) <span style="color:#666">%&gt;%</span>
<span style="color:#00f">mutate</span>(RMSE <span style="color:#666">=</span> <span style="color:#00f">map_dbl</span>(cost, rmse_wrapper, object <span style="color:#666">=</span> object))
}
</code></pre></div><p>Since this will be called across the set of outer cross-validation splits, another wrapper is required:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r"><span style="color:#408080;font-style:italic"># `object` is an `rsplit` object in `results$inner_resamples` </span>
summarize_tune_results <span style="color:#666">&lt;-</span> <span style="color:#00f">function</span>(object) {
<span style="color:#408080;font-style:italic"># Return row-bound tibble that has the 25 bootstrap results</span>
<span style="color:#00f">map_df</span>(object<span style="color:#666">$</span>splits, tune_over_cost) <span style="color:#666">%&gt;%</span>
<span style="color:#408080;font-style:italic"># For each value of the tuning parameter, compute the </span>
<span style="color:#408080;font-style:italic"># average RMSE which is the inner bootstrap estimate. </span>
<span style="color:#00f">group_by</span>(cost) <span style="color:#666">%&gt;%</span>
<span style="color:#00f">summarize</span>(mean_RMSE <span style="color:#666">=</span> <span style="color:#00f">mean</span>(RMSE, na.rm <span style="color:#666">=</span> <span style="color:#008000;font-weight:bold">TRUE</span>),
n <span style="color:#666">=</span> <span style="color:#00f">length</span>(RMSE))
}
</code></pre></div><p>Now that those functions are defined, we can execute all the inner resampling loops:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r">tuning_results <span style="color:#666">&lt;-</span> <span style="color:#00f">map</span>(results<span style="color:#666">$</span>inner_resamples, summarize_tune_results)
</code></pre></div><p>Alternatively, since these computations can be run in parallel, we can use the furrr package. Instead of using <code>map()</code>, the function <code>future_map()</code> parallelizes the iterations using the
<a href="https://cran.r-project.org/web/packages/future/vignettes/future-1-overview.html" target="_blank" rel="noopener">future package</a>. The <code>multisession</code> plan uses the local cores to process the inner resampling loop. The end results are the same as the sequential computations.</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r"><span style="color:#00f">library</span>(furrr)
<span style="color:#00f">plan</span>(multisession)
tuning_results <span style="color:#666">&lt;-</span> <span style="color:#00f">future_map</span>(results<span style="color:#666">$</span>inner_resamples, summarize_tune_results)
</code></pre></div><p>The object <code>tuning_results</code> is a list of data frames for each of the 50 outer resamples.</p>
<p>Let&rsquo;s make a plot of the averaged results to see what the relationship is between the RMSE and the tuning parameters for each of the inner bootstrapping operations:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r"><span style="color:#00f">library</span>(scales)
pooled_inner <span style="color:#666">&lt;-</span> tuning_results <span style="color:#666">%&gt;%</span> bind_rows
best_cost <span style="color:#666">&lt;-</span> <span style="color:#00f">function</span>(dat) dat<span style="color:#00f">[which.min</span>(dat<span style="color:#666">$</span>mean_RMSE),]
p <span style="color:#666">&lt;-</span>
<span style="color:#00f">ggplot</span>(pooled_inner, <span style="color:#00f">aes</span>(x <span style="color:#666">=</span> cost, y <span style="color:#666">=</span> mean_RMSE)) <span style="color:#666">+</span>
<span style="color:#00f">scale_x_continuous</span>(trans <span style="color:#666">=</span> <span style="color:#ba2121">&#39;log2&#39;</span>) <span style="color:#666">+</span>
<span style="color:#00f">xlab</span>(<span style="color:#ba2121">&#34;SVM Cost&#34;</span>) <span style="color:#666">+</span> <span style="color:#00f">ylab</span>(<span style="color:#ba2121">&#34;Inner RMSE&#34;</span>)
<span style="color:#00f">for </span>(i in <span style="color:#666">1</span><span style="color:#666">:</span><span style="color:#00f">length</span>(tuning_results))
p <span style="color:#666">&lt;-</span> p <span style="color:#666">+</span>
<span style="color:#00f">geom_line</span>(data <span style="color:#666">=</span> tuning_results[[i]], alpha <span style="color:#666">=</span> <span style="color:#666">.2</span>) <span style="color:#666">+</span>
<span style="color:#00f">geom_point</span>(data <span style="color:#666">=</span> <span style="color:#00f">best_cost</span>(tuning_results[[i]]), pch <span style="color:#666">=</span> <span style="color:#666">16</span>, alpha <span style="color:#666">=</span> <span style="color:#666">3</span><span style="color:#666">/</span><span style="color:#666">4</span>)
p <span style="color:#666">&lt;-</span> p <span style="color:#666">+</span> <span style="color:#00f">geom_smooth</span>(data <span style="color:#666">=</span> pooled_inner, se <span style="color:#666">=</span> <span style="color:#008000;font-weight:bold">FALSE</span>)
p
</code></pre></div><p><img src="figs/rmse-plot-1.svg" width="672" /></p>
<p>Each gray line is a separate bootstrap resampling curve created from a different 90% of the data. The blue line is a LOESS smooth of all the results pooled together.</p>
<p>To determine the best parameter estimate for each of the outer resampling iterations:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r">cost_vals <span style="color:#666">&lt;-</span>
tuning_results <span style="color:#666">%&gt;%</span>
<span style="color:#00f">map_df</span>(best_cost) <span style="color:#666">%&gt;%</span>
<span style="color:#00f">select</span>(cost)
results <span style="color:#666">&lt;-</span>
<span style="color:#00f">bind_cols</span>(results, cost_vals) <span style="color:#666">%&gt;%</span>
<span style="color:#00f">mutate</span>(cost <span style="color:#666">=</span> <span style="color:#00f">factor</span>(cost, levels <span style="color:#666">=</span> <span style="color:#00f">paste</span>(<span style="color:#666">2</span> ^ <span style="color:#00f">seq</span>(<span style="color:#666">-2</span>, <span style="color:#666">8</span>, by <span style="color:#666">=</span> <span style="color:#666">1</span>))))
<span style="color:#00f">ggplot</span>(results, <span style="color:#00f">aes</span>(x <span style="color:#666">=</span> cost)) <span style="color:#666">+</span>
<span style="color:#00f">geom_bar</span>() <span style="color:#666">+</span>
<span style="color:#00f">xlab</span>(<span style="color:#ba2121">&#34;SVM Cost&#34;</span>) <span style="color:#666">+</span>
<span style="color:#00f">scale_x_discrete</span>(drop <span style="color:#666">=</span> <span style="color:#008000;font-weight:bold">FALSE</span>)
</code></pre></div><p><img src="figs/choose-1.svg" width="672" /></p>
<p>Most of the resamples produced an optimal cost value of 2.0, but the distribution is right-skewed due to the flat trend in the resampling profile once the cost value becomes 10 or larger.</p>
<p>Now that we have these estimates, we can compute the outer resampling results for each of the 50 splits using the corresponding tuning parameter value:</p>
<div class="highlight"><pre style=";-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-r" data-lang="r">results <span style="color:#666">&lt;-</span>
results <span style="color:#666">%&gt;%</span>
<span style="color:#00f">mutate</span>(RMSE <span style="color:#666">=</span> <span style="color:#00f">map2_dbl</span>(splits, cost, svm_rmse))
<span style="color:#00f">summary</span>(results<span style="color:#666">$</span>RMSE)
<span style="color:#408080;font-style:italic">#&gt; Min. 1st Qu. Median Mean 3rd Qu. Max. </span>
<span style="color:#408080;font-style:italic">#&gt; 1.57 2.09 2.68 2.69 3.25 4.25</span>