Slides 13

UBC-STAT · Oct 9, 2024 · 16b7d95 · 16b7d95
1 parent c4205b4
commit 16b7d95
Show file tree

Hide file tree

Showing 10 changed files with 3,743 additions and 2,437 deletions.
diff --git a/_freeze/schedule/slides/13-gams-trees/execute-results/html.json b/_freeze/schedule/slides/13-gams-trees/execute-results/html.json
diff --git a/_freeze/schedule/slides/13-gams-trees/figure-revealjs/big-tree-1.svg b/_freeze/schedule/slides/13-gams-trees/figure-revealjs/big-tree-1.svg
diff --git a/_freeze/schedule/slides/13-gams-trees/figure-revealjs/gam-mod-1.svg b/_freeze/schedule/slides/13-gams-trees/figure-revealjs/gam-mod-1.svg
diff --git a/_freeze/schedule/slides/13-gams-trees/figure-revealjs/partition-view-1.svg b/_freeze/schedule/slides/13-gams-trees/figure-revealjs/partition-view-1.svg
diff --git a/_freeze/schedule/slides/13-gams-trees/figure-revealjs/unnamed-chunk-1-1.svg b/_freeze/schedule/slides/13-gams-trees/figure-revealjs/unnamed-chunk-1-1.svg
diff --git a/_freeze/schedule/slides/13-gams-trees/figure-revealjs/unnamed-chunk-2-1.svg b/_freeze/schedule/slides/13-gams-trees/figure-revealjs/unnamed-chunk-2-1.svg
diff --git a/_freeze/schedule/slides/13-gams-trees/figure-revealjs/unnamed-chunk-3-1.svg b/_freeze/schedule/slides/13-gams-trees/figure-revealjs/unnamed-chunk-3-1.svg
diff --git a/_freeze/schedule/slides/13-gams-trees/figure-revealjs/unnamed-chunk-4-1.svg b/_freeze/schedule/slides/13-gams-trees/figure-revealjs/unnamed-chunk-4-1.svg
diff --git a/_freeze/schedule/slides/13-gams-trees/figure-revealjs/unnamed-chunk-5-1.svg b/_freeze/schedule/slides/13-gams-trees/figure-revealjs/unnamed-chunk-5-1.svg
diff --git a/schedule/slides/13-gams-trees.qmd b/schedule/slides/13-gams-trees.qmd
@@ -83,9 +83,32 @@ $\Expect{Y \given X=x} = \beta_0 + f_1(x_{1})+\cdots+f_p(x_{p}),$
 
 then
 
-$\textrm{MSE}(\hat f) = \frac{Cp}{n^{4/5}} + \sigma^2.$
+$$
+R_n^{(\mathrm{GAM})} =
+  \underbrace{\frac{C_1^{(\mathrm{GAM})}}{n^{4/5}}}_{\mathrm{bias}^2} +
+  \underbrace{\frac{C_2^{(\mathrm{GAM})}}{n^{4/5}}}_{\mathrm{var}} +
+  \sigma^2.
+$$
+Compare with OLS and non-additive local smoothers:
+
+$$
+R_n^{(\mathrm{OLS})} =
+  \underbrace{C_1^{(\mathrm{OLS})}}_{\mathrm{bias}^2} +
+  \underbrace{\tfrac{C_2^{(\mathrm{OLS})}}{n/p}}_{\mathrm{var}} +
+  \sigma^2,
+\qquad
+R_n^{(\mathrm{local})} =
+  \underbrace{\tfrac{C_1^{(\mathrm{local})}}{n^{4/(4+p)}}}_{\mathrm{bias}^2} +
+  \underbrace{\tfrac{C_2^{(\mathrm{local})}}{n^{4/(4+p)}}}_{\mathrm{var}} +
+  \sigma^2.
+$$
 
-* Exponent no longer depends on $p$. Converges faster. (If the truth is additive.)
+---
+
+* We no longer have an exponential dependence on $p$!
+
+* But our predictor is restrictive to functions that decompose additively.
+  (This is a big limitation.)
 
 * You could also use the same methods to include "some" interactions like
 
@@ -108,64 +131,53 @@ plot(ex_smooth2,
 
 ## Regression trees
 
-Trees involve stratifying or segmenting the predictor space into a number of simple regions.
-
-Trees are simple and useful for interpretation.  
-
-Basic trees are not great at prediction. 
-
-Modern methods that use trees are much better (Module 4)
-
-## Regression trees
-
-Regression trees estimate piece-wise constant functions
-
-The slabs are axis-parallel rectangles $R_1,\ldots,R_K$ based on $\X$
-
-In each region, we average the $y_i$'s: $\hat\mu_1,\ldots,\hat\mu_k$
-
-Minimize $\sum_{k=1}^K \sum_{i=1}^n (y_i-\mu_k)^2$ over $R_k,\mu_k$ for $k\in \{1,\ldots,K\}$
-
-. . .
-
-This sounds more complicated than it is.
-
-The minimization is performed __greedily__ (like forward stepwise regression).
+* Trees involve stratifying or segmenting the predictor space into a number of simple regions.
+* Trees are simple and useful for interpretation.  
+* Basic trees are not great at prediction. 
+* Modern methods that use trees are much better (Module 4)
 
 
+## Example with mobility data
 
-##
+::: flex
+::: w-50
 
-
-![](https://www.aafp.org/dam/AAFP/images/journals/blogs/inpractice/covid_dx_algorithm4.png)
-
-
-
-## Mobility data
-
-```{r small-tree-prelim, echo=FALSE}
+"Small" tree
+```{r}
+#| code-fold: true
+#| fig-width: 8
 data("mobility", package = "Stat406")
 library(tree)
 library(maptree)
 mob <- mobility[complete.cases(mobility), ] %>% dplyr::select(-ID, -Name)
 set.seed(12345)
 par(mar = c(0, 0, 0, 0), oma = c(0, 0, 0, 0))
-```
-
-```{r}
-#| fig-width: 8
 bigtree <- tree(Mobility ~ ., data = mob)
 smalltree <- prune.tree(bigtree, k = .09)
 draw.tree(smalltree, digits = 2)
 ```
+:::
 
-This is called the [dendrogram]{.secondary}
+::: w-50
+"Big" tree
+```{r big-tree, echo=FALSE}
+#| fig-width: 8
+#| fig-height: 5
+draw.tree(bigtree, digits = 2)
+```
+:::
+:::
 
+[Terminology]{.secondary}
+
+* We call each split or end point a *node*.
+* Each terminal node is referred to as a *leaf*.
 
-## Partition view
+## Example with mobility data
 
 ```{r partition-view}
-#| fig-width: 8
+#| code-fold: true
+#| fig-width: 10
 mob$preds <- predict(smalltree)
 par(mfrow = c(1, 2), mar = c(5, 3, 0, 0))
 draw.tree(smalltree, digits = 2)
@@ -178,24 +190,97 @@ partition.tree(smalltree, add = TRUE, ordvars = c("Black", "Commute"))
 ```
 
 
-We predict all observations in a region with the same value.  
-$\bullet$ The three regions correspond to the leaves of the tree.
+[(The three regions correspond to the leaves of the tree.)]{.small}
+\
 
+* Trees are *piecewise constant functions*.\
+  [We predict all observations in a region with the same value.]{.small}
+* Prediction regions are axis-parallel rectangles $R_1,\ldots,R_K$ based on $\X$
 
-##
 
-```{r big-tree}
-#| fig-width: 8
-#| fig-height: 5
-draw.tree(bigtree, digits = 2)
-```
 
+<!-- ## -->
 
-[Terminology]{.secondary}
 
-We call each split or end point a node.  Each terminal node is referred to as a leaf.  
+<!-- ![](https://www.aafp.org/dam/AAFP/images/journals/blogs/inpractice/covid_dx_algorithm4.png) -->
+
+
+<!-- ## Dendrogram view -->
+
+<!-- ```{r} -->
+<!-- #| code-fold: true -->
+<!-- #| fig-width: 8 -->
+<!-- data("mobility", package = "Stat406") -->
+<!-- library(tree) -->
+<!-- library(maptree) -->
+<!-- mob <- mobility[complete.cases(mobility), ] %>% dplyr::select(-ID, -Name) -->
+<!-- set.seed(12345) -->
+<!-- par(mar = c(0, 0, 0, 0), oma = c(0, 0, 0, 0)) -->
+<!-- smalltree <- prune.tree(bigtree, k = .09) -->
+<!-- draw.tree(smalltree, digits = 2) -->
+<!-- ``` -->
+
+<!-- This is called the [dendrogram]{.secondary} -->
+
+
+<!-- ## Partition view -->
+
+<!-- ```{r partition-view} -->
+<!-- #| code-fold: true -->
+<!-- #| fig-width: 10 -->
+<!-- mob$preds <- predict(smalltree) -->
+<!-- par(mfrow = c(1, 2), mar = c(5, 3, 0, 0)) -->
+<!-- draw.tree(smalltree, digits = 2) -->
+<!-- cols <- viridisLite::viridis(20, direction = -1)[cut(log(mob$Mobility), 20)] -->
+<!-- plot(mob$Black, mob$Commute, -->
+<!--   pch = 19, cex = .4, bty = "n", las = 1, col = cols, -->
+<!--   ylab = "Commute time", xlab = "% Black" -->
+<!-- ) -->
+<!-- partition.tree(smalltree, add = TRUE, ordvars = c("Black", "Commute")) -->
+<!-- ``` -->
 
-The interior nodes lead to branches.  
+
+
+## Constructing Trees
+
+::: flex
+::: w-60
+
+Iterative algorithm:
+
+* While ($\mathtt{depth} \ne \mathtt{max.depth}$):
+    * For each existing region $R_k$
+        * For a given *splitting variable* $j$ and *split value* $s$,
+          define
+          $$
+          \begin{align}
+          R_k^> &= \{x \in R_k : x^{(j)} > s\} \\
+          R_k^< &= \{x \in R_k : x^{(j)} > s\}
+          \end{align}
+          $$
+        * Choose $j$ and $s$ 
+          to minimize
+          $$|R_k^>| \cdot \widehat{Var}(R_k^>) + |R_k^<| \cdot  \widehat{Var}(R_k^<)$$
+
+:::
+
+::: w-35
+```{r echo=FALSE}
+#| fig-width: 5
+#| fig-height: 4
+plot(mob$Black, mob$Commute,
+  pch = 19, cex = .4, bty = "n", las = 1, col = cols,
+  ylab = "Commute time", xlab = "% Black"
+)
+partition.tree(smalltree, add = TRUE, ordvars = c("Black", "Commute"))
+```
+::: fragment
+This algorithm is *greedy*, so it doesn't find the optimal tree\
+[(But it works well!)]{.small}
+
+:::
+:::
+:::
 
 
 ## Advantages and disadvantages of trees
@@ -206,11 +291,13 @@ The interior nodes lead to branches.
 
 🎉 Trees can easily be displayed graphically no matter the dimension of the data.
 
-🎉 Trees can easily handle qualitative predictors without the need to create dummy variables.
+🎉 Trees can easily handle categorical predictors without the need to create one-hot encodings.
+
+🎉 *Trees are GREAT for missing data!!!*
 
 💩 Trees aren't very good at prediction.
 
-💩 Full trees badly overfit, so we "prune" them using CV 
+💩 Big trees badly overfit, so we "prune" them using CV 
 
 . . .