diff --git a/_freeze/schedule/slides/20-boosting/execute-results/html.json b/_freeze/schedule/slides/20-boosting/execute-results/html.json
index a710d7f..56f21bf 100644
--- a/_freeze/schedule/slides/20-boosting/execute-results/html.json
+++ b/_freeze/schedule/slides/20-boosting/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "4e41ba0061438c4140349f5002e37fd6",
+  "hash": "ec86535b178e899a578c3a3c7779af0c",
   "result": {
-    "markdown": "---\nlecture: \"20 Boosting\"\nformat: \n  revealjs:\n    multiplex: true\nmetadata-files: \n  - _metadata.yml\n---\n---\n---\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 02 November 2023\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n\n## Last time\n\n\n\nWe learned about bagging, for averaging [low-bias]{.secondary} / [high-variance]{.tertiary} estimators.\n\nToday, we examine it's opposite: Boosting.\n\nBoosting also combines estimators, but it combines [high-bias]{.secondary} / [low-variance]{.tertiary} estimators.\n\nBoosting has a number of flavours. And if you Google descriptions, most are wrong.\n\nFor a deep (and accurate) treatment, see [ESL] Chapter 10\n\n\n. . .\n\nWe'll discuss 2 flavours: [AdaBoost]{.secondary} and [Gradient Boosting]{.secondary}\n\nNeither requires a tree, but that's the typical usage.\n\nBoosting needs a \"weak learner\", so small trees (stumps) are natural.\n\n\n\n## AdaBoost intuition (for classification)\n\nAt each iteration, we weight the [observations]{.secondary}.\n\nObservations that are currently misclassified, get [higher]{.tertiary} weights.\n\nSo on the next iteration, we'll try harder to correctly classify our mistakes.\n\nThe number of iterations must be chosen.\n\n\n\n## AdaBoost (Freund and Schapire, generic)\n\nLet $G(x, \\theta)$ be any weak learner \n\n⛭ imagine a tree with one split: then $\\theta=$ (feature, split point)\n\n\n\nAlgorithm (AdaBoost) 🛠️\n\n* Set observation weights $w_i=1/n$.\n* Until we quit ( $m<M$ iterations )\n    a. Estimate the classifier $G(x,\\theta_m)$ using weights $w_i$\n    a. Calculate it's weighted error $\\textrm{err}_m = \\sum_{i=1}^n w_i I(y_i \\neq G(x_i, \\theta_m)) / \\sum w_i$\n    a. Set $\\alpha_m = \\log((1-\\textrm{err}_m)/\\text{err}_m)$\n    a. Update $w_i \\leftarrow w_i \\exp(\\alpha_m I(y_i \\neq G(x_i,\\theta_m)))$\n* Final classifier is $G(x) = \\textrm{sign}\\left( \\sum_{m=1}^M \\alpha_m G(x, \\theta_m)\\right)$\n\n\n\n\n## Using mobility data again\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code  code-fold=\"true\"}\nlibrary(kableExtra)\nlibrary(randomForest)\nmob <- Stat406::mobility |>\n  mutate(mobile = as.factor(Mobility > .1)) |>\n  select(-ID, -Name, -Mobility, -State) |>\n  drop_na()\nn <- nrow(mob)\ntrainidx <- sample.int(n, floor(n * .75))\ntestidx <- setdiff(1:n, trainidx)\ntrain <- mob[trainidx, ]\ntest <- mob[testidx, ]\nrf <- randomForest(mobile ~ ., data = train)\nbag <- randomForest(mobile ~ ., data = train, mtry = ncol(mob) - 1)\npreds <- tibble(truth = test$mobile, rf = predict(rf, test), bag = predict(bag, test))\n```\n:::\n\n::: {.cell layout-align=\"center\" output-location='column-fragment'}\n\n```{.r .cell-code  code-line-numbers=\"1-6|7-12|17|\"}\nlibrary(gbm)\ntrain_boost <- train |>\n  mutate(mobile = as.integer(mobile) - 1)\n# needs {0, 1} responses\ntest_boost <- test |>\n  mutate(mobile = as.integer(mobile) - 1)\nadab <- gbm(\n  mobile ~ .,\n  data = train_boost,\n  n.trees = 500,\n  distribution = \"adaboost\"\n)\npreds$adab <- as.numeric(\n  predict(adab, test_boost) > 0\n)\npar(mar = c(5, 11, 0, 1))\ns <- summary(adab, las = 1)\n```\n\n::: {.cell-output-display}\n![](20-boosting_files/figure-revealjs/unnamed-chunk-2-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Forward stagewise additive modeling (FSAM, completely generic)\n\nAlgorithm 🛠️\n\n* Set initial predictor $f_0(x)=0$\n* Until we quit ( $m<M$ iterations )\n    a. Compute $(\\beta_m, \\theta_m) = \\argmin_{\\beta, \\theta} \\sum_{i=1}^n L\\left(y_i,\\ f_{m-1}(x_i) + \\beta G(x_i,\\ \\theta)\\right)$\n    a. Set $f_m(x) = f_{m-1}(x) + \\beta_m G(x,\\ \\theta_m)$\n* Final classifier is $G(x, \\theta_M) = \\textrm{sign}\\left( f_M(x) \\right)$\n\n\nHere, $L$ is a loss function that measures prediction accuracy\n\n. . .\n\n* If [(1)]{.secondary} $L(y,\\ f(x))= \\exp(-y f(x))$, [(2)]{.secondary} $G$ is a classifier, and WLOG $y \\in \\{-1, 1\\}$ \n\nFSAM is equivalent to AdaBoost. Proven 5 years later (Friedman, Hastie, and Tibshirani 2000).\n\n\n## So what?\n\nIt turns out that \"exponential loss\" $L(y,\\ f(x))= \\exp(-y f(x))$ is not very robust.\n\nHere are some other loss functions for 2-class classification\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](20-boosting_files/figure-revealjs/loss-funs-1.svg){fig-align='center'}\n:::\n:::\n\n\n. . .\n\nWant losses which penalize negative margin, but not positive margins.\n\nRobust means [don't over-penalize large negatives]{.hand}\n\n\n\n## Gradient boosting\n\nIn the forward stagewise algorithm, we solved a minimization and then made an update:\n\n$$f_m(x) = f_{m-1}(x) + \\beta_m G(x, \\theta_m)$$\n\nFor most loss functions $L$ / procedures $G$ this optimization is difficult: $$\\argmin_{\\beta, \\theta} \\sum_{i=1}^n L\\left(y_i,\\ f_{m-1}(x_i) + \\beta G(x_i, \\theta)\\right)$$\n\n💡 Just take one gradient step toward the minimum 💡\n\n$$f_m(x) = f_{m-1}(x) -\\gamma_m \\nabla L(y,f_{m-1}(x)) = f_{m-1}(x) +\\gamma_m \\left(-\\nabla L(y,f_{m-1}(x))\\right)$$\n\nThis is called [Gradient boosting]{.secondary}\n\nNotice how similar the update steps look.\n\n## Gradient boosting\n\n$$f_m(x) = f_{m-1}(x) -\\gamma_m \\nabla L(y,f_{m-1}(x)) = f_{m-1}(x) +\\gamma_m \\left(-\\nabla L(y,f_{m-1}(x))\\right)$$\n\nGradient boosting goes only part of the way toward the minimum at each $m$. \n\nThis has two advantages:\n\n1. Since we're not fitting $\\beta, \\theta$ to the data as \"hard\", the learner is weaker.\n\n2. This procedure is computationally much simpler.\n\nSimpler because we only require the gradient at one value, don't have to fully optimize.\n\n\n\n\n## Gradient boosting -- Algorithm 🛠️\n\n\n* Set initial predictor $f_0(x)=\\overline{\\y}$\n* Until we quit ( $m<M$ iterations )\n    a. Compute pseudo-residuals (what is the gradient of $L(y,f)=(y-f(x))^2$?)\n    $$r_i = -\\frac{\\partial L(y_i,f(x_i))}{\\partial f(x_i)}\\bigg|_{f(x_i)=f_{m-1}(x_i)}$$\n    b. Estimate weak learner, $G(x, \\theta_m)$, with the training set $\\{r_i, x_i\\}$.\n    c. Find the step size $\\gamma_m = \\argmin_\\gamma \\sum_{i=1}^n L(y_i, f_{m-1}(x_i) + \\gamma G(x_i, \\theta_m))$\n    b. Set $f_m(x) = f_{m-1}(x) + \\gamma_m G(x, \\theta_m)$\n* Final predictor is $f_M(x)$.\n\n\n## Gradient boosting modifications\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ngrad_boost <- gbm(mobile ~ ., data = train_boost, n.trees = 500, distribution = \"bernoulli\")\n```\n:::\n\n\n* Typically done with \"small\" trees, not stumps because of the gradient. You can specify the size. Usually 4-8 terminal nodes is recommended (more gives more interactions between predictors)\n\n* Usually modify the gradient step to $f_m(x) = f_{m-1}(x) + \\gamma_m \\alpha G(x,\\theta_m)$ with $0<\\alpha<1$. Helps to keep from fitting too hard.\n\n* Often combined with Bagging so that each step is fit using a bootstrap resample of the data. Gives us out-of-bag options.\n\n* There are many other extensions, notably XGBoost.\n\n## Results for `mobility`\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code  code-fold=\"true\"}\nlibrary(cowplot)\nboost_preds <- tibble(\n  adaboost = predict(adab, test_boost),\n  gbm = predict(grad_boost, test_boost),\n  truth = test$mobile\n)\ng1 <- ggplot(boost_preds, aes(adaboost, gbm, color = as.factor(truth))) +\n  geom_text(aes(label = as.integer(truth) - 1)) +\n  geom_vline(xintercept = 0) +\n  geom_hline(yintercept = 0) +\n  xlab(\"adaboost margin\") +\n  ylab(\"gbm margin\") +\n  theme(legend.position = \"none\") +\n  scale_color_manual(values = c(\"orange\", \"blue\")) +\n  annotate(\"text\",\n    x = -4, y = 5, color = red,\n    label = paste(\n      \"gbm error\\n\",\n      round(with(boost_preds, mean((gbm > 0) != truth)), 2)\n    )\n  ) +\n  annotate(\"text\",\n    x = 4, y = -5, color = red,\n    label = paste(\"adaboost error\\n\", round(with(boost_preds, mean((adaboost > 0) != truth)), 2))\n  )\nboost_oob <- tibble(\n  adaboost = adab$oobag.improve, gbm = grad_boost$oobag.improve,\n  ntrees = 1:500\n)\ng2 <- boost_oob %>%\n  pivot_longer(-ntrees, values_to = \"OOB_Error\") %>%\n  ggplot(aes(x = ntrees, y = OOB_Error, color = name)) +\n  geom_line() +\n  scale_color_manual(values = c(orange, blue)) +\n  theme(legend.title = element_blank())\nplot_grid(g1, g2, rel_widths = c(.4, .6))\n```\n\n::: {.cell-output-display}\n![](20-boosting_files/figure-revealjs/unnamed-chunk-3-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n## Major takeaways\n\n* Two flavours of Boosting \n    1. AdaBoost (the original) and \n    2. gradient boosting (easier and more computationally friendly)\n\n* The connection is \"Forward stagewise additive modelling\" (AdaBoost is a special case)\n\n* The connection reveals that AdaBoost \"isn't robust because it uses exponential loss\" (squared error is even worse)\n\n* Gradient boosting is a computationally easier version of FSAM\n\n* All use **weak learners** (compare to Bagging)\n\n* Think about the Bias-Variance implications\n\n* You can use these for regression or classification\n\n* You can do this with other weak learners besides trees.\n\n\n\n# Next time...\n\nNeural networks and deep learning, the beginning\n",
+    "markdown": "---\nlecture: \"20 Boosting\"\nformat: revealjs\nmetadata-files: \n  - _metadata.yml\n---\n---\n---\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 02 November 2023\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n\n## Last time\n\n\n\nWe learned about bagging, for averaging [low-bias]{.secondary} / [high-variance]{.tertiary} estimators.\n\nToday, we examine it's opposite: Boosting.\n\nBoosting also combines estimators, but it combines [high-bias]{.secondary} / [low-variance]{.tertiary} estimators.\n\nBoosting has a number of flavours. And if you Google descriptions, most are wrong.\n\nFor a deep (and accurate) treatment, see [ESL] Chapter 10\n\n\n. . .\n\nWe'll discuss 2 flavours: [AdaBoost]{.secondary} and [Gradient Boosting]{.secondary}\n\nNeither requires a tree, but that's the typical usage.\n\nBoosting needs a \"weak learner\", so small trees (stumps) are natural.\n\n\n\n## AdaBoost intuition (for classification)\n\nAt each iteration, we weight the [observations]{.secondary}.\n\nObservations that are currently misclassified, get [higher]{.tertiary} weights.\n\nSo on the next iteration, we'll try harder to correctly classify our mistakes.\n\nThe number of iterations must be chosen.\n\n\n\n## AdaBoost (Freund and Schapire, generic)\n\nLet $G(x, \\theta)$ be any weak learner \n\n⛭ imagine a tree with one split: then $\\theta=$ (feature, split point)\n\n\n\nAlgorithm (AdaBoost) 🛠️\n\n* Set observation weights $w_i=1/n$.\n* Until we quit ( $m<M$ iterations )\n    a. Estimate the classifier $G(x,\\theta_m)$ using weights $w_i$\n    a. Calculate it's weighted error $\\textrm{err}_m = \\sum_{i=1}^n w_i I(y_i \\neq G(x_i, \\theta_m)) / \\sum w_i$\n    a. Set $\\alpha_m = \\log((1-\\textrm{err}_m)/\\text{err}_m)$\n    a. Update $w_i \\leftarrow w_i \\exp(\\alpha_m I(y_i \\neq G(x_i,\\theta_m)))$\n* Final classifier is $G(x) = \\textrm{sign}\\left( \\sum_{m=1}^M \\alpha_m G(x, \\theta_m)\\right)$\n\n\n\n\n## Using mobility data again\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code  code-fold=\"true\"}\nlibrary(kableExtra)\nlibrary(randomForest)\nmob <- Stat406::mobility |>\n  mutate(mobile = as.factor(Mobility > .1)) |>\n  select(-ID, -Name, -Mobility, -State) |>\n  drop_na()\nn <- nrow(mob)\ntrainidx <- sample.int(n, floor(n * .75))\ntestidx <- setdiff(1:n, trainidx)\ntrain <- mob[trainidx, ]\ntest <- mob[testidx, ]\nrf <- randomForest(mobile ~ ., data = train)\nbag <- randomForest(mobile ~ ., data = train, mtry = ncol(mob) - 1)\npreds <- tibble(truth = test$mobile, rf = predict(rf, test), bag = predict(bag, test))\n```\n:::\n\n::: {.cell layout-align=\"center\" output-location='column-fragment'}\n\n```{.r .cell-code  code-line-numbers=\"1-6|7-12|17|\"}\nlibrary(gbm)\ntrain_boost <- train |>\n  mutate(mobile = as.integer(mobile) - 1)\n# needs {0, 1} responses\ntest_boost <- test |>\n  mutate(mobile = as.integer(mobile) - 1)\nadab <- gbm(\n  mobile ~ .,\n  data = train_boost,\n  n.trees = 500,\n  distribution = \"adaboost\"\n)\npreds$adab <- as.numeric(\n  predict(adab, test_boost) > 0\n)\npar(mar = c(5, 11, 0, 1))\ns <- summary(adab, las = 1)\n```\n\n::: {.cell-output-display}\n![](20-boosting_files/figure-revealjs/unnamed-chunk-2-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Forward stagewise additive modeling (FSAM, completely generic)\n\nAlgorithm 🛠️\n\n* Set initial predictor $f_0(x)=0$\n* Until we quit ( $m<M$ iterations )\n    a. Compute $(\\beta_m, \\theta_m) = \\argmin_{\\beta, \\theta} \\sum_{i=1}^n L\\left(y_i,\\ f_{m-1}(x_i) + \\beta G(x_i,\\ \\theta)\\right)$\n    a. Set $f_m(x) = f_{m-1}(x) + \\beta_m G(x,\\ \\theta_m)$\n* Final classifier is $G(x, \\theta_M) = \\textrm{sign}\\left( f_M(x) \\right)$\n\n\nHere, $L$ is a loss function that measures prediction accuracy\n\n. . .\n\n* If [(1)]{.secondary} $L(y,\\ f(x))= \\exp(-y f(x))$, [(2)]{.secondary} $G$ is a classifier, and WLOG $y \\in \\{-1, 1\\}$ \n\nFSAM is equivalent to AdaBoost. Proven 5 years later (Friedman, Hastie, and Tibshirani 2000).\n\n\n## So what?\n\nIt turns out that \"exponential loss\" $L(y,\\ f(x))= \\exp(-y f(x))$ is not very robust.\n\nHere are some other loss functions for 2-class classification\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](20-boosting_files/figure-revealjs/loss-funs-1.svg){fig-align='center'}\n:::\n:::\n\n\n. . .\n\nWant losses which penalize negative margin, but not positive margins.\n\nRobust means [don't over-penalize large negatives]{.hand}\n\n\n\n## Gradient boosting\n\nIn the forward stagewise algorithm, we solved a minimization and then made an update:\n\n$$f_m(x) = f_{m-1}(x) + \\beta_m G(x, \\theta_m)$$\n\nFor most loss functions $L$ / procedures $G$ this optimization is difficult: $$\\argmin_{\\beta, \\theta} \\sum_{i=1}^n L\\left(y_i,\\ f_{m-1}(x_i) + \\beta G(x_i, \\theta)\\right)$$\n\n💡 Just take one gradient step toward the minimum 💡\n\n$$f_m(x) = f_{m-1}(x) -\\gamma_m \\nabla L(y,f_{m-1}(x)) = f_{m-1}(x) +\\gamma_m \\left(-\\nabla L(y,f_{m-1}(x))\\right)$$\n\nThis is called [Gradient boosting]{.secondary}\n\nNotice how similar the update steps look.\n\n## Gradient boosting\n\n$$f_m(x) = f_{m-1}(x) -\\gamma_m \\nabla L(y,f_{m-1}(x)) = f_{m-1}(x) +\\gamma_m \\left(-\\nabla L(y,f_{m-1}(x))\\right)$$\n\nGradient boosting goes only part of the way toward the minimum at each $m$. \n\nThis has two advantages:\n\n1. Since we're not fitting $\\beta, \\theta$ to the data as \"hard\", the learner is weaker.\n\n2. This procedure is computationally much simpler.\n\nSimpler because we only require the gradient at one value, don't have to fully optimize.\n\n\n\n\n## Gradient boosting -- Algorithm 🛠️\n\n\n* Set initial predictor $f_0(x)=\\overline{\\y}$\n* Until we quit ( $m<M$ iterations )\n    a. Compute pseudo-residuals (what is the gradient of $L(y,f)=(y-f(x))^2$?)\n    $$r_i = -\\frac{\\partial L(y_i,f(x_i))}{\\partial f(x_i)}\\bigg|_{f(x_i)=f_{m-1}(x_i)}$$\n    b. Estimate weak learner, $G(x, \\theta_m)$, with the training set $\\{r_i, x_i\\}$.\n    c. Find the step size $\\gamma_m = \\argmin_\\gamma \\sum_{i=1}^n L(y_i, f_{m-1}(x_i) + \\gamma G(x_i, \\theta_m))$\n    b. Set $f_m(x) = f_{m-1}(x) + \\gamma_m G(x, \\theta_m)$\n* Final predictor is $f_M(x)$.\n\n\n## Gradient boosting modifications\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ngrad_boost <- gbm(mobile ~ ., data = train_boost, n.trees = 500, distribution = \"bernoulli\")\n```\n:::\n\n\n* Typically done with \"small\" trees, not stumps because of the gradient. You can specify the size. Usually 4-8 terminal nodes is recommended (more gives more interactions between predictors)\n\n* Usually modify the gradient step to $f_m(x) = f_{m-1}(x) + \\gamma_m \\alpha G(x,\\theta_m)$ with $0<\\alpha<1$. Helps to keep from fitting too hard.\n\n* Often combined with Bagging so that each step is fit using a bootstrap resample of the data. Gives us out-of-bag options.\n\n* There are many other extensions, notably XGBoost.\n\n## Results for `mobility`\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code  code-fold=\"true\"}\nlibrary(cowplot)\nboost_preds <- tibble(\n  adaboost = predict(adab, test_boost),\n  gbm = predict(grad_boost, test_boost),\n  truth = test$mobile\n)\ng1 <- ggplot(boost_preds, aes(adaboost, gbm, color = as.factor(truth))) +\n  geom_text(aes(label = as.integer(truth) - 1)) +\n  geom_vline(xintercept = 0) +\n  geom_hline(yintercept = 0) +\n  xlab(\"adaboost margin\") +\n  ylab(\"gbm margin\") +\n  theme(legend.position = \"none\") +\n  scale_color_manual(values = c(\"orange\", \"blue\")) +\n  annotate(\"text\",\n    x = -4, y = 5, color = red,\n    label = paste(\n      \"gbm error\\n\",\n      round(with(boost_preds, mean((gbm > 0) != truth)), 2)\n    )\n  ) +\n  annotate(\"text\",\n    x = 4, y = -5, color = red,\n    label = paste(\"adaboost error\\n\", round(with(boost_preds, mean((adaboost > 0) != truth)), 2))\n  )\nboost_oob <- tibble(\n  adaboost = adab$oobag.improve, gbm = grad_boost$oobag.improve,\n  ntrees = 1:500\n)\ng2 <- boost_oob %>%\n  pivot_longer(-ntrees, values_to = \"OOB_Error\") %>%\n  ggplot(aes(x = ntrees, y = OOB_Error, color = name)) +\n  geom_line() +\n  scale_color_manual(values = c(orange, blue)) +\n  theme(legend.title = element_blank())\nplot_grid(g1, g2, rel_widths = c(.4, .6))\n```\n\n::: {.cell-output-display}\n![](20-boosting_files/figure-revealjs/unnamed-chunk-3-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n## Major takeaways\n\n* Two flavours of Boosting \n    1. AdaBoost (the original) and \n    2. gradient boosting (easier and more computationally friendly)\n\n* The connection is \"Forward stagewise additive modelling\" (AdaBoost is a special case)\n\n* The connection reveals that AdaBoost \"isn't robust because it uses exponential loss\" (squared error is even worse)\n\n* Gradient boosting is a computationally easier version of FSAM\n\n* All use **weak learners** (compare to Bagging)\n\n* Think about the Bias-Variance implications\n\n* You can use these for regression or classification\n\n* You can do this with other weak learners besides trees.\n\n\n\n# Next time...\n\nNeural networks and deep learning, the beginning\n",
     "supporting": [
       "20-boosting_files"
     ],