module 4 done

UBC-STAT · Oct 12, 2023 · b74d83d · b74d83d
1 parent 9af2ec4
commit b74d83d
Show file tree

Hide file tree

Showing 11 changed files with 2,986 additions and 773 deletions.
diff --git a/_freeze/schedule/slides/23-nnets-other/execute-results/html.json b/_freeze/schedule/slides/23-nnets-other/execute-results/html.json
@@ -0,0 +1,23 @@
+{
+  "hash": "76e653849650b32b289ca1c075d509ab",
+  "result": {
+    "markdown": "---\nlecture: \"23 Neural nets - other considerations\"\nformat: revealjs\nmetadata-files: \n  - _metadata.yml\n---\n---\n---\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 12 October 2023\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n$$\n\n\n\n\n\n\n## Estimation procedures (training)\n\n\nBack-propagation\n\n[Advantages:]{.secondary}\n\n-   It's updates only depend on local\n    information in the sense that if objects in the hierarchical model\n    are unrelated to each other, the updates aren't affected\n\n    (This helps in many ways, most notably in parallel architectures)\n\n-   It doesn't require second-derivative information\n\n-   As the updates are only in terms of $\\hat{R}_i$, the algorithm can\n    be run in either batch  or online  mode\n\n[Down sides:]{.tertiary}\n\n-   It can be very slow\n\n-   Need to choose the learning rate\n    $\\gamma_t$\n\n## Other algorithms\n\nThere are many  variations on the fitting algorithm\n\n[Stochastic gradient descent:]{.secondary} (SGD) discussed in the optimization lecture\n\nThe rest are variations that use lots of tricks\n\n* RMSprop\n* Adam\n* Adadelta\n* Adagrad\n* Adamax\n* Nadam\n* Ftrl\n\n\n## Regularizing neural networks\n\nNNets can almost always achieve 0 training error. Even with regularization. Because they have so many parameters.\n\nFlavours:\n\n-   a complexity penalization term $\\longrightarrow$ solve $\\min \\hat{R} + \\rho(\\alpha,\\beta)$\n-   early stopping on the back propagation algorithm used for fitting\n\n\nWeight decay\n: This is like ridge regression in that we penalize the squared Euclidean norm of the weights $\\rho(\\mathbf{W},\\mathbf{B}) = \\sum w_i^2 + \\sum b_i^2$\n\nWeight elimination\n: This encourages more shrinking of small weights $\\rho(\\mathbf{W},\\mathbf{B}) =  \\sum \\frac{w_i^2}{1+w_i^2} + \\sum \\frac{b_i^2}{1 + b_i^2}$ or Lasso-type\n\nDropout\n: In each epoch, randomly choose $z\\%$ of the nodes and set those weights to zero.\n\n\n\n## Other common pitfalls\n\nThere are a few areas to watch out for\n\n[Nonconvexity:]{.tertiary} \n\nThe neural network optimization problem is non-convex. \n\nThis makes any numerical solution highly dependent on the initial values. These should be\n\n* chosen carefully, typically random near 0. [DON'T]{.hand} use all 0.\n* regenerated several times to check sensitivity\n\n[Scaling:]{.tertiary}  \nBe sure to standardize the covariates before training\n\n## Other common pitfalls\n\n[Number of hidden units:]{.tertiary}  \nIt is generally\nbetter to have too many hidden units than too few (regularization\ncan eliminate some).\n\n\n[Sifting the output:]{.tertiary}\n\n* Choose the solution that minimizes training error\n* Choose the solution that minimizes the penalized  training error\n* Average the solutions across runs\n\n\n## Tuning parameters\n\nThere are many.\n\n* Regularization\n* Stopping criterion\n* learning rate\n* Architecture\n* Dropout %\n* others...\n\nThese are hard to tune.\n\nIn practice, people might choose \"some\" with a validation set, and fix the rest largely arbitrarily\n\n. . .\n\nMore often, people set them all arbitrarily\n\n\n## Thoughts on NNets {.smaller}\n\nOff the top of my head, without lots of justification\n\n::: flex\n::: w-50\n\n🤬😡 [Why don't statisticians like them?]{.tertiary} 🤬😡\n\n- There is little theory (though this is increasing)\n- Stat theory applies to global minima, here, only local determined by the optimizer\n- Little understanding of when they work\n- In large part, NNets look like logistic regression + feature creation. We understand that well, and in many applications, it performs as well\n- Explosion of tuning parameters without a way to decide\n- Require massive datasets to work\n- Lots of examples where they perform _exceedingly_ poorly\n:::\n\n::: w-50\n\n\n🔥🔥[Why are they hot?]{.tertiary}🔥🔥\n\n- Perform exceptionally well on typical CS tasks (images, translation)\n- Take advantage of SOTA computing (parallel, GPUs)\n- Very good for multinomial logistic regression\n- An excellent example of \"transfer learning\"\n- They generate pretty pictures (the nets, pseudo-responses at hidden units)\n\n:::\n:::\n\n\n## Keras\n\nMost people who do deep learning use Python $+$ Keras $+$ Tensorflow\n\nIt takes some work to get all this software up and running.\n\nIt is possible to do in with R using an [interface to Keras](https://keras.rstudio.com/index.html).\n\n. . .\n\nI used to try to do a walk-through, but the interface is quite brittle\n\nIf you want to explore, see the handout:\n\n* Knitted: <https://ubc-stat.github.io/stat-406-lectures/handouts/keras-nnet.html>\n* Rmd: <https://ubc-stat.github.io/stat-406-lectures/handouts/keras-nnet.Rmd>\n\n\n# Double descent and model complexity\n\n##\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n```{=html}\n<blockquote class=\"twitter-tweet\" data-width=\"550\" data-lang=\"en\" data-dnt=\"true\" data-theme=\"light\"><p lang=\"en\" dir=\"ltr\">The Bias-Variance Trade-Off &amp; &quot;DOUBLE DESCENT&quot; 🧵<br><br>Remember the bias-variance trade-off? It says that models  perform well for an &quot;intermediate level of flexibility&quot;.  You&#39;ve seen the picture of the U-shape test error curve.<br><br>We try to hit the &quot;sweet spot&quot; of flexibility.<br><br>1/🧵 <a href=\"https://t.co/HPk05izkZh\">pic.twitter.com/HPk05izkZh</a></p>&mdash; Daniela Witten (@daniela_witten) <a href=\"https://twitter.com/daniela_witten/status/1292293102103748609?ref_src=twsrc%5Etfw\">August 9, 2020</a></blockquote>\n\n```\n:::\n:::\n\n\n\n## Where does this U shape come from?\n\n\n[MSE = Squared Bias + Variance + Irreducible Noise]{.secondary}\n\n\nAs we increase flexibility:\n\n* Squared bias goes down\n* Variance goes up\n* Eventually, | $\\partial$ Variance | $>$ | $\\partial$ Squared Bias |.\n\n\n[Goal:]{.secondary} Choose amount of flexibility to balance these and minimize MSE.\n\n. . .\n\n[Use CV or something to estimate MSE and decide how much flexibility.]{.hand}\n\n\n##\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n```{=html}\n<blockquote class=\"twitter-tweet\" data-conversation=\"none\" data-width=\"550\" data-lang=\"en\" data-dnt=\"true\" data-theme=\"light\"><p lang=\"en\" dir=\"ltr\">In the past few yrs, (and particularly in the context of deep learning) ppl have noticed &quot;double descent&quot; -- when you continue to fit increasingly flexible models that interpolate the training data, then the test error can start to DECREASE again!! <br><br>Check it out: <br>3/ <a href=\"https://t.co/Vo54tRVRNG\">pic.twitter.com/Vo54tRVRNG</a></p>&mdash; Daniela Witten (@daniela_witten) <a href=\"https://twitter.com/daniela_witten/status/1292293104855158784?ref_src=twsrc%5Etfw\">August 9, 2020</a></blockquote>\n\n```\n:::\n:::\n\n\n\n\n## Zero training error and model saturation\n\n* In Deep Learning, the recommendation is to \"fit until you get zero training error\"\n\n* This somehow magically, leads to a continued decrease in test error.\n\n* So, who cares about the Bias-Variance Trade off!!\n\n. . .\n\n[Lesson:]{.secondary}\n\nBV Trade off is not wrong. 😢\n\nThis is a misunderstanding of black box algorithms and flexibility.\n\nWe don't even need deep learning to illustrate. \n\n##\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(splines)\nset.seed(20221102)\nn <- 20\ndf <- tibble(\n  x = seq(-1.5 * pi, 1.5 * pi, length.out = n),\n  y = sin(x) + runif(n, -0.5, 0.5)\n)\ng <- ggplot(df, aes(x, y)) + geom_point() + stat_function(fun = sin) + ylim(c(-2, 2))\ng + stat_smooth(method = lm, formula = y ~ bs(x, df = 4), se = FALSE, color = green) + # too smooth\n  stat_smooth(method = lm, formula = y ~ bs(x, df = 8), se = FALSE, color = orange) # looks good\n```\n\n::: {.cell-output-display}\n![](23-nnets-other_files/figure-revealjs/unnamed-chunk-3-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n##\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nxn <- seq(-1.5 * pi, 1.5 * pi, length.out = 1000)\n# Spline by hand\nX <- bs(df$x, df = 20, intercept = TRUE)\nXn <- bs(xn, df = 20, intercept = TRUE)\nS <- svd(X)\nyhat <- Xn %*% S$v %*% diag(1/S$d) %*% crossprod(S$u, df$y)\ng + geom_line(data = tibble(x=xn, y=yhat), colour = orange) +\n  ggtitle(\"20 degrees of freedom\")\n```\n\n::: {.cell-output-display}\n![](23-nnets-other_files/figure-revealjs/unnamed-chunk-4-1.svg){fig-align='center'}\n:::\n:::\n\n\n##\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nxn <- seq(-1.5 * pi, 1.5 * pi, length.out = 1000)\n# Spline by hand\nX <- bs(df$x, df = 40, intercept = TRUE)\nXn <- bs(xn, df = 40, intercept = TRUE)\nS <- svd(X)\nyhat <- Xn %*% S$v %*% diag(1/S$d) %*% crossprod(S$u, df$y)\ng + geom_line(data = tibble(x = xn, y = yhat), colour = orange) +\n  ggtitle(\"40 degrees of freedom\")\n```\n\n::: {.cell-output-display}\n![](23-nnets-other_files/figure-revealjs/unnamed-chunk-5-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## What happened?!\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code  code-line-numbers=\"1|3-12|13-16|\"}\ndoffs <- 4:50\nmse <- function(x, y) mean((x - y)^2)\nget_errs <- function(doff) {\n  X <- bs(df$x, df = doff, intercept = TRUE)\n  Xn <- bs(xn, df = doff, intercept = TRUE)\n  S <- svd(X)\n  yh <- S$u %*% crossprod(S$u, df$y)\n  bhat <- S$v %*% diag(1 / S$d) %*% crossprod(S$u, df$y)\n  yhat <- Xn %*% S$v %*% diag(1 / S$d) %*% crossprod(S$u, df$y)\n  nb <- sqrt(sum(bhat^2))\n  tibble(train = mse(df$y, yh), test = mse(yhat, sin(xn)), norm = nb)\n}\nerrs <- map(doffs, get_errs) |>\n  list_rbind() |> \n  mutate(`degrees of freedom` = doffs) |> \n  pivot_longer(train:test, values_to = \"error\")\n```\n:::\n\n\n## What happened?!\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code  code-fold=\"true\"}\nggplot(errs, aes(`degrees of freedom`, error, color = name)) +\n  geom_line(linewidth = 2) + \n  coord_cartesian(ylim = c(0, .12)) +\n  scale_x_log10() + \n  scale_colour_manual(values = c(blue, orange), name = \"\") +\n  geom_vline(xintercept = 20)\n```\n\n::: {.cell-output-display}\n![](23-nnets-other_files/figure-revealjs/unnamed-chunk-7-1.svg){fig-align='center'}\n:::\n:::\n\n\n## What happened?!\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code  code-fold=\"true\"}\nbest_test <- errs |> filter(name == \"test\")\nmin_norm <- best_test$norm[which.min(best_test$error)]\nggplot(best_test, aes(norm, error)) +\n  geom_line(colour = blue, size = 2) + ylab(\"test error\") +\n  geom_vline(xintercept = min_norm, colour = orange) +\n  scale_y_log10() + scale_x_log10() + geom_vline(xintercept = 20)\n```\n\n::: {.cell-output-display}\n![](23-nnets-other_files/figure-revealjs/unnamed-chunk-8-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Degrees of freedom and complexity\n\n* In low dimensions (where $n \\gg p$), with linear smoothers, df and model complexity are roughly the same.\n\n* But this relationship breaks down in more complicated settings\n\n* We've already seen this:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(glmnet)\nout <- cv.glmnet(X, df$y, nfolds = n) # leave one out\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code  code-fold=\"true\"}\nwith(\n  out, \n  tibble(lambda = lambda, df = nzero, cv = cvm, cvup = cvup, cvlo = cvlo )\n) |> \n  filter(df > 0) |>\n  pivot_longer(lambda:df) |> \n  ggplot(aes(x = value)) +\n  geom_errorbar(aes(ymax = cvup, ymin = cvlo)) +\n  geom_point(aes(y = cv), colour = orange) +\n  facet_wrap(~ name, strip.position = \"bottom\", scales = \"free_x\") +\n  scale_y_log10() +\n  scale_x_log10() + theme(axis.title.x = element_blank())\n```\n\n::: {.cell-output-display}\n![](23-nnets-other_files/figure-revealjs/unnamed-chunk-10-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Infinite solutions\n\n* In Lasso, df is not really the right measure of complexity\n\n* Better is $\\lambda$ or the norm of the coefficients (these are basically the same)\n\n* So what happened with the Splines?\n\n. . .\n\n* When df $= 20$, there's a unique solution that interpolates the data\n\n* When df $> 20$, there are infinitely many solutions that interpolate the data.\n\nBecause we used the SVD to solve the system, we happened to pick one: the one that has the smallest $\\Vert\\hat\\beta\\Vert_2$\n\nRecent work in Deep Learning shows that SGD has the same property: it returns the local optima with the smallest norm.\n\nIf we measure complexity in terms of the norm of the weights, rather than by counting parameters, we don't see double descent anymore.\n\n\n## The lesson\n\n* Deep learning isn't magic.\n\n* Zero training error with lots of parameters doesn't mean good test error.\n\n* We still need the bias variance tradeoff\n\n* It's intuition still applies: more flexibility eventually leads to increased MSE\n\n* But we need to be careful how we measure complexity.\n\n::: aside\n\nThere is very interesting recent theory that says \nwhen we can expect lower test error to the right of the interpolation threshold\nthan to the left. \n\n:::\n\n\n\n# Next time...\n\n[Module 5]{.secondary}\n\n[unsupervised learning]{.secondary}\n\n\n\n\n",
+    "supporting": [
+      "23-nnets-other_files"
+    ],
+    "filters": [
+      "rmarkdown/pagebreak.lua"
+    ],
+    "includes": {
+      "include-in-header": [
+        "<script src=\"../../site_libs/twitter-widget/widgets.js\"></script>\n"
+      ],
+      "include-after-body": [
+        "\n<script>\n  // htmlwidgets need to know to resize themselves when slides are shown/hidden.\n  // Fire the \"slideenter\" event (handled by htmlwidgets.js) when the current\n  // slide changes (different for each slide format).\n  (function () {\n    // dispatch for htmlwidgets\n    function fireSlideEnter() {\n      const event = window.document.createEvent(\"Event\");\n      event.initEvent(\"slideenter\", true, true);\n      window.document.dispatchEvent(event);\n    }\n\n    function fireSlideChanged(previousSlide, currentSlide) {\n      fireSlideEnter();\n\n      // dispatch for shiny\n      if (window.jQuery) {\n        if (previousSlide) {\n          window.jQuery(previousSlide).trigger(\"hidden\");\n        }\n        if (currentSlide) {\n          window.jQuery(currentSlide).trigger(\"shown\");\n        }\n      }\n    }\n\n    // hookup for slidy\n    if (window.w3c_slidy) {\n      window.w3c_slidy.add_observer(function (slide_num) {\n        // slide_num starts at position 1\n        fireSlideChanged(null, w3c_slidy.slides[slide_num - 1]);\n      });\n    }\n\n  })();\n</script>\n\n"
+      ]
+    },
+    "engineDependencies": {},
+    "preserve": {},
+    "postProcess": true
+  }
+}