Merge branch 'main' of https://github.com/UBC-STAT/stat-406

UBC-STAT · Nov 3, 2023 · 4eb925e · 4eb925e
2 parents b0c7cf3 + ec8a139
commit 4eb925e
Show file tree

Hide file tree

Showing 9 changed files with 28 additions and 1,187 deletions.
diff --git a/_freeze/schedule/slides/18-the-bootstrap/execute-results/html.json b/_freeze/schedule/slides/18-the-bootstrap/execute-results/html.json
diff --git a/_freeze/schedule/slides/20-boosting/execute-results/html.json b/_freeze/schedule/slides/20-boosting/execute-results/html.json
diff --git a/_freeze/schedule/slides/21-nnets-intro/execute-results/html.json b/_freeze/schedule/slides/21-nnets-intro/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "f3fc08f00287583f6b5da27e5f8904c8",
+  "hash": "1764e2e9b2555daeb42521f17b1bf76f",
   "result": {
-    "markdown": "---\nlecture: \"21 Neural nets\"\nformat: revealjs\nmetadata-files: \n  - _metadata.yml\n---\n---\n---\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 12 October 2023\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n$$\n\n\n\n\n## Overview\n\nNeural networks are models for supervised\nlearning\n\n \nLinear combinations of features  are passed\nthrough a non-linear transformation in successive layers\n\n \nAt the top layer, the resulting latent\nfactors are fed into an algorithm for\npredictions\n\n(Most commonly via least squares or logistic loss)\n\n \n\n\n\n## Background\n\n::: flex\n::: w-50\n\nNeural networks have come about in 3 \"waves\" \n\nThe first was an attempt in the 1950s to model the mechanics of the human brain\n\nIt appeared the brain worked by\n\n-   taking atomic units known as [neurons]{.tertiary},\n    which can be \"on\" or \"off\"\n-   putting them in [networks]{.tertiary} \n\nA neuron itself interprets the status of other neurons\n\nThere weren't really computers, so we couldn't estimate these things\n:::\n\n::: w-50\n\n\n![](https://miro.medium.com/v2/resize:fit:870/0*j0gW8xn8GkL7MrOs.gif){fig-align=\"center\" width=600}\n\n:::\n:::\n\n## Background\n\nAfter the development of parallel, distributed computation in the 1980s,\nthis \"artificial intelligence\" view was diminished\n\nAnd neural networks gained popularity \n\nBut, the growing popularity of SVMs and boosting/bagging in the late\n1990s, neural networks again fell out of favor\n\nThis was due to many of the problems we'll discuss (non-convexity being\nthe main one)\n\n. . .\n\nIn the mid 2000's, new approaches for\n[initializing]{.tertiary} neural networks became\navailable\n\n \nThese approaches are collectively known as [deep learning]{.secondary}\n\n \nState-of-the-art performance on various classification\ntasks has been accomplished via neural networks\n\nToday, Neural Networks/Deep Learning are the hottest...\n\n\n\n\n\n## High level overview\n\n\n![](gfx/single-layer-net.svg){fig-align=\"center\" height=500}\n\n\n\n\n## Recall nonparametric regression\n\nSuppose $Y \\in \\mathbb{R}$ and we are trying estimate\nthe regression function $$\\Expect{Y\\given X} = f_*(X)$$\n\n \nIn Module 2, we discussed basis expansion, \n\n\n\n1.  We know $f_*(x) =\\sum_{k=1}^\\infty \\beta_k h_k(x)$ some basis $h_1,h_2,\\ldots$ (using $h$ instead of $\\phi$ to match ISLR)\n\n2.  Truncate this expansion at $K$: \n    $f_*^K(x) \\approx \\sum_{k=1}^K \\beta_k h_k(x)$\n\n3.  Estimate $\\beta_k$ with least squares\n\n\n## Recall nonparametric regression\n\nThe weaknesses of this approach are:\n\n-   The basis is fixed and independent of the data\n-   If $p$ is large, then nonparametrics doesn't work well at all (recall the Curse of Dimensionality)\n-   If the basis doesn't \"agree\" with $f_*$, then $K$ will have to be\n    large to capture the structure\n-   What if parts of $f_*$ have substantially different structure? Say $f_*(x)$ really wiggly for $x \\in [-1,3]$ but smooth elsewhere\n\nAn alternative would be to have the data\n[tell]{.secondary} us what kind of basis to use (Module 5)\n\n\n## 1-layer for Regression\n\n::: flex\n::: w-50\n\nA single layer neural network model is\n$$\n\\begin{aligned}\n&f(x) = \\beta_0 + \\sum_{k=1}^K \\beta_k h_k(x) \\\\\n&= \\beta_0 + \\sum_{k=1}^K \\beta_k \\ g(w_{k0} + w_k^{\\top}x)\\\\\n&= \\beta_0 + \\sum_{k=1}^K \\beta_k \\ A_k\\\\\n\\end{aligned}\n$$\n\n[Compare:]{.secondary} A nonparametric regression\n$$f(x) = \\beta_0 + \\sum_{k=1}^K \\beta_k {\\phi_k(x)}$$\n\n:::\n\n::: w-50\n\n![](gfx/single-layer-net.svg){fig-align=\"center\" width=500}\n\n:::\n:::\n\n\n\n\n## Terminology\n\n$$f(x) = {\\beta_0} + \\sum_{k=1}^{{K}} {\\beta_k} {g(w_{k0} + w_k^{\\top}x)}$$\nThe main components are\n\n-   The derived features ${A_k = g(w_{k0} + w_k^{\\top}x)}$ and are called the [hidden units]{.secondary} or [activations]{.secondary}\n-   The function $g$ is called the [activation function]{.secondary}  (more on this later)\n-   The parameters\n${\\beta_0},{\\beta_k},{w_{k0}},{w_k}$ are estimated from the data for all $k = 1,\\ldots, K$.\n-   The number of hidden units ${K}$ is a tuning\n    parameter\n- $\\beta_0$ and $w_{k0}$ are usually called [biases]{.secondary} (I'm going to set them to 0 and ignore them in future formulas. Just for space. It's just an intercept)    \n\n\n## Terminology\n\n$$f(x) = {\\beta_0} + \\sum_{k=1}^{{K}} {\\beta_k} {g(w_{k0} + w_k^{\\top}x)}$$\n\n\nNotes (no biases):\n\n$\\beta \\in \\R^k$.  \n\n$w_k \\in \\R^p,\\ k = 1,\\ldots,K$  \n\n$\\mathbf{W} \\in \\R^{K\\times p}$\n\n\n## What about classification (10 classes, 2 layers)\n\n\n::: flex\n::: w-40\n\n$$\n\\begin{aligned}\nA_k^{(1)} &= g\\left(\\sum_{j=1}^p w^{(1)}_{k,j} x_j\\right)\\\\\nA_\\ell^{(2)} &= g\\left(\\sum_{k=1}^{K_1} w^{(2)}_{\\ell,k} A_k^{(1)} \\right)\\\\\nz_m &= \\sum_{\\ell=1}^{K_2} \\beta_{m,\\ell} A_\\ell^{(2)}\\\\\nf_m(x) &= \\frac{1}{1 + \\exp(-z_m)}\\\\\n\\end{aligned}\n$$\n\n:::\n::: w-60\n\n![](gfx/two-layer-net.svg){fig-align=\"center\" width=500}\n\n:::\n:::\n\nPredict class with largest probability: $\\hat{Y} = \\argmax_{m} f_m(x)$\n\n## What about classification (10 classes, 2 layers)\n\n::: flex\n::: w-40\n\nNotes:\n\n$B \\in \\R^{M\\times K_2}$ (here $M=10$).  \n\n$\\mathbf{W}_2 \\in \\R^{K_2\\times K_1}$  \n\n$\\mathbf{W}_1 \\in \\R^{K_1\\times p}$\n\n:::\n::: w-60\n\n![](gfx/two-layer-net.svg){fig-align=\"center\" width=500}\n\n:::\n:::\n\n\n## Two observations\n\n\n1. The $g$ function generates a [feature map]{.secondary}\n\nWe start with $p$ covariates and we generate $K$ features (1-layer)\n\n::: flex\n\n::: w-50\n\n[Logistic / Least-squares with a polynomial transformation]{.tertiary}\n\n$$\n\\begin{aligned}\n&\\Phi(x) \\\\\n& = \n(1, x_1, \\ldots, x_p, x_1^2,\\ldots,x_p^2,\\ldots\\\\\n& \\quad \\ldots x_1x_2, \\ldots, x_{p-1}x_p) \\\\\n& =\n(\\phi_1(x),\\ldots,\\phi_{K_2}(x))\\\\\nf(x) &=  \\sum_{k=1}^{K_2} \\beta_k \\phi_k(x) = \\beta^\\top \\Phi(x)\n\\end{aligned}\n$$\n\n:::\n\n::: w-50\n[Neural network]{.secondary}\n\n\n\n$$\\begin{aligned}\nA_k &= g\\left( \\sum_{j=1}^p w_{kj}x_j\\right) = g\\left( w_{k}^{\\top}x\\right)\\\\\n\\Phi(x) &= (A_1,\\ldots, A_K)^\\top \\in \\mathbb{R}^{K}\\\\\nf(x) &=\\beta^{\\top} \\Phi(x)=\\beta^\\top A\\\\ \n&=  \\sum_{k=1}^K \\beta_k g\\left( \\sum_{j=1}^p w_{kj}x_j\\right)\\end{aligned}$$\n\n:::\n:::\n\n## Two observations\n\n2. If $g(u) = u$, (or $=3u$) then neural networks reduce to (massively underdetermined) ordinary least squares (try to show this)\n\n* ReLU is the current fashion (used to be tanh or logistic)\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](21-nnets-intro_files/figure-revealjs/sigmoid-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n# Next time...\n\nHow do we estimate these monsters?\n",
+    "markdown": "---\nlecture: \"21 Neural nets\"\nformat: revealjs\nmetadata-files: \n  - _metadata.yml\n---\n---\n---\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 02 November 2023\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n## Overview\n\nNeural networks are models for supervised\nlearning\n\n \nLinear combinations of features  are passed\nthrough a non-linear transformation in successive layers\n\n \nAt the top layer, the resulting latent\nfactors are fed into an algorithm for\npredictions\n\n(Most commonly via least squares or logistic loss)\n\n \n\n\n\n## Background\n\n::: flex\n::: w-50\n\nNeural networks have come about in 3 \"waves\" \n\nThe first was an attempt in the 1950s to model the mechanics of the human brain\n\nIt appeared the brain worked by\n\n-   taking atomic units known as [neurons]{.tertiary},\n    which can be \"on\" or \"off\"\n-   putting them in [networks]{.tertiary} \n\nA neuron itself interprets the status of other neurons\n\nThere weren't really computers, so we couldn't estimate these things\n:::\n\n::: w-50\n\n\n![](https://miro.medium.com/v2/resize:fit:870/0*j0gW8xn8GkL7MrOs.gif){fig-align=\"center\" width=600}\n\n:::\n:::\n\n## Background\n\nAfter the development of parallel, distributed computation in the 1980s,\nthis \"artificial intelligence\" view was diminished\n\nAnd neural networks gained popularity \n\nBut, the growing popularity of SVMs and boosting/bagging in the late\n1990s, neural networks again fell out of favor\n\nThis was due to many of the problems we'll discuss (non-convexity being\nthe main one)\n\n. . .\n\n \nState-of-the-art performance on various classification\ntasks has been accomplished via neural networks\n\nToday, Neural Networks/Deep Learning are the hottest...\n\n\n\n\n\n## High level overview\n\n\n![](gfx/single-layer-net.svg){fig-align=\"center\" height=500}\n\n\n\n\n## Recall nonparametric regression\n\nSuppose $Y \\in \\mathbb{R}$ and we are trying estimate\nthe regression function $$\\Expect{Y\\given X} = f_*(X)$$\n\n \nIn Module 2, we discussed basis expansion, \n\n\n\n1.  We know $f_*(x) =\\sum_{k=1}^\\infty \\beta_k \\phi_k(x)$ some basis \n$\\phi_1,\\phi_2,\\ldots$\n\n2.  Truncate this expansion at $K$: \n$f_*^K(x) \\approx \\sum_{k=1}^K \\beta_k \\phi_k(x)$\n\n3.  Estimate $\\beta_k$ with least squares\n\n\n## Recall nonparametric regression\n\nThe weaknesses of this approach are:\n\n-   The basis is fixed and independent of the data\n-   If $p$ is large, then nonparametrics doesn't work well at all (recall the Curse of Dimensionality)\n-   If the basis doesn't \"agree\" with $f_*$, then $K$ will have to be\n    large to capture the structure\n-   What if parts of $f_*$ have substantially different structure? Say $f_*(x)$ really wiggly for $x \\in [-1,3]$ but smooth elsewhere\n\nAn alternative would be to have the data\n[tell]{.secondary} us what kind of basis to use (Module 5)\n\n\n## 1-layer for Regression\n\n::: flex\n::: w-50\n\nA single layer neural network model is\n$$\n\\begin{aligned}\n&f(x) = \\sum_{k=1}^K \\beta_k h_k(x) \\\\\n&= \\sum_{k=1}^K \\beta_k \\ g(w_k^{\\top}x)\\\\\n&= \\sum_{k=1}^K \\beta_k \\ A_k\\\\\n\\end{aligned}\n$$\n\n[Compare:]{.secondary} A nonparametric regression\n$$f(x) = \\sum_{k=1}^K \\beta_k {\\phi_k(x)}$$\n\n:::\n\n::: w-50\n\n![](gfx/single-layer-net.svg){fig-align=\"center\" width=500}\n\n:::\n:::\n\n\n\n\n## Terminology\n\n$$f(x) = \\sum_{k=1}^{{K}} {\\beta_k} {g( w_k^{\\top}x)}$$\nThe main components are\n\n-   The derived features ${A_k = g(w_k^{\\top}x)}$ and are called the [hidden units]{.secondary} or [activations]{.secondary}\n-   The function $g$ is called the [activation function]{.secondary}  (more on this later)\n-   The parameters\n${\\beta_k},{w_k}$ are estimated from the data for all $k = 1,\\ldots, K$.\n-   The number of hidden units ${K}$ is a tuning\n    parameter\n    \n$$f(x) = \\sum_{k=1}^{{K}} \\beta_0 + {\\beta_k} {g(w_{k0} + w_k^{\\top}x)}$$\n\n- Could add $\\beta_0$ and $w_{k0}$. Called [biases]{.secondary} \n(I'm going to ignore them. It's just an intercept)    \n\n\n## Terminology\n\n$$f(x) = \\sum_{k=1}^{{K}} {\\beta_k} {g(w_k^{\\top}x)}$$\n\n\nNotes (no biases):\n\n<br/>\n\n$\\beta \\in \\R^k$  \n\n$w_k \\in \\R^p,\\ k = 1,\\ldots,K$  \n\n$\\mathbf{W} \\in \\R^{K\\times p}$\n\n\n## What about classification (10 classes, 2 layers)\n\n\n::: flex\n::: w-40\n\n$$\n\\begin{aligned}\nA_k^{(1)} &= g\\left(\\sum_{j=1}^p w^{(1)}_{k,j} x_j\\right)\\\\\nA_\\ell^{(2)} &= g\\left(\\sum_{k=1}^{K_1} w^{(2)}_{\\ell,k} A_k^{(1)} \\right)\\\\\nz_m &= \\sum_{\\ell=1}^{K_2} \\beta_{m,\\ell} A_\\ell^{(2)}\\\\\nf_m(x) &= \\frac{1}{1 + \\exp(-z_m)}\\\\\n\\end{aligned}\n$$\n\n:::\n::: w-60\n\n![](gfx/two-layer-net.svg){fig-align=\"center\" width=500}\n\n:::\n:::\n\nPredict class with largest probability \n$\\longrightarrow\\ \\widehat{Y} = \\argmax_{m} f_m(x)$\n\n## What about classification (10 classes, 2 layers)\n\n::: flex\n::: w-40\n\nNotes:\n\n$B \\in \\R^{M\\times K_2}$ (here $M=10$).  \n\n$\\mathbf{W}_2 \\in \\R^{K_2\\times K_1}$  \n\n$\\mathbf{W}_1 \\in \\R^{K_1\\times p}$\n\n:::\n::: w-60\n\n![](gfx/two-layer-net.svg){fig-align=\"center\" width=500}\n\n:::\n:::\n\n\n## Two observations\n\n\n1. The $g$ function generates a [feature map]{.secondary}\n\nWe start with $p$ covariates and we generate $K$ features (1-layer)\n\n::: flex\n\n::: w-50\n\n[Logistic / Least-squares with a polynomial transformation]{.tertiary}\n\n$$\n\\begin{aligned}\n&\\Phi(x) \\\\\n& = \n(1, x_1, \\ldots, x_p, x_1^2,\\ldots,x_p^2,\\ldots\\\\\n& \\quad \\ldots x_1x_2, \\ldots, x_{p-1}x_p) \\\\\n& =\n(\\phi_1(x),\\ldots,\\phi_{K_2}(x))\\\\\nf(x) &=  \\sum_{k=1}^{K_2} \\beta_k \\phi_k(x) = \\beta^\\top \\Phi(x)\n\\end{aligned}\n$$\n\n:::\n\n::: w-50\n[Neural network]{.secondary}\n\n\n\n$$\\begin{aligned}\nA_k &= g\\left( \\sum_{j=1}^p w_{kj}x_j\\right) = g\\left( w_{k}^{\\top}x\\right)\\\\\n\\Phi(x) &= (A_1,\\ldots, A_K)^\\top \\in \\mathbb{R}^{K}\\\\\nf(x) &=\\beta^{\\top} \\Phi(x)=\\beta^\\top A\\\\ \n&=  \\sum_{k=1}^K \\beta_k g\\left( \\sum_{j=1}^p w_{kj}x_j\\right)\\end{aligned}$$\n\n:::\n:::\n\n## Two observations\n\n2. If $g(u) = u$, (or $=3u$) then neural networks reduce to (massively underdetermined) ordinary least squares (try to show this)\n\n* ReLU is the current fashion (used to be tanh or logistic)\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](21-nnets-intro_files/figure-revealjs/sigmoid-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n# Next time...\n\nHow do we estimate these monsters?\n",
     "supporting": [
       "21-nnets-intro_files"
     ],

diff --git a/_freeze/site_libs/revealjs/plugin/multiplex/multiplex.js b/_freeze/site_libs/revealjs/plugin/multiplex/multiplex.js
diff --git a/_freeze/site_libs/revealjs/plugin/multiplex/plugin.yml b/_freeze/site_libs/revealjs/plugin/multiplex/plugin.yml
diff --git a/_freeze/site_libs/revealjs/plugin/multiplex/socket.io.js b/_freeze/site_libs/revealjs/plugin/multiplex/socket.io.js