From 023d8c6bf65de3868d270472efe8d8122e54d8c4 Mon Sep 17 00:00:00 2001 From: "Daniel J. McDonald" Date: Thu, 12 Oct 2023 14:55:34 -0700 Subject: [PATCH] estimation caching --- .../slides/22-nnets-estimation/execute-results/html.json | 4 ++-- schedule/slides/22-nnets-estimation.qmd | 1 - 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/_freeze/schedule/slides/22-nnets-estimation/execute-results/html.json b/_freeze/schedule/slides/22-nnets-estimation/execute-results/html.json index af20988..5e8b711 100644 --- a/_freeze/schedule/slides/22-nnets-estimation/execute-results/html.json +++ b/_freeze/schedule/slides/22-nnets-estimation/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "bc679338e07c951a98af45617743d3eb", + "hash": "dbbe4d645c0c54d5c4a2b19e0fac6c80", "result": { - "markdown": "---\nlecture: \"22 Neural nets - estimation\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n---\n---\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 12 October 2023\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n$$\n\n\n\n\n\n## Neural Network terms again (T hidden layers, regression)\n\n\n::: flex\n::: w-50\n\n$$\n\\begin{aligned}\nA_{k}^{(1)} &= g\\left(\\sum_{j=1}^p w^{(1)}_{k,j} x_j\\right)\\\\\nA_{\\ell}^{(t)} &= g\\left(\\sum_{k=1}^{K_{t-1}} w^{(t)}_{\\ell,k} A_{k}^{(t-1)} \\right)\\\\\n\\hat{Y} &= z_m = \\sum_{\\ell=1}^{K_T} \\beta_{m,\\ell} A_{\\ell}^{(T)}\\ \\ (M = 1)\n\\end{aligned}\n$$\n\n* $B \\in \\R^{M\\times K_T}$. \n* $M=1$ for regression \n* $\\mathbf{W}_t \\in \\R^{K_2\\times K_1}$ $t=1,\\ldots,T$ \n\n:::\n::: w-50\n![](gfx/single-layer-net.svg){fig-align=\"center\" width=500}\n:::\n:::\n\n\n## Training neural networks. First, choices\n\n\n\n* Choose the architecture: how many layers, units per layer, what connections?\n\n* Choose the loss: common choices (for each data point $i$)\n\n\nRegression\n: $\\hat{R}_i = \\frac{1}{2}(y_i - \\hat{y}_i)^2$ (the 1/2 just makes the derivative nice)\n\nClassification\n: $\\hat{R}_i = I(y_i = m)\\log( 1 + \\exp(-z_{im}))$\n\n* Choose the activation function $g$\n\n\n\n## Training neural networks (intuition)\n\n* We need to estimate $B$, $\\mathbf{W}_t$, $t=1,\\ldots,T$\n\n* We want to minimize $\\hat{R} = \\sum_{i=1}^n \\hat{R}_i$ as a function of all this.\n\n* We use gradient descent, but in this dialect, we call it [back propagation]{.secondary}\n\n::: flex\n::: w-50\n \nDerivatives via the chain\nrule: computed by a forward and backward sweep\n\nAll the $g(u)$'s that get used have $g'(u)$ \"nice\".\n\n\nIf $g$ is ReLu: \n\n* $g(u) = xI(x>0)$\n* $g'(u) = I(x>0)$\n:::\n\n::: w-50\n\nOnce we have derivatives from backprop,\n\n$$\n\\begin{align}\n\\widetilde{B} &\\leftarrow B - \\gamma \\frac{\\partial \\widehat{R}}{\\partial B}\\\\\n\\widetilde{\\mathbf{W}_t} &\\leftarrow \\mathbf{W}_t - \\gamma \\frac{\\partial \\widehat{R}}{\\partial \\mathbf{W}_t}\n\\end{align}\n$$\n\n:::\n:::\n\n\n## Chain rule {.smaller}\n\n\nWe want $\\frac{\\partial}{\\partial B} \\hat{R}_i$ and $\\frac{\\partial}{\\partial W_{t}}\\hat{R}_i$ for all $t$.\n\n[Regression:]{.secondary} $\\hat{R}_i = \\frac{1}{2}(y_i - \\hat{y}_i)^2$\n\n\n$$\\begin{aligned}\n\\frac{\\partial\\hat{R}_i}{\\partial B} &= -(y_i - \\hat{y}_i)\\frac{\\partial \\hat{y_i}}{\\partial B} =\\underbrace{-(y_i - \\hat{y}_i)}_{-r_i} \\mathbf{A}^{(T)}\\\\\n\\frac{\\partial}{\\partial \\mathbf{W}_T} \\hat{R}_i &= -(y_i - \\hat{y}_i)\\frac{\\partial\\hat{y_i}}{\\partial \\mathbf{W}_T} = -r_i \\frac{\\partial \\hat{y}_i}{\\partial \\mathbf{A}^{(T)}} \\frac{\\partial \\mathbf{A}^{(T)}}{\\partial \\mathbf{W}_T}\\\\ \n&= -\\left(r_i B \\odot g'(\\mathbf{W}_T \\mathbf{A}^{(T)}) \\right) \\left(\\mathbf{A}^{(T-1)}\\right)^\\top\\\\\n\\frac{\\partial}{\\partial \\mathbf{W}_{T-1}} \\hat{R}_i &= -(y_i - \\hat{y}_i)\\frac{\\partial\\hat{y_i}}{\\partial \\mathbf{W}_{T-1}} = -r_i \\frac{\\partial \\hat{y}_i}{\\partial \\mathbf{A}^{(T)}} \\frac{\\partial \\mathbf{A}^{(T)}}{\\partial \\mathbf{W}_{T-1}}\\\\\n&= -r_i \\frac{\\partial \\hat{y}_i}{\\partial \\mathbf{A}^{(T)}} \\frac{\\partial \\mathbf{A}^{(T)}}{\\partial \\mathbf{W}_{T}}\\frac{\\partial \\mathbf{W}_{T}}{\\partial \\mathbf{A}^{(T-1)}}\\frac{\\partial \\mathbf{A}^{(T-1)}}{\\partial \\mathbf{W}_{T-1}}\\\\\n\\cdots &= \\cdots\n\\end{aligned}$$\n\n\n\n## Mapping it out {.smaller}\n\nGiven current $\\mathbf{W}_t, B$, we want to get new, $\\widetilde{\\mathbf{W}}_t,\\ \\widetilde B$ for $t=1,\\ldots,T$\n\n* Squared error for regression, cross-entropy for classification\n\n::: flex\n::: w-50\n\n[Feed forward]{.tertiary} ``{=html}\n\n$$\\mathbf{A}^{(0)} = \\mathbf{X} \\in \\R^{n\\times p}$$\n\nRepeat, $t= 1,\\ldots, T$\n\n1. $\\mathbf{Z}_{t} = \\mathbf{A}^{(t-1)}\\mathbf{W}_t \\in \\R^{n\\times K_t}$\n1. $\\mathbf{A}^{(t)} = g(\\mathbf{Z}_{t})$ (component wise)\n1. $\\dot{\\mathbf{A}}^{(t)} = g'(\\mathbf{Z}_t)$\n\n$$\\begin{cases}\n\\hat{\\mathbf{y}} =\\mathbf{A}^{(T)} B \\in \\R^n \\\\\n\\hat{\\Pi} = \\left(1 + \\exp\\left(-\\mathbf{A}^{(T)}\\mathbf{B}\\right)\\right)^{-1} \\in \\R^{n \\times M}\\end{cases}$$\n\n:::\n\n::: w-50\n\n\n[Back propogate]{.secondary} ``{=html}\n\n$$r = \\begin{cases}\n-\\left(\\mathbf{y} - \\widehat{\\mathbf{y}}\\right) \\\\\n-\\left(1 - \\widehat{\\Pi}\\right)[y]\\end{cases}$$\n\n\n$$\n\\begin{aligned}\n\\frac{\\partial}{\\partial \\mathbf{B}} \\widehat{R} &= \\left(\\mathbf{A}^{(T)}\\right)^\\top \\mathbf{r}\\\\\n\\boldsymbol{\\Gamma} &\\leftarrow \\mathbf{r}\\\\\n\\mathbf{W}_{T+1} &\\leftarrow \\mathbf{B}\n\\end{aligned}\n$$\n\n\nRepeat, $t = T,...,1$,\n\n1. $\\boldsymbol{\\Gamma} \\leftarrow \\left(\\boldsymbol{\\Gamma} \\mathbf{W}_{t+1}\\right) \\odot\\dot{\\mathbf{A}}^{(t)}$\n1. $\\frac{\\partial R}{\\partial \\mathbf{W}_t} =\\left(\\mathbf{A}^{(t)}\\right)^\\top \\Gamma$\n\n:::\n:::\n\n\n\n## Deep nets\n\n\nSome comments on adding layers:\n\n- It has been shown that one hidden layer is sufficient to approximate\n any bounded piecewise continuous function\n\n- However, this may take a huge number of hidden units (i.e. $K_1 \\gg 1$). \n\n- This is what people mean when they say that NNets are \"universal approximators\"\n \n- By including multiple layers, we can have fewer hidden units per\n layer. \n \n- Also, we can encode (in)dependencies that can speed computations \n\n- We don't have to connect everything the way we have been\n\n\n## Simple example\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nn <- 200\ndf <- tibble(\n x = seq(.05, 1, length = n),\n y = sin(1 / x) + rnorm(n, 0, .1) # Doppler function\n)\ntestdata <- matrix(seq(.05, 1, length.out = 1e3), ncol = 1)\nlibrary(neuralnet)\nnn_out <- neuralnet(y ~ x, data = df, hidden = c(10, 5, 15), threshold = 0.01, rep = 3)\nnn_preds <- map(1:3, ~ compute(nn_out, testdata, .x)$net.result)\nyhat <- nn_preds |> bind_cols() |> rowMeans() # average over the runs\n```\n:::\n\n::: {.cell layout-align=\"center\" hash='22-nnets-estimation_cache/revealjs/unnamed-chunk-2_52e662922e12e41fe8cd6b6f64387e3c'}\n\n```{.r .cell-code code-fold=\"true\"}\n# This code will reproduce the analysis, takes some time\nset.seed(406406406)\nn <- 200\ndf <- tibble(\n x = seq(.05, 1, length = n),\n y = sin(1 / x) + rnorm(n, 0, .1) # Doppler function\n)\ntestx <- matrix(seq(.05, 1, length.out = 1e3), ncol = 1)\nlibrary(neuralnet)\nlibrary(splines)\nfstar <- sin(1 / testx)\nspline_test_err <- function(k) {\n fit <- lm(y ~ bs(x, df = k), data = df)\n yhat <- predict(fit, newdata = tibble(x = testx))\n mean((yhat - fstar)^2)\n}\nKs <- 1:15 * 10\nSplineErr <- map_dbl(Ks, ~ spline_test_err(.x))\n\nJgrid <- c(5, 10, 15)\nNNerr <- double(length(Jgrid)^3)\nNNplot <- character(length(Jgrid)^3)\nsweep <- 0\nfor (J1 in Jgrid) {\n for (J2 in Jgrid) {\n for (J3 in Jgrid) {\n sweep <- sweep + 1\n NNplot[sweep] <- paste(J1, J2, J3, sep = \" \")\n nn_out <- neuralnet(y ~ x, df,\n hidden = c(J1, J2, J3),\n threshold = 0.01, rep = 3\n )\n nn_results <- sapply(1:3, function(x) {\n compute(nn_out, testx, x)$net.result\n })\n # Run them through the neural network\n Yhat <- rowMeans(nn_results)\n NNerr[sweep] <- mean((Yhat - fstar)^2)\n }\n }\n}\n\nbestK <- Ks[which.min(SplineErr)]\nbestspline <- predict(lm(y ~ bs(x, bestK), data = df), newdata = tibble(x = testx))\nbesthidden <- as.numeric(unlist(strsplit(NNplot[which.min(NNerr)], \" \")))\nnn_out <- neuralnet(y ~ x, df, hidden = besthidden, threshold = 0.01, rep = 3)\nnn_results <- sapply(1:3, function(x) compute(nn_out, testdata, x)$net.result)\n# Run them through the neural network\nbestnn <- rowMeans(nn_results)\nplotd <- data.frame(\n x = testdata, spline = bestspline, nnet = bestnn, truth = fstar\n)\nsave.image(file = \"data/nnet-example.Rdata\")\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](22-nnets-estimation_files/figure-revealjs/fun-nnet-spline-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Different architectures\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](22-nnets-estimation_files/figure-revealjs/nnet-vs-spline-plots-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n# Next time...\n\nOther considerations\n", + "markdown": "---\nlecture: \"22 Neural nets - estimation\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n---\n---\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 12 October 2023\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n$$\n\n\n\n\n\n## Neural Network terms again (T hidden layers, regression)\n\n\n::: flex\n::: w-50\n\n$$\n\\begin{aligned}\nA_{k}^{(1)} &= g\\left(\\sum_{j=1}^p w^{(1)}_{k,j} x_j\\right)\\\\\nA_{\\ell}^{(t)} &= g\\left(\\sum_{k=1}^{K_{t-1}} w^{(t)}_{\\ell,k} A_{k}^{(t-1)} \\right)\\\\\n\\hat{Y} &= z_m = \\sum_{\\ell=1}^{K_T} \\beta_{m,\\ell} A_{\\ell}^{(T)}\\ \\ (M = 1)\n\\end{aligned}\n$$\n\n* $B \\in \\R^{M\\times K_T}$. \n* $M=1$ for regression \n* $\\mathbf{W}_t \\in \\R^{K_2\\times K_1}$ $t=1,\\ldots,T$ \n\n:::\n::: w-50\n![](gfx/single-layer-net.svg){fig-align=\"center\" width=500}\n:::\n:::\n\n\n## Training neural networks. First, choices\n\n\n\n* Choose the architecture: how many layers, units per layer, what connections?\n\n* Choose the loss: common choices (for each data point $i$)\n\n\nRegression\n: $\\hat{R}_i = \\frac{1}{2}(y_i - \\hat{y}_i)^2$ (the 1/2 just makes the derivative nice)\n\nClassification\n: $\\hat{R}_i = I(y_i = m)\\log( 1 + \\exp(-z_{im}))$\n\n* Choose the activation function $g$\n\n\n\n## Training neural networks (intuition)\n\n* We need to estimate $B$, $\\mathbf{W}_t$, $t=1,\\ldots,T$\n\n* We want to minimize $\\hat{R} = \\sum_{i=1}^n \\hat{R}_i$ as a function of all this.\n\n* We use gradient descent, but in this dialect, we call it [back propagation]{.secondary}\n\n::: flex\n::: w-50\n \nDerivatives via the chain\nrule: computed by a forward and backward sweep\n\nAll the $g(u)$'s that get used have $g'(u)$ \"nice\".\n\n\nIf $g$ is ReLu: \n\n* $g(u) = xI(x>0)$\n* $g'(u) = I(x>0)$\n:::\n\n::: w-50\n\nOnce we have derivatives from backprop,\n\n$$\n\\begin{align}\n\\widetilde{B} &\\leftarrow B - \\gamma \\frac{\\partial \\widehat{R}}{\\partial B}\\\\\n\\widetilde{\\mathbf{W}_t} &\\leftarrow \\mathbf{W}_t - \\gamma \\frac{\\partial \\widehat{R}}{\\partial \\mathbf{W}_t}\n\\end{align}\n$$\n\n:::\n:::\n\n\n## Chain rule {.smaller}\n\n\nWe want $\\frac{\\partial}{\\partial B} \\hat{R}_i$ and $\\frac{\\partial}{\\partial W_{t}}\\hat{R}_i$ for all $t$.\n\n[Regression:]{.secondary} $\\hat{R}_i = \\frac{1}{2}(y_i - \\hat{y}_i)^2$\n\n\n$$\\begin{aligned}\n\\frac{\\partial\\hat{R}_i}{\\partial B} &= -(y_i - \\hat{y}_i)\\frac{\\partial \\hat{y_i}}{\\partial B} =\\underbrace{-(y_i - \\hat{y}_i)}_{-r_i} \\mathbf{A}^{(T)}\\\\\n\\frac{\\partial}{\\partial \\mathbf{W}_T} \\hat{R}_i &= -(y_i - \\hat{y}_i)\\frac{\\partial\\hat{y_i}}{\\partial \\mathbf{W}_T} = -r_i \\frac{\\partial \\hat{y}_i}{\\partial \\mathbf{A}^{(T)}} \\frac{\\partial \\mathbf{A}^{(T)}}{\\partial \\mathbf{W}_T}\\\\ \n&= -\\left(r_i B \\odot g'(\\mathbf{W}_T \\mathbf{A}^{(T)}) \\right) \\left(\\mathbf{A}^{(T-1)}\\right)^\\top\\\\\n\\frac{\\partial}{\\partial \\mathbf{W}_{T-1}} \\hat{R}_i &= -(y_i - \\hat{y}_i)\\frac{\\partial\\hat{y_i}}{\\partial \\mathbf{W}_{T-1}} = -r_i \\frac{\\partial \\hat{y}_i}{\\partial \\mathbf{A}^{(T)}} \\frac{\\partial \\mathbf{A}^{(T)}}{\\partial \\mathbf{W}_{T-1}}\\\\\n&= -r_i \\frac{\\partial \\hat{y}_i}{\\partial \\mathbf{A}^{(T)}} \\frac{\\partial \\mathbf{A}^{(T)}}{\\partial \\mathbf{W}_{T}}\\frac{\\partial \\mathbf{W}_{T}}{\\partial \\mathbf{A}^{(T-1)}}\\frac{\\partial \\mathbf{A}^{(T-1)}}{\\partial \\mathbf{W}_{T-1}}\\\\\n\\cdots &= \\cdots\n\\end{aligned}$$\n\n\n\n## Mapping it out {.smaller}\n\nGiven current $\\mathbf{W}_t, B$, we want to get new, $\\widetilde{\\mathbf{W}}_t,\\ \\widetilde B$ for $t=1,\\ldots,T$\n\n* Squared error for regression, cross-entropy for classification\n\n::: flex\n::: w-50\n\n[Feed forward]{.tertiary} ``{=html}\n\n$$\\mathbf{A}^{(0)} = \\mathbf{X} \\in \\R^{n\\times p}$$\n\nRepeat, $t= 1,\\ldots, T$\n\n1. $\\mathbf{Z}_{t} = \\mathbf{A}^{(t-1)}\\mathbf{W}_t \\in \\R^{n\\times K_t}$\n1. $\\mathbf{A}^{(t)} = g(\\mathbf{Z}_{t})$ (component wise)\n1. $\\dot{\\mathbf{A}}^{(t)} = g'(\\mathbf{Z}_t)$\n\n$$\\begin{cases}\n\\hat{\\mathbf{y}} =\\mathbf{A}^{(T)} B \\in \\R^n \\\\\n\\hat{\\Pi} = \\left(1 + \\exp\\left(-\\mathbf{A}^{(T)}\\mathbf{B}\\right)\\right)^{-1} \\in \\R^{n \\times M}\\end{cases}$$\n\n:::\n\n::: w-50\n\n\n[Back propogate]{.secondary} ``{=html}\n\n$$r = \\begin{cases}\n-\\left(\\mathbf{y} - \\widehat{\\mathbf{y}}\\right) \\\\\n-\\left(1 - \\widehat{\\Pi}\\right)[y]\\end{cases}$$\n\n\n$$\n\\begin{aligned}\n\\frac{\\partial}{\\partial \\mathbf{B}} \\widehat{R} &= \\left(\\mathbf{A}^{(T)}\\right)^\\top \\mathbf{r}\\\\\n\\boldsymbol{\\Gamma} &\\leftarrow \\mathbf{r}\\\\\n\\mathbf{W}_{T+1} &\\leftarrow \\mathbf{B}\n\\end{aligned}\n$$\n\n\nRepeat, $t = T,...,1$,\n\n1. $\\boldsymbol{\\Gamma} \\leftarrow \\left(\\boldsymbol{\\Gamma} \\mathbf{W}_{t+1}\\right) \\odot\\dot{\\mathbf{A}}^{(t)}$\n1. $\\frac{\\partial R}{\\partial \\mathbf{W}_t} =\\left(\\mathbf{A}^{(t)}\\right)^\\top \\Gamma$\n\n:::\n:::\n\n\n\n## Deep nets\n\n\nSome comments on adding layers:\n\n- It has been shown that one hidden layer is sufficient to approximate\n any bounded piecewise continuous function\n\n- However, this may take a huge number of hidden units (i.e. $K_1 \\gg 1$). \n\n- This is what people mean when they say that NNets are \"universal approximators\"\n \n- By including multiple layers, we can have fewer hidden units per\n layer. \n \n- Also, we can encode (in)dependencies that can speed computations \n\n- We don't have to connect everything the way we have been\n\n\n## Simple example\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nn <- 200\ndf <- tibble(\n x = seq(.05, 1, length = n),\n y = sin(1 / x) + rnorm(n, 0, .1) # Doppler function\n)\ntestdata <- matrix(seq(.05, 1, length.out = 1e3), ncol = 1)\nlibrary(neuralnet)\nnn_out <- neuralnet(y ~ x, data = df, hidden = c(10, 5, 15), threshold = 0.01, rep = 3)\nnn_preds <- map(1:3, ~ compute(nn_out, testdata, .x)$net.result)\nyhat <- nn_preds |> bind_cols() |> rowMeans() # average over the runs\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\n# This code will reproduce the analysis, takes some time\nset.seed(406406406)\nn <- 200\ndf <- tibble(\n x = seq(.05, 1, length = n),\n y = sin(1 / x) + rnorm(n, 0, .1) # Doppler function\n)\ntestx <- matrix(seq(.05, 1, length.out = 1e3), ncol = 1)\nlibrary(neuralnet)\nlibrary(splines)\nfstar <- sin(1 / testx)\nspline_test_err <- function(k) {\n fit <- lm(y ~ bs(x, df = k), data = df)\n yhat <- predict(fit, newdata = tibble(x = testx))\n mean((yhat - fstar)^2)\n}\nKs <- 1:15 * 10\nSplineErr <- map_dbl(Ks, ~ spline_test_err(.x))\n\nJgrid <- c(5, 10, 15)\nNNerr <- double(length(Jgrid)^3)\nNNplot <- character(length(Jgrid)^3)\nsweep <- 0\nfor (J1 in Jgrid) {\n for (J2 in Jgrid) {\n for (J3 in Jgrid) {\n sweep <- sweep + 1\n NNplot[sweep] <- paste(J1, J2, J3, sep = \" \")\n nn_out <- neuralnet(y ~ x, df,\n hidden = c(J1, J2, J3),\n threshold = 0.01, rep = 3\n )\n nn_results <- sapply(1:3, function(x) {\n compute(nn_out, testx, x)$net.result\n })\n # Run them through the neural network\n Yhat <- rowMeans(nn_results)\n NNerr[sweep] <- mean((Yhat - fstar)^2)\n }\n }\n}\n\nbestK <- Ks[which.min(SplineErr)]\nbestspline <- predict(lm(y ~ bs(x, bestK), data = df), newdata = tibble(x = testx))\nbesthidden <- as.numeric(unlist(strsplit(NNplot[which.min(NNerr)], \" \")))\nnn_out <- neuralnet(y ~ x, df, hidden = besthidden, threshold = 0.01, rep = 3)\nnn_results <- sapply(1:3, function(x) compute(nn_out, testdata, x)$net.result)\n# Run them through the neural network\nbestnn <- rowMeans(nn_results)\nplotd <- data.frame(\n x = testdata, spline = bestspline, nnet = bestnn, truth = fstar\n)\nsave.image(file = \"data/nnet-example.Rdata\")\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](22-nnets-estimation_files/figure-revealjs/fun-nnet-spline-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Different architectures\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](22-nnets-estimation_files/figure-revealjs/nnet-vs-spline-plots-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n# Next time...\n\nOther considerations\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/schedule/slides/22-nnets-estimation.qmd b/schedule/slides/22-nnets-estimation.qmd index 219f3fc..e558e65 100644 --- a/schedule/slides/22-nnets-estimation.qmd +++ b/schedule/slides/22-nnets-estimation.qmd @@ -201,7 +201,6 @@ yhat <- nn_preds |> bind_cols() |> rowMeans() # average over the runs ```{r} #| eval: false -#| cache: true #| code-fold: true # This code will reproduce the analysis, takes some time set.seed(406406406)