diff --git a/_freeze/schedule/slides/00-classification-losses/execute-results/html.json b/_freeze/schedule/slides/00-classification-losses/execute-results/html.json
index 4df31ef..4b83642 100644
--- a/_freeze/schedule/slides/00-classification-losses/execute-results/html.json
+++ b/_freeze/schedule/slides/00-classification-losses/execute-results/html.json
@@ -1,7 +1,7 @@
{
- "hash": "8b2edae6a14401844329d3d63dfd8c75",
+ "hash": "a52fe0e79bf5e4db92eca003f346d642",
"result": {
- "markdown": "---\nlecture: \"00 Evaluating classifiers\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n---\n---\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 16 October 2023\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n$$\n\n\n\n\n\n## How do we measure accuracy?\n\n[So far]{.secondary} --- 0-1 loss. If correct class, lose 0 else lose 1.\n\n[Asymmetric classification loss]{.secondary} --- If correct class, lose 0 else lose something.\n\nFor example, consider facial recognition. Goal is \"person OK\", \"person has expired passport\", \"person is a known terrorist\"\n\n1. If classify OK, but was terrorist, lose 1,000,000\n1. If classify OK, but expired passport, lose 2\n1. If classify terrorist, but was OK, lose 100\n1. If classify terrorist, but was expired passport, lose 10\n1. etc.\n\n. . .\n\n\nResults in a 3x3 matrix of losses with 0 on the diagonal.\n\n\n::: {.cell layout-align=\"center\" R.options='{\"scipen\":8}'}\n::: {.cell-output .cell-output-stdout}\n```\n [,1] [,2] [,3]\n[1,] 0 2 30\n[2,] 10 0 100\n[3,] 1000000 50000 0\n```\n:::\n:::\n\n\n\n## Deviance loss\n\nSometimes we output [probabilities]{.secondary} as well as class labels.\n\nFor example, logistic regression returns the probability that an observation is in class 1. $P(Y_i = 1 \\given x_i) = 1 / (1 + \\exp\\{-x'_i \\hat\\beta\\})$\n\nLDA and QDA produce probabilities as well. So do Neural Networks (typically)\n\n(Trees \"don't\", neither does KNN, though you could fake it)\n\n. . .\n\n
\n\n* Deviance loss for 2-class classification is $-2\\textrm{loglikelihood}(y, \\hat{p}) = -2 (y_i x'_i\\hat{\\beta} - \\log (1-\\hat{p}))$\n\n(Technically, it's the difference between this and the loss of the null model, but people play fast and loose)\n\n* Could also use cross entropy or Gini index.\n\n\n\n## Calibration\n\nSuppose we predict some probabilities for our data, how often do those events happen?\n\nIn principle, if we predict $\\hat{p}(x_i)=0.2$ for a bunch of events observations $i$, we'd like to see about 20% 1 and 80% 0. (In training set and test set)\n\nThe same goes for the other probabilities. If we say \"20% chance of rain\" it should rain 20% of such days.\n\n\nOf course, we didn't predict **exactly** $\\hat{p}(x_i)=0.2$ ever, so lets look at $[.15, .25]$.\n\n\n::: {.cell layout-align=\"center\" output-location='fragment'}\n\n```{.r .cell-code code-line-numbers=\"1-6|7|8-9\"}\nn <- 250\ndat <- tibble(\n x = seq(-5, 5, length.out = n), \n p = 1 / (1 + exp(-x)),\n y = rbinom(n, 1, p)\n)\nfit <- glm(y ~ x, family = binomial, data = dat)\ndat$phat <- predict(fit, type = \"response\") # predicted probabilities\ndat |> filter(phat > .15, phat < .25) |> summarize(target = .2, obs = mean(y))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1 × 2\n target obs\n \n1 0.2 0.222\n```\n:::\n:::\n\n\n\n## Calibration plot\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nbinary_calibration_plot <- function(y, phat, nbreaks = 10) {\n dat <- tibble(y = y, phat = phat) |>\n mutate(bins = cut_number(phat, n = nbreaks))\n midpts <- quantile(dat$phat, seq(0, 1, length.out = nbreaks + 1), na.rm = TRUE)\n midpts <- midpts[-length(midpts)] + diff(midpts) / 2\n sum_dat <- dat |> \n group_by(bins) |>\n summarise(p = mean(y, na.rm = TRUE), \n se = sqrt(p * (1 - p) / n())) |>\n arrange(p)\n sum_dat$x <- midpts\n \n ggplot(sum_dat, aes(x = x)) + \n geom_errorbar(aes(ymin = pmax(p - 1.96*se, 0), ymax = pmin(p + 1.96*se, 1))) +\n geom_point(aes(y = p), color = blue) + \n geom_abline(slope = 1, intercept = 0, color = orange) +\n ylab(\"observed frequency\") + xlab(\"average predicted probability\") +\n coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +\n geom_rug(data = dat, aes(x = phat), sides = 'b')\n}\n```\n:::\n\n\n\n## Amazingly well-calibrated\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nbinary_calibration_plot(dat$y, dat$phat, 20L)\n```\n\n::: {.cell-output-display}\n![](00-classification-losses_files/figure-revealjs/unnamed-chunk-4-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Less well-calibrated\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](00-classification-losses_files/figure-revealjs/unnamed-chunk-5-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n## True positive, false negative, sensitivity, specificity\n\nTrue positive rate\n: \\# correct predict positive / \\# actual positive (1 - FNR)\n\nFalse negative rate\n: \\# incorrect predict negative / \\# actual positive (1 - TPR), Type II Error\n\nTrue negative rate\n: \\# correct predict negative / \\# actual negative\n\nFalse positive rate\n: \\# incorrect predict positive / \\# actual negative (1 - TNR), Type I Error\n\nSensitivity\n: TPR, 1 - Type II error\n\nSpecificity\n: TNR, 1 - Type I error\n\n\n\n## ROC and thresholds\n\nROC (Receiver Operating Characteristic) Curve\n: TPR (sensitivity) vs. FPR (1 - specificity)\n \nAUC (Area under the curve)\n: Integral of ROC. Closer to 1 is better.\n \nSo far, we've been thresholding at 0.5, though you shouldn't always do that. \n \nWith unbalanced data (say 10% 0 and 90% 1), if you care equally about predicting both classes, you might want to choose a different cutoff (like in LDA).\n \nTo make the [ROC]{.secondary} we look at our errors [as we vary the cutoff]{.secondary}\n \n\n## ROC curve\n\n\n\n::: {.cell layout-align=\"center\" output-location='column-fragment'}\n\n```{.r .cell-code}\nroc <- function(prediction, y) {\n op <- order(prediction, decreasing = TRUE)\n preds <- prediction[op]\n y <- y[op]\n noty <- 1 - y\n if (any(duplicated(preds))) {\n y <- rev(tapply(y, preds, sum))\n noty <- rev(tapply(noty, preds, sum))\n }\n tibble(\n FPR = cumsum(noty) / sum(noty), \n TPR = cumsum(y) / sum(y)\n )\n}\n\nggplot(roc(dat$phat, dat$y), aes(FPR, TPR)) +\n geom_step(color = blue, size = 2) +\n geom_abline(slope = 1, intercept = 0)\n```\n\n::: {.cell-output-display}\n![](00-classification-losses_files/figure-revealjs/unnamed-chunk-6-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n## Other stuff\n\n![](gfx/huge-roc.png)\n\n* Source: worth exploring [Wikipedia](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)\n",
+ "markdown": "---\nlecture: \"00 Evaluating classifiers\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n---\n---\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 16 October 2023\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n$$\n\n\n\n\n\n## How do we measure accuracy?\n\n[So far]{.secondary} --- 0-1 loss. If correct class, lose 0 else lose 1.\n\n[Asymmetric classification loss]{.secondary} --- If correct class, lose 0 else lose something.\n\nFor example, consider facial recognition. Goal is \"person OK\", \"person has expired passport\", \"person is a known terrorist\"\n\n1. If classify OK, but was terrorist, lose 1,000,000\n1. If classify OK, but expired passport, lose 2\n1. If classify terrorist, but was OK, lose 100\n1. If classify terrorist, but was expired passport, lose 10\n1. etc.\n\n. . .\n\n\nResults in a 3x3 matrix of losses with 0 on the diagonal.\n\n\n::: {.cell layout-align=\"center\" R.options='{\"scipen\":8}'}\n::: {.cell-output .cell-output-stdout}\n```\n [,1] [,2] [,3]\n[1,] 0 2 30\n[2,] 10 0 100\n[3,] 1000000 50000 0\n```\n:::\n:::\n\n\n\n## Deviance loss\n\nSometimes we output [probabilities]{.secondary} as well as class labels.\n\nFor example, logistic regression returns the probability that an observation is in class 1. $P(Y_i = 1 \\given x_i) = 1 / (1 + \\exp\\{-x'_i \\hat\\beta\\})$\n\nLDA and QDA produce probabilities as well. So do Neural Networks (typically)\n\n(Trees \"don't\", neither does KNN, though you could fake it)\n\n. . .\n\n\n\n* Deviance loss for 2-class classification is $-2\\textrm{loglikelihood}(y, \\hat{p}) = -2 (y_i x'_i\\hat{\\beta} - \\log (1-\\hat{p}))$\n\n(Technically, it's the difference between this and the loss of the null model, but people play fast and loose)\n\n* Could also use cross entropy or Gini index.\n\n\n\n## Calibration\n\nSuppose we predict some probabilities for our data, how often do those events happen?\n\nIn principle, if we predict $\\hat{p}(x_i)=0.2$ for a bunch of events observations $i$, we'd like to see about 20% 1 and 80% 0. (In training set and test set)\n\nThe same goes for the other probabilities. If we say \"20% chance of rain\" it should rain 20% of such days.\n\n\nOf course, we didn't predict **exactly** $\\hat{p}(x_i)=0.2$ ever, so lets look at $[.15, .25]$.\n\n\n::: {.cell layout-align=\"center\" output-location='fragment'}\n\n```{.r .cell-code code-line-numbers=\"1-6|7|8-9\"}\nn <- 250\ndat <- tibble(\n x = seq(-5, 5, length.out = n),\n p = 1 / (1 + exp(-x)),\n y = rbinom(n, 1, p)\n)\nfit <- glm(y ~ x, family = binomial, data = dat)\ndat$phat <- predict(fit, type = \"response\") # predicted probabilities\ndat |>\n filter(phat > .15, phat < .25) |>\n summarize(target = .2, obs = mean(y))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1 × 2\n target obs\n \n1 0.2 0.222\n```\n:::\n:::\n\n\n\n## Calibration plot\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nbinary_calibration_plot <- function(y, phat, nbreaks = 10) {\n dat <- tibble(y = y, phat = phat) |>\n mutate(bins = cut_number(phat, n = nbreaks))\n midpts <- quantile(dat$phat, seq(0, 1, length.out = nbreaks + 1), na.rm = TRUE)\n midpts <- midpts[-length(midpts)] + diff(midpts) / 2\n sum_dat <- dat |>\n group_by(bins) |>\n summarise(\n p = mean(y, na.rm = TRUE),\n se = sqrt(p * (1 - p) / n())\n ) |>\n arrange(p)\n sum_dat$x <- midpts\n\n ggplot(sum_dat, aes(x = x)) +\n geom_errorbar(aes(ymin = pmax(p - 1.96 * se, 0), ymax = pmin(p + 1.96 * se, 1))) +\n geom_point(aes(y = p), colour = blue) +\n geom_abline(slope = 1, intercept = 0, colour = orange) +\n ylab(\"observed frequency\") +\n xlab(\"average predicted probability\") +\n coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +\n geom_rug(data = dat, aes(x = phat), sides = \"b\")\n}\n```\n:::\n\n\n\n## Amazingly well-calibrated\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nbinary_calibration_plot(dat$y, dat$phat, 20L)\n```\n\n::: {.cell-output-display}\n![](00-classification-losses_files/figure-revealjs/unnamed-chunk-4-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Less well-calibrated\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](00-classification-losses_files/figure-revealjs/unnamed-chunk-5-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n## True positive, false negative, sensitivity, specificity\n\nTrue positive rate\n: \\# correct predict positive / \\# actual positive (1 - FNR)\n\nFalse negative rate\n: \\# incorrect predict negative / \\# actual positive (1 - TPR), Type II Error\n\nTrue negative rate\n: \\# correct predict negative / \\# actual negative\n\nFalse positive rate\n: \\# incorrect predict positive / \\# actual negative (1 - TNR), Type I Error\n\nSensitivity\n: TPR, 1 - Type II error\n\nSpecificity\n: TNR, 1 - Type I error\n\n\n\n## ROC and thresholds\n\nROC (Receiver Operating Characteristic) Curve\n: TPR (sensitivity) vs. FPR (1 - specificity)\n \nAUC (Area under the curve)\n: Integral of ROC. Closer to 1 is better.\n \nSo far, we've been thresholding at 0.5, though you shouldn't always do that. \n \nWith unbalanced data (say 10% 0 and 90% 1), if you care equally about predicting both classes, you might want to choose a different cutoff (like in LDA).\n \nTo make the [ROC]{.secondary} we look at our errors [as we vary the cutoff]{.secondary}\n \n\n## ROC curve\n\n\n\n::: {.cell layout-align=\"center\" output-location='column-fragment'}\n\n```{.r .cell-code}\nroc <- function(prediction, y) {\n op <- order(prediction, decreasing = TRUE)\n preds <- prediction[op]\n y <- y[op]\n noty <- 1 - y\n if (any(duplicated(preds))) {\n y <- rev(tapply(y, preds, sum))\n noty <- rev(tapply(noty, preds, sum))\n }\n tibble(\n FPR = cumsum(noty) / sum(noty),\n TPR = cumsum(y) / sum(y)\n )\n}\n\nggplot(roc(dat$phat, dat$y), aes(FPR, TPR)) +\n geom_step(colour = blue, size = 2) +\n geom_abline(slope = 1, intercept = 0)\n```\n\n::: {.cell-output-display}\n![](00-classification-losses_files/figure-revealjs/unnamed-chunk-6-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n## Other stuff\n\n![](gfx/huge-roc.png)\n\n* Source: worth exploring [Wikipedia](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)\n",
"supporting": [
"00-classification-losses_files"
],
diff --git a/_freeze/schedule/slides/00-gradient-descent/execute-results/html.json b/_freeze/schedule/slides/00-gradient-descent/execute-results/html.json
new file mode 100644
index 0000000..e32b26c
--- /dev/null
+++ b/_freeze/schedule/slides/00-gradient-descent/execute-results/html.json
@@ -0,0 +1,20 @@
+{
+ "hash": "4263e78f44b33351fdfdb717c8d23c2b",
+ "result": {
+ "markdown": "---\nlecture: \"00 Gradient descent\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n---\n---\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 16 October 2023\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n$$\n\n\n\n\n## Simple optimization techniques\n\n\nWe'll see \"gradient descent\" a few times: \n\n1. solves logistic regression (simple version of IRWLS)\n1. gradient boosting\n1. Neural networks\n\nThis seems like a good time to explain it.\n\nSo what is it and how does it work?\n\n\n## Very basic example\n\n::: flex\n::: w-65\n\nSuppose I want to minimize $f(x)=(x-6)^2$ numerically.\n\nI start at a point (say $x_1=23$)\n\nI want to \"go\" in the negative direction of the gradient.\n\nThe gradient (at $x_1=23$) is $f'(23)=2(23-6)=34$.\n\nMove current value toward current value - 34.\n\n$x_2 = x_1 - \\gamma 34$, for $\\gamma$ small.\n\nIn general, $x_{n+1} = x_n -\\gamma f'(x_n)$.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nniter <- 10\ngam <- 0.1\nx <- double(niter)\nx[1] <- 23\ngrad <- function(x) 2 * (x - 6)\nfor (i in 2:niter) x[i] <- x[i - 1] - gam * grad(x[i - 1])\n```\n:::\n\n\n:::\n\n::: w-35\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](00-gradient-descent_files/figure-revealjs/unnamed-chunk-2-1.svg){fig-align='center'}\n:::\n:::\n\n\n:::\n:::\n\n\n## Why does this work?\n\n\n[Heuristic interpretation:]{.secendary}\n\n* Gradient tells me the slope.\n\n* negative gradient points toward the minimum\n\n* go that way, but not too far (or we'll miss it)\n\n## Why does this work?\n\n[More rigorous interpretation:]{.secondary}\n\n- Taylor expansion\n$$\nf(x) \\approx f(x_0) + \\nabla f(x_0)^{\\top}(x-x_0) + \\frac{1}{2}(x-x_0)^\\top H(x_0) (x-x_0)\n$$\n- replace $H$ with $\\gamma^{-1} I$\n\n- minimize this quadratic approximation in $x$:\n$$\n0\\overset{\\textrm{set}}{=}\\nabla f(x_0) + \\frac{1}{\\gamma}(x-x_0) \\Longrightarrow x = x_0 - \\gamma \\nabla f(x_0)\n$$\n\n\n## Visually\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](00-gradient-descent_files/figure-revealjs/unnamed-chunk-3-1.svg){fig-align='center'}\n:::\n:::\n\n\n## Visually\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](00-gradient-descent_files/figure-revealjs/unnamed-chunk-4-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## What $\\gamma$? (more details than we have time for)\n\nWhat to use for $\\gamma_k$? \n\n\n[Fixed]{.secondary}\n\n- Only works if $\\gamma$ is exactly right \n- Usually does not work\n\n[Decay on a schedule]{.secondary}\n\n$\\gamma_{k+1} = \\frac{\\gamma_k}{1+ck}$ or $\\gamma_{k} = \\gamma_0 b^k$\n\n[Exact line search]{.secondary}\n\n- Tells you exactly how far to go.\n- At each $k$, solve\n$\\gamma_k = \\arg\\min_{s \\geq 0} f( x^{(k)} - s f(x^{(k-1)}))$\n- Usually can't solve this.\n\n\n\n##\n\n$$ f(x_1,x_2) = x_1^2 + 0.5x_2^2$$\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nx <- matrix(0, 40, 2); x[1, ] <- c(1, 1)\ngrad <- function(x) c(2, 1) * x\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](00-gradient-descent_files/figure-revealjs/unnamed-chunk-6-1.svg){fig-align='center'}\n:::\n:::\n\n\n##\n\n$$ f(x_1,x_2) = x_1^2 + 0.5x_2^2$$\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ngamma <- .1\nfor (k in 2:40) x[k, ] <- x[k - 1, ] - gamma * grad(x[k - 1, ])\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](00-gradient-descent_files/figure-revealjs/unnamed-chunk-8-1.svg){fig-align='center'}\n:::\n:::\n\n\n##\n\n$$ f(x_1,x_2) = x_1^2 + 0.5x_2^2$$\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ngamma <- .9 # bigger gamma\nfor (k in 2:40) x[k, ] <- x[k - 1, ] - gamma * grad(x[k - 1, ])\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](00-gradient-descent_files/figure-revealjs/unnamed-chunk-11-1.svg){fig-align='center'}\n:::\n:::\n\n\n##\n\n$$ f(x_1,x_2) = x_1^2 + 0.5x_2^2$$\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ngamma <- .9 # big, but decrease it on schedule\nfor (k in 2:40) x[k, ] <- x[k - 1, ] - gamma * .9^k * grad(x[k - 1, ])\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](00-gradient-descent_files/figure-revealjs/unnamed-chunk-14-1.svg){fig-align='center'}\n:::\n:::\n\n\n##\n\n$$ f(x_1,x_2) = x_1^2 + 0.5x_2^2$$\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ngamma <- .5 # theoretically optimal\nfor (k in 2:40) x[k, ] <- x[k - 1, ] - gamma * grad(x[k - 1, ])\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](00-gradient-descent_files/figure-revealjs/unnamed-chunk-17-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n\n## When do we stop?\n\nFor $\\epsilon>0$, small\n\n\nCheck any / all of\n\n1. $|f'(x)| < \\epsilon$\n1. $|x^{(k)} - x^{(k-1)}| < \\epsilon$\n1. $|f(x^{(k)}) - f(x^{(k-1)})| < \\epsilon$\n\n\n## Stochastic gradient descent\n\nSuppose $f(x) = \\frac{1}{n}\\sum_{i=1}^n f_i(x)$\n\nLike if $f(\\beta) = \\frac{1}{n}\\sum_{i=1}^n (y_i - x^\\top_i\\beta)^2$.\n\nThen $f'(\\beta) = \\frac{1}{n}\\sum_{i=1}^n f'_i(\\beta) = \\frac{1}{n} \\sum_{i=1}^n -2x_i^\\top(y_i - x^\\top_i\\beta)$ \n\nIf $n$ is really big, it may take a long time to compute $f'$\n\nSo, just sample some partition our data into mini-batches $\\mathcal{M}_j$\n\nAnd approximate (imagine the Law of Large Numbers, use a sample to approximate the population)\n$$f'(x) = \\frac{1}{n}\\sum_{i=1}^n f'_i(x) \\approx \\frac{1}{m}\\sum_{i\\in\\mathcal{M}_j}f'_{i}(x)$$\n\n## SGD\n\n$$\n\\begin{aligned}\nf'(\\beta) &= \\frac{1}{n}\\sum_{i=1}^n f'_i(\\beta) = \\frac{1}{n} \\sum_{i=1}^n -2x_i^\\top(y_i - x^\\top_i\\beta)\\\\\nf'(x) &= \\frac{1}{n}\\sum_{i=1}^n f'_i(x) \\approx \\frac{1}{m}\\sum_{i\\in\\mathcal{M}_j}f'_{i}(x)\n\\end{aligned}\n$$\n\n\nUsually cycle through \"mini-batches\":\n\n* Use a different mini-batch at each iteration of GD\n* Cycle through until we see all the data\n\n\nThis is the workhorse for neural network optimization\n\n\n\n# Practice with GD and Logistic regression\n\n\n## Gradient descent for Logistic regression\n\nSuppose $Y=1$ with probability $p(x)$ and $Y=0$ with probability $1-p(x)$, $x \\in \\R$. \n\nI want to model $P(Y=1| X=x)$. \n\nI'll assume that $\\log\\left(\\frac{p(x)}{1-p(x)}\\right) = ax$ for some scalar $a$. This means that\n$p(x) = \\frac{\\exp(ax)}{1+\\exp(ax)} = \\frac{1}{1+\\exp(-ax)}$\n\n. . .\n\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nn <- 100\na <- 2\nx <- runif(n, -5, 5)\nlogit <- function(x) 1 / (1 + exp(-x))\np <- logit(a * x)\ny <- rbinom(n, 1, p)\ndf <- tibble(x, y)\nggplot(df, aes(x, y)) +\n geom_point(colour = \"cornflowerblue\") +\n stat_function(fun = ~ logit(a * .x))\n```\n\n::: {.cell-output-display}\n![](00-gradient-descent_files/figure-revealjs/generate-data-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n## Reminder: the likelihood\n\n$$\nL(y | a, x) = \\prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\\textrm{ and }\np(x) = \\frac{1}{1+\\exp(-ax)}\n$$\n\n\n$$\n\\begin{aligned}\n\\ell(y | a, x) &= \\log \\prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i} \n= \\sum_{i=1}^n y_i\\log p(x_i) + (1-y_i)\\log(1-p(x_i))\\\\\n&= \\sum_{i=1}^n\\log(1-p(x_i)) + y_i\\log\\left(\\frac{p(x_i)}{1-p(x_i)}\\right)\\\\\n&=\\sum_{i=1}^n ax_i y_i + \\log\\left(1-p(x_i)\\right)\\\\\n&=\\sum_{i=1}^n ax_i y_i + \\log\\left(\\frac{1}{1+\\exp(ax_i)}\\right)\n\\end{aligned}\n$$\n\n## Reminder: the likelihood\n\n$$\nL(y | a, x) = \\prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\\textrm{ and }\np(x) = \\frac{1}{1+\\exp(-ax)}\n$$\n\nNow, we want the negative of this. Why? \n\nWe would maximize the likelihood/log-likelihood, so we minimize the negative likelihood/log-likelihood (and scale by $1/n$)\n\n\n$$-\\ell(y | a, x) = \\frac{1}{n}\\sum_{i=1}^n -ax_i y_i - \\log\\left(\\frac{1}{1+\\exp(ax_i)}\\right)$$\n\n## Reminder: the likelihood\n\n$$\n\\frac{1}{n}L(y | a, x) = \\frac{1}{n}\\prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\\textrm{ and }\np(x) = \\frac{1}{1+\\exp(-ax)}\n$$\n\nThis is, in the notation of our slides $f(a)$. \n\nWe want to minimize it in $a$ by gradient descent. \n\nSo we need the derivative with respect to $a$: $f'(a)$. \n\nNow, conveniently, this simplifies a lot. \n\n\n$$\n\\begin{aligned}\n\\frac{d}{d a} f(a) &= \\frac{1}{n}\\sum_{i=1}^n -x_i y_i - \\left(-\\frac{x_i \\exp(ax_i)}{1+\\exp(ax_i)}\\right)\\\\\n&=\\frac{1}{n}\\sum_{i=1}^n -x_i y_i + p(x_i)x_i = \\frac{1}{n}\\sum_{i=1}^n -x_i(y_i-p(x_i)).\n\\end{aligned}\n$$\n\n## Reminder: the likelihood\n\n$$\n\\frac{1}{n}L(y | a, x) = \\frac{1}{n}\\prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\\textrm{ and }\np(x) = \\frac{1}{1+\\exp(-ax)}\n$$\n\n(Simple) gradient descent to minimize $-\\ell(a)$ or maximize $L(y|a,x)$ is:\n\n1. Input $a_1,\\ \\gamma>0,\\ j_\\max,\\ \\epsilon>0,\\ \\frac{d}{da} -\\ell(a)$.\n2. For $j=1,\\ 2,\\ \\ldots,\\ j_\\max$,\n$$a_j = a_{j-1} - \\gamma \\frac{d}{da} (-\\ell(a_{j-1}))$$\n3. Stop if $\\epsilon > |a_j - a_{j-1}|$ or $|d / da\\ \\ell(a)| < \\epsilon$.\n\n## Reminder: the likelihood\n\n$$\n\\frac{1}{n}L(y | a, x) = \\frac{1}{n}\\prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\\textrm{ and }\np(x) = \\frac{1}{1+\\exp(-ax)}\n$$\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-line-numbers=\"1,10-11|2-3|4-9|\"}\namle <- function(x, y, a0, gam = 0.5, jmax = 50, eps = 1e-6) {\n a <- double(jmax) # place to hold stuff (always preallocate space)\n a[1] <- a0 # starting value\n for (j in 2:jmax) { # avoid possibly infinite while loops\n px <- logit(a[j - 1] * x)\n grad <- mean(-x * (y - px))\n a[j] <- a[j - 1] - gam * grad\n if (abs(grad) < eps || abs(a[j] - a[j - 1]) < eps) break\n }\n a[1:j]\n}\n```\n:::\n\n\n\n\n## Try it:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nround(too_big <- amle(x, y, 5, 50), 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 5.000 3.360 2.019 1.815 2.059 1.782 2.113 1.746 2.180 1.711 2.250 1.684\n[13] 2.309 1.669 2.344 1.663 2.359 1.661 2.364 1.660 2.365 1.660 2.366 1.660\n[25] 2.366 1.660 2.366 1.660 2.366 1.660 2.366 1.660 2.366 1.660 2.366 1.660\n[37] 2.366 1.660 2.366 1.660 2.366 1.660 2.366 1.660 2.366 1.660 2.366 1.660\n[49] 2.366 1.660\n```\n:::\n\n```{.r .cell-code}\nround(too_small <- amle(x, y, 5, 1), 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 5.000 4.967 4.934 4.902 4.869 4.837 4.804 4.772 4.739 4.707 4.675 4.643\n[13] 4.611 4.579 4.547 4.515 4.483 4.451 4.420 4.388 4.357 4.326 4.294 4.263\n[25] 4.232 4.201 4.170 4.140 4.109 4.078 4.048 4.018 3.988 3.957 3.927 3.898\n[37] 3.868 3.838 3.809 3.779 3.750 3.721 3.692 3.663 3.635 3.606 3.578 3.550\n[49] 3.522 3.494\n```\n:::\n\n```{.r .cell-code}\nround(just_right <- amle(x, y, 5, 10), 3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 5.000 4.672 4.351 4.038 3.735 3.445 3.171 2.917 2.688 2.488 2.322 2.191\n[13] 2.094 2.027 1.983 1.956 1.940 1.930 1.925 1.922 1.920 1.919 1.918 1.918\n[25] 1.918 1.918 1.918 1.917 1.917 1.917 1.917\n```\n:::\n:::\n\n\n\n## Visual\n\n\n\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nnegll <- function(a) {\n -a * mean(x * y) -\n rowMeans(log(1 / (1 + exp(outer(a, x)))))\n}\nblah <- list_rbind(\n map(\n rlang::dots_list(\n too_big, too_small, just_right, .named = TRUE\n ), \n as_tibble),\n names_to = \"gamma\"\n) |> mutate(negll = negll(value))\nggplot(blah, aes(value, negll)) +\n geom_point(aes(colour = gamma)) +\n facet_wrap(~gamma, ncol = 1) +\n stat_function(fun = negll, xlim = c(-2.5, 5)) +\n scale_y_log10() + \n xlab(\"a\") + \n ylab(\"negative log likelihood\") +\n geom_vline(xintercept = tail(just_right, 1)) +\n scale_colour_brewer(palette = \"Set1\") +\n theme(legend.position = \"none\")\n```\n\n::: {.cell-output-display}\n![](00-gradient-descent_files/figure-revealjs/unnamed-chunk-18-1.svg){fig-align='center'}\n:::\n:::\n\n\n## Check vs. `glm()`\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsummary(glm(y ~ x - 1, family = \"binomial\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nglm(formula = y ~ x - 1, family = \"binomial\")\n\nCoefficients:\n Estimate Std. Error z value Pr(>|z|) \nx 1.9174 0.4785 4.008 6.13e-05 ***\n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n Null deviance: 138.629 on 100 degrees of freedom\nResidual deviance: 32.335 on 99 degrees of freedom\nAIC: 34.335\n\nNumber of Fisher Scoring iterations: 7\n```\n:::\n:::\n",
+ "supporting": [
+ "00-gradient-descent_files"
+ ],
+ "filters": [
+ "rmarkdown/pagebreak.lua"
+ ],
+ "includes": {
+ "include-after-body": [
+ "\n\n\n"
+ ]
+ },
+ "engineDependencies": {},
+ "preserve": {},
+ "postProcess": true
+ }
+}
\ No newline at end of file
diff --git a/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/generate-data-1.svg b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/generate-data-1.svg
new file mode 100644
index 0000000..bcff84a
--- /dev/null
+++ b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/generate-data-1.svg
@@ -0,0 +1,340 @@
+
+
diff --git a/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-10-1.svg b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-10-1.svg
new file mode 100644
index 0000000..0be094a
--- /dev/null
+++ b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-10-1.svg
@@ -0,0 +1,375 @@
+
+
diff --git a/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-11-1.svg b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-11-1.svg
new file mode 100644
index 0000000..bc95a3d
--- /dev/null
+++ b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-11-1.svg
@@ -0,0 +1,385 @@
+
+
diff --git a/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-13-1.svg b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-13-1.svg
new file mode 100644
index 0000000..b329028
--- /dev/null
+++ b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-13-1.svg
@@ -0,0 +1,375 @@
+
+
diff --git a/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-14-1.svg b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-14-1.svg
new file mode 100644
index 0000000..59d7ece
--- /dev/null
+++ b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-14-1.svg
@@ -0,0 +1,385 @@
+
+
diff --git a/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-16-1.svg b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-16-1.svg
new file mode 100644
index 0000000..17e9b91
--- /dev/null
+++ b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-16-1.svg
@@ -0,0 +1,375 @@
+
+
diff --git a/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-17-1.svg b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-17-1.svg
new file mode 100644
index 0000000..52a36ae
--- /dev/null
+++ b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-17-1.svg
@@ -0,0 +1,385 @@
+
+
diff --git a/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-18-1.svg b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-18-1.svg
new file mode 100644
index 0000000..1459783
--- /dev/null
+++ b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-18-1.svg
@@ -0,0 +1,720 @@
+
+
diff --git a/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-2-1.svg b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-2-1.svg
new file mode 100644
index 0000000..52ed4cf
--- /dev/null
+++ b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-2-1.svg
@@ -0,0 +1,258 @@
+
+
diff --git a/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-3-1.svg b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-3-1.svg
new file mode 100644
index 0000000..1f398d0
--- /dev/null
+++ b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-3-1.svg
@@ -0,0 +1,288 @@
+
+
diff --git a/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-3-2.svg b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-3-2.svg
new file mode 100644
index 0000000..da31a52
--- /dev/null
+++ b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-3-2.svg
@@ -0,0 +1,292 @@
+
+
diff --git a/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-4-1.svg b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-4-1.svg
new file mode 100644
index 0000000..da31a52
--- /dev/null
+++ b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-4-1.svg
@@ -0,0 +1,292 @@
+
+
diff --git a/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-5-1.svg b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-5-1.svg
new file mode 100644
index 0000000..dfd9722
--- /dev/null
+++ b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-5-1.svg
@@ -0,0 +1,334 @@
+
+
diff --git a/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-6-1.svg b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-6-1.svg
new file mode 100644
index 0000000..de1d0da
--- /dev/null
+++ b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-6-1.svg
@@ -0,0 +1,329 @@
+
+
diff --git a/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-7-1.svg b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-7-1.svg
new file mode 100644
index 0000000..f22e835
--- /dev/null
+++ b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-7-1.svg
@@ -0,0 +1,375 @@
+
+
diff --git a/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-8-1.svg b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-8-1.svg
new file mode 100644
index 0000000..2f08f97
--- /dev/null
+++ b/_freeze/schedule/slides/00-gradient-descent/figure-revealjs/unnamed-chunk-8-1.svg
@@ -0,0 +1,385 @@
+
+
diff --git a/schedule/slides/00-classification-losses.qmd b/schedule/slides/00-classification-losses.qmd
index 6c16630..ff2f7c0 100644
--- a/schedule/slides/00-classification-losses.qmd
+++ b/schedule/slides/00-classification-losses.qmd
@@ -70,13 +70,15 @@ Of course, we didn't predict **exactly** $\hat{p}(x_i)=0.2$ ever, so lets look a
#| output-location: fragment
n <- 250
dat <- tibble(
- x = seq(-5, 5, length.out = n),
+ x = seq(-5, 5, length.out = n),
p = 1 / (1 + exp(-x)),
y = rbinom(n, 1, p)
)
fit <- glm(y ~ x, family = binomial, data = dat)
dat$phat <- predict(fit, type = "response") # predicted probabilities
-dat |> filter(phat > .15, phat < .25) |> summarize(target = .2, obs = mean(y))
+dat |>
+ filter(phat > .15, phat < .25) |>
+ summarize(target = .2, obs = mean(y))
```
@@ -88,20 +90,23 @@ binary_calibration_plot <- function(y, phat, nbreaks = 10) {
mutate(bins = cut_number(phat, n = nbreaks))
midpts <- quantile(dat$phat, seq(0, 1, length.out = nbreaks + 1), na.rm = TRUE)
midpts <- midpts[-length(midpts)] + diff(midpts) / 2
- sum_dat <- dat |>
+ sum_dat <- dat |>
group_by(bins) |>
- summarise(p = mean(y, na.rm = TRUE),
- se = sqrt(p * (1 - p) / n())) |>
+ summarise(
+ p = mean(y, na.rm = TRUE),
+ se = sqrt(p * (1 - p) / n())
+ ) |>
arrange(p)
sum_dat$x <- midpts
-
- ggplot(sum_dat, aes(x = x)) +
- geom_errorbar(aes(ymin = pmax(p - 1.96*se, 0), ymax = pmin(p + 1.96*se, 1))) +
- geom_point(aes(y = p), color = blue) +
- geom_abline(slope = 1, intercept = 0, color = orange) +
- ylab("observed frequency") + xlab("average predicted probability") +
+
+ ggplot(sum_dat, aes(x = x)) +
+ geom_errorbar(aes(ymin = pmax(p - 1.96 * se, 0), ymax = pmin(p + 1.96 * se, 1))) +
+ geom_point(aes(y = p), colour = blue) +
+ geom_abline(slope = 1, intercept = 0, colour = orange) +
+ ylab("observed frequency") +
+ xlab("average predicted probability") +
coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +
- geom_rug(data = dat, aes(x = phat), sides = 'b')
+ geom_rug(data = dat, aes(x = phat), sides = "b")
}
```
@@ -179,13 +184,13 @@ roc <- function(prediction, y) {
noty <- rev(tapply(noty, preds, sum))
}
tibble(
- FPR = cumsum(noty) / sum(noty),
+ FPR = cumsum(noty) / sum(noty),
TPR = cumsum(y) / sum(y)
)
}
ggplot(roc(dat$phat, dat$y), aes(FPR, TPR)) +
- geom_step(color = blue, size = 2) +
+ geom_step(colour = blue, size = 2) +
geom_abline(slope = 1, intercept = 0)
```
diff --git a/schedule/slides/00-gradient-descent.Rmd b/schedule/slides/00-gradient-descent.Rmd
deleted file mode 100644
index 942295a..0000000
--- a/schedule/slides/00-gradient-descent.Rmd
+++ /dev/null
@@ -1,471 +0,0 @@
----
-title: "00 Gradient descent"
-author:
- - "STAT 406"
- - "Daniel J. McDonald"
-date: 'Last modified - `r Sys.Date()`'
----
-
-## Simple optimization techniques
-
-$$\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]}
-\newcommand{\Var}[1]{\mathbb{V}\left[ #1 \right]}
-\newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]}
-\newcommand{\given}{\ \vert\ }
-\newcommand{\argmin}{\arg\min}
-\newcommand{\argmax}{\arg\max}
-\newcommand{\R}{\mathbb{R}}
-\newcommand{\P}{\mathbb{P}}
-\renewcommand{\hat}{\widehat}
-\newcommand{\tr}[1]{\mbox{tr}(#1)}
-\newcommand{\X}{\mathbf{X}}$$
-
-
-```{r setup, include=FALSE, warning=FALSE, message=FALSE}
-source("rmd_config.R")
-```
-
-
-```{r css-extras, file="css-extras.R", echo=FALSE}
-```
-
-
-We'll see "gradient descent" a few times:
-
-1. solves logistic regression (simple version of IRWLS)
-
-2. gradient boosting
-
-3. Neural networks
-
-This seems like a good time to explain it.
-
-So what is it and how does it work?
-
----
-
-## Very basic example
-
-.pull-left-wide[
-Suppose I want to minimize $f(x)=(x-6)^2$ numerically.
-
-I start at a point (say $x_1=23$)
-
-I want to "go" in the negative direction of the gradient.
-
-The gradient (at $x_1=23$) is $f'(23)=2(23-6)=34$.
-
-Move current value toward current value - 34.
-
-$x_2 = x_1 - \gamma 34$, for $\gamma$ small.
-
-In general, $x_{n+1} = x_n -\gamma f'(x_n)$.
-
-```{r}
-niter <- 10
-gam <- 0.1
-x <- double(niter)
-x[1] <- 23
-grad <- function(x) 2 * (x - 6)
-for (i in 2:niter) x[i] = x[i - 1] - gam * grad(x[i - 1])
-```
-
-
-]
-
-.pull-right-narrow[
-
-
-```{r echo=FALSE, fig.height=4,fig.width=4,fig.align='center'}
-ggplot(data.frame(x=x, y=(x-6)^2)) + geom_path(aes(x,y)) +
- geom_point(aes(x,y)) +
- coord_cartesian(xlim=c(6-24,24),ylim=c(0,300)) +
- geom_vline(xintercept = 6, color=red,linetype="dotted") +
- geom_hline(yintercept = 0,color=red,linetype="dotted") +
- stat_function(data=data.frame(x=c(6-24,24)),aes(x), fun=function(x) (x-6)^2, color=blue, alpha=.4) +
- ylab(bquote(f(x)))
-```
-
-]
-
----
-
-## Why does this work?
-
-__Heuristic interpretation:__
-
-* Gradient tells me the slope.
-
-* negative gradient points toward the minimum
-
-* go that way, but not too far (or we'll miss it)
-
---
-
-__More rigorous interpretation:__
-
-- Taylor expansion
-$$
-f(x) \approx f(x_0) + \nabla f(x_0)^{\top}(x-x_0) + \frac{1}{2}(x-x_0)^\top H(x_0) (x-x_0)
-$$
-- replace $H$ with $\gamma^{-1} I$
-
-- minimize the quadratic approximation in $x$:
-$$
-0\overset{\textrm{set}}{=}\nabla f(x_0) + \frac{1}{\gamma}(x-x_0) \Longrightarrow x = x_0 - \gamma \nabla f(x_0)
-$$
-
----
-layout: true
-
-## Visually
-
----
-
-```{r, fig.height=6, fig.width=12, fig.align="center", echo=FALSE}
-f <- function(x) (x-1)^2*(x>1) + log(1+exp(-2*x))
-fp <- function(x) 2*(x-1)*(x>1) -2/(1+exp(2*x))
-quad <- function(x, x0, gam=.1) f(x0) + fp(x0)*(x-x0) + 1/(2*gam)*(x-x0)^2
-x = c(-1.75,-1,-.5)
-
-ggplot(data.frame(x=c(-2,3)), aes(x)) +
- stat_function(fun=f, color=blue) +
- geom_point(data=data.frame(x=x,y=f(x)),aes(x,y),color=red) +
- stat_function(fun=quad, args = list(x0=-1.75), color=red) +
- stat_function(fun=quad, args = list(x0=-1), color=red) +
- stat_function(fun=quad, args = list(x0=-.5), color=red) +
- coord_cartesian(ylim=c(0,4)) + ggtitle("gamma = 0.1")
-```
-
----
-
-```{r, fig.height=6, fig.width=12, fig.align="center", echo=FALSE}
-ggplot(data.frame(x=c(-2,3)), aes(x)) +
- stat_function(fun=f, color=blue) +
- geom_point(data=data.frame(x=x,y=f(x)),aes(x,y),color=red) +
- stat_function(fun=quad, args = list(x0=-1.75, gam = .25), color=red) +
- stat_function(fun=quad, args = list(x0=-1, gam = .25), color=red) +
- stat_function(fun=quad, args = list(x0=-.5, gam = .25), color=red) +
- coord_cartesian(ylim=c(0,4)) + ggtitle("gamma = 0.25")
-```
-
-
----
-layout: false
-
-## What $\gamma$? (more details than we have time for)
-
-What to use for $\gamma_k$?
-
-
-__Fixed__
-
-- Only works if $\gamma$ is exactly right
-- Usually does not work
-
-__Decay on a schedule__
-
-$$\gamma_{k+1} = \frac{\gamma_k}{1+ck} \quad \textrm{or} \quad \gamma_{k} = \gamma_0 b^k$$
-
-
-
-__Exact line search__
-
-- Tells you exactly how far to go.
-- At each $k$, solve
-$\gamma_k = \arg\min_{s \geq 0} f( x^{(k)} - s f(x^{(k-1)}))$
-- Usually can't solve this.
-
-
-
----
-layout: true
-
-$$ f(x_1,x_2) = x_1^2 + 0.5x_2^2$$
-
-```{r, message=FALSE, echo=TRUE, fig.align='center', fig.align='center'}
-x <- matrix(0, 40, 2)
-x[1, ] <- c(1, 1)
-grad <- function(x) c(2, 1) * x
-```
-
----
-
-```{r, message=FALSE, echo=FALSE, fig.align='center', fig.height=5, fig.width=8}
-df = expand.grid(b1 = seq(-1,1,length.out = 100), b2 = seq(-1,1,length.out = 100))
-df = df %>% mutate(f = b1^2 + b2^2/2)
-g = ggplot(df, aes(b1,b2)) + geom_raster(aes(fill=log10(f))) + scale_fill_viridis_c() + xlab('x1') + ylab('x2')
-g
-```
-
----
-
-
-```{r, message=FALSE, echo=TRUE}
-gamma <- .1
-for (k in 2:40) x[k, ] <- x[k - 1, ] - gamma * grad(x[k - 1, ])
-```
-
-```{r, echo=FALSE, fig.align='center', fig.height=4, fig.width=8}
-b = tibble(b1=x[,1],b2=x[,2])
-g + geom_path(data=b,aes(b1,b2)) +
- geom_point(data=b,aes(b1,b2),size=2)
-```
-
----
-
-```{r, message=FALSE, echo=FALSE}
-x <- matrix(0, 40, 2)
-x[1, ] <- c(1, 1)
-```
-
-```{r, echo=TRUE}
-gamma <- .9 # bigger gamma
-for (k in 2:40) x[k,] <- x[k - 1,] - gamma * grad(x[k-1,])
-```
-
-```{r, echo=FALSE, fig.align='center', fig.height=4, fig.width=8}
-b = tibble(b1=x[,1],b2=x[,2])
-g + geom_path(data=b,aes(b1,b2)) +
- geom_point(data=b,aes(b1,b2),size=2) +
- coord_cartesian(c(-1,1),c(-1,1))
-```
-
----
-
-
-```{r, message=FALSE, echo=FALSE, fig.align='center'}
-x <- matrix(0, 40, 2)
-x[1,] <- c(1,1)
-```
-
-```{r, echo=TRUE}
-gamma <- .9 # big, but decrease it on schedule
-for (k in 2:40) x[k, ] <- x[k - 1, ] - gamma * .9^k * grad(x[k - 1, ])
-```
-
-```{r, echo=FALSE, fig.align='center', fig.height=4, fig.width=8}
-b = tibble(b1=x[,1],b2=x[,2])
-g + geom_path(data=b,aes(b1,b2)) + geom_point(data=b,aes(b1,b2),size=2)
-```
-
----
-
-
-```{r, message=FALSE, echo=FALSE, fig.align='center'}
-x <- matrix(0, 40, 2)
-x[1,] <- c(1,1)
-```
-
-```{r, echo=TRUE}
-gamma <- .5 # theoretically optimal
-for (k in 2:40) x[k, ] <- x[k - 1, ] - gamma * grad(x[k - 1, ])
-```
-
-```{r, echo=FALSE, fig.align='center', fig.height=4, fig.width=8}
-b = tibble(b1=x[,1],b2=x[,2])
-g + geom_path(data=b,aes(b1,b2)) + geom_point(data=b,aes(b1,b2),size=2)
-```
-
-
-
----
-layout: false
-
-## When do we stop?
-
-For $\epsilon>0$, small
-
-
-Check any / all of
-
-1. $|f'(x)| < \epsilon$
-
-2. $|x^{(k)} - x^{(k-1)}| < \epsilon$
-
-3. $|f(x^{(k)}) - f(x^{(k-1)})| < \epsilon$
-
----
-
-## Stochastic gradient descent
-
-Suppose $f(x) = \frac{1}{n}\sum_{i=1}^n f_i(x)$
-
-Like if $f(\beta) = \frac{1}{n}\sum_{i=1}^n (y_i - x^\top_i\beta)^2$.
-
-Then $f'(\beta) = \frac{1}{n}\sum_{i=1}^n f'_i(\beta) = \frac{1}{n} \sum_{i=1}^n -2x_i^\top(y_i - x^\top_i\beta)$
-
-If $n$ is really big, it may take a long time to compute $f'$
-
-So, just sample some partition our data into mini-batches $\mathcal{M}_j$
-
-And approximate (imagine the Law of Large Numbers, use a sample to approximate the population)
-$$f'(x) = \frac{1}{n}\sum_{i=1}^n f'_i(x) \approx \frac{1}{m}\sum_{i\in\mathcal{M}_j}f'_{i}(x)$$
-
-Usually cycle through "mini-batches":
-* Use a different mini-batch at each iteration of GD
-* Cycle through until we see all the data
-
---
-
-This is the workhorse for neural network optimization
-
----
-class: center, middle, inverse
-
-# Practice with GD and Logistic regression
-
----
-
-## Gradient descent for Logistic regression
-
-Suppose $Y=1$ with probability $p(x)$ and $Y=0$ with probability $1-p(x)$, $x \in \R$.
-
-I want to model $P(Y=1| X=x)$.
-
-I'll assume that $\log\left(\frac{p(x)}{1-p(x)}\right) = ax$ for some scalar $a$. This means that
-$p(x) = \frac{\exp(ax)}{1+\exp(ax)} = \frac{1}{1+\exp(-ax)}$
-
-
---
-
-.pull-left[
-```{r generate-data}
-n <- 100
-a <- 2
-x <- runif(n, -5, 5)
-logit <- function(x) 1 / (1 + exp(-x))
-p <- logit(a * x)
-y <- rbinom(n, 1, p)
-df <- tibble(x, y)
-g <- ggplot(df, aes(x, y)) +
- geom_point(color="cornflowerblue") +
- stat_function(fun = ~ logit(a * .x))
-```
-]
-
-
-.pull-right[
-```{r, echo=FALSE, fig.height=4}
-g
-```
-]
-
----
-layout: true
-
-
-## Reminder: the likelihood
-
-$$L(y | a, x) = \prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}$$
-
-Remember, we defined $p(x) = \frac{\exp(ax)}{1+\exp(ax)} = \frac{1}{1+\exp(-ax)}$.
-
----
-
-The log likelihood is therefore
-
-$$
-\begin{aligned}
-\ell(y | a, x) &= \log \prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i} \\
-&= \sum_{i=1}^n y_i\log p(x_i) + (1-y_i)\log(1-p(x_i))\\
-&= \sum_{i=1}^n\log(1-p(x_i)) + y_i\log\left(\frac{p(x_i)}{1-p(x_i)}\right)\\
-&=\sum_{i=1}^n ax_i y_i + \log\left(1-p(x_i)\right)\\
-&=\sum_{i=1}^n ax_i y_i + \log\left(\frac{1}{1+\exp(ax_i)}\right)
-\end{aligned}
-$$
-
----
-
-Now, we want the negative of this. Why?
-
-We would maximize the likelihood/log-likelihood, so we minimize the negative likelihood/log-likelihood.
-
-
-$$-\ell(y | a, x) = \sum_{i=1}^n -ax_i y_i - \log\left(\frac{1}{1+\exp(ax_i)}\right)$$
-
----
-
-This is, in the notation of our slides $f(a)$.
-
-We want to minimize it in $a$ by gradient descent.
-
-So we need the derivative with respect to $a$: $f'(a)$.
-
-Now, conveniently, this simplifies a lot.
-
-
-$$\frac{d}{d a} f(a) = \sum_{i=1}^n -x_i y_i - \left(-\frac{x_i \exp(ax_i)}{1+\exp(ax_i)}\right) =
-\sum_{i=1}^n -x_i y_i + p(x_i)x_i = \sum_{i=1}^n -x_i(y_i-p(x_i)).$$
-
----
-
-(Simple) gradient descent to minimize $-\ell(a)$ or maximize $L(y|a,x)$ is:
-
-1. Input $a_1,\ \gamma>0,\ j_\max,\ \epsilon>0,\ \frac{d}{da} -\ell(a)$.
-2. For $j=1,\ 2,\ \ldots,\ j_\max$,
-$$a_j = a_{j-1} - \gamma \frac{d}{da} (-\ell(a_{j-1}))$$
-3. Stop if $\epsilon > |a_j - a_{j-1}|$ or $|d / da\ \ell(a)| < \epsilon$.
-
----
-
-```{r amlecorrect, echo=TRUE}
-amle <- function(x, y, a0, gam = 0.5, jmax = 50, eps = 1e-6){
- a <- double(jmax) # place to hold stuff (always preallocate space)
- a[1] <- a0 # starting value
- for (j in 2:jmax) { # avoid possibly infinite while loops
- px <- logit(a[j - 1] * x)
- grad <- sum( -x * (y - px) )
- a[j] <- a[j - 1] - gam * grad
- if (abs(grad) < eps || abs(a[j] - a[j-1]) < eps) break
- }
- a[1:j]
-}
-```
-
----
-layout: false
-
-## Try it:
-
-```{r ourmle1}
-round(too_big <- amle(x, y, 5, .5), 3)
-round(too_small <- amle(x, y, 5, .01), 3)
-round(just_right <- amle(x, y, 5, .1), 3)
-```
-
----
-
-## Visual
-
-
-.pull-left[
-```{r}
-negll <- function(a) {
- -a * sum(x*y) -
- rowSums(log(1 / (1 + exp(outer(a, x)))))
-}
-tb <- tibble(a = too_big, negll = negll(a))
-ts <- tibble(a = too_small, negll = negll(a))
-jr <- tibble(a = just_right, negll = negll(a))
-ppp <- bind_rows(too_big = tb,
- too_small = ts,
- just_right = jr,
- .id = "gamma")
-g <- ggplot(ppp, aes(a, negll)) +
- geom_point(aes(color = gamma)) +
- facet_wrap(~gamma, ncol = 1) +
- stat_function(fun = negll, xlim = c(-2.5,5)) +
- scale_y_log10() +
- scale_color_brewer(palette = "Set1") +
- theme(legend.position = "bottom")
-summary(glm(y ~ x - 1, family = "binomial"))
-```
-]
-
-.pull-right[
-```{r}
-g
-```
-]
-
diff --git a/schedule/slides/00-gradient-descent.html b/schedule/slides/00-gradient-descent.html
deleted file mode 100644
index 1ae4c35..0000000
--- a/schedule/slides/00-gradient-descent.html
+++ /dev/null
@@ -1,646 +0,0 @@
-
-
-
- 00 Gradient descent
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
diff --git a/schedule/slides/00-gradient-descent.qmd b/schedule/slides/00-gradient-descent.qmd
new file mode 100644
index 0000000..7e584c5
--- /dev/null
+++ b/schedule/slides/00-gradient-descent.qmd
@@ -0,0 +1,484 @@
+---
+lecture: "00 Gradient descent"
+format: revealjs
+metadata-files:
+ - _metadata.yml
+---
+
+{{< include _titleslide.qmd >}}
+
+## Simple optimization techniques
+
+
+We'll see "gradient descent" a few times:
+
+1. solves logistic regression (simple version of IRWLS)
+1. gradient boosting
+1. Neural networks
+
+This seems like a good time to explain it.
+
+So what is it and how does it work?
+
+
+## Very basic example
+
+::: flex
+::: w-65
+
+Suppose I want to minimize $f(x)=(x-6)^2$ numerically.
+
+I start at a point (say $x_1=23$)
+
+I want to "go" in the negative direction of the gradient.
+
+The gradient (at $x_1=23$) is $f'(23)=2(23-6)=34$.
+
+Move current value toward current value - 34.
+
+$x_2 = x_1 - \gamma 34$, for $\gamma$ small.
+
+In general, $x_{n+1} = x_n -\gamma f'(x_n)$.
+
+```{r}
+niter <- 10
+gam <- 0.1
+x <- double(niter)
+x[1] <- 23
+grad <- function(x) 2 * (x - 6)
+for (i in 2:niter) x[i] <- x[i - 1] - gam * grad(x[i - 1])
+```
+
+:::
+
+::: w-35
+
+
+```{r echo=FALSE}
+#| fig-width: 5
+#| fig-height: 8
+ggplot(data.frame(x = x, y = (x - 6)^2)) +
+ geom_path(aes(x, y)) +
+ geom_point(aes(x, y)) +
+ coord_cartesian(xlim = c(6 - 24, 24), ylim = c(0, 300)) +
+ geom_vline(xintercept = 6, colour = red, linetype = "dotted") +
+ geom_hline(yintercept = 0, colour = red, linetype = "dotted") +
+ stat_function(data = data.frame(x = c(6 - 24, 24)), aes(x),
+ fun = function(x) (x - 6)^2, colour = blue, alpha = .4) +
+ ylab(bquote(f(x)))
+```
+
+:::
+:::
+
+
+## Why does this work?
+
+
+[Heuristic interpretation:]{.secendary}
+
+* Gradient tells me the slope.
+
+* negative gradient points toward the minimum
+
+* go that way, but not too far (or we'll miss it)
+
+## Why does this work?
+
+[More rigorous interpretation:]{.secondary}
+
+- Taylor expansion
+$$
+f(x) \approx f(x_0) + \nabla f(x_0)^{\top}(x-x_0) + \frac{1}{2}(x-x_0)^\top H(x_0) (x-x_0)
+$$
+- replace $H$ with $\gamma^{-1} I$
+
+- minimize this quadratic approximation in $x$:
+$$
+0\overset{\textrm{set}}{=}\nabla f(x_0) + \frac{1}{\gamma}(x-x_0) \Longrightarrow x = x_0 - \gamma \nabla f(x_0)
+$$
+
+
+## Visually
+
+```{r}
+#| echo: false
+f <- function(x) (x - 1)^2 * (x > 1) + log(1 + exp(-2 * x))
+fp <- function(x) 2 * (x - 1) * (x > 1) - 2 / (1 + exp(2 * x))
+quad <- function(x, x0, gam = .1) f(x0) + fp(x0) * (x - x0) + 1 / (2 * gam) * (x - x0)^2
+x <- c(-1.75, -1, -.5)
+
+ggplot(data.frame(x = c(-2, 3)), aes(x)) +
+ stat_function(fun = f, colour = blue) +
+ geom_point(data = data.frame(x = x, y = f(x)), aes(x, y), colour = red) +
+ stat_function(fun = quad, args = list(x0 = -1.75), colour = red) +
+ stat_function(fun = quad, args = list(x0 = -1), colour = red) +
+ stat_function(fun = quad, args = list(x0 = -.5), colour = red) +
+ coord_cartesian(ylim = c(0, 4)) +
+ ggtitle("gamma = 0.1")
+```
+
+## Visually
+
+```{r}
+#| echo: false
+ggplot(data.frame(x = c(-2, 3)), aes(x)) +
+ stat_function(fun = f, colour = blue) +
+ geom_point(data = data.frame(x = x, y = f(x)), aes(x, y), colour = red) +
+ stat_function(fun = quad, args = list(x0 = -1.75, gam = .25), colour = red) +
+ stat_function(fun = quad, args = list(x0 = -1, gam = .25), colour = red) +
+ stat_function(fun = quad, args = list(x0 = -.5, gam = .25), colour = red) +
+ coord_cartesian(ylim = c(0, 4)) +
+ ggtitle("gamma = 0.25")
+```
+
+
+## What $\gamma$? (more details than we have time for)
+
+What to use for $\gamma_k$?
+
+
+[Fixed]{.secondary}
+
+- Only works if $\gamma$ is exactly right
+- Usually does not work
+
+[Decay on a schedule]{.secondary}
+
+$\gamma_{k+1} = \frac{\gamma_k}{1+ck}$ or $\gamma_{k} = \gamma_0 b^k$
+
+[Exact line search]{.secondary}
+
+- Tells you exactly how far to go.
+- At each $k$, solve
+$\gamma_k = \arg\min_{s \geq 0} f( x^{(k)} - s f(x^{(k-1)}))$
+- Usually can't solve this.
+
+
+
+##
+
+$$ f(x_1,x_2) = x_1^2 + 0.5x_2^2$$
+
+```{r, message=FALSE, echo=TRUE}
+x <- matrix(0, 40, 2); x[1, ] <- c(1, 1)
+grad <- function(x) c(2, 1) * x
+```
+
+
+```{r, message=FALSE, echo=FALSE}
+#| fig-width: 8
+#| fig-height: 4
+df <- expand.grid(b1 = seq(-1, 1, length.out = 100), b2 = seq(-1, 1, length.out = 100))
+df <- df %>% mutate(f = b1^2 + b2^2 / 2)
+g <- ggplot(df, aes(b1, b2)) +
+ geom_raster(aes(fill = log10(f))) +
+ scale_fill_viridis_c() +
+ xlab("x1") +
+ ylab("x2") +
+ coord_cartesian(expand = FALSE, ylim = c(-1, 1), xlim = c(-1, 1))
+g
+```
+
+##
+
+$$ f(x_1,x_2) = x_1^2 + 0.5x_2^2$$
+
+```{r, message=FALSE, echo=TRUE}
+gamma <- .1
+for (k in 2:40) x[k, ] <- x[k - 1, ] - gamma * grad(x[k - 1, ])
+```
+
+```{r, echo=FALSE}
+#| fig-width: 8
+#| fig-height: 4
+b <- tibble(b1 = x[, 1], b2 = x[, 2])
+g + geom_path(data = b, aes(b1, b2)) +
+ geom_point(data = b, aes(b1, b2), size = 2)
+```
+
+##
+
+$$ f(x_1,x_2) = x_1^2 + 0.5x_2^2$$
+
+```{r, message=FALSE, echo=FALSE}
+x <- matrix(0, 40, 2)
+x[1, ] <- c(1, 1)
+```
+
+```{r, echo=TRUE}
+gamma <- .9 # bigger gamma
+for (k in 2:40) x[k, ] <- x[k - 1, ] - gamma * grad(x[k - 1, ])
+```
+
+```{r, echo=FALSE}
+#| fig-width: 8
+#| fig-height: 4
+b <- tibble(b1 = x[, 1], b2 = x[, 2])
+g + geom_path(data = b, aes(b1, b2)) +
+ geom_point(data = b, aes(b1, b2), size = 2)
+```
+
+##
+
+$$ f(x_1,x_2) = x_1^2 + 0.5x_2^2$$
+
+```{r, message=FALSE, echo=FALSE, fig.align='center'}
+x <- matrix(0, 40, 2)
+x[1, ] <- c(1, 1)
+```
+
+```{r, echo=TRUE}
+gamma <- .9 # big, but decrease it on schedule
+for (k in 2:40) x[k, ] <- x[k - 1, ] - gamma * .9^k * grad(x[k - 1, ])
+```
+
+```{r, echo=FALSE}
+#| fig-width: 8
+#| fig-height: 4
+b <- tibble(b1 = x[, 1], b2 = x[, 2])
+g + geom_path(data = b, aes(b1, b2)) + geom_point(data = b, aes(b1, b2), size = 2)
+```
+
+##
+
+$$ f(x_1,x_2) = x_1^2 + 0.5x_2^2$$
+
+```{r, message=FALSE, echo=FALSE, fig.align='center'}
+x <- matrix(0, 40, 2)
+x[1, ] <- c(1, 1)
+```
+
+```{r, echo=TRUE}
+gamma <- .5 # theoretically optimal
+for (k in 2:40) x[k, ] <- x[k - 1, ] - gamma * grad(x[k - 1, ])
+```
+
+```{r, echo=FALSE}
+#| fig-width: 8
+#| fig-height: 4
+b <- tibble(b1 = x[, 1], b2 = x[, 2])
+g + geom_path(data = b, aes(b1, b2)) + geom_point(data = b, aes(b1, b2), size = 2)
+```
+
+
+
+
+## When do we stop?
+
+For $\epsilon>0$, small
+
+
+Check any / all of
+
+1. $|f'(x)| < \epsilon$
+1. $|x^{(k)} - x^{(k-1)}| < \epsilon$
+1. $|f(x^{(k)}) - f(x^{(k-1)})| < \epsilon$
+
+
+## Stochastic gradient descent
+
+Suppose $f(x) = \frac{1}{n}\sum_{i=1}^n f_i(x)$
+
+Like if $f(\beta) = \frac{1}{n}\sum_{i=1}^n (y_i - x^\top_i\beta)^2$.
+
+Then $f'(\beta) = \frac{1}{n}\sum_{i=1}^n f'_i(\beta) = \frac{1}{n} \sum_{i=1}^n -2x_i^\top(y_i - x^\top_i\beta)$
+
+If $n$ is really big, it may take a long time to compute $f'$
+
+So, just sample some partition our data into mini-batches $\mathcal{M}_j$
+
+And approximate (imagine the Law of Large Numbers, use a sample to approximate the population)
+$$f'(x) = \frac{1}{n}\sum_{i=1}^n f'_i(x) \approx \frac{1}{m}\sum_{i\in\mathcal{M}_j}f'_{i}(x)$$
+
+## SGD
+
+$$
+\begin{aligned}
+f'(\beta) &= \frac{1}{n}\sum_{i=1}^n f'_i(\beta) = \frac{1}{n} \sum_{i=1}^n -2x_i^\top(y_i - x^\top_i\beta)\\
+f'(x) &= \frac{1}{n}\sum_{i=1}^n f'_i(x) \approx \frac{1}{m}\sum_{i\in\mathcal{M}_j}f'_{i}(x)
+\end{aligned}
+$$
+
+
+Usually cycle through "mini-batches":
+
+* Use a different mini-batch at each iteration of GD
+* Cycle through until we see all the data
+
+
+This is the workhorse for neural network optimization
+
+
+
+# Practice with GD and Logistic regression
+
+
+## Gradient descent for Logistic regression
+
+Suppose $Y=1$ with probability $p(x)$ and $Y=0$ with probability $1-p(x)$, $x \in \R$.
+
+I want to model $P(Y=1| X=x)$.
+
+I'll assume that $\log\left(\frac{p(x)}{1-p(x)}\right) = ax$ for some scalar $a$. This means that
+$p(x) = \frac{\exp(ax)}{1+\exp(ax)} = \frac{1}{1+\exp(-ax)}$
+
+. . .
+
+```{r generate-data}
+#| output-location: column
+#| fig-width: 6
+#| fig-height: 4
+n <- 100
+a <- 2
+x <- runif(n, -5, 5)
+logit <- function(x) 1 / (1 + exp(-x))
+p <- logit(a * x)
+y <- rbinom(n, 1, p)
+df <- tibble(x, y)
+ggplot(df, aes(x, y)) +
+ geom_point(colour = "cornflowerblue") +
+ stat_function(fun = ~ logit(a * .x))
+```
+
+
+
+## Reminder: the likelihood
+
+$$
+L(y | a, x) = \prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\textrm{ and }
+p(x) = \frac{1}{1+\exp(-ax)}
+$$
+
+
+$$
+\begin{aligned}
+\ell(y | a, x) &= \log \prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}
+= \sum_{i=1}^n y_i\log p(x_i) + (1-y_i)\log(1-p(x_i))\\
+&= \sum_{i=1}^n\log(1-p(x_i)) + y_i\log\left(\frac{p(x_i)}{1-p(x_i)}\right)\\
+&=\sum_{i=1}^n ax_i y_i + \log\left(1-p(x_i)\right)\\
+&=\sum_{i=1}^n ax_i y_i + \log\left(\frac{1}{1+\exp(ax_i)}\right)
+\end{aligned}
+$$
+
+## Reminder: the likelihood
+
+$$
+L(y | a, x) = \prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\textrm{ and }
+p(x) = \frac{1}{1+\exp(-ax)}
+$$
+
+Now, we want the negative of this. Why?
+
+We would maximize the likelihood/log-likelihood, so we minimize the negative likelihood/log-likelihood (and scale by $1/n$)
+
+
+$$-\ell(y | a, x) = \frac{1}{n}\sum_{i=1}^n -ax_i y_i - \log\left(\frac{1}{1+\exp(ax_i)}\right)$$
+
+## Reminder: the likelihood
+
+$$
+\frac{1}{n}L(y | a, x) = \frac{1}{n}\prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\textrm{ and }
+p(x) = \frac{1}{1+\exp(-ax)}
+$$
+
+This is, in the notation of our slides $f(a)$.
+
+We want to minimize it in $a$ by gradient descent.
+
+So we need the derivative with respect to $a$: $f'(a)$.
+
+Now, conveniently, this simplifies a lot.
+
+
+$$
+\begin{aligned}
+\frac{d}{d a} f(a) &= \frac{1}{n}\sum_{i=1}^n -x_i y_i - \left(-\frac{x_i \exp(ax_i)}{1+\exp(ax_i)}\right)\\
+&=\frac{1}{n}\sum_{i=1}^n -x_i y_i + p(x_i)x_i = \frac{1}{n}\sum_{i=1}^n -x_i(y_i-p(x_i)).
+\end{aligned}
+$$
+
+## Reminder: the likelihood
+
+$$
+\frac{1}{n}L(y | a, x) = \frac{1}{n}\prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\textrm{ and }
+p(x) = \frac{1}{1+\exp(-ax)}
+$$
+
+(Simple) gradient descent to minimize $-\ell(a)$ or maximize $L(y|a,x)$ is:
+
+1. Input $a_1,\ \gamma>0,\ j_\max,\ \epsilon>0,\ \frac{d}{da} -\ell(a)$.
+2. For $j=1,\ 2,\ \ldots,\ j_\max$,
+$$a_j = a_{j-1} - \gamma \frac{d}{da} (-\ell(a_{j-1}))$$
+3. Stop if $\epsilon > |a_j - a_{j-1}|$ or $|d / da\ \ell(a)| < \epsilon$.
+
+## Reminder: the likelihood
+
+$$
+\frac{1}{n}L(y | a, x) = \frac{1}{n}\prod_{i=1}^n p(x_i)^{y_i}(1-p(x_i))^{1-y_i}\textrm{ and }
+p(x) = \frac{1}{1+\exp(-ax)}
+$$
+
+```{r amlecorrect}
+#| code-line-numbers: "1,10-11|2-3|4-9|"
+amle <- function(x, y, a0, gam = 0.5, jmax = 50, eps = 1e-6) {
+ a <- double(jmax) # place to hold stuff (always preallocate space)
+ a[1] <- a0 # starting value
+ for (j in 2:jmax) { # avoid possibly infinite while loops
+ px <- logit(a[j - 1] * x)
+ grad <- mean(-x * (y - px))
+ a[j] <- a[j - 1] - gam * grad
+ if (abs(grad) < eps || abs(a[j] - a[j - 1]) < eps) break
+ }
+ a[1:j]
+}
+```
+
+
+
+## Try it:
+
+```{r ourmle1}
+round(too_big <- amle(x, y, 5, 50), 3)
+round(too_small <- amle(x, y, 5, 1), 3)
+round(just_right <- amle(x, y, 5, 10), 3)
+```
+
+
+## Visual
+
+
+
+```{r}
+#| output-location: column
+#| fig-width: 6
+#| fig-height: 8
+negll <- function(a) {
+ -a * mean(x * y) -
+ rowMeans(log(1 / (1 + exp(outer(a, x)))))
+}
+blah <- list_rbind(
+ map(
+ rlang::dots_list(
+ too_big, too_small, just_right, .named = TRUE
+ ),
+ as_tibble),
+ names_to = "gamma"
+) |> mutate(negll = negll(value))
+ggplot(blah, aes(value, negll)) +
+ geom_point(aes(colour = gamma)) +
+ facet_wrap(~gamma, ncol = 1) +
+ stat_function(fun = negll, xlim = c(-2.5, 5)) +
+ scale_y_log10() +
+ xlab("a") +
+ ylab("negative log likelihood") +
+ geom_vline(xintercept = tail(just_right, 1)) +
+ scale_colour_brewer(palette = "Set1") +
+ theme(legend.position = "none")
+```
+
+## Check vs. `glm()`
+
+```{r}
+summary(glm(y ~ x - 1, family = "binomial"))
+```