gradient descent redone

UBC-STAT · Oct 16, 2023 · 5cfc994 · 5cfc994
1 parent 1628837
commit 5cfc994
Show file tree

Hide file tree

Showing 22 changed files with 6,418 additions and 1,133 deletions.
diff --git a/_freeze/schedule/slides/00-classification-losses/execute-results/html.json b/_freeze/schedule/slides/00-classification-losses/execute-results/html.json
@@ -1,7 +1,7 @@
 {
-  "hash": "8b2edae6a14401844329d3d63dfd8c75",
+  "hash": "a52fe0e79bf5e4db92eca003f346d642",
   "result": {
-    "markdown": "---\nlecture: \"00 Evaluating classifiers\"\nformat: revealjs\nmetadata-files: \n  - _metadata.yml\n---\n---\n---\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 16 October 2023\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n$$\n\n\n\n\n\n## How do we measure accuracy?\n\n[So far]{.secondary} --- 0-1 loss. If correct class, lose 0 else lose 1.\n\n[Asymmetric classification loss]{.secondary} --- If correct class, lose 0 else lose something.\n\nFor example, consider facial recognition. Goal is \"person OK\", \"person has expired passport\", \"person is a known terrorist\"\n\n1. If classify OK, but was terrorist, lose 1,000,000\n1. If classify OK, but expired passport, lose 2\n1. If classify terrorist, but was OK, lose 100\n1. If classify terrorist, but was expired passport, lose 10\n1. etc.\n\n. . .\n\n\nResults in a 3x3 matrix of losses with 0 on the diagonal.\n\n\n::: {.cell layout-align=\"center\" R.options='{\"scipen\":8}'}\n::: {.cell-output .cell-output-stdout}\n```\n        [,1]  [,2] [,3]\n[1,]       0     2   30\n[2,]      10     0  100\n[3,] 1000000 50000    0\n```\n:::\n:::\n\n\n\n## Deviance loss\n\nSometimes we output [probabilities]{.secondary} as well as class labels.\n\nFor example, logistic regression returns the probability that an observation is in class 1. $P(Y_i = 1 \\given x_i) = 1 / (1 + \\exp\\{-x'_i \\hat\\beta\\})$\n\nLDA and QDA produce probabilities as well. So do Neural Networks (typically)\n\n(Trees \"don't\", neither does KNN, though you could fake it)\n\n. . .\n\n<hr>\n\n* Deviance loss for 2-class classification is $-2\\textrm{loglikelihood}(y, \\hat{p}) = -2 (y_i x'_i\\hat{\\beta} - \\log (1-\\hat{p}))$\n\n(Technically, it's the difference between this and the loss of the null model, but people play fast and loose)\n\n* Could also use cross entropy or Gini index.\n\n\n\n## Calibration\n\nSuppose we predict some probabilities for our data, how often do those events happen?\n\nIn principle, if we predict $\\hat{p}(x_i)=0.2$ for a bunch of events observations $i$, we'd like to see about 20% 1 and 80% 0. (In training set and test set)\n\nThe same goes for the other probabilities. If we say \"20% chance of rain\" it should rain 20% of such days.\n\n\nOf course, we didn't predict **exactly** $\\hat{p}(x_i)=0.2$ ever, so lets look at $[.15, .25]$.\n\n\n::: {.cell layout-align=\"center\" output-location='fragment'}\n\n```{.r .cell-code  code-line-numbers=\"1-6|7|8-9\"}\nn <- 250\ndat <- tibble(\n  x = seq(-5, 5, length.out = n), \n  p = 1 / (1 + exp(-x)),\n  y = rbinom(n, 1, p)\n)\nfit <- glm(y ~ x, family = binomial, data = dat)\ndat$phat <- predict(fit, type = \"response\") # predicted probabilities\ndat |> filter(phat > .15, phat < .25) |> summarize(target = .2, obs = mean(y))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1 × 2\n  target   obs\n   <dbl> <dbl>\n1    0.2 0.222\n```\n:::\n:::\n\n\n\n## Calibration plot\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nbinary_calibration_plot <- function(y, phat, nbreaks = 10) {\n  dat <- tibble(y = y, phat = phat) |>\n    mutate(bins = cut_number(phat, n = nbreaks))\n  midpts <- quantile(dat$phat, seq(0, 1, length.out = nbreaks + 1), na.rm = TRUE)\n  midpts <- midpts[-length(midpts)] + diff(midpts) / 2\n  sum_dat <- dat |> \n    group_by(bins) |>\n    summarise(p = mean(y, na.rm = TRUE), \n              se = sqrt(p * (1 - p) / n())) |>\n    arrange(p)\n  sum_dat$x <- midpts\n  \n  ggplot(sum_dat, aes(x = x)) + \n    geom_errorbar(aes(ymin = pmax(p - 1.96*se, 0), ymax = pmin(p + 1.96*se, 1))) +\n    geom_point(aes(y = p), color = blue) + \n    geom_abline(slope = 1, intercept = 0, color = orange) +\n    ylab(\"observed frequency\") + xlab(\"average predicted probability\") +\n    coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +\n    geom_rug(data = dat, aes(x = phat), sides = 'b')\n}\n```\n:::\n\n\n\n## Amazingly well-calibrated\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nbinary_calibration_plot(dat$y, dat$phat, 20L)\n```\n\n::: {.cell-output-display}\n![](00-classification-losses_files/figure-revealjs/unnamed-chunk-4-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Less well-calibrated\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](00-classification-losses_files/figure-revealjs/unnamed-chunk-5-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n## True positive, false negative, sensitivity, specificity\n\nTrue positive rate\n: \\# correct predict positive  / \\# actual positive (1 - FNR)\n\nFalse negative rate\n: \\# incorrect predict negative  / \\# actual positive (1 - TPR), Type II Error\n\nTrue negative rate\n: \\# correct predict negative  / \\# actual negative\n\nFalse positive rate\n: \\# incorrect predict positive  / \\# actual negative (1 - TNR), Type I Error\n\nSensitivity\n: TPR, 1 - Type II error\n\nSpecificity\n: TNR, 1 - Type I error\n\n\n\n## ROC and thresholds\n\nROC (Receiver Operating Characteristic) Curve\n: TPR (sensitivity) vs. FPR (1 - specificity)\n \nAUC (Area under the curve)\n: Integral of ROC. Closer to 1 is better.\n \nSo far, we've been thresholding at 0.5, though you shouldn't always do that. \n \nWith unbalanced data (say 10% 0 and 90% 1), if you care equally about predicting both classes, you might want to choose a different cutoff (like in LDA).\n \nTo make the [ROC]{.secondary} we look at our errors [as we vary the cutoff]{.secondary}\n \n\n## ROC curve\n\n\n\n::: {.cell layout-align=\"center\" output-location='column-fragment'}\n\n```{.r .cell-code}\nroc <- function(prediction, y) {\n  op <- order(prediction, decreasing = TRUE)\n  preds <- prediction[op]\n  y <- y[op]\n  noty <- 1 - y\n  if (any(duplicated(preds))) {\n    y <- rev(tapply(y, preds, sum))\n    noty <- rev(tapply(noty, preds, sum))\n  }\n  tibble(\n    FPR = cumsum(noty) / sum(noty), \n    TPR = cumsum(y) / sum(y)\n  )\n}\n\nggplot(roc(dat$phat, dat$y), aes(FPR, TPR)) +\n  geom_step(color = blue, size = 2) +\n  geom_abline(slope = 1, intercept = 0)\n```\n\n::: {.cell-output-display}\n![](00-classification-losses_files/figure-revealjs/unnamed-chunk-6-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n## Other stuff\n\n![](gfx/huge-roc.png)\n\n* Source: worth exploring [Wikipedia](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)\n",
+    "markdown": "---\nlecture: \"00 Evaluating classifiers\"\nformat: revealjs\nmetadata-files: \n  - _metadata.yml\n---\n---\n---\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 16 October 2023\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n$$\n\n\n\n\n\n## How do we measure accuracy?\n\n[So far]{.secondary} --- 0-1 loss. If correct class, lose 0 else lose 1.\n\n[Asymmetric classification loss]{.secondary} --- If correct class, lose 0 else lose something.\n\nFor example, consider facial recognition. Goal is \"person OK\", \"person has expired passport\", \"person is a known terrorist\"\n\n1. If classify OK, but was terrorist, lose 1,000,000\n1. If classify OK, but expired passport, lose 2\n1. If classify terrorist, but was OK, lose 100\n1. If classify terrorist, but was expired passport, lose 10\n1. etc.\n\n. . .\n\n\nResults in a 3x3 matrix of losses with 0 on the diagonal.\n\n\n::: {.cell layout-align=\"center\" R.options='{\"scipen\":8}'}\n::: {.cell-output .cell-output-stdout}\n```\n        [,1]  [,2] [,3]\n[1,]       0     2   30\n[2,]      10     0  100\n[3,] 1000000 50000    0\n```\n:::\n:::\n\n\n\n## Deviance loss\n\nSometimes we output [probabilities]{.secondary} as well as class labels.\n\nFor example, logistic regression returns the probability that an observation is in class 1. $P(Y_i = 1 \\given x_i) = 1 / (1 + \\exp\\{-x'_i \\hat\\beta\\})$\n\nLDA and QDA produce probabilities as well. So do Neural Networks (typically)\n\n(Trees \"don't\", neither does KNN, though you could fake it)\n\n. . .\n\n<hr>\n\n* Deviance loss for 2-class classification is $-2\\textrm{loglikelihood}(y, \\hat{p}) = -2 (y_i x'_i\\hat{\\beta} - \\log (1-\\hat{p}))$\n\n(Technically, it's the difference between this and the loss of the null model, but people play fast and loose)\n\n* Could also use cross entropy or Gini index.\n\n\n\n## Calibration\n\nSuppose we predict some probabilities for our data, how often do those events happen?\n\nIn principle, if we predict $\\hat{p}(x_i)=0.2$ for a bunch of events observations $i$, we'd like to see about 20% 1 and 80% 0. (In training set and test set)\n\nThe same goes for the other probabilities. If we say \"20% chance of rain\" it should rain 20% of such days.\n\n\nOf course, we didn't predict **exactly** $\\hat{p}(x_i)=0.2$ ever, so lets look at $[.15, .25]$.\n\n\n::: {.cell layout-align=\"center\" output-location='fragment'}\n\n```{.r .cell-code  code-line-numbers=\"1-6|7|8-9\"}\nn <- 250\ndat <- tibble(\n  x = seq(-5, 5, length.out = n),\n  p = 1 / (1 + exp(-x)),\n  y = rbinom(n, 1, p)\n)\nfit <- glm(y ~ x, family = binomial, data = dat)\ndat$phat <- predict(fit, type = \"response\") # predicted probabilities\ndat |>\n  filter(phat > .15, phat < .25) |>\n  summarize(target = .2, obs = mean(y))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1 × 2\n  target   obs\n   <dbl> <dbl>\n1    0.2 0.222\n```\n:::\n:::\n\n\n\n## Calibration plot\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nbinary_calibration_plot <- function(y, phat, nbreaks = 10) {\n  dat <- tibble(y = y, phat = phat) |>\n    mutate(bins = cut_number(phat, n = nbreaks))\n  midpts <- quantile(dat$phat, seq(0, 1, length.out = nbreaks + 1), na.rm = TRUE)\n  midpts <- midpts[-length(midpts)] + diff(midpts) / 2\n  sum_dat <- dat |>\n    group_by(bins) |>\n    summarise(\n      p = mean(y, na.rm = TRUE),\n      se = sqrt(p * (1 - p) / n())\n    ) |>\n    arrange(p)\n  sum_dat$x <- midpts\n\n  ggplot(sum_dat, aes(x = x)) +\n    geom_errorbar(aes(ymin = pmax(p - 1.96 * se, 0), ymax = pmin(p + 1.96 * se, 1))) +\n    geom_point(aes(y = p), colour = blue) +\n    geom_abline(slope = 1, intercept = 0, colour = orange) +\n    ylab(\"observed frequency\") +\n    xlab(\"average predicted probability\") +\n    coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +\n    geom_rug(data = dat, aes(x = phat), sides = \"b\")\n}\n```\n:::\n\n\n\n## Amazingly well-calibrated\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nbinary_calibration_plot(dat$y, dat$phat, 20L)\n```\n\n::: {.cell-output-display}\n![](00-classification-losses_files/figure-revealjs/unnamed-chunk-4-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Less well-calibrated\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](00-classification-losses_files/figure-revealjs/unnamed-chunk-5-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n## True positive, false negative, sensitivity, specificity\n\nTrue positive rate\n: \\# correct predict positive  / \\# actual positive (1 - FNR)\n\nFalse negative rate\n: \\# incorrect predict negative  / \\# actual positive (1 - TPR), Type II Error\n\nTrue negative rate\n: \\# correct predict negative  / \\# actual negative\n\nFalse positive rate\n: \\# incorrect predict positive  / \\# actual negative (1 - TNR), Type I Error\n\nSensitivity\n: TPR, 1 - Type II error\n\nSpecificity\n: TNR, 1 - Type I error\n\n\n\n## ROC and thresholds\n\nROC (Receiver Operating Characteristic) Curve\n: TPR (sensitivity) vs. FPR (1 - specificity)\n \nAUC (Area under the curve)\n: Integral of ROC. Closer to 1 is better.\n \nSo far, we've been thresholding at 0.5, though you shouldn't always do that. \n \nWith unbalanced data (say 10% 0 and 90% 1), if you care equally about predicting both classes, you might want to choose a different cutoff (like in LDA).\n \nTo make the [ROC]{.secondary} we look at our errors [as we vary the cutoff]{.secondary}\n \n\n## ROC curve\n\n\n\n::: {.cell layout-align=\"center\" output-location='column-fragment'}\n\n```{.r .cell-code}\nroc <- function(prediction, y) {\n  op <- order(prediction, decreasing = TRUE)\n  preds <- prediction[op]\n  y <- y[op]\n  noty <- 1 - y\n  if (any(duplicated(preds))) {\n    y <- rev(tapply(y, preds, sum))\n    noty <- rev(tapply(noty, preds, sum))\n  }\n  tibble(\n    FPR = cumsum(noty) / sum(noty),\n    TPR = cumsum(y) / sum(y)\n  )\n}\n\nggplot(roc(dat$phat, dat$y), aes(FPR, TPR)) +\n  geom_step(colour = blue, size = 2) +\n  geom_abline(slope = 1, intercept = 0)\n```\n\n::: {.cell-output-display}\n![](00-classification-losses_files/figure-revealjs/unnamed-chunk-6-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n## Other stuff\n\n![](gfx/huge-roc.png)\n\n* Source: worth exploring [Wikipedia](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)\n",
     "supporting": [
       "00-classification-losses_files"
     ],