Update last of classification slides

UBC-STAT · Oct 24, 2024 · 7293a21 · 7293a21
1 parent c1417d2
commit 7293a21
Show file tree

Hide file tree

Showing 16 changed files with 8,606 additions and 3,063 deletions.
diff --git a/_freeze/schedule/slides/00-classification-losses/execute-results/html.json b/_freeze/schedule/slides/00-classification-losses/execute-results/html.json
@@ -1,7 +1,8 @@
 {
-  "hash": "a52fe0e79bf5e4db92eca003f346d642",
+  "hash": "03d8618359c4ba22b893a73c0d7d89b5",
   "result": {
-    "markdown": "---\nlecture: \"00 Evaluating classifiers\"\nformat: revealjs\nmetadata-files: \n  - _metadata.yml\n---\n---\n---\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 16 October 2023\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n$$\n\n\n\n\n\n## How do we measure accuracy?\n\n[So far]{.secondary} --- 0-1 loss. If correct class, lose 0 else lose 1.\n\n[Asymmetric classification loss]{.secondary} --- If correct class, lose 0 else lose something.\n\nFor example, consider facial recognition. Goal is \"person OK\", \"person has expired passport\", \"person is a known terrorist\"\n\n1. If classify OK, but was terrorist, lose 1,000,000\n1. If classify OK, but expired passport, lose 2\n1. If classify terrorist, but was OK, lose 100\n1. If classify terrorist, but was expired passport, lose 10\n1. etc.\n\n. . .\n\n\nResults in a 3x3 matrix of losses with 0 on the diagonal.\n\n\n::: {.cell layout-align=\"center\" R.options='{\"scipen\":8}'}\n::: {.cell-output .cell-output-stdout}\n```\n        [,1]  [,2] [,3]\n[1,]       0     2   30\n[2,]      10     0  100\n[3,] 1000000 50000    0\n```\n:::\n:::\n\n\n\n## Deviance loss\n\nSometimes we output [probabilities]{.secondary} as well as class labels.\n\nFor example, logistic regression returns the probability that an observation is in class 1. $P(Y_i = 1 \\given x_i) = 1 / (1 + \\exp\\{-x'_i \\hat\\beta\\})$\n\nLDA and QDA produce probabilities as well. So do Neural Networks (typically)\n\n(Trees \"don't\", neither does KNN, though you could fake it)\n\n. . .\n\n<hr>\n\n* Deviance loss for 2-class classification is $-2\\textrm{loglikelihood}(y, \\hat{p}) = -2 (y_i x'_i\\hat{\\beta} - \\log (1-\\hat{p}))$\n\n(Technically, it's the difference between this and the loss of the null model, but people play fast and loose)\n\n* Could also use cross entropy or Gini index.\n\n\n\n## Calibration\n\nSuppose we predict some probabilities for our data, how often do those events happen?\n\nIn principle, if we predict $\\hat{p}(x_i)=0.2$ for a bunch of events observations $i$, we'd like to see about 20% 1 and 80% 0. (In training set and test set)\n\nThe same goes for the other probabilities. If we say \"20% chance of rain\" it should rain 20% of such days.\n\n\nOf course, we didn't predict **exactly** $\\hat{p}(x_i)=0.2$ ever, so lets look at $[.15, .25]$.\n\n\n::: {.cell layout-align=\"center\" output-location='fragment'}\n\n```{.r .cell-code  code-line-numbers=\"1-6|7|8-9\"}\nn <- 250\ndat <- tibble(\n  x = seq(-5, 5, length.out = n),\n  p = 1 / (1 + exp(-x)),\n  y = rbinom(n, 1, p)\n)\nfit <- glm(y ~ x, family = binomial, data = dat)\ndat$phat <- predict(fit, type = \"response\") # predicted probabilities\ndat |>\n  filter(phat > .15, phat < .25) |>\n  summarize(target = .2, obs = mean(y))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1 × 2\n  target   obs\n   <dbl> <dbl>\n1    0.2 0.222\n```\n:::\n:::\n\n\n\n## Calibration plot\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nbinary_calibration_plot <- function(y, phat, nbreaks = 10) {\n  dat <- tibble(y = y, phat = phat) |>\n    mutate(bins = cut_number(phat, n = nbreaks))\n  midpts <- quantile(dat$phat, seq(0, 1, length.out = nbreaks + 1), na.rm = TRUE)\n  midpts <- midpts[-length(midpts)] + diff(midpts) / 2\n  sum_dat <- dat |>\n    group_by(bins) |>\n    summarise(\n      p = mean(y, na.rm = TRUE),\n      se = sqrt(p * (1 - p) / n())\n    ) |>\n    arrange(p)\n  sum_dat$x <- midpts\n\n  ggplot(sum_dat, aes(x = x)) +\n    geom_errorbar(aes(ymin = pmax(p - 1.96 * se, 0), ymax = pmin(p + 1.96 * se, 1))) +\n    geom_point(aes(y = p), colour = blue) +\n    geom_abline(slope = 1, intercept = 0, colour = orange) +\n    ylab(\"observed frequency\") +\n    xlab(\"average predicted probability\") +\n    coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +\n    geom_rug(data = dat, aes(x = phat), sides = \"b\")\n}\n```\n:::\n\n\n\n## Amazingly well-calibrated\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nbinary_calibration_plot(dat$y, dat$phat, 20L)\n```\n\n::: {.cell-output-display}\n![](00-classification-losses_files/figure-revealjs/unnamed-chunk-4-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Less well-calibrated\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](00-classification-losses_files/figure-revealjs/unnamed-chunk-5-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n## True positive, false negative, sensitivity, specificity\n\nTrue positive rate\n: \\# correct predict positive  / \\# actual positive (1 - FNR)\n\nFalse negative rate\n: \\# incorrect predict negative  / \\# actual positive (1 - TPR), Type II Error\n\nTrue negative rate\n: \\# correct predict negative  / \\# actual negative\n\nFalse positive rate\n: \\# incorrect predict positive  / \\# actual negative (1 - TNR), Type I Error\n\nSensitivity\n: TPR, 1 - Type II error\n\nSpecificity\n: TNR, 1 - Type I error\n\n\n\n## ROC and thresholds\n\nROC (Receiver Operating Characteristic) Curve\n: TPR (sensitivity) vs. FPR (1 - specificity)\n \nAUC (Area under the curve)\n: Integral of ROC. Closer to 1 is better.\n \nSo far, we've been thresholding at 0.5, though you shouldn't always do that. \n \nWith unbalanced data (say 10% 0 and 90% 1), if you care equally about predicting both classes, you might want to choose a different cutoff (like in LDA).\n \nTo make the [ROC]{.secondary} we look at our errors [as we vary the cutoff]{.secondary}\n \n\n## ROC curve\n\n\n\n::: {.cell layout-align=\"center\" output-location='column-fragment'}\n\n```{.r .cell-code}\nroc <- function(prediction, y) {\n  op <- order(prediction, decreasing = TRUE)\n  preds <- prediction[op]\n  y <- y[op]\n  noty <- 1 - y\n  if (any(duplicated(preds))) {\n    y <- rev(tapply(y, preds, sum))\n    noty <- rev(tapply(noty, preds, sum))\n  }\n  tibble(\n    FPR = cumsum(noty) / sum(noty),\n    TPR = cumsum(y) / sum(y)\n  )\n}\n\nggplot(roc(dat$phat, dat$y), aes(FPR, TPR)) +\n  geom_step(colour = blue, size = 2) +\n  geom_abline(slope = 1, intercept = 0)\n```\n\n::: {.cell-output-display}\n![](00-classification-losses_files/figure-revealjs/unnamed-chunk-6-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n## Other stuff\n\n![](gfx/huge-roc.png)\n\n* Source: worth exploring [Wikipedia](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)\n",
+    "engine": "knitr",
+    "markdown": "---\nlecture: \"00 Evaluating classifiers\"\nformat: revealjs\nmetadata-files: \n  - _metadata.yml\n---\n\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 23 October 2024\n\n\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n\\newcommand{\\U}{\\mathbf{U}}\n\\newcommand{\\D}{\\mathbf{D}}\n\\newcommand{\\V}{\\mathbf{V}}\n$$\n\n\n\n\n\n## How do we measure accuracy?\n\n[So far]{.secondary} --- 0-1 loss. If correct class, lose 0 else lose 1.\n\n. . .\n\n[Generalization: Asymmetric classification loss]{.secondary} --- If correct class, lose 0 else lose something.\n\n\\\nE.g. MRI screening. Goal is \"person OK\", \"person has a disease\"\n\n1. If classify OK, but was disease, lose 1,000,000\n1. If classify disease, but was OK, lose 10\n1. etc.\n\n. . .\n\n\nResults in a 2x2 matrix of losses with 0 on the diagonal.\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output .cell-output-stdout}\n\n```\n     [,1]    [,2]\n[1,]    0 1000000\n[2,]   10       0\n```\n\n\n:::\n:::\n\n\n\n\n## Deviance loss\n\nSometimes we output [probabilities]{.secondary} as well as class labels.\n\nFor example, logistic regression returns the probability that an observation is in class 1. $P(Y_i = 1 \\given x_i) = 1 / (1 + \\exp\\{-x'_i \\hat\\beta\\})$\n\nLDA and QDA produce probabilities as well. So do Neural Networks (typically)\n\n(Trees \"don't\", neither does KNN, though you could fake it)\n\n. . .\n\n<hr>\n\n* Deviance loss for 2-class classification is $-2\\textrm{loglikelihood}(y, \\hat{p}) = -2 (y_i x'_i\\hat{\\beta} - \\log (1-\\hat{p}))$\n\n<!-- (Technically, it's the difference between this and the loss of the null model, but people play fast and loose) -->\n\n* Could also use cross entropy or Gini index.\n\n\n\n## Calibration\n\nSuppose we predict some probabilities for our data, how often do those events happen?\n\nIn principle, if we predict $\\hat{p}(x_i)=0.2$ for a bunch of events observations $i$, we'd like to see about 20% 1 and 80% 0. (In training set and test set)\n\nThe same goes for the other probabilities. If we say \"20% chance of rain\" it should rain 20% of such days.\n\n\nOf course, we didn't predict **exactly** $\\hat{p}(x_i)=0.2$ ever, so lets look at $[.15, .25]$.\n\n\n## Calibration plot\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nn <- 250\ndat <- tibble(\n  x = seq(-5, 5, length.out = n),\n  p = 1 / (1 + exp(-x)),\n  y = rbinom(n, 1, p)\n)\nfit <- glm(y ~ x, family = binomial, data = dat)\ndat$phat <- predict(fit, type = \"response\") # predicted probabilities\nbinary_calibration_plot <- function(y, phat, nbreaks = 10) {\n  dat <- tibble(y = y, phat = phat) |>\n    mutate(bins = cut_number(phat, n = nbreaks))\n  midpts <- quantile(dat$phat, seq(0, 1, length.out = nbreaks + 1), na.rm = TRUE)\n  midpts <- midpts[-length(midpts)] + diff(midpts) / 2\n  sum_dat <- dat |>\n    group_by(bins) |>\n    summarise(\n      p = mean(y, na.rm = TRUE),\n      se = sqrt(p * (1 - p) / n())\n    ) |>\n    arrange(p)\n  sum_dat$x <- midpts\n\n  ggplot(sum_dat, aes(x = x)) +\n    geom_errorbar(aes(ymin = pmax(p - 1.96 * se, 0), ymax = pmin(p + 1.96 * se, 1))) +\n    geom_point(aes(y = p), colour = blue) +\n    geom_abline(slope = 1, intercept = 0, colour = orange) +\n    ylab(\"observed frequency\") +\n    xlab(\"average predicted probability\") +\n    coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +\n    geom_rug(data = dat, aes(x = phat), sides = \"b\")\n}\n```\n:::\n\n\n\n\n## Amazingly well-calibrated\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nbinary_calibration_plot(dat$y, dat$phat, 20L)\n```\n\n::: {.cell-output-display}\n![](00-classification-losses_files/figure-revealjs/unnamed-chunk-3-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n## Less well-calibrated\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](00-classification-losses_files/figure-revealjs/unnamed-chunk-4-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n\n## True positive, false negative, sensitivity, specificity\n\nTrue positive rate\n: \\# correct predict positive  / \\# actual positive (1 - FNR)\n\nFalse negative rate\n: \\# incorrect predict negative  / \\# actual positive (1 - TPR), Type II Error\n\nTrue negative rate\n: \\# correct predict negative  / \\# actual negative\n\nFalse positive rate\n: \\# incorrect predict positive  / \\# actual negative (1 - TNR), Type I Error\n\nSensitivity\n: TPR, 1 - Type II error\n\nSpecificity\n: TNR, 1 - Type I error\n\n\n\n## Decision making\n\nGiven a logistic regression output $\\hat P(Y \\mid X) = 0.56$,\nshould we assign $\\hat Y = 1$ or $\\hat Y = 0$?\n\nE.g. $P(Y=1 \\mid X)$ is predicted probability that email $X$ is spam. \\\nDo we send it to the spam folder ($\\hat Y=1$) or the inbox ($\\hat Y=0$)?\n\n. . .\n\nSo far we've been making the \"decision\" $\\hat Y=1$ if $\\hat P(Y=1 \\mid X) > \\hat P(Y=0 \\mid X)$ \\\ni.e. $\\hat Y = \\begin{cases} 1 & \\hat P(Y=1 \\mid X) > 0.5 \\\\ 0 & \\mathrm{o.w.} \\end{cases}$.\n\nBut maybe (for our application) a \"better\" decision is $$\\hat Y = \\begin{cases} 1 & \\hat P(Y=1 \\mid X) > t \\\\ 0 & \\mathrm{o.w.} \\end{cases}$$\n\n\n## ROC and thresholds\n\nROC (Receiver Operating Characteristic) Curve\n: TPR (sensitivity) vs. FPR (1 - specificity)\n: [Each point corresponds to a different $0 \\leq t \\leq 1$.]{.small}\n \nAUC (Area under the curve)\n: Integral of ROC. Closer to 1 is better.\n\n\n\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nroc <- function(prediction, y) {\n  op <- order(prediction, decreasing = TRUE)\n  preds <- prediction[op]\n  y <- y[op]\n  noty <- 1 - y\n  if (any(duplicated(preds))) {\n    y <- rev(tapply(y, preds, sum))\n    noty <- rev(tapply(noty, preds, sum))\n  }\n  tibble(\n    FPR = cumsum(noty) / sum(noty),\n    TPR = cumsum(y) / sum(y)\n  )\n}\n\nggplot(roc(dat$phat, dat$y), aes(FPR, TPR)) +\n  geom_step(colour = blue, size = 2) +\n  geom_abline(slope = 1, intercept = 0)\n```\n\n::: {.cell-output-display}\n![](00-classification-losses_files/figure-revealjs/unnamed-chunk-5-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Other stuff\n\n![](gfx/huge-roc.png)\n\n* Source: worth exploring [Wikipedia](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)\n\n\n\n\n\n# Next time ... {background-image=\"https://i1.wp.com/bdtechtalks.com/wp-content/uploads/2018/12/artificial-intelligence-deep-learning-neural-networks-ai.jpg?w=1392&ssl=1\" background-opacity=.4}\n\n\n[Module 4]{.secondary}\n\n[boosting, bagging, random forests, and neural nets]{.secondary}\n",
     "supporting": [
       "00-classification-losses_files"
     ],