diff --git a/_freeze/schedule/slides/00-classification-losses/execute-results/html.json b/_freeze/schedule/slides/00-classification-losses/execute-results/html.json
new file mode 100644
index 0000000..4df31ef
--- /dev/null
+++ b/_freeze/schedule/slides/00-classification-losses/execute-results/html.json
@@ -0,0 +1,20 @@
+{
+ "hash": "8b2edae6a14401844329d3d63dfd8c75",
+ "result": {
+ "markdown": "---\nlecture: \"00 Evaluating classifiers\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n---\n---\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 16 October 2023\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n$$\n\n\n\n\n\n## How do we measure accuracy?\n\n[So far]{.secondary} --- 0-1 loss. If correct class, lose 0 else lose 1.\n\n[Asymmetric classification loss]{.secondary} --- If correct class, lose 0 else lose something.\n\nFor example, consider facial recognition. Goal is \"person OK\", \"person has expired passport\", \"person is a known terrorist\"\n\n1. If classify OK, but was terrorist, lose 1,000,000\n1. If classify OK, but expired passport, lose 2\n1. If classify terrorist, but was OK, lose 100\n1. If classify terrorist, but was expired passport, lose 10\n1. etc.\n\n. . .\n\n\nResults in a 3x3 matrix of losses with 0 on the diagonal.\n\n\n::: {.cell layout-align=\"center\" R.options='{\"scipen\":8}'}\n::: {.cell-output .cell-output-stdout}\n```\n [,1] [,2] [,3]\n[1,] 0 2 30\n[2,] 10 0 100\n[3,] 1000000 50000 0\n```\n:::\n:::\n\n\n\n## Deviance loss\n\nSometimes we output [probabilities]{.secondary} as well as class labels.\n\nFor example, logistic regression returns the probability that an observation is in class 1. $P(Y_i = 1 \\given x_i) = 1 / (1 + \\exp\\{-x'_i \\hat\\beta\\})$\n\nLDA and QDA produce probabilities as well. So do Neural Networks (typically)\n\n(Trees \"don't\", neither does KNN, though you could fake it)\n\n. . .\n\n
\n\n* Deviance loss for 2-class classification is $-2\\textrm{loglikelihood}(y, \\hat{p}) = -2 (y_i x'_i\\hat{\\beta} - \\log (1-\\hat{p}))$\n\n(Technically, it's the difference between this and the loss of the null model, but people play fast and loose)\n\n* Could also use cross entropy or Gini index.\n\n\n\n## Calibration\n\nSuppose we predict some probabilities for our data, how often do those events happen?\n\nIn principle, if we predict $\\hat{p}(x_i)=0.2$ for a bunch of events observations $i$, we'd like to see about 20% 1 and 80% 0. (In training set and test set)\n\nThe same goes for the other probabilities. If we say \"20% chance of rain\" it should rain 20% of such days.\n\n\nOf course, we didn't predict **exactly** $\\hat{p}(x_i)=0.2$ ever, so lets look at $[.15, .25]$.\n\n\n::: {.cell layout-align=\"center\" output-location='fragment'}\n\n```{.r .cell-code code-line-numbers=\"1-6|7|8-9\"}\nn <- 250\ndat <- tibble(\n x = seq(-5, 5, length.out = n), \n p = 1 / (1 + exp(-x)),\n y = rbinom(n, 1, p)\n)\nfit <- glm(y ~ x, family = binomial, data = dat)\ndat$phat <- predict(fit, type = \"response\") # predicted probabilities\ndat |> filter(phat > .15, phat < .25) |> summarize(target = .2, obs = mean(y))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1 × 2\n target obs\n \n1 0.2 0.222\n```\n:::\n:::\n\n\n\n## Calibration plot\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nbinary_calibration_plot <- function(y, phat, nbreaks = 10) {\n dat <- tibble(y = y, phat = phat) |>\n mutate(bins = cut_number(phat, n = nbreaks))\n midpts <- quantile(dat$phat, seq(0, 1, length.out = nbreaks + 1), na.rm = TRUE)\n midpts <- midpts[-length(midpts)] + diff(midpts) / 2\n sum_dat <- dat |> \n group_by(bins) |>\n summarise(p = mean(y, na.rm = TRUE), \n se = sqrt(p * (1 - p) / n())) |>\n arrange(p)\n sum_dat$x <- midpts\n \n ggplot(sum_dat, aes(x = x)) + \n geom_errorbar(aes(ymin = pmax(p - 1.96*se, 0), ymax = pmin(p + 1.96*se, 1))) +\n geom_point(aes(y = p), color = blue) + \n geom_abline(slope = 1, intercept = 0, color = orange) +\n ylab(\"observed frequency\") + xlab(\"average predicted probability\") +\n coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +\n geom_rug(data = dat, aes(x = phat), sides = 'b')\n}\n```\n:::\n\n\n\n## Amazingly well-calibrated\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nbinary_calibration_plot(dat$y, dat$phat, 20L)\n```\n\n::: {.cell-output-display}\n![](00-classification-losses_files/figure-revealjs/unnamed-chunk-4-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Less well-calibrated\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](00-classification-losses_files/figure-revealjs/unnamed-chunk-5-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n## True positive, false negative, sensitivity, specificity\n\nTrue positive rate\n: \\# correct predict positive / \\# actual positive (1 - FNR)\n\nFalse negative rate\n: \\# incorrect predict negative / \\# actual positive (1 - TPR), Type II Error\n\nTrue negative rate\n: \\# correct predict negative / \\# actual negative\n\nFalse positive rate\n: \\# incorrect predict positive / \\# actual negative (1 - TNR), Type I Error\n\nSensitivity\n: TPR, 1 - Type II error\n\nSpecificity\n: TNR, 1 - Type I error\n\n\n\n## ROC and thresholds\n\nROC (Receiver Operating Characteristic) Curve\n: TPR (sensitivity) vs. FPR (1 - specificity)\n \nAUC (Area under the curve)\n: Integral of ROC. Closer to 1 is better.\n \nSo far, we've been thresholding at 0.5, though you shouldn't always do that. \n \nWith unbalanced data (say 10% 0 and 90% 1), if you care equally about predicting both classes, you might want to choose a different cutoff (like in LDA).\n \nTo make the [ROC]{.secondary} we look at our errors [as we vary the cutoff]{.secondary}\n \n\n## ROC curve\n\n\n\n::: {.cell layout-align=\"center\" output-location='column-fragment'}\n\n```{.r .cell-code}\nroc <- function(prediction, y) {\n op <- order(prediction, decreasing = TRUE)\n preds <- prediction[op]\n y <- y[op]\n noty <- 1 - y\n if (any(duplicated(preds))) {\n y <- rev(tapply(y, preds, sum))\n noty <- rev(tapply(noty, preds, sum))\n }\n tibble(\n FPR = cumsum(noty) / sum(noty), \n TPR = cumsum(y) / sum(y)\n )\n}\n\nggplot(roc(dat$phat, dat$y), aes(FPR, TPR)) +\n geom_step(color = blue, size = 2) +\n geom_abline(slope = 1, intercept = 0)\n```\n\n::: {.cell-output-display}\n![](00-classification-losses_files/figure-revealjs/unnamed-chunk-6-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n## Other stuff\n\n![](gfx/huge-roc.png)\n\n* Source: worth exploring [Wikipedia](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)\n",
+ "supporting": [
+ "00-classification-losses_files"
+ ],
+ "filters": [
+ "rmarkdown/pagebreak.lua"
+ ],
+ "includes": {
+ "include-after-body": [
+ "\n\n\n"
+ ]
+ },
+ "engineDependencies": {},
+ "preserve": {},
+ "postProcess": true
+ }
+}
\ No newline at end of file
diff --git a/_freeze/schedule/slides/00-classification-losses/figure-revealjs/unnamed-chunk-4-1.svg b/_freeze/schedule/slides/00-classification-losses/figure-revealjs/unnamed-chunk-4-1.svg
new file mode 100644
index 0000000..fdd4da7
--- /dev/null
+++ b/_freeze/schedule/slides/00-classification-losses/figure-revealjs/unnamed-chunk-4-1.svg
@@ -0,0 +1,1744 @@
+
+
diff --git a/_freeze/schedule/slides/00-classification-losses/figure-revealjs/unnamed-chunk-5-1.svg b/_freeze/schedule/slides/00-classification-losses/figure-revealjs/unnamed-chunk-5-1.svg
new file mode 100644
index 0000000..f534d4b
--- /dev/null
+++ b/_freeze/schedule/slides/00-classification-losses/figure-revealjs/unnamed-chunk-5-1.svg
@@ -0,0 +1,1929 @@
+
+
diff --git a/_freeze/schedule/slides/00-classification-losses/figure-revealjs/unnamed-chunk-6-1.svg b/_freeze/schedule/slides/00-classification-losses/figure-revealjs/unnamed-chunk-6-1.svg
new file mode 100644
index 0000000..49586d2
--- /dev/null
+++ b/_freeze/schedule/slides/00-classification-losses/figure-revealjs/unnamed-chunk-6-1.svg
@@ -0,0 +1,257 @@
+
+
diff --git a/schedule/slides/00-classification-losses.html b/schedule/slides/00-classification-losses.html
deleted file mode 100644
index 209c0e4..0000000
--- a/schedule/slides/00-classification-losses.html
+++ /dev/null
@@ -1,381 +0,0 @@
-
-
-
- 00 Evaluating classifiers
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
diff --git a/schedule/slides/00-classification-losses.Rmd b/schedule/slides/00-classification-losses.qmd
similarity index 51%
rename from schedule/slides/00-classification-losses.Rmd
rename to schedule/slides/00-classification-losses.qmd
index bdff0d9..6c16630 100644
--- a/schedule/slides/00-classification-losses.Rmd
+++ b/schedule/slides/00-classification-losses.qmd
@@ -1,68 +1,64 @@
---
-title: "00 Evaluating classifiers"
-author:
- - "STAT 406"
- - "Daniel J. McDonald"
-date: 'Last modified - `r Sys.Date()`'
+lecture: "00 Evaluating classifiers"
+format: revealjs
+metadata-files:
+ - _metadata.yml
---
-```{r setup, include=FALSE, warning=FALSE, message=FALSE}
-source("rmd_config.R")
-```
-
-```{r css-extras, file="css-extras.R", echo=FALSE}
-```
+{{< include _titleslide.qmd >}}
## How do we measure accuracy?
-**So far** --- 0-1 loss. If correct class, lose 0 else lose 1.
+[So far]{.secondary} --- 0-1 loss. If correct class, lose 0 else lose 1.
-**Asymmetric classification loss** --- If correct class, lose 0 else lose something.
+[Asymmetric classification loss]{.secondary} --- If correct class, lose 0 else lose something.
-For example, consider facial recognition. Goal is "person ok", "person has expired passport", "person is a known terrorist"
+For example, consider facial recognition. Goal is "person OK", "person has expired passport", "person is a known terrorist"
-1. If classify ok, but was terrorist, lose 1000000
-1. If classify ok, but expired passport, lose 2
-1. If classify terrorist, but was ok, lose 100
+1. If classify OK, but was terrorist, lose 1,000,000
+1. If classify OK, but expired passport, lose 2
+1. If classify terrorist, but was OK, lose 100
1. If classify terrorist, but was expired passport, lose 10
1. etc.
+. . .
+
+
Results in a 3x3 matrix of losses with 0 on the diagonal.
```{r echo=FALSE, R.options=list(scipen=8)}
-matrix(c(0,10,1000000, 2, 0, 50000, 30, 100, 0), nrow = 3)
+matrix(c(0, 10, 1000000, 2, 0, 50000, 30, 100, 0), nrow = 3)
```
----
## Deviance loss
-Sometimes we output **probabilities** as well as class labels.
+Sometimes we output [probabilities]{.secondary} as well as class labels.
-For example, logistic regression returns the probability that an observation is in class 1. $P(Y_i = 1) = 1 / (1 + \exp\{-x'_i \hat\beta\})$
+For example, logistic regression returns the probability that an observation is in class 1. $P(Y_i = 1 \given x_i) = 1 / (1 + \exp\{-x'_i \hat\beta\})$
LDA and QDA produce probabilities as well. So do Neural Networks (typically)
(Trees "don't", neither does KNN, though you could fake it)
---
+. . .
- ---
+
-* Deviance loss for 2-class classification is $\sum_{i=1}^n -2\textrm{log-likelihood}(y, \hat{p}) = \sum_{i=1}^n -2 (y_i x'_i\hat{\beta} - \log (1-\hat{p}))$
+* Deviance loss for 2-class classification is $-2\textrm{loglikelihood}(y, \hat{p}) = -2 (y_i x'_i\hat{\beta} - \log (1-\hat{p}))$
(Technically, it's the difference between this and the loss of the null model, but people play fast and loose)
* Could also use cross entropy or Gini index.
----
+
## Calibration
Suppose we predict some probabilities for our data, how often do those events happen?
-In principle, if we predict $\hat{p}(x_i)=0.2$ for a bunch of events observations $i$, we'd like to see about 20% 1 and 80% 0.
+In principle, if we predict $\hat{p}(x_i)=0.2$ for a bunch of events observations $i$, we'd like to see about 20% 1 and 80% 0. (In training set and test set)
The same goes for the other probabilities. If we say "20% chance of rain" it should rain 20% of such days.
@@ -70,29 +66,32 @@ The same goes for the other probabilities. If we say "20% chance of rain" it sho
Of course, we didn't predict **exactly** $\hat{p}(x_i)=0.2$ ever, so lets look at $[.15, .25]$.
```{r}
-x <- seq(-5, 5, length.out = 1000)
-p <- 1 / (1 + exp(-x))
-y <- rbinom(1000, 1, p)
-dat <- data.frame(y = y, x = x)
+#| code-line-numbers: "1-6|7|8-9"
+#| output-location: fragment
+n <- 250
+dat <- tibble(
+ x = seq(-5, 5, length.out = n),
+ p = 1 / (1 + exp(-x)),
+ y = rbinom(n, 1, p)
+)
fit <- glm(y ~ x, family = binomial, data = dat)
dat$phat <- predict(fit, type = "response") # predicted probabilities
-mean(dat$y[dat$phat > 0.15 & dat$phat < 0.25])
+dat |> filter(phat > .15, phat < .25) |> summarize(target = .2, obs = mean(y))
```
----
## Calibration plot
```{r}
binary_calibration_plot <- function(y, phat, nbreaks = 10) {
- dat <- data.frame(y = y, phat = phat) %>%
+ dat <- tibble(y = y, phat = phat) |>
mutate(bins = cut_number(phat, n = nbreaks))
midpts <- quantile(dat$phat, seq(0, 1, length.out = nbreaks + 1), na.rm = TRUE)
midpts <- midpts[-length(midpts)] + diff(midpts) / 2
- sum_dat <- dat %>%
- group_by(bins) %>%
+ sum_dat <- dat |>
+ group_by(bins) |>
summarise(p = mean(y, na.rm = TRUE),
- se = sqrt(p * (1 - p) / n())) %>%
+ se = sqrt(p * (1 - p) / n())) |>
arrange(p)
sum_dat$x <- midpts
@@ -106,66 +105,70 @@ binary_calibration_plot <- function(y, phat, nbreaks = 10) {
}
```
----
## Amazingly well-calibrated
-```{r, fig.width=8, fig.height=5, fig.align="center"}
-binary_calibration_plot(y, dat$phat, 20L)
+```{r}
+#| fig-width: 8
+#| fig-height: 5
+binary_calibration_plot(dat$y, dat$phat, 20L)
```
----
## Less well-calibrated
-```{r, echo=FALSE, fig.width=8, fig.height=6, fig.align="center"}
-data("bakeoff_train", package = "Stat406")
-baked <- bakeoff_train %>% dplyr::select(winners, percent_star:technical_median)
-baked <- baked[complete.cases(baked),]
-phat <- predict(glm(winners ~ ., family = binomial, data = baked), type = "response")
-binary_calibration_plot(baked$winners, phat, 10L)
+```{r, echo=FALSE}
+#| fig-width: 8
+#| fig-height: 5
+binary_calibration_plot(rbinom(250, 1, 0.5), rbeta(250, 1.3, 1), 15L)
```
----
## True positive, false negative, sensitivity, specificity
-**True positive rate** - # correct predict positive / # actual positive (1 - FNR)
+True positive rate
+: \# correct predict positive / \# actual positive (1 - FNR)
-**False negative rate** - # incorrect predict negative / # actual positive (1 - TPR), Type II Error
+False negative rate
+: \# incorrect predict negative / \# actual positive (1 - TPR), Type II Error
-**True negative rate** - # correct predict negative / # actual negative
+True negative rate
+: \# correct predict negative / \# actual negative
-**False positive rate** - # incorrect predict positive / # actual negative (1 - TNR), Type I Error
+False positive rate
+: \# incorrect predict positive / \# actual negative (1 - TNR), Type I Error
+Sensitivity
+: TPR, 1 - Type II error
-**Sensitivity** - TPR, 1 - Type II error
+Specificity
+: TNR, 1 - Type I error
-**Specificity** - TNR, 1 - Type I error
---
- ---
-
-**ROC (Receiver Operating Characteristic) Curve** - TPR (sensitivity) vs. FPR (1 - specificity)
+## ROC and thresholds
+
+ROC (Receiver Operating Characteristic) Curve
+: TPR (sensitivity) vs. FPR (1 - specificity)
-**AUC (Area under the curve)** - Integral of ROC. Closer to 1 is better.
+AUC (Area under the curve)
+: Integral of ROC. Closer to 1 is better.
So far, we've been thresholding at 0.5, though you shouldn't always do that.
With unbalanced data (say 10% 0 and 90% 1), if you care equally about predicting both classes, you might want to choose a different cutoff (like in LDA).
-To make the **ROC** we look at our errors **as we vary the cutoff**
+To make the [ROC]{.secondary} we look at our errors [as we vary the cutoff]{.secondary}
----
## ROC curve
-.pull-left[
-
```{r}
+#| output-location: column-fragment
+#| fig-width: 8
+#| fig-height: 6
roc <- function(prediction, y) {
op <- order(prediction, decreasing = TRUE)
preds <- prediction[op]
@@ -175,31 +178,21 @@ roc <- function(prediction, y) {
y <- rev(tapply(y, preds, sum))
noty <- rev(tapply(noty, preds, sum))
}
- data.frame(
+ tibble(
FPR = cumsum(noty) / sum(noty),
- TPR = cumsum(y) / sum(y))
+ TPR = cumsum(y) / sum(y)
+ )
}
-roc_plt <- ggplot(roc(dat$phat, dat$y),
- aes(FPR, TPR)) +
+ggplot(roc(dat$phat, dat$y), aes(FPR, TPR)) +
geom_step(color = blue, size = 2) +
geom_abline(slope = 1, intercept = 0)
```
-]
-.pull-right[
-
-```{r, echo=FALSE}
-roc_plt
-```
-
-]
-
----
## Other stuff
-![:scale 75%](gfx/huge-roc.png)
+![](gfx/huge-roc.png)
* Source: worth exploring [Wikipedia](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)