technical point on AIC

UBC-STAT · Sep 26, 2023 · 7d26b30 · 7d26b30
1 parent 6e104e0
commit 7d26b30
Show file tree

Hide file tree

Showing 2 changed files with 10 additions and 5 deletions.
diff --git a/_freeze/schedule/slides/06-information-criteria/execute-results/html.json b/_freeze/schedule/slides/06-information-criteria/execute-results/html.json
@@ -1,8 +1,10 @@
 {
-  "hash": "fff10fe2207c47aa6a2ddbb9e7fb106e",
+  "hash": "8e8209d71d2cc213b3854b91d53a0f11",
   "result": {
-    "markdown": "---\nlecture: \"06 Information criteria\"\nformat: revealjs\nmetadata-files: \n  - _metadata.yml\n---\n---\n---\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 18 September 2023\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n$$\n\n\n\n\n\n## Generalized CV\n\nLast time we saw a nice trick, that works some of the time (OLS, Ridge regression,...)\n\n$$\\mbox{LOO-CV} = \\frac{1}{n} \\sum_{i=1}^n \\frac{(y_i -\\widehat{y}_i)^2}{(1-h_{ii})^2} = \\frac{1}{n} \\sum_{i=1}^n \\frac{\\widehat{e}_i^2}{(1-h_{ii})^2}.$$\n\n1. $\\widehat{\\y} = \\widehat{f}(\\mathbf{X}) = \\mathbf{H}\\mathbf{y}$ for some matrix $\\mathbf{H}$.\n2. A technical thing.\n\n$$\\newcommand{\\H}{\\mathbf{H}}$$\n\n\n## This is another nice trick.\n\nIdea: replace $h_{ii}$ with $\\frac{1}{n}\\sum_{i=1}^n h_{ii} = \\frac{1}{n}\\textrm{tr}(\\mathbf{H})$\n\nLet's call $\\textrm{tr}(\\mathbf{H})$ the _degrees-of-freedom_ (or just _df_)\n\n$$\\textrm{GCV} = \\frac{1}{n} \\sum_{i=1}^n \\frac{\\widehat{e}_i^2}{(1-\\textrm{df}/n)^2} = \\frac{\\textrm{MSE}}{(1-\\textrm{df}/n)^2}$$\n\n\n[Where does this stuff come from?]{.hand}\n\n\n## What are `hatvalues`?\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ncv_nice <- function(mdl) mean((residuals(mdl) / (1 - hatvalues(mdl)))^2)\n```\n:::\n\n\nIn OLS, $\\widehat{\\y} = \\X\\widehat{\\beta} = \\X(\\X^\\top \\X)^{-1}\\X^\\top \\y$\n\nWe often call $\\mathbf{H} = \\X(\\X^\\top \\X)^{-1}\\X^\\top$ the Hat matrix, because it [puts the hat]{.hand} on $\\y$\n\nGCV uses $\\textrm{tr}(\\mathbf{H})$. \n\nFor `lm()`, this is just `p`, the number of predictors (Why?)\n\nThis is one way of understanding the name _degrees-of-freedom_\n\n\n\n## Alternative interpretation:\n\n\nSuppose, $Y_i$ is independent from some distribution with mean $\\mu_i$ and variance $\\sigma^2$\n\n(remember: in the linear model $\\Expect{Y_i} = x_i^\\top \\beta = \\mu_i$ )\n\nLet $\\widehat{\\mathbf{Y}}$ be an estimator of $\\mu$ (all $i=1,\\ldots,n$ elements of the vector).\n\n. . .\n\n\\begin{aligned}\n& \\Expect{\\frac{1}{n}\\sum (\\widehat Y_i-\\mu_i)^2} \\\\\n&= \\Expect{\\frac{1}{n}\\sum (\\widehat Y_i-Y_i + Y_i -\\mu_i)^2}\\\\\n&= \\frac{1}{n}\\Expect{\\sum (\\widehat Y_i-Y_i)^2} + \\frac{1}{n}\\Expect{\\sum (Y_i-\\mu_i)^2} + \\frac{2}{n}\\Expect{\\sum (\\widehat Y_i-Y_i)(Y_i-\\mu_i)}\\\\\n&= \\frac{1}{n}\\sum \\Expect{(\\widehat Y_i-Y_i)^2} + \\sigma^2 + \\frac{2}{n}\\Expect{\\sum (\\widehat Y_i-Y_i)(Y_i-\\mu_i)} = \\cdots =\\\\\n&= \\frac{1}{n}\\sum \\Expect{(\\widehat Y_i-Y_i)^2} - \\sigma^2 + \\frac{2}{n}\\sum\\Cov{Y_i}{\\widehat Y_i}\n\\end{aligned}\n\n\n## Alternative interpretation:\n\n$$\\Expect{\\frac{1}{n}\\sum (\\widehat Y_i-\\mu_i)^2} = \\frac{1}{n}\\sum \\Expect{(\\widehat Y_i-Y_i)^2} - \\sigma^2 + \\frac{2}{n}\\sum\\Cov{Y_i}{\\widehat Y_i}$$\n\nNow, if $\\widehat{\\mathbf{Y}} = \\H \\mathbf{Y}$ for some matrix $\\H$,\n\n$\\sum\\Cov{Y_i}{\\widehat Y_i} = \\Expect{\\mathbf{Y}^\\top \\H \\mathbf{Y}} = \\sigma^2 \\textrm{tr}(\\H)$\n\n\nThis gives _Mallow's $C_p$_ aka _Stein's Unbiased Risk Estimator_:\n\n$MSE + 2\\hat{\\sigma}^2\\textrm{df}/n$\n\n::: callout-important\nUnfortunately, df may be difficult or impossible to calculate for complicated\nprediction methods. But one can often estimate it well. This idea is beyond\nthe level of this course.\n:::\n\n\n## AIC and BIC\n\nThese have a very similar flavor to $C_p$, but their genesis is different.\n\nWithout going into too much detail, they look like\n\n$\\textrm{AIC}/n = -2\\textrm{loglikelihood}/n + 2\\textrm{df}/n$\n\n$\\textrm{BIC}/n = -2\\textrm{loglikelihood}/n + 2\\log(n)\\textrm{df}/n$\n\n. . .\n\nIn the case of a linear model with Gaussian errors, \n\n\\begin{aligned}\n\\textrm{AIC} &= -n + 2n\\log(2\\pi) - 2 + 2\\log(n) - 2\\log(RSS) + 2(p+2) \\\\\n&\\propto -2\\log(RSS) + 2(p+2)\n\\end{aligned}\n\n( $p+2$ because of the intercept and the unknown variance)\n\n. . .\n\n::: callout-important\nUnfortunately, different books/software/notes define these differently. Even different R packages. This is __super annoying__. \n\nForms above are in [ESL] eq. (7.29) and (7.35). [ISLR] gives special cases in Section 6.1.3. Remember the generic form here.\n:::\n\n\n\n## Over-fitting vs. Under-fitting\n\n\n> Over-fitting means estimating a really complicated function when you don't have enough data.\n\n\nThis is likely a [low-bias / high-variance]{.hand} situation.\n\n\n> Under-fitting means estimating a really simple function when you have lots of data. \n\n\nThis is likely a [high-bias / low-variance]{.hand} situation.\n\nBoth of these outcomes are bad (they have high risk $=$ big $R_n$ ).\n\nThe best way to avoid them is to use a reasonable estimate of _prediction risk_ to choose how complicated your model should be.\n\n\n## Recommendations\n\n::: secondary\nWhen comparing models, choose one criterion: CV / AIC / BIC / Cp / GCV. \n\nCV is usually easiest to make sense of and doesn't depend on other unknown parameters.\n\nBut, it requires refitting the model.\n\nAlso, it can be strange in cases with discrete predictors, time series, repeated measurements, graph structures, etc.\n:::\n\n. . .\n\nHigh-level intuition of these:\n\n* GCV tends to choose \"dense\" models.\n\n* Theory says AIC chooses the \"best predicting model\" asymptotically.\n\n* Theory says BIC should choose the \"true model\" asymptotically, tends to select fewer predictors.\n\n* In some special cases, AIC = Cp = SURE $\\approx$ LOO-CV\n\n\n::: aside\nFor more information: see [ESL] Chapter 7.\nThis material is more challenging than the level of this course, and is easily and often misunderstood.\n:::\n\n\n\n# My recommendation: \n\n[Use CV]{.hand .secondary}\n\n\n## A few more caveats\n\nIt is often tempting to \"just compare\" risk estimates from vastly different models. \n\nFor example, \n\n* different transformations of the predictors, \n\n* different transformations of the response, \n\n* Poisson likelihood vs. Gaussian likelihood in `glm()`\n\n\n[This is not always justified.]{.secondary}\n\n1. The \"high-level intuition\" is for \"nested\" models.\n\n1. Different likelihoods aren't comparable.\n\n1. Residuals / response variables on different scales aren't directly comparable.\n\n\"Validation set\" is easy, because you're always comparing to the \"right\" thing. But it has lots of drawbacks.\n\n\n\n# Next time ...\n\nGreedy selection\n",
-    "supporting": [],
+    "markdown": "---\nlecture: \"06 Information criteria\"\nformat: revealjs\nmetadata-files: \n  - _metadata.yml\n---\n---\n---\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 26 September 2023\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n\\newcommand{\\norm}[1]{\\left\\lVert #1 \\right\\rVert}\n\\newcommand{\\snorm}[1]{\\lVert #1 \\rVert}\n\\newcommand{\\tr}[1]{\\mbox{tr}(#1)}\n\\newcommand{\\brt}{\\widehat{\\beta}^R_{s}}\n\\newcommand{\\brl}{\\widehat{\\beta}^R_{\\lambda}}\n\\newcommand{\\bls}{\\widehat{\\beta}_{ols}}\n\\newcommand{\\blt}{\\widehat{\\beta}^L_{s}}\n\\newcommand{\\bll}{\\widehat{\\beta}^L_{\\lambda}}\n$$\n\n\n\n\n\n## Generalized CV\n\nLast time we saw a nice trick, that works some of the time (OLS, Ridge regression,...)\n\n$$\\mbox{LOO-CV} = \\frac{1}{n} \\sum_{i=1}^n \\frac{(y_i -\\widehat{y}_i)^2}{(1-h_{ii})^2} = \\frac{1}{n} \\sum_{i=1}^n \\frac{\\widehat{e}_i^2}{(1-h_{ii})^2}.$$\n\n1. $\\widehat{\\y} = \\widehat{f}(\\mathbf{X}) = \\mathbf{H}\\mathbf{y}$ for some matrix $\\mathbf{H}$.\n2. A technical thing.\n\n$$\\newcommand{\\H}{\\mathbf{H}}$$\n\n\n## This is another nice trick.\n\nIdea: replace $h_{ii}$ with $\\frac{1}{n}\\sum_{i=1}^n h_{ii} = \\frac{1}{n}\\textrm{tr}(\\mathbf{H})$\n\nLet's call $\\textrm{tr}(\\mathbf{H})$ the _degrees-of-freedom_ (or just _df_)\n\n$$\\textrm{GCV} = \\frac{1}{n} \\sum_{i=1}^n \\frac{\\widehat{e}_i^2}{(1-\\textrm{df}/n)^2} = \\frac{\\textrm{MSE}}{(1-\\textrm{df}/n)^2}$$\n\n\n[Where does this stuff come from?]{.hand}\n\n\n## What are `hatvalues`?\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ncv_nice <- function(mdl) mean((residuals(mdl) / (1 - hatvalues(mdl)))^2)\n```\n:::\n\n\nIn OLS, $\\widehat{\\y} = \\X\\widehat{\\beta} = \\X(\\X^\\top \\X)^{-1}\\X^\\top \\y$\n\nWe often call $\\mathbf{H} = \\X(\\X^\\top \\X)^{-1}\\X^\\top$ the Hat matrix, because it [puts the hat]{.hand} on $\\y$\n\nGCV uses $\\textrm{tr}(\\mathbf{H})$. \n\nFor `lm()`, this is just `p`, the number of predictors (Why?)\n\nThis is one way of understanding the name _degrees-of-freedom_\n\n\n\n## Alternative interpretation:\n\n\nSuppose, $Y_i$ is independent from some distribution with mean $\\mu_i$ and variance $\\sigma^2$\n\n(remember: in the linear model $\\Expect{Y_i} = x_i^\\top \\beta = \\mu_i$ )\n\nLet $\\widehat{\\mathbf{Y}}$ be an estimator of $\\mu$ (all $i=1,\\ldots,n$ elements of the vector).\n\n. . .\n\n\\begin{aligned}\n& \\Expect{\\frac{1}{n}\\sum (\\widehat Y_i-\\mu_i)^2} \\\\\n&= \\Expect{\\frac{1}{n}\\sum (\\widehat Y_i-Y_i + Y_i -\\mu_i)^2}\\\\\n&= \\frac{1}{n}\\Expect{\\sum (\\widehat Y_i-Y_i)^2} + \\frac{1}{n}\\Expect{\\sum (Y_i-\\mu_i)^2} + \\frac{2}{n}\\Expect{\\sum (\\widehat Y_i-Y_i)(Y_i-\\mu_i)}\\\\\n&= \\frac{1}{n}\\sum \\Expect{(\\widehat Y_i-Y_i)^2} + \\sigma^2 + \\frac{2}{n}\\Expect{\\sum (\\widehat Y_i-Y_i)(Y_i-\\mu_i)} = \\cdots =\\\\\n&= \\frac{1}{n}\\sum \\Expect{(\\widehat Y_i-Y_i)^2} - \\sigma^2 + \\frac{2}{n}\\sum\\Cov{Y_i}{\\widehat Y_i}\n\\end{aligned}\n\n\n## Alternative interpretation:\n\n$$\\Expect{\\frac{1}{n}\\sum (\\widehat Y_i-\\mu_i)^2} = \\frac{1}{n}\\sum \\Expect{(\\widehat Y_i-Y_i)^2} - \\sigma^2 + \\frac{2}{n}\\sum\\Cov{Y_i}{\\widehat Y_i}$$\n\nNow, if $\\widehat{\\mathbf{Y}} = \\H \\mathbf{Y}$ for some matrix $\\H$,\n\n$\\sum\\Cov{Y_i}{\\widehat Y_i} = \\Expect{\\mathbf{Y}^\\top \\H \\mathbf{Y}} = \\sigma^2 \\textrm{tr}(\\H)$\n\n\nThis gives _Mallow's $C_p$_ aka _Stein's Unbiased Risk Estimator_:\n\n$MSE + 2\\hat{\\sigma}^2\\textrm{df}/n$\n\n::: callout-important\nUnfortunately, df may be difficult or impossible to calculate for complicated\nprediction methods. But one can often estimate it well. This idea is beyond\nthe level of this course.\n:::\n\n\n## AIC and BIC\n\nThese have a very similar flavor to $C_p$, but their genesis is different.\n\nWithout going into too much detail, they look like\n\n$\\textrm{AIC}/n = -2\\textrm{loglikelihood}/n + 2\\textrm{df}/n$\n\n$\\textrm{BIC}/n = -2\\textrm{loglikelihood}/n + 2\\log(n)\\textrm{df}/n$\n\n. . .\n\nIn the case of a linear model with Gaussian errors, \n\n\\begin{aligned}\n\\textrm{AIC} &= -n + 2n\\log(2\\pi) - 2 + 2\\log(n) - 2\\log(RSS) + 2(p+2) \\\\\n&\\propto -2\\log(RSS) + 2(p+2)\n\\end{aligned}\n\n( $p+2$ because of the intercept and the unknown variance)\n\n. . .\n\n::: callout-important\nUnfortunately, different books/software/notes define these differently. Even different R packages. This is __super annoying__. \n\nForms above are in [ESL] eq. (7.29) and (7.35). [ISLR] gives special cases in Section 6.1.3. Remember the generic form here.\n:::\n\n\n\n## Over-fitting vs. Under-fitting\n\n\n> Over-fitting means estimating a really complicated function when you don't have enough data.\n\n\nThis is likely a [low-bias / high-variance]{.hand} situation.\n\n\n> Under-fitting means estimating a really simple function when you have lots of data. \n\n\nThis is likely a [high-bias / low-variance]{.hand} situation.\n\nBoth of these outcomes are bad (they have high risk $=$ big $R_n$ ).\n\nThe best way to avoid them is to use a reasonable estimate of _prediction risk_ to choose how complicated your model should be.\n\n\n## Recommendations\n\n::: secondary\nWhen comparing models, choose one criterion: CV / AIC / BIC / Cp / GCV. \n\nCV is usually easiest to make sense of and doesn't depend on other unknown parameters.\n\nBut, it requires refitting the model.\n\nAlso, it can be strange in cases with discrete predictors, time series, repeated measurements, graph structures, etc.\n:::\n\n\n\n## High-level intuition of these:\n\n* GCV tends to choose \"dense\" models.\n\n* Theory says AIC chooses the \"best predicting model\" asymptotically.\n\n* Theory says BIC should choose the \"true model\" asymptotically, tends to select fewer predictors.\n\n* In some special cases, AIC = Cp = SURE $\\approx$ LOO-CV\n\n\n* As a technical point, CV (or validation set) is estimating error on \n[new data]{.secondary}, unseen $(X_0, Y_0)$, while AIC / CP are estimating error on [new Y]{.secondary} at the observed $x_1,\\ldots,x_n$. This is subtle.\n\n::: aside\nFor more information: see [ESL] Chapter 7.\nThis material is more challenging than the level of this course, and is easily and often misunderstood.\n:::\n\n\n\n# My recommendation: \n\n[Use CV]{.hand .secondary}\n\n\n## A few more caveats\n\nIt is often tempting to \"just compare\" risk estimates from vastly different models. \n\nFor example, \n\n* different transformations of the predictors, \n\n* different transformations of the response, \n\n* Poisson likelihood vs. Gaussian likelihood in `glm()`\n\n\n[This is not always justified.]{.secondary}\n\n1. The \"high-level intuition\" is for \"nested\" models.\n\n1. Different likelihoods aren't comparable.\n\n1. Residuals / response variables on different scales aren't directly comparable.\n\n\"Validation set\" is easy, because you're always comparing to the \"right\" thing. But it has lots of drawbacks.\n\n\n\n# Next time ...\n\nGreedy selection\n",
+    "supporting": [
+      "06-information-criteria_files"
+    ],
     "filters": [
       "rmarkdown/pagebreak.lua"
     ],

diff --git a/schedule/slides/06-information-criteria.qmd b/schedule/slides/06-information-criteria.qmd
@@ -152,9 +152,9 @@ But, it requires refitting the model.
 Also, it can be strange in cases with discrete predictors, time series, repeated measurements, graph structures, etc.
 :::
 
-. . .
 
-High-level intuition of these:
+
+## High-level intuition of these:
 
 * GCV tends to choose "dense" models.
 
@@ -165,6 +165,9 @@ High-level intuition of these:
 * In some special cases, AIC = Cp = SURE $\approx$ LOO-CV
 
 
+* As a technical point, CV (or validation set) is estimating error on 
+[new data]{.secondary}, unseen $(X_0, Y_0)$, while AIC / CP are estimating error on [new Y]{.secondary} at the observed $x_1,\ldots,x_n$. This is subtle.
+
 ::: aside
 For more information: see [ESL] Chapter 7.
 This material is more challenging than the level of this course, and is easily and often misunderstood.