Built site for gh-pages

UBC-STAT · Sep 24, 2024 · 141c983 · 141c983
1 parent 2dbf12e
commit 141c983
Show file tree

Hide file tree

Showing 4 changed files with 77 additions and 87 deletions.
diff --git a/.nojekyll b/.nojekyll
@@ -1 +1 @@
-87f2f05c
+57dbfbac
diff --git a/schedule/slides/06-information-criteria.html b/schedule/slides/06-information-criteria.html
@@ -458,7 +458,7 @@ <h2>LOO-CV: Math to the rescue!</h2>
 <section id="loo-cv-math-to-the-rescue-1" class="slide level2">
 <h2>LOO-CV: Math to the rescue!</h2>
 <p>For models where predictions are a <strong>linear function</strong> of the training responses*,</p>
-<p><strong>LOO-CV has a closed-form expression!</strong></p>
+<p><strong>LOO-CV has a closed-form expression!</strong> Just need to fit <em>once</em>:</p>
 <p><span class="math display">\[\mbox{LOO-CV} \,\, \hat R_n = \frac{1}{n} \sum_{i=1}^n \frac{(Y_i -\widehat{Y}_i)^2}{(1-{\boldsymbol H}_{ii})^2}.\]</span></p>
 <ul>
 <li>Numerator is the <em>squared residual</em> (loss) for training point <span class="math inline">\(i\)</span>.</li>
@@ -651,8 +651,8 @@ <h3 id="observations-1">Observations</h3>
 <h2>AIC and BIC</h2>
 <p>These have a very similar flavor to <span class="math inline">\(C_p\)</span>, but their genesis is different.</p>
 <p>Without going into too much detail, they look like</p>
-<p><span class="math inline">\(\textrm{AIC}/n = -2\textrm{loglikelihood}/n + 2\textrm{df}/n\)</span></p>
-<p><span class="math inline">\(\textrm{BIC}/n = -2\textrm{loglikelihood}/n + 2\log(n)\textrm{df}/n\)</span></p>
+<p><span class="math inline">\(\textrm{AIC}/n = -2\textrm{log-likelihood}/n + 2\textrm{df}/n\)</span></p>
+<p><span class="math inline">\(\textrm{BIC}/n = -2\textrm{log-likelihood}/n + 2\log(n)\textrm{df}/n\)</span></p>
 <div class="fragment">
 <p>In the case of a linear model with Gaussian errors and <span class="math inline">\(p\)</span> predictors</p>
 <span class="math display">\[\begin{aligned}
@@ -683,31 +683,29 @@ <h2>Over-fitting vs.&nbsp;Under-fitting</h2>
 <blockquote>
 <p>Over-fitting means estimating a really complicated function when you don’t have enough data.</p>
 </blockquote>
-<p>This is likely a <span class="hand">low-bias / high-variance</span> situation.</p>
+<p>This is likely a <strong>low-bias / high-variance</strong> situation.</p>
 <blockquote>
 <p>Under-fitting means estimating a really simple function when you have lots of data.</p>
 </blockquote>
-<p>This is likely a <span class="hand">high-bias / low-variance</span> situation.</p>
+<p>This is likely a <strong>high-bias / low-variance</strong> situation.</p>
 <p>Both of these outcomes are bad (they have high risk <span class="math inline">\(=\)</span> big <span class="math inline">\(R_n\)</span> ).</p>
 <p>The best way to avoid them is to use a reasonable estimate of <em>prediction risk</em> to choose how complicated your model should be.</p>
 </section>
-<section id="recommendations" class="slide level2">
-<h2>Recommendations</h2>
-<div class="secondary">
-<p>When comparing models, choose one criterion: CV / AIC / BIC / Cp / GCV.</p>
-<p>CV is usually easiest to make sense of and doesn’t depend on other unknown parameters.</p>
-<p>But, it requires refitting the model.</p>
-<p>Also, it can be strange in cases with discrete predictors, time series, repeated measurements, graph structures, etc.</p>
-</div>
-</section>
-<section id="high-level-intuition-of-these" class="slide level2">
-<h2>High-level intuition of these:</h2>
+<section id="commentary" class="slide level2">
+<h2>Commentary</h2>
+<ul>
+<li>When comparing models, choose one criterion: CV / AIC / BIC / Cp / GCV.
+<ul>
+<li>In some special cases, AIC = Cp = SURE <span class="math inline">\(\approx\)</span> LOO-CV</li>
+</ul></li>
+<li>CV is generic, easy, and doesn’t depend on unknowns.
 <ul>
-<li><p>GCV tends to choose “dense” models.</p></li>
-<li><p>Theory says AIC chooses the “best predicting model” asymptotically.</p></li>
-<li><p>Theory says BIC should choose the “true model” asymptotically, tends to select fewer predictors.</p></li>
-<li><p>In some special cases, AIC = Cp = SURE <span class="math inline">\(\approx\)</span> LOO-CV</p></li>
-<li><p>As a technical point, CV (or validation set) is estimating error on <span class="secondary">new data</span>, unseen <span class="math inline">\((X_0, Y_0)\)</span>, while AIC / CP are estimating error on <span class="secondary">new Y</span> at the observed <span class="math inline">\(x_1,\ldots,x_n\)</span>. This is subtle.</p></li>
+<li>But requires refitting, and nontrivial for discrete predictors, time series, etc.</li>
+</ul></li>
+<li>GCV tends to choose “dense” models.</li>
+<li>Theory says AIC chooses “best predicting model” asymptotically.</li>
+<li>Theory says BIC chooses “true model” asymptotically, tends to select fewer predictors.</li>
+<li>Technical: CV (or validation set) is estimating error on <span class="secondary">new data</span>, unseen <span class="math inline">\((X_0, Y_0)\)</span>; AIC / CP are estimating error on <span class="secondary">new Y</span> at the observed <span class="math inline">\(x_1,\ldots,x_n\)</span>. This is subtle.</li>
 </ul>
 
 <aside><div>
@@ -716,11 +714,11 @@ <h2>High-level intuition of these:</h2>
 <section>
 <section id="my-recommendation" class="title-slide slide level1 center">
 <h1>My recommendation:</h1>
-<p><span class="hand secondary">Use CV</span></p>
+<p><strong>Use CV.</strong></p>
 </section>
 <section id="a-few-more-caveats" class="slide level2">
 <h2>A few more caveats</h2>
-<p>It is often tempting to “just compare” risk estimates from vastly different models.</p>
+<p>Tempting to “just compare” risk estimates from vastly different models.</p>
 <p>For example,</p>
 <ul>
 <li><p>different transformations of the predictors,</p></li>
@@ -733,7 +731,6 @@ <h2>A few more caveats</h2>
 <li><p>Different likelihoods aren’t comparable.</p></li>
 <li><p>Residuals / response variables on different scales aren’t directly comparable.</p></li>
 </ol>
-<p>“Validation set” is easy, because you’re always comparing to the “right” thing. But it has lots of drawbacks.</p>
 </section></section>
 <section id="next-time" class="title-slide slide level1 center">
 <h1>Next time …</h1>

diff --git a/search.json b/search.json
@@ -3126,7 +3126,7 @@
     "href": "schedule/slides/06-information-criteria.html#loo-cv-math-to-the-rescue-1",
     "title": "UBC Stat406 2024W",
     "section": "LOO-CV: Math to the rescue!",
-    "text": "LOO-CV: Math to the rescue!\nFor models where predictions are a linear function of the training responses*,\nLOO-CV has a closed-form expression!\n\\[\\mbox{LOO-CV} \\,\\, \\hat R_n = \\frac{1}{n} \\sum_{i=1}^n \\frac{(Y_i -\\widehat{Y}_i)^2}{(1-{\\boldsymbol H}_{ii})^2}.\\]\n\nNumerator is the squared residual (loss) for training point \\(i\\).\nDenominator weights each residual by diagonal of \\(H\\) some factor\n\\(H_{ii}\\) are leverage/hat values: tell you what happens when moving data point \\(i\\) a bit\n\n*: plus some technicalities\n\n\n\n\n\n\n\nTip\n\n\nDeriving this sucks. I wouldn’t recommend doing it yourself."
+    "text": "LOO-CV: Math to the rescue!\nFor models where predictions are a linear function of the training responses*,\nLOO-CV has a closed-form expression! Just need to fit once:\n\\[\\mbox{LOO-CV} \\,\\, \\hat R_n = \\frac{1}{n} \\sum_{i=1}^n \\frac{(Y_i -\\widehat{Y}_i)^2}{(1-{\\boldsymbol H}_{ii})^2}.\\]\n\nNumerator is the squared residual (loss) for training point \\(i\\).\nDenominator weights each residual by diagonal of \\(H\\) some factor\n\\(H_{ii}\\) are leverage/hat values: tell you what happens when moving data point \\(i\\) a bit\n\n*: plus some technicalities\n\n\n\n\n\n\n\nTip\n\n\nDeriving this sucks. I wouldn’t recommend doing it yourself."
   },
   {
     "objectID": "schedule/slides/06-information-criteria.html#computing-the-formula",
@@ -3182,7 +3182,7 @@
     "href": "schedule/slides/06-information-criteria.html#aic-and-bic",
     "title": "UBC Stat406 2024W",
     "section": "AIC and BIC",
-    "text": "AIC and BIC\nThese have a very similar flavor to \\(C_p\\), but their genesis is different.\nWithout going into too much detail, they look like\n\\(\\textrm{AIC}/n = -2\\textrm{loglikelihood}/n + 2\\textrm{df}/n\\)\n\\(\\textrm{BIC}/n = -2\\textrm{loglikelihood}/n + 2\\log(n)\\textrm{df}/n\\)\n\nIn the case of a linear model with Gaussian errors and \\(p\\) predictors\n\\[\\begin{aligned}\n\\textrm{AIC}/n &= \\log(2\\pi) + \\log(RSS/n) + 2(p+1)/n \\\\\n&\\propto \\log(RSS) + 2(p+1)/n\n\\end{aligned}\\]\n( \\(p+1\\) because of the unknown variance, intercept included in \\(p\\) or not)\n\n\n\n\n\n\n\n\nImportant\n\n\nUnfortunately, different books/software/notes define these differently. Even different R packages. This is super annoying.\nForms above are in [ESL] eq. (7.29) and (7.35). [ISLR] gives special cases in Section 6.1.3. Remember the generic form here."
+    "text": "AIC and BIC\nThese have a very similar flavor to \\(C_p\\), but their genesis is different.\nWithout going into too much detail, they look like\n\\(\\textrm{AIC}/n = -2\\textrm{log-likelihood}/n + 2\\textrm{df}/n\\)\n\\(\\textrm{BIC}/n = -2\\textrm{log-likelihood}/n + 2\\log(n)\\textrm{df}/n\\)\n\nIn the case of a linear model with Gaussian errors and \\(p\\) predictors\n\\[\\begin{aligned}\n\\textrm{AIC}/n &= \\log(2\\pi) + \\log(RSS/n) + 2(p+1)/n \\\\\n&\\propto \\log(RSS) + 2(p+1)/n\n\\end{aligned}\\]\n( \\(p+1\\) because of the unknown variance, intercept included in \\(p\\) or not)\n\n\n\n\n\n\n\n\nImportant\n\n\nUnfortunately, different books/software/notes define these differently. Even different R packages. This is super annoying.\nForms above are in [ESL] eq. (7.29) and (7.35). [ISLR] gives special cases in Section 6.1.3. Remember the generic form here."
   },
   {
     "objectID": "schedule/slides/06-information-criteria.html#over-fitting-vs.-under-fitting",
@@ -3192,25 +3192,18 @@
     "text": "Over-fitting vs. Under-fitting\n\nOver-fitting means estimating a really complicated function when you don’t have enough data.\n\nThis is likely a low-bias / high-variance situation.\n\nUnder-fitting means estimating a really simple function when you have lots of data.\n\nThis is likely a high-bias / low-variance situation.\nBoth of these outcomes are bad (they have high risk \\(=\\) big \\(R_n\\) ).\nThe best way to avoid them is to use a reasonable estimate of prediction risk to choose how complicated your model should be."
   },
   {
-    "objectID": "schedule/slides/06-information-criteria.html#recommendations",
-    "href": "schedule/slides/06-information-criteria.html#recommendations",
+    "objectID": "schedule/slides/06-information-criteria.html#commentary",
+    "href": "schedule/slides/06-information-criteria.html#commentary",
     "title": "UBC Stat406 2024W",
-    "section": "Recommendations",
-    "text": "Recommendations\n\nWhen comparing models, choose one criterion: CV / AIC / BIC / Cp / GCV.\nCV is usually easiest to make sense of and doesn’t depend on other unknown parameters.\nBut, it requires refitting the model.\nAlso, it can be strange in cases with discrete predictors, time series, repeated measurements, graph structures, etc."
-  },
-  {
-    "objectID": "schedule/slides/06-information-criteria.html#high-level-intuition-of-these",
-    "href": "schedule/slides/06-information-criteria.html#high-level-intuition-of-these",
-    "title": "UBC Stat406 2024W",
-    "section": "High-level intuition of these:",
-    "text": "High-level intuition of these:\n\nGCV tends to choose “dense” models.\nTheory says AIC chooses the “best predicting model” asymptotically.\nTheory says BIC should choose the “true model” asymptotically, tends to select fewer predictors.\nIn some special cases, AIC = Cp = SURE \\(\\approx\\) LOO-CV\nAs a technical point, CV (or validation set) is estimating error on new data, unseen \\((X_0, Y_0)\\), while AIC / CP are estimating error on new Y at the observed \\(x_1,\\ldots,x_n\\). This is subtle.\n\n\n\nFor more information: see [ESL] Chapter 7. This material is more challenging than the level of this course, and is easily and often misunderstood."
+    "section": "Commentary",
+    "text": "Commentary\n\nWhen comparing models, choose one criterion: CV / AIC / BIC / Cp / GCV.\n\nIn some special cases, AIC = Cp = SURE \\(\\approx\\) LOO-CV\n\nCV is generic, easy, and doesn’t depend on unknowns.\n\nBut requires refitting, and nontrivial for discrete predictors, time series, etc.\n\nGCV tends to choose “dense” models.\nTheory says AIC chooses “best predicting model” asymptotically.\nTheory says BIC chooses “true model” asymptotically, tends to select fewer predictors.\nTechnical: CV (or validation set) is estimating error on new data, unseen \\((X_0, Y_0)\\); AIC / CP are estimating error on new Y at the observed \\(x_1,\\ldots,x_n\\). This is subtle.\n\n\n\nFor more information: see [ESL] Chapter 7. This material is more challenging than the level of this course, and is easily and often misunderstood."
   },
   {
     "objectID": "schedule/slides/06-information-criteria.html#a-few-more-caveats",
     "href": "schedule/slides/06-information-criteria.html#a-few-more-caveats",
     "title": "UBC Stat406 2024W",
     "section": "A few more caveats",
-    "text": "A few more caveats\nIt is often tempting to “just compare” risk estimates from vastly different models.\nFor example,\n\ndifferent transformations of the predictors,\ndifferent transformations of the response,\nPoisson likelihood vs. Gaussian likelihood in glm()\n\nThis is not always justified.\n\nThe “high-level intuition” is for “nested” models.\nDifferent likelihoods aren’t comparable.\nResiduals / response variables on different scales aren’t directly comparable.\n\n“Validation set” is easy, because you’re always comparing to the “right” thing. But it has lots of drawbacks."
+    "text": "A few more caveats\nTempting to “just compare” risk estimates from vastly different models.\nFor example,\n\ndifferent transformations of the predictors,\ndifferent transformations of the response,\nPoisson likelihood vs. Gaussian likelihood in glm()\n\nThis is not always justified.\n\nThe “high-level intuition” is for “nested” models.\nDifferent likelihoods aren’t comparable.\nResiduals / response variables on different scales aren’t directly comparable."
   },
   {
     "objectID": "schedule/slides/19-bagging-and-rf.html#section",