diff --git a/_freeze/schedule/slides/00-r-review/execute-results/html.json b/_freeze/schedule/slides/00-r-review/execute-results/html.json index dc02962..0523c7f 100644 --- a/_freeze/schedule/slides/00-r-review/execute-results/html.json +++ b/_freeze/schedule/slides/00-r-review/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "de70733d64f6b7d191db020c494ceffe", + "hash": "86d9e3f58e1e8414feba1c2862895ab7", "result": { - "markdown": "---\nlecture: \"00 R, Rmarkdown, code, and `{tidyverse}`:
A whirlwind tour\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n---\n---\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 06 September 2023\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n$$\n\n\n\n\n\n# The basics\n\n\n## Tour of Rstudio\n\nThings to note\n\n1. Console\n1. Terminal\n1. Scripts, .Rmd, Knit\n1. Files, Projects\n1. Getting help\n1. Environment, Git\n\n\n\n## Simple stuff\n\n:::: {.columns}\n::: {.column width=\"45%\"}\n\n### Vectors:\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nx <- c(1, 3, 4)\nx[1]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1\n```\n:::\n\n```{.r .cell-code}\nx[-1]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3 4\n```\n:::\n\n```{.r .cell-code}\nrev(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 4 3 1\n```\n:::\n\n```{.r .cell-code}\nc(x, x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1 3 4 1 3 4\n```\n:::\n:::\n\n:::\n\n::: {.column width=\"10%\"}\n:::\n\n::: {.column width=\"45%\"}\n\n### Matrices:\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nx <- matrix(1:25, nrow = 5, ncol = 5)\nx[1,]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1 6 11 16 21\n```\n:::\n\n```{.r .cell-code}\nx[,-1]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [,1] [,2] [,3] [,4]\n[1,] 6 11 16 21\n[2,] 7 12 17 22\n[3,] 8 13 18 23\n[4,] 9 14 19 24\n[5,] 10 15 20 25\n```\n:::\n\n```{.r .cell-code}\nx[c(1,3), 2:3]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [,1] [,2]\n[1,] 6 11\n[2,] 8 13\n```\n:::\n:::\n\n\n:::\n::::\n\n\n## Simple stuff\n\n::: flex\n::: w-50\n\n### Lists\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n(l <- list(\n a = letters[1:2], \n b = 1:4, \n c = list(a = 1)))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$a\n[1] \"a\" \"b\"\n\n$b\n[1] 1 2 3 4\n\n$c\n$c$a\n[1] 1\n```\n:::\n\n```{.r .cell-code}\nl$a\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a\" \"b\"\n```\n:::\n\n```{.r .cell-code}\nl$c$a\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1\n```\n:::\n\n```{.r .cell-code}\nl[\"b\"] # compare to l[[\"b\"]] == l$b\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$b\n[1] 1 2 3 4\n```\n:::\n:::\n\n:::\n\n\n::: w-50\n\n### Data frames\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n(dat <- data.frame(\n z = 1:5, \n b = 6:10, \n c = letters[1:5]))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n z b c\n1 1 6 a\n2 2 7 b\n3 3 8 c\n4 4 9 d\n5 5 10 e\n```\n:::\n\n```{.r .cell-code}\nclass(dat)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"data.frame\"\n```\n:::\n\n```{.r .cell-code}\ndat$b\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 6 7 8 9 10\n```\n:::\n\n```{.r .cell-code}\ndat[1,]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n z b c\n1 1 6 a\n```\n:::\n:::\n\n\n:::\n:::\n\n[Data frames are sort-of lists and sort-of matrices]{.secondary}\n\n\n## Tibbles\n\n[These are `{tidyverse}` data frames]{.secondary}\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n(dat2 <- tibble(z = 1:5, b = z + 5, c = letters[z]))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 5 × 3\n z b c \n \n1 1 6 a \n2 2 7 b \n3 3 8 c \n4 4 9 d \n5 5 10 e \n```\n:::\n\n```{.r .cell-code}\nclass(dat2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"tbl_df\" \"tbl\" \"data.frame\"\n```\n:::\n:::\n\n\nWe'll return to classes in a moment. A `tbl_df` is a \"subclass\" of `data.frame`.\n\nAnything that `data.frame` can do, `tbl_df` can do (better).\n\nFor instance, the printing is more informative.\n\nAlso, you can construct one by referencing previously constructed columns.\n\n\n\n# Functions\n\n\n\n## Understanding signatures\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\nsig <- sig::sig\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsig(lm)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfn <- function(formula, data, subset, weights, na.action, method = \"qr\", model\n = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts =\n NULL, offset, ...)\n```\n:::\n\n```{.r .cell-code}\nsig(`+`)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfn <- function(e1, e2)\n```\n:::\n\n```{.r .cell-code}\nsig(dplyr::filter)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfn <- function(.data, ..., .by = NULL, .preserve = FALSE)\n```\n:::\n\n```{.r .cell-code}\nsig(stats::filter)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfn <- function(x, filter, method = c(\"convolution\", \"recursive\"), sides = 2,\n circular = FALSE, init = NULL)\n```\n:::\n\n```{.r .cell-code}\nsig(rnorm)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfn <- function(n, mean = 0, sd = 1)\n```\n:::\n:::\n\n\n\n## These are all the same\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nset.seed(12345)\nrnorm(3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.5855288 0.7094660 -0.1093033\n```\n:::\n\n```{.r .cell-code}\nset.seed(12345)\nrnorm(n = 3, mean = 0)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.5855288 0.7094660 -0.1093033\n```\n:::\n\n```{.r .cell-code}\nset.seed(12345)\nrnorm(3, 0, 1)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.5855288 0.7094660 -0.1093033\n```\n:::\n\n```{.r .cell-code}\nset.seed(12345)\nrnorm(sd = 1, n = 3, mean = 0)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.5855288 0.7094660 -0.1093033\n```\n:::\n:::\n\n\n* Functions can have default values.\n* You may, but don't have to, name the arguments\n* If you name them, you can pass them out of order (but you shouldn't).\n\n\n## Write lots of functions. I can't emphasize this enough.\n\n::: flex\n\n::: w-50\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nf <- function(arg1, arg2, arg3 = 12, ...) {\n stuff <- arg1 * arg3\n stuff2 <- stuff + arg2\n plot(arg1, stuff2, ...)\n return(stuff2)\n}\nx <- rnorm(100)\n```\n:::\n\n:::\n\n\n::: w-50\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ny1 <- f(x, 3, 15, col = 4, pch = 19)\n```\n\n::: {.cell-output-display}\n![](00-r-review_files/figure-revealjs/plot-it-1.svg){fig-align='center'}\n:::\n\n```{.r .cell-code}\nstr(y1)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n num [1:100] -3.8 12.09 -24.27 12.45 -1.14 ...\n```\n:::\n:::\n\n\n:::\n:::\n\n\n## Outputs vs. Side effects\n\n::: flex\n::: w-50\n* Side effects are things a function does, outputs can be assigned to variables\n* A good example is the `hist` function\n* You have probably only seen the side effect which is to plot the histogram\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nmy_histogram <- hist(rnorm(1000))\n```\n\n::: {.cell-output-display}\n![](00-r-review_files/figure-revealjs/unnamed-chunk-9-1.svg){fig-align='center'}\n:::\n:::\n\n\n:::\n\n\n::: w-50\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nstr(my_histogram)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nList of 6\n $ breaks : num [1:14] -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 ...\n $ counts : int [1:13] 4 21 41 89 142 200 193 170 74 38 ...\n $ density : num [1:13] 0.008 0.042 0.082 0.178 0.284 0.4 0.386 0.34 0.148 0.076 ...\n $ mids : num [1:13] -2.75 -2.25 -1.75 -1.25 -0.75 -0.25 0.25 0.75 1.25 1.75 ...\n $ xname : chr \"rnorm(1000)\"\n $ equidist: logi TRUE\n - attr(*, \"class\")= chr \"histogram\"\n```\n:::\n\n```{.r .cell-code}\nclass(my_histogram)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"histogram\"\n```\n:::\n:::\n\n\n:::\n:::\n\n\n\n## When writing functions, program defensively, ensure behaviour\n\n::: flex\n::: w-50\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nincrementer <- function(x, inc_by = 1) {\n x + 1\n}\n \nincrementer(2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3\n```\n:::\n\n```{.r .cell-code}\nincrementer(1:4)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 2 3 4 5\n```\n:::\n\n```{.r .cell-code}\nincrementer(\"a\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in x + 1: non-numeric argument to binary operator\n```\n:::\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nincrementer <- function(x, inc_by = 1) {\n stopifnot(is.numeric(x))\n return(x + 1)\n}\nincrementer(\"a\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in incrementer(\"a\"): is.numeric(x) is not TRUE\n```\n:::\n:::\n\n\n:::\n\n\n::: w-50\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nincrementer <- function(x, inc_by = 1) {\n if (!is.numeric(x)) {\n stop(\"`x` must be numeric\")\n }\n x + 1\n}\nincrementer(\"a\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in incrementer(\"a\"): `x` must be numeric\n```\n:::\n\n```{.r .cell-code}\nincrementer(2, -3) ## oops!\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3\n```\n:::\n\n```{.r .cell-code}\nincrementer <- function(x, inc_by = 1) {\n if (!is.numeric(x)) {\n stop(\"`x` must be numeric\")\n }\n x + inc_by\n}\nincrementer(2, -3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] -1\n```\n:::\n:::\n\n:::\n:::\n\n\n## How to keep track\n\n::: flex\n::: w-50\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(testthat)\nincrementer <- function(x, inc_by = 1) {\n if (!is.numeric(x)) {\n stop(\"`x` must be numeric\")\n }\n if (!is.numeric(inc_by)) {\n stop(\"`inc_by` must be numeric\")\n }\n x + inc_by\n}\nexpect_error(incrementer(\"a\"))\nexpect_equal(incrementer(1:3), 2:4)\nexpect_equal(incrementer(2, -3), -1)\nexpect_error(incrementer(1, \"b\"))\nexpect_identical(incrementer(1:3), 2:4)\n```\n\n::: {.cell-output .cell-output-error}\n```\nError: incrementer(1:3) not identical to 2:4.\nObjects equal but not identical\n```\n:::\n:::\n\n:::\n\n\n::: w-50\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nis.integer(2:4)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nis.integer(incrementer(1:3))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n\n```{.r .cell-code}\nexpect_identical(incrementer(1:3, 1L), 2:4)\n```\n:::\n\n:::\n:::\n\n. . .\n\n::: callout-important\nIf you copy something, write a function.\n\nValidate your arguments.\n\nTo ensure proper functionality, write tests to check if inputs result in predicted outputs.\n:::\n\n\n\n# Classes and methods\n\n\n\n## Classes\n\n::: flex\n::: w-50\n\nWe saw some of these earlier:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntib <- tibble(\n x1 = rnorm(100), \n x2 = rnorm(100), \n y = x1 + 2 * x2 + rnorm(100)\n)\nmdl <- lm(y ~ ., data = tib )\nclass(tib)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"tbl_df\" \"tbl\" \"data.frame\"\n```\n:::\n\n```{.r .cell-code}\nclass(mdl)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"lm\"\n```\n:::\n:::\n\n\nThe class allows for the use of \"methods\"\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nprint(mdl)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = y ~ ., data = tib)\n\nCoefficients:\n(Intercept) x1 x2 \n -0.1742 1.0454 2.0470 \n```\n:::\n:::\n\n\n:::\n\n\n::: w-50\n\n\n* `R` \"knows what to do\" when you `print()` an object of class `\"lm\"`.\n\n* `print()` is called a \"generic\" function. \n\n* You can create \"methods\" that get dispatched.\n\n* For any generic, `R` looks for a method for the class.\n\n* If available, it calls that function.\n\n:::\n:::\n\n## Viewing the dispatch chain\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsloop::s3_dispatch(print(incrementer))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n=> print.function\n * print.default\n```\n:::\n\n```{.r .cell-code}\nsloop::s3_dispatch(print(tib))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n print.tbl_df\n=> print.tbl\n * print.data.frame\n * print.default\n```\n:::\n\n```{.r .cell-code}\nsloop::s3_dispatch(print(mdl))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n=> print.lm\n * print.default\n```\n:::\n:::\n\n\n\n## R-Geeky But Important\n\nThere are [lots]{.secondary} of generic functions in `R`\n\nCommon ones are `print()`, `summary()`, and `plot()`.\n\nAlso, lots of important statistical modelling concepts:\n`residuals()` `coef()` \n\n(In `python`, these work the opposite way: `obj.residuals`. The dot after the object accesses methods defined for that type of object. But the dispatch behaviour is less robust.) \n\n* The convention is\nthat the specialized function is named `method.class()`, e.g., `summary.lm()`.\n\n* If no specialized function is defined, R will try to use `method.default()`.\n\nFor this reason, `R` programmers try to avoid `.` in names of functions or objects.\n\n\n## Wherefore methods?\n\n\n* The advantage is that you don't have to learn a totally\nnew syntax to grab residuals or plot things\n\n* You just use `residuals(mdl)` whether `mdl` has class `lm`\ncould have been done two centuries ago, or a Batrachian Emphasis Machine\nwhich won't be invented for another five years. \n\n* The one draw-back is the help pages for the generic methods tend\nto be pretty vague\n\n* Compare `?summary` with `?summary.lm`. \n\n\n\n# Environments \n\n\n## Different environments\n\n* These are often tricky, but are very common.\n\n* Most programming languages have this concept in one way or another.\n\n* In `R` code run in the Console produces objects in the \"Global environment\"\n\n* You can see what you create in the \"Environment\" tab.\n\n* But there's lots of other stuff.\n\n* Many packages are automatically loaded at startup, so you have access to the functions and data inside\n\nFor example `mean()`, `lm()`, `plot()`, `iris` (technically `iris` is lazy-loaded, meaning it's not in memory until you call it, but it is available)\n\n\n\n##\n\n* Other packages require you to load them with `library(pkg)` before their functions are available.\n\n* But, you can call those functions by prefixing the package name `ggplot2::ggplot()`.\n\n* You can also access functions that the package developer didn't \"export\" for use with `:::` like `dplyr:::as_across_fn_call()`\n\n::: {.notes}\n\nThat is all about accessing \"objects in package environments\"\n\n:::\n\n\n## Other issues with environments\n\n\nAs one might expect, functions create an environment inside the function.\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nz <- 1\nfun <- function(x) {\n z <- x\n print(z)\n invisible(z)\n}\nfun(14)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 14\n```\n:::\n:::\n\n\nNon-trivial cases are `data-masking` environments.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntib <- tibble(x1 = rnorm(100), x2 = rnorm(100), y = x1 + 2 * x2)\nmdl <- lm(y ~ x2, data = tib)\nx2\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in eval(expr, envir, enclos): object 'x2' not found\n```\n:::\n:::\n\n\n* `lm()` looks \"inside\" the `tib` to find `y` and `x2`\n* The data variables are added to the `lm()` environment\n\n\n## Other issues with environments\n\n[When Knit, `.Rmd` files run in their OWN environment.]{.fourth-colour}\n\nThey are run from top to bottom, with code chunks depending on previous\n\nThis makes them reproducible.\n\nJupyter notebooks don't do this. 😱\n\nObjects in your local environment are not available in the `.Rmd`\n\nObjects in the `.Rmd` are not available locally.\n\n::: {.callout-tip}\nThe most frequent error I see is:\n\n* running chunks individually, 1-by-1, and it works\n* Knitting, and it fails\n\nThe reason is almost always that the chunks refer to objects in the Environment that don't exist in the `.Rmd`\n:::\n\n##\n\n\n### This error also happens because:\n\n* `library()` calls were made globally but not in the `.Rmd` \n * so the packages aren't loaded\n\n* paths to data or other objects are not relative to the `.Rmd` in your file system \n * they must be\n\n\n* Carefully keeping Labs / Assignments in their current location will help to avoid some of these.\n\n\n# Debugging\n\n\n\n## How to fix code\n\n* If you're using a function in a package, start with `?function` to see the help\n * Make sure you're calling the function correctly.\n * Try running the examples.\n * paste the error into Google (if you share the error on Slack, I often do this first)\n * Go to the package website if it exists, and browse around\n \n* If your `.Rmd` won't Knit\n * Did you make the mistake on the last slide?\n * Did it Knit before? Then the bug is in whatever you added.\n * Did you never Knit it? Why not?\n * Call `rstudioapi::restartSession()`, then run the Chunks 1-by-1\n \n##\n \nAdding `browser()`\n\n* Only useful with your own functions.\n* Open the script with the function, and add `browser()` to the code somewhere\n* Then call your function.\n* The execution will Stop where you added `browser()` and you'll have access to the local environment to play around\n\n\n## Reproducible examples\n\n::: {.callout-tip}\n## Question I get on Slack that I hate:\n\n\"I ran the code like you had on Slide 39, but it didn't work.\"\n:::\n\n* If you want to ask me why the code doesn't work, you need to show me what's wrong.\n\n::: {.callout-warning}\n## Don't just paste a screenshot!\n\nUnless you get lucky, I won't be able to figure it out from that. And we'll both get frustrated.\n:::\n\nWhat you need is a Reproducible Example or `reprex`.\n\n* This is a small chunk of code that \n 1. runs in it's own environment \n 1. and produces the error.\n \n#\n\nThe best way to do this is with the `{reprex}` package.\n\n\n## Reproducible examples, How it works {.smaller}\n\n1. Open a new `.R` script.\n\n1. Paste your buggy code in the file (no need to save)\n\n1. Edit your code to make sure it's \"enough to produce the error\" and nothing more. (By rerunning the code a few times.)\n\n1. Copy your code.\n\n1. Call `reprex::reprex(venue = \"r\")` from the console. This will run your code in a new environment and show the result in the Viewer tab. Does it create the error you expect?\n\n1. If it creates other errors, that may be the problem. You may fix the bug on your own!\n\n1. If it doesn't have errors, then your global environment is Farblunget.\n\n1. The Output is now on your clipboard. Go to Slack and paste it in a message. Then press `Cmd+Shift+Enter` (on Mac) or `Ctrl+Shift+Enter` (Windows/Linux). Under Type, select `R`.\n\n1. Send the message, perhaps with more description and an SOS emoji.\n\n::: {.callout-note}\nBecause Reprex runs in it's own environment, it doesn't have access to any of the libraries you loaded or the stuff in your global environment. You'll have to load these things in the script.\n:::\n\n\n# Understanding `{tidyverse}`\n\n\n## `{tidyverse}` is huge\n\nCore tidyverse is nearly 30 different R packages, but we're going to just talk about a few of them.\n\nFalls roughly into a few categories:\n\n1. [Convenience functions:]{.secondary} `{magrittr}` and many many others.\n1. [Data processing:]{.secondary} `{dplyr}` and many others.\n1. [Graphing:]{.secondary} `{ggplot2}` and some others like `{scales}`.\n1. [Utilities]{.secondary}\n\n\n. . .\n\n
\n\nWe're going to talk quickly about some of it, but ignore much of 2.\n\nThere's a lot that's great about these packages, especially ease of data processing.\n\nBut it doesn't always jive with base `R` (it's almost a separate proglang at this point).\n\n\n## Piping\n\nThis was introduced by `{magrittr}` as `%>%`, \n\nbut is now in base R (>=4.1.0) as `|>`.\n\nNote: there are other pipes in `{magrittr}` (e.g. `%$%` and `%T%`) but I've never used them.\n\nI've used the old version for so long, that it's hard for me to adopt the new one.\n\nThe point of the pipe is to [logically sequence nested operations]{.secondary}\n\n## Example\n\n:::: {.columns}\n::: {.column width=\"50%\"}\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nmse1 <- print(\n sum(\n residuals(\n lm(y~., data = mutate(\n tib, \n x3 = x1^2,\n x4 = log(x2 + abs(min(x2)) + 1)\n )\n )\n )^2\n )\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 6.469568e-29\n```\n:::\n:::\n\n\n:::\n\n\n::: {.column width=\"50%\"}\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-line-numbers=\"|5-6\"}\nmse2 <- tib |>\n mutate(\n x3 = x1^2, \n x4 = log(x2 + abs(min(x2)) + 1)\n ) %>% # base pipe only goes to first arg\n lm(y ~ ., data = .) |> # note the use of `.`\n residuals() |>\n magrittr::raise_to_power(2) |> # same as `^`(2)\n sum() |>\n print()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 6.469568e-29\n```\n:::\n:::\n\n\n:::\n::::\n\n## \n\nIt may seem like we should push this all the way\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-line-numbers=\"|9-10\"}\ntib |>\n mutate(\n x3 = x1^2, \n x4 = log(x2 + abs(min(x2)) + 1)\n ) %>% # base pipe only goes to first arg\n lm(y ~ ., data = .) |> # note the use of `.`\n residuals() |>\n magrittr::raise_to_power(2) |> # same as `^`(2)\n sum() ->\n mse3\n```\n:::\n\n\nThis works, but it's really annoying.\n\n## A new one...\n\nJust last week, I learned\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-line-numbers=\"1-4|5-8|\"}\nlibrary(magrittr)\ntib <- tibble(x = 1:5, z = 6:10)\ntib <- tib |> mutate(b = x + z)\ntib\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 5 × 3\n x z b\n \n1 1 6 7\n2 2 7 9\n3 3 8 11\n4 4 9 13\n5 5 10 15\n```\n:::\n\n```{.r .cell-code code-line-numbers=\"1-4|5-8|\"}\n# start over\ntib <- tibble(x = 1:5, z = 6:10)\ntib %<>% mutate(b = x + z)\ntib\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 5 × 3\n x z b\n \n1 1 6 7\n2 2 7 9\n3 3 8 11\n4 4 9 13\n5 5 10 15\n```\n:::\n:::\n\n\n\n\n## Data processing in `{dplyr}` {.smaller}\n\nThis package has all sorts of things. And it interacts with `{tibble}` generally.\n\nThe basic idea is \"tibble in, tibble out\".\n\nSatisfies [data masking]{.secondary} which means you can refer to columns by name or use helpers like `ends_with(\"_rate\")`\n\nMajorly useful operations:\n\n1. `select()` (chooses columns to keep)\n1. `mutate()` (showed this already)\n1. `group_by()`\n1. `pivot_longer()` and `pivot_wider()`\n1. `left_join()` and `full_join()`\n1. `summarise()`\n\n::: {.callout-note}\n`filter()` and `select()` are functions in Base R. \n\nSometimes you get 🐞 because it called the wrong version. \n\nTo be sure, prefix it like `dplyr::select()`.\n:::\n\n## A useful data frame\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(epidatr)\ncovid <- covidcast(\n source = \"jhu-csse\",\n signals = \"confirmed_7dav_incidence_prop,deaths_7dav_incidence_prop\",\n time_type = \"day\",\n geo_type = \"state\",\n time_values = epirange(20220801, 20220821),\n geo_values = \"ca,wa\") |>\n fetch() |>\n select(geo_value, time_value, signal, value)\n\ncovid\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 84 × 4\n geo_value time_value signal value\n \n 1 ca 2022-08-01 confirmed_7dav_incidence_prop 45.4\n 2 wa 2022-08-01 confirmed_7dav_incidence_prop 27.7\n 3 ca 2022-08-02 confirmed_7dav_incidence_prop 44.9\n 4 wa 2022-08-02 confirmed_7dav_incidence_prop 27.7\n 5 ca 2022-08-03 confirmed_7dav_incidence_prop 44.5\n 6 wa 2022-08-03 confirmed_7dav_incidence_prop 26.6\n 7 ca 2022-08-04 confirmed_7dav_incidence_prop 42.3\n 8 wa 2022-08-04 confirmed_7dav_incidence_prop 26.6\n 9 ca 2022-08-05 confirmed_7dav_incidence_prop 40.7\n10 wa 2022-08-05 confirmed_7dav_incidence_prop 34.6\n# ℹ 74 more rows\n```\n:::\n:::\n\n\n## Examples\n\nRename the `signal` to something short.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ncovid <- covid |> \n mutate(signal = case_when(\n str_starts(signal, \"confirmed\") ~ \"case_rate\", \n TRUE ~ \"death_rate\"\n ))\n```\n:::\n\n\n\nSort by `time_value` then `geo_value`\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ncovid <- covid |> arrange(time_value, geo_value)\n```\n:::\n\n\nCalculate grouped medians\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ncovid |> \n group_by(geo_value, signal) |>\n summarise(med = median(value), .groups = \"drop\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 × 3\n geo_value signal med\n \n1 ca case_rate 33.2 \n2 ca death_rate 0.112\n3 wa case_rate 23.2 \n4 wa death_rate 0.178\n```\n:::\n:::\n\n\n## Examples\n\nSplit the data into two tibbles by signal\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ncases <- covid |> \n filter(signal == \"case_rate\") |>\n rename(case_rate = value) |> select(-signal)\ndeaths <- covid |> \n filter(signal == \"death_rate\") |>\n rename(death_rate = value) |> select(-signal)\n```\n:::\n\n\nJoin them together\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\njoined <- full_join(cases, deaths, by = c(\"geo_value\", \"time_value\"))\n```\n:::\n\n\nDo the same thing by pivoting\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ncovid |> pivot_wider(names_from = signal, values_from = value)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 42 × 4\n geo_value time_value case_rate death_rate\n \n 1 ca 2022-08-01 45.4 0.105\n 2 wa 2022-08-01 27.7 0.169\n 3 ca 2022-08-02 44.9 0.106\n 4 wa 2022-08-02 27.7 0.169\n 5 ca 2022-08-03 44.5 0.107\n 6 wa 2022-08-03 26.6 0.173\n 7 ca 2022-08-04 42.3 0.112\n 8 wa 2022-08-04 26.6 0.173\n 9 ca 2022-08-05 40.7 0.116\n10 wa 2022-08-05 34.6 0.225\n# ℹ 32 more rows\n```\n:::\n:::\n\n\n\n\n## Plotting with `{ggplot2}`\n\n* Everything you can do with `ggplot()`, you can do with `plot()`. But the \ndefaults are _much_ prettier.\n\n* It's also much easier to adjust by aesthetics / panels by factors.\n\n* It also uses \"data masking\": data goes into `ggplot(data = mydata)`, then the columns are available to the rest.\n\n* It (sort of) pipes, but by adding **layers** with `+`\n\n* It [strongly prefers]{.secondary} \"long\" data frames over \"wide\" data frames.\n\n
\n\nI'll give a very fast overview of some confusing bits.\n\n# \n\nI suggest exploring\n\n🔗 [This slide deck](https://djnavarro.net/slides-starting-ggplot/)\n\nfor more help\n\n\n\n---\n\n\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nggplot(\n data = covid |> \n filter(signal == \"case_rate\")\n) +\n geom_point(\n mapping = aes(\n x = time_value,\n y = value\n )\n ) + \n geom_smooth( \n mapping = aes( \n x = time_value, \n y = value \n ) \n ) \n```\n\n::: {.cell-output-display}\n![](00-r-review_files/figure-revealjs/adding-geoms-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n::: {.notes}\n\n* The complete code\n* Data is specified in the ggplot, passed along\n* (we show only case_rate)\n\n\n* The Grey SE shading is pretty ugly\n* And there are two states mashed together\n* That trend is awfully wiggly\n\n:::\n\n---\n\n\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nggplot(\n data = covid |> filter(signal == \"case_rate\")\n) +\n geom_point(\n mapping = aes(\n x = time_value,\n y = value,\n colour = geo_value\n )\n ) + \n geom_smooth( \n mapping = aes( \n x = time_value, \n y = value,\n colour = geo_value\n ),\n se = FALSE,\n method = \"lm\"\n ) \n```\n\n::: {.cell-output-display}\n![](00-r-review_files/figure-revealjs/adding-geoms2-1.svg){fig-align='center'}\n:::\n:::\n\n\n::: {.notes}\n\n* Separate out the states by colour\n* straight lines instead\n* no more grey shading\n* Why do I keep writing all that mapping = stuff?\n\n:::\n\n---\n\n\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nggplot(\n data = covid |> filter(signal == \"case_rate\"),\n mapping = aes(\n x = time_value,\n y = value,\n colour = geo_value\n )\n) +\n geom_point() + \n geom_smooth(se = FALSE, method = \"lm\") \n```\n\n::: {.cell-output-display}\n![](00-r-review_files/figure-revealjs/adding-geoms3-1.svg){fig-align='center'}\n:::\n:::\n\n\n::: {.notes}\n\nmapping in the `ggplot()` call is shared across the rest\n\n:::\n\n---\n\n\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nggplot(\n covid |> filter(signal == \"case_rate\"),\n aes(time_value, value, colour = geo_value)\n) +\n geom_point() + \n geom_smooth(se = FALSE, method = \"lm\") \n```\n\n::: {.cell-output-display}\n![](00-r-review_files/figure-revealjs/adding-geoms4-1.svg){fig-align='center'}\n:::\n:::\n\n\n::: {.notes}\nDon't need to name the arguments.\n\nThis is typically what ggplot code looks like.\n\nLet's go a bit further to spruce this up.\n\n:::\n\n---\n\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nggplot(\n covid, \n aes(time_value, value, colour = geo_value)\n) +\n geom_point() + \n geom_smooth(se = FALSE, method = \"lm\") +\n facet_grid(signal ~ geo_value) +\n scale_colour_manual(\n name = NULL, \n values = c(blue, orange)) +\n theme(legend.position = \"bottom\")\n```\n\n::: {.cell-output-display}\n![](00-r-review_files/figure-revealjs/adding-geoms5-1.svg){fig-align='center'}\n:::\n:::\n\n\n::: {.notes}\n\n* use facet_grid to split out states / show both signals (formula)\n* change the colour scaling, remove the annoying title\n* put the legend on the bottom\n* But the y-axis scale is shared, measurements are on different scales\n\n:::\n\n---\n\n\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nggplot(\n covid, \n aes(time_value, value, colour = geo_value)\n) +\n geom_point() + \n geom_smooth(se = FALSE, method = \"lm\") +\n facet_grid(signal ~ geo_value, scales = \"free_y\") +\n scale_colour_manual(\n name = NULL, \n values = c(blue, orange)) +\n theme(legend.position = \"bottom\")\n```\n\n::: {.cell-output-display}\n![](00-r-review_files/figure-revealjs/adding-geoms6-1.svg){fig-align='center'}\n:::\n:::\n", + "markdown": "---\nlecture: \"00 R, Rmarkdown, code, and `{tidyverse}`:
A whirlwind tour\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n---\n---\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 06 September 2023\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n$$\n\n\n\n\n\n# The basics\n\n\n## Tour of Rstudio\n\nThings to note\n\n1. Console\n1. Terminal\n1. Scripts, .Rmd, Knit\n1. Files, Projects\n1. Getting help\n1. Environment, Git\n\n\n\n## Simple stuff\n\n:::: {.columns}\n::: {.column width=\"45%\"}\n\n### Vectors:\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nx <- c(1, 3, 4)\nx[1]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1\n```\n:::\n\n```{.r .cell-code}\nx[-1]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3 4\n```\n:::\n\n```{.r .cell-code}\nrev(x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 4 3 1\n```\n:::\n\n```{.r .cell-code}\nc(x, x)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1 3 4 1 3 4\n```\n:::\n:::\n\n:::\n\n::: {.column width=\"10%\"}\n:::\n\n::: {.column width=\"45%\"}\n\n### Matrices:\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nx <- matrix(1:25, nrow = 5, ncol = 5)\nx[1,]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1 6 11 16 21\n```\n:::\n\n```{.r .cell-code}\nx[,-1]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [,1] [,2] [,3] [,4]\n[1,] 6 11 16 21\n[2,] 7 12 17 22\n[3,] 8 13 18 23\n[4,] 9 14 19 24\n[5,] 10 15 20 25\n```\n:::\n\n```{.r .cell-code}\nx[c(1,3), 2:3]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [,1] [,2]\n[1,] 6 11\n[2,] 8 13\n```\n:::\n:::\n\n\n:::\n::::\n\n\n## Simple stuff\n\n::: flex\n::: w-50\n\n### Lists\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n(l <- list(\n a = letters[1:2], \n b = 1:4, \n c = list(a = 1)))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$a\n[1] \"a\" \"b\"\n\n$b\n[1] 1 2 3 4\n\n$c\n$c$a\n[1] 1\n```\n:::\n\n```{.r .cell-code}\nl$a\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"a\" \"b\"\n```\n:::\n\n```{.r .cell-code}\nl$c$a\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1\n```\n:::\n\n```{.r .cell-code}\nl[\"b\"] # compare to l[[\"b\"]] == l$b\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$b\n[1] 1 2 3 4\n```\n:::\n:::\n\n:::\n\n\n::: w-50\n\n### Data frames\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n(dat <- data.frame(\n z = 1:5, \n b = 6:10, \n c = letters[1:5]))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n z b c\n1 1 6 a\n2 2 7 b\n3 3 8 c\n4 4 9 d\n5 5 10 e\n```\n:::\n\n```{.r .cell-code}\nclass(dat)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"data.frame\"\n```\n:::\n\n```{.r .cell-code}\ndat$b\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 6 7 8 9 10\n```\n:::\n\n```{.r .cell-code}\ndat[1,]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n z b c\n1 1 6 a\n```\n:::\n:::\n\n\n:::\n:::\n\n[Data frames are sort-of lists and sort-of matrices]{.secondary}\n\n\n## Tibbles\n\n[These are `{tidyverse}` data frames]{.secondary}\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\n(dat2 <- tibble(z = 1:5, b = z + 5, c = letters[z]))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 5 × 3\n z b c \n \n1 1 6 a \n2 2 7 b \n3 3 8 c \n4 4 9 d \n5 5 10 e \n```\n:::\n\n```{.r .cell-code}\nclass(dat2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"tbl_df\" \"tbl\" \"data.frame\"\n```\n:::\n:::\n\n\nWe'll return to classes in a moment. A `tbl_df` is a \"subclass\" of `data.frame`.\n\nAnything that `data.frame` can do, `tbl_df` can do (better).\n\nFor instance, the printing is more informative.\n\nAlso, you can construct one by referencing previously constructed columns.\n\n\n\n# Functions\n\n\n\n## Understanding signatures\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\nsig <- sig::sig\n```\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsig(lm)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfn <- function(formula, data, subset, weights, na.action, method = \"qr\", model\n = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts =\n NULL, offset, ...)\n```\n:::\n\n```{.r .cell-code}\nsig(`+`)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfn <- function(e1, e2)\n```\n:::\n\n```{.r .cell-code}\nsig(dplyr::filter)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfn <- function(.data, ..., .by = NULL, .preserve = FALSE)\n```\n:::\n\n```{.r .cell-code}\nsig(stats::filter)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfn <- function(x, filter, method = c(\"convolution\", \"recursive\"), sides = 2,\n circular = FALSE, init = NULL)\n```\n:::\n\n```{.r .cell-code}\nsig(rnorm)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nfn <- function(n, mean = 0, sd = 1)\n```\n:::\n:::\n\n\n\n## These are all the same\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nset.seed(12345)\nrnorm(3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.5855288 0.7094660 -0.1093033\n```\n:::\n\n```{.r .cell-code}\nset.seed(12345)\nrnorm(n = 3, mean = 0)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.5855288 0.7094660 -0.1093033\n```\n:::\n\n```{.r .cell-code}\nset.seed(12345)\nrnorm(3, 0, 1)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.5855288 0.7094660 -0.1093033\n```\n:::\n\n```{.r .cell-code}\nset.seed(12345)\nrnorm(sd = 1, n = 3, mean = 0)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.5855288 0.7094660 -0.1093033\n```\n:::\n:::\n\n\n* Functions can have default values.\n* You may, but don't have to, name the arguments\n* If you name them, you can pass them out of order (but you shouldn't).\n\n\n## Write lots of functions. I can't emphasize this enough.\n\n::: flex\n\n::: w-50\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nf <- function(arg1, arg2, arg3 = 12, ...) {\n stuff <- arg1 * arg3\n stuff2 <- stuff + arg2\n plot(arg1, stuff2, ...)\n return(stuff2)\n}\nx <- rnorm(100)\n```\n:::\n\n:::\n\n\n::: w-50\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ny1 <- f(x, 3, 15, col = 4, pch = 19)\n```\n\n::: {.cell-output-display}\n![](00-r-review_files/figure-revealjs/plot-it-1.svg){fig-align='center'}\n:::\n\n```{.r .cell-code}\nstr(y1)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n num [1:100] -3.8 12.09 -24.27 12.45 -1.14 ...\n```\n:::\n:::\n\n\n:::\n:::\n\n\n## Outputs vs. Side effects\n\n::: flex\n::: w-50\n* Side effects are things a function does, outputs can be assigned to variables\n* A good example is the `hist` function\n* You have probably only seen the side effect which is to plot the histogram\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nmy_histogram <- hist(rnorm(1000))\n```\n\n::: {.cell-output-display}\n![](00-r-review_files/figure-revealjs/unnamed-chunk-9-1.svg){fig-align='center'}\n:::\n:::\n\n\n:::\n\n\n::: w-50\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nstr(my_histogram)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nList of 6\n $ breaks : num [1:14] -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 ...\n $ counts : int [1:13] 4 21 41 89 142 200 193 170 74 38 ...\n $ density : num [1:13] 0.008 0.042 0.082 0.178 0.284 0.4 0.386 0.34 0.148 0.076 ...\n $ mids : num [1:13] -2.75 -2.25 -1.75 -1.25 -0.75 -0.25 0.25 0.75 1.25 1.75 ...\n $ xname : chr \"rnorm(1000)\"\n $ equidist: logi TRUE\n - attr(*, \"class\")= chr \"histogram\"\n```\n:::\n\n```{.r .cell-code}\nclass(my_histogram)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"histogram\"\n```\n:::\n:::\n\n\n:::\n:::\n\n\n\n## When writing functions, program defensively, ensure behaviour\n\n::: flex\n::: w-50\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nincrementer <- function(x, inc_by = 1) {\n x + 1\n}\n \nincrementer(2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3\n```\n:::\n\n```{.r .cell-code}\nincrementer(1:4)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 2 3 4 5\n```\n:::\n\n```{.r .cell-code}\nincrementer(\"a\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in x + 1: non-numeric argument to binary operator\n```\n:::\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nincrementer <- function(x, inc_by = 1) {\n stopifnot(is.numeric(x))\n return(x + 1)\n}\nincrementer(\"a\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in incrementer(\"a\"): is.numeric(x) is not TRUE\n```\n:::\n:::\n\n\n:::\n\n\n::: w-50\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nincrementer <- function(x, inc_by = 1) {\n if (!is.numeric(x)) {\n stop(\"`x` must be numeric\")\n }\n x + 1\n}\nincrementer(\"a\")\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in incrementer(\"a\"): `x` must be numeric\n```\n:::\n\n```{.r .cell-code}\nincrementer(2, -3) ## oops!\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3\n```\n:::\n\n```{.r .cell-code}\nincrementer <- function(x, inc_by = 1) {\n if (!is.numeric(x)) {\n stop(\"`x` must be numeric\")\n }\n x + inc_by\n}\nincrementer(2, -3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] -1\n```\n:::\n:::\n\n:::\n:::\n\n\n## How to keep track\n\n::: flex\n::: w-50\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(testthat)\nincrementer <- function(x, inc_by = 1) {\n if (!is.numeric(x)) {\n stop(\"`x` must be numeric\")\n }\n if (!is.numeric(inc_by)) {\n stop(\"`inc_by` must be numeric\")\n }\n x + inc_by\n}\nexpect_error(incrementer(\"a\"))\nexpect_equal(incrementer(1:3), 2:4)\nexpect_equal(incrementer(2, -3), -1)\nexpect_error(incrementer(1, \"b\"))\nexpect_identical(incrementer(1:3), 2:4)\n```\n\n::: {.cell-output .cell-output-error}\n```\nError: incrementer(1:3) not identical to 2:4.\nObjects equal but not identical\n```\n:::\n:::\n\n:::\n\n\n::: w-50\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nis.integer(2:4)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nis.integer(incrementer(1:3))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n\n```{.r .cell-code}\nexpect_identical(incrementer(1:3, 1L), 2:4)\n```\n:::\n\n:::\n:::\n\n. . .\n\n::: callout-important\nIf you copy something, write a function.\n\nValidate your arguments.\n\nTo ensure proper functionality, write tests to check if inputs result in predicted outputs.\n:::\n\n\n\n# Classes and methods\n\n\n\n## Classes\n\n::: flex\n::: w-50\n\nWe saw some of these earlier:\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntib <- tibble(\n x1 = rnorm(100), \n x2 = rnorm(100), \n y = x1 + 2 * x2 + rnorm(100)\n)\nmdl <- lm(y ~ ., data = tib )\nclass(tib)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"tbl_df\" \"tbl\" \"data.frame\"\n```\n:::\n\n```{.r .cell-code}\nclass(mdl)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"lm\"\n```\n:::\n:::\n\n\nThe class allows for the use of \"methods\"\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nprint(mdl)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = y ~ ., data = tib)\n\nCoefficients:\n(Intercept) x1 x2 \n -0.1742 1.0454 2.0470 \n```\n:::\n:::\n\n\n:::\n\n\n::: w-50\n\n\n* `R` \"knows what to do\" when you `print()` an object of class `\"lm\"`.\n\n* `print()` is called a \"generic\" function. \n\n* You can create \"methods\" that get dispatched.\n\n* For any generic, `R` looks for a method for the class.\n\n* If available, it calls that function.\n\n:::\n:::\n\n## Viewing the dispatch chain\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsloop::s3_dispatch(print(incrementer))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n=> print.function\n * print.default\n```\n:::\n\n```{.r .cell-code}\nsloop::s3_dispatch(print(tib))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n print.tbl_df\n=> print.tbl\n * print.data.frame\n * print.default\n```\n:::\n\n```{.r .cell-code}\nsloop::s3_dispatch(print(mdl))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n=> print.lm\n * print.default\n```\n:::\n:::\n\n\n\n## R-Geeky But Important\n\nThere are [lots]{.secondary} of generic functions in `R`\n\nCommon ones are `print()`, `summary()`, and `plot()`.\n\nAlso, lots of important statistical modelling concepts:\n`residuals()` `coef()` \n\n(In `python`, these work the opposite way: `obj.residuals`. The dot after the object accesses methods defined for that type of object. But the dispatch behaviour is less robust.) \n\n* The convention is\nthat the specialized function is named `method.class()`, e.g., `summary.lm()`.\n\n* If no specialized function is defined, R will try to use `method.default()`.\n\nFor this reason, `R` programmers try to avoid `.` in names of functions or objects.\n\n\n## Wherefore methods?\n\n\n* The advantage is that you don't have to learn a totally\nnew syntax to grab residuals or plot things\n\n* You just use `residuals(mdl)` whether `mdl` has class `lm`\ncould have been done two centuries ago, or a Batrachian Emphasis Machine\nwhich won't be invented for another five years. \n\n* The one draw-back is the help pages for the generic methods tend\nto be pretty vague\n\n* Compare `?summary` with `?summary.lm`. \n\n\n\n# Environments \n\n\n## Different environments\n\n* These are often tricky, but are very common.\n\n* Most programming languages have this concept in one way or another.\n\n* In `R` code run in the Console produces objects in the \"Global environment\"\n\n* You can see what you create in the \"Environment\" tab.\n\n* But there's lots of other stuff.\n\n* Many packages are automatically loaded at startup, so you have access to the functions and data inside\n\nFor example `mean()`, `lm()`, `plot()`, `iris` (technically `iris` is lazy-loaded, meaning it's not in memory until you call it, but it is available)\n\n\n\n##\n\n* Other packages require you to load them with `library(pkg)` before their functions are available.\n\n* But, you can call those functions by prefixing the package name `ggplot2::ggplot()`.\n\n* You can also access functions that the package developer didn't \"export\" for use with `:::` like `dplyr:::as_across_fn_call()`\n\n::: {.notes}\n\nThat is all about accessing \"objects in package environments\"\n\n:::\n\n\n## Other issues with environments\n\n\nAs one might expect, functions create an environment inside the function.\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nz <- 1\nfun <- function(x) {\n z <- x\n print(z)\n invisible(z)\n}\nfun(14)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 14\n```\n:::\n:::\n\n\nNon-trivial cases are `data-masking` environments.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntib <- tibble(x1 = rnorm(100), x2 = rnorm(100), y = x1 + 2 * x2)\nmdl <- lm(y ~ x2, data = tib)\nx2\n```\n\n::: {.cell-output .cell-output-error}\n```\nError in eval(expr, envir, enclos): object 'x2' not found\n```\n:::\n:::\n\n\n* `lm()` looks \"inside\" the `tib` to find `y` and `x2`\n* The data variables are added to the `lm()` environment\n\n\n## Other issues with environments\n\n[When Knit, `.Rmd` files run in their OWN environment.]{.fourth-colour}\n\nThey are run from top to bottom, with code chunks depending on previous\n\nThis makes them reproducible.\n\nJupyter notebooks don't do this. 😱\n\nObjects in your local environment are not available in the `.Rmd`\n\nObjects in the `.Rmd` are not available locally.\n\n::: {.callout-tip}\nThe most frequent error I see is:\n\n* running chunks individually, 1-by-1, and it works\n* Knitting, and it fails\n\nThe reason is almost always that the chunks refer to objects in the Environment that don't exist in the `.Rmd`\n:::\n\n##\n\n\n### This error also happens because:\n\n* `library()` calls were made globally but not in the `.Rmd` \n * so the packages aren't loaded\n\n* paths to data or other objects are not relative to the `.Rmd` in your file system \n * they must be\n\n\n* Carefully keeping Labs / Assignments in their current location will help to avoid some of these.\n\n\n# Debugging\n\n\n\n## How to fix code\n\n* If you're using a function in a package, start with `?function` to see the help\n * Make sure you're calling the function correctly.\n * Try running the examples.\n * paste the error into Google (if you share the error on Slack, I often do this first)\n * Go to the package website if it exists, and browse around\n \n* If your `.Rmd` won't Knit\n * Did you make the mistake on the last slide?\n * Did it Knit before? Then the bug is in whatever you added.\n * Did you never Knit it? Why not?\n * Call `rstudioapi::restartSession()`, then run the Chunks 1-by-1\n \n##\n \nAdding `browser()`\n\n* Only useful with your own functions.\n* Open the script with the function, and add `browser()` to the code somewhere\n* Then call your function.\n* The execution will Stop where you added `browser()` and you'll have access to the local environment to play around\n\n\n## Reproducible examples\n\n::: {.callout-tip}\n## Question I get on Slack that I hate:\n\n\"I ran the code like you had on Slide 39, but it didn't work.\"\n:::\n\n* If you want to ask me why the code doesn't work, you need to show me what's wrong.\n\n::: {.callout-warning}\n## Don't just paste a screenshot!\n\nUnless you get lucky, I won't be able to figure it out from that. And we'll both get frustrated.\n:::\n\nWhat you need is a Reproducible Example or `reprex`.\n\n* This is a small chunk of code that \n 1. runs in it's own environment \n 1. and produces the error.\n \n#\n\nThe best way to do this is with the `{reprex}` package.\n\n\n## Reproducible examples, How it works {.smaller}\n\n1. Open a new `.R` script.\n\n1. Paste your buggy code in the file (no need to save)\n\n1. Edit your code to make sure it's \"enough to produce the error\" and nothing more. (By rerunning the code a few times.)\n\n1. Copy your code.\n\n1. Call `reprex::reprex(venue = \"r\")` from the console. This will run your code in a new environment and show the result in the Viewer tab. Does it create the error you expect?\n\n1. If it creates other errors, that may be the problem. You may fix the bug on your own!\n\n1. If it doesn't have errors, then your global environment is Farblunget.\n\n1. The Output is now on your clipboard. Go to Slack and paste it in a message. Then press `Cmd+Shift+Enter` (on Mac) or `Ctrl+Shift+Enter` (Windows/Linux). Under Type, select `R`.\n\n1. Send the message, perhaps with more description and an SOS emoji.\n\n::: {.callout-note}\nBecause Reprex runs in it's own environment, it doesn't have access to any of the libraries you loaded or the stuff in your global environment. You'll have to load these things in the script.\n:::\n\n\n# Understanding `{tidyverse}`\n\n\n## `{tidyverse}` is huge\n\nCore tidyverse is nearly 30 different R packages, but we're going to just talk about a few of them.\n\nFalls roughly into a few categories:\n\n1. [Convenience functions:]{.secondary} `{magrittr}` and many many others.\n1. [Data processing:]{.secondary} `{dplyr}` and many others.\n1. [Graphing:]{.secondary} `{ggplot2}` and some others like `{scales}`.\n1. [Utilities]{.secondary}\n\n\n. . .\n\n
\n\nWe're going to talk quickly about some of it, but ignore much of 2.\n\nThere's a lot that's great about these packages, especially ease of data processing.\n\nBut it doesn't always jive with base `R` (it's almost a separate proglang at this point).\n\n\n## Piping\n\nThis was introduced by `{magrittr}` as `%>%`, \n\nbut is now in base R (>=4.1.0) as `|>`.\n\nNote: there are other pipes in `{magrittr}` (e.g. `%$%` and `%T%`) but I've never used them.\n\nI've used the old version for so long, that it's hard for me to adopt the new one.\n\nThe point of the pipe is to [logically sequence nested operations]{.secondary}\n\n## Example\n\n:::: {.columns}\n::: {.column width=\"50%\"}\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nmse1 <- print(\n sum(\n residuals(\n lm(y~., data = mutate(\n tib, \n x3 = x1^2,\n x4 = log(x2 + abs(min(x2)) + 1)\n )\n )\n )^2\n )\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 6.469568e-29\n```\n:::\n:::\n\n\n:::\n\n\n::: {.column width=\"50%\"}\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-line-numbers=\"|5-6\"}\nmse2 <- tib |>\n mutate(\n x3 = x1^2, \n x4 = log(x2 + abs(min(x2)) + 1)\n ) %>% # base pipe only goes to first arg\n lm(y ~ ., data = .) |> # note the use of `.`\n residuals() |>\n magrittr::raise_to_power(2) |> # same as `^`(2)\n sum() |>\n print()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 6.469568e-29\n```\n:::\n:::\n\n\n:::\n::::\n\n## \n\nIt may seem like we should push this all the way\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-line-numbers=\"|9-10\"}\ntib |>\n mutate(\n x3 = x1^2, \n x4 = log(x2 + abs(min(x2)) + 1)\n ) %>% # base pipe only goes to first arg\n lm(y ~ ., data = .) |> # note the use of `.`\n residuals() |>\n magrittr::raise_to_power(2) |> # same as `^`(2)\n sum() ->\n mse3\n```\n:::\n\n\nThis works, but it's really annoying.\n\n## A new one...\n\nJust last week, I learned\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(magrittr)\ntib <- tibble(x = 1:5, z = 6:10)\ntib <- tib |> mutate(b = x + z)\ntib\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 5 × 3\n x z b\n \n1 1 6 7\n2 2 7 9\n3 3 8 11\n4 4 9 13\n5 5 10 15\n```\n:::\n\n```{.r .cell-code}\n# start over\ntib <- tibble(x = 1:5, z = 6:10)\ntib %<>% mutate(b = x + z)\ntib\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 5 × 3\n x z b\n \n1 1 6 7\n2 2 7 9\n3 3 8 11\n4 4 9 13\n5 5 10 15\n```\n:::\n:::\n\n\n\n\n## Data processing in `{dplyr}` {.smaller}\n\nThis package has all sorts of things. And it interacts with `{tibble}` generally.\n\nThe basic idea is \"tibble in, tibble out\".\n\nSatisfies [data masking]{.secondary} which means you can refer to columns by name or use helpers like `ends_with(\"_rate\")`\n\nMajorly useful operations:\n\n1. `select()` (chooses columns to keep)\n1. `mutate()` (showed this already)\n1. `group_by()`\n1. `pivot_longer()` and `pivot_wider()`\n1. `left_join()` and `full_join()`\n1. `summarise()`\n\n::: {.callout-note}\n`filter()` and `select()` are functions in Base R. \n\nSometimes you get 🐞 because it called the wrong version. \n\nTo be sure, prefix it like `dplyr::select()`.\n:::\n\n## A useful data frame\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(epidatr)\ncovid <- covidcast(\n source = \"jhu-csse\",\n signals = \"confirmed_7dav_incidence_prop,deaths_7dav_incidence_prop\",\n time_type = \"day\",\n geo_type = \"state\",\n time_values = epirange(20220801, 20220821),\n geo_values = \"ca,wa\") |>\n fetch() |>\n select(geo_value, time_value, signal, value)\n\ncovid\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 84 × 4\n geo_value time_value signal value\n \n 1 ca 2022-08-01 confirmed_7dav_incidence_prop 45.4\n 2 wa 2022-08-01 confirmed_7dav_incidence_prop 27.7\n 3 ca 2022-08-02 confirmed_7dav_incidence_prop 44.9\n 4 wa 2022-08-02 confirmed_7dav_incidence_prop 27.7\n 5 ca 2022-08-03 confirmed_7dav_incidence_prop 44.5\n 6 wa 2022-08-03 confirmed_7dav_incidence_prop 26.6\n 7 ca 2022-08-04 confirmed_7dav_incidence_prop 42.3\n 8 wa 2022-08-04 confirmed_7dav_incidence_prop 26.6\n 9 ca 2022-08-05 confirmed_7dav_incidence_prop 40.7\n10 wa 2022-08-05 confirmed_7dav_incidence_prop 34.6\n# ℹ 74 more rows\n```\n:::\n:::\n\n\n## Examples\n\nRename the `signal` to something short.\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ncovid <- covid |> \n mutate(signal = case_when(\n str_starts(signal, \"confirmed\") ~ \"case_rate\", \n TRUE ~ \"death_rate\"\n ))\n```\n:::\n\n\n\nSort by `time_value` then `geo_value`\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ncovid <- covid |> arrange(time_value, geo_value)\n```\n:::\n\n\nCalculate grouped medians\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ncovid |> \n group_by(geo_value, signal) |>\n summarise(med = median(value), .groups = \"drop\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 × 3\n geo_value signal med\n \n1 ca case_rate 33.2 \n2 ca death_rate 0.112\n3 wa case_rate 23.2 \n4 wa death_rate 0.178\n```\n:::\n:::\n\n\n## Examples\n\nSplit the data into two tibbles by signal\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ncases <- covid |> \n filter(signal == \"case_rate\") |>\n rename(case_rate = value) |> select(-signal)\ndeaths <- covid |> \n filter(signal == \"death_rate\") |>\n rename(death_rate = value) |> select(-signal)\n```\n:::\n\n\nJoin them together\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\njoined <- full_join(cases, deaths, by = c(\"geo_value\", \"time_value\"))\n```\n:::\n\n\nDo the same thing by pivoting\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ncovid |> pivot_wider(names_from = signal, values_from = value)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 42 × 4\n geo_value time_value case_rate death_rate\n \n 1 ca 2022-08-01 45.4 0.105\n 2 wa 2022-08-01 27.7 0.169\n 3 ca 2022-08-02 44.9 0.106\n 4 wa 2022-08-02 27.7 0.169\n 5 ca 2022-08-03 44.5 0.107\n 6 wa 2022-08-03 26.6 0.173\n 7 ca 2022-08-04 42.3 0.112\n 8 wa 2022-08-04 26.6 0.173\n 9 ca 2022-08-05 40.7 0.116\n10 wa 2022-08-05 34.6 0.225\n# ℹ 32 more rows\n```\n:::\n:::\n\n\n\n\n## Plotting with `{ggplot2}`\n\n* Everything you can do with `ggplot()`, you can do with `plot()`. But the \ndefaults are _much_ prettier.\n\n* It's also much easier to adjust by aesthetics / panels by factors.\n\n* It also uses \"data masking\": data goes into `ggplot(data = mydata)`, then the columns are available to the rest.\n\n* It (sort of) pipes, but by adding [layers]{.secondary} with `+`\n\n* It [strongly prefers]{.secondary} \"long\" data frames over \"wide\" data frames.\n\n
\n\nI'll give a very fast overview of some confusing bits.\n\n# \n\nI suggest exploring\n\n🔗 [This slide deck](https://djnavarro.net/slides-starting-ggplot/)\n\nfor more help\n\n\n\n---\n\n\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nggplot(\n data = covid |> \n filter(signal == \"case_rate\")\n) +\n geom_point(\n mapping = aes(\n x = time_value,\n y = value\n )\n ) + \n geom_smooth( \n mapping = aes( \n x = time_value, \n y = value \n ) \n ) \n```\n\n::: {.cell-output-display}\n![](00-r-review_files/figure-revealjs/adding-geoms-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n::: {.notes}\n\n* The complete code\n* Data is specified in the ggplot, passed along\n* (we show only case_rate)\n\n\n* The Grey SE shading is pretty ugly\n* And there are two states mashed together\n* That trend is awfully wiggly\n\n:::\n\n---\n\n\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nggplot(\n data = covid |> filter(signal == \"case_rate\")\n) +\n geom_point(\n mapping = aes(\n x = time_value,\n y = value,\n colour = geo_value\n )\n ) + \n geom_smooth( \n mapping = aes( \n x = time_value, \n y = value,\n colour = geo_value\n ),\n se = FALSE,\n method = \"lm\"\n ) \n```\n\n::: {.cell-output-display}\n![](00-r-review_files/figure-revealjs/adding-geoms2-1.svg){fig-align='center'}\n:::\n:::\n\n\n::: {.notes}\n\n* Separate out the states by colour\n* straight lines instead\n* no more grey shading\n* Why do I keep writing all that mapping = stuff?\n\n:::\n\n---\n\n\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nggplot(\n data = covid |> filter(signal == \"case_rate\"),\n mapping = aes(\n x = time_value,\n y = value,\n colour = geo_value\n )\n) +\n geom_point() + \n geom_smooth(se = FALSE, method = \"lm\") \n```\n\n::: {.cell-output-display}\n![](00-r-review_files/figure-revealjs/adding-geoms3-1.svg){fig-align='center'}\n:::\n:::\n\n\n::: {.notes}\n\nmapping in the `ggplot()` call is shared across the rest\n\n:::\n\n---\n\n\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nggplot(\n covid |> filter(signal == \"case_rate\"),\n aes(time_value, value, colour = geo_value)\n) +\n geom_point() + \n geom_smooth(se = FALSE, method = \"lm\") \n```\n\n::: {.cell-output-display}\n![](00-r-review_files/figure-revealjs/adding-geoms4-1.svg){fig-align='center'}\n:::\n:::\n\n\n::: {.notes}\nDon't need to name the arguments.\n\nThis is typically what ggplot code looks like.\n\nLet's go a bit further to spruce this up.\n\n:::\n\n---\n\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nggplot(\n covid, \n aes(time_value, value, colour = geo_value)\n) +\n geom_point() + \n geom_smooth(se = FALSE, method = \"lm\") +\n facet_grid(signal ~ geo_value) +\n scale_colour_manual(\n name = NULL, \n values = c(blue, orange)) +\n theme(legend.position = \"bottom\")\n```\n\n::: {.cell-output-display}\n![](00-r-review_files/figure-revealjs/adding-geoms5-1.svg){fig-align='center'}\n:::\n:::\n\n\n::: {.notes}\n\n* use facet_grid to split out states / show both signals (formula)\n* change the colour scaling, remove the annoying title\n* put the legend on the bottom\n* But the y-axis scale is shared, measurements are on different scales\n\n:::\n\n---\n\n\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nggplot(\n covid, \n aes(time_value, value, colour = geo_value)\n) +\n geom_point() + \n geom_smooth(se = FALSE, method = \"lm\") +\n facet_grid(signal ~ geo_value, scales = \"free_y\") +\n scale_colour_manual(\n name = NULL, \n values = c(blue, orange)) +\n theme(legend.position = \"bottom\")\n```\n\n::: {.cell-output-display}\n![](00-r-review_files/figure-revealjs/adding-geoms6-1.svg){fig-align='center'}\n:::\n:::\n", "supporting": [ "00-r-review_files" ], diff --git a/_freeze/schedule/slides/02-lm-example/execute-results/html.json b/_freeze/schedule/slides/02-lm-example/execute-results/html.json index dd6944a..89034e3 100644 --- a/_freeze/schedule/slides/02-lm-example/execute-results/html.json +++ b/_freeze/schedule/slides/02-lm-example/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "71d496db5e43221b519fddff51e2899a", + "hash": "a2b9571494f1a1b32f51caa1df4d09fd", "result": { - "markdown": "---\nlecture: \"02 Linear model example\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n---\n---\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 06 September 2023\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n$$\n\n\n\n\n\n## Economic mobility\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndata(\"mobility\", package = \"Stat406\")\nmobility\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 741 × 43\n ID Name Mobility State Population Urban Black Seg_racial Seg_income\n \n 1 100 Johnson Ci… 0.0622 TN 576081 1 0.021 0.09 0.035\n 2 200 Morristown 0.0537 TN 227816 1 0.02 0.093 0.026\n 3 301 Middlesbor… 0.0726 TN 66708 0 0.015 0.064 0.024\n 4 302 Knoxville 0.0563 TN 727600 1 0.056 0.21 0.092\n 5 401 Winston-Sa… 0.0448 NC 493180 1 0.174 0.262 0.072\n 6 402 Martinsvil… 0.0518 VA 92753 0 0.224 0.137 0.024\n 7 500 Greensboro 0.0474 NC 1055133 1 0.218 0.22 0.068\n 8 601 North Wilk… 0.0517 NC 90016 0 0.032 0.114 0.012\n 9 602 Galax 0.0796 VA 64676 0 0.029 0.131 0.005\n10 700 Spartanburg 0.0431 SC 354533 1 0.207 0.139 0.045\n# ℹ 731 more rows\n# ℹ 34 more variables: Seg_poverty , Seg_affluence , Commute ,\n# Income , Gini , Share01 , Gini_99 , Middle_class ,\n# Local_tax_rate , Local_gov_spending , Progressivity ,\n# EITC , School_spending , Student_teacher_ratio ,\n# Test_scores , HS_dropout , Colleges , Tuition ,\n# Graduation , Labor_force_participation , Manufacturing , …\n```\n:::\n:::\n\n\n::: {.notes}\n\nNote how many observations and predictors it has.\n\nWe'll use `Mobility` as the response\n\n:::\n\n\n\n## A linear model\n\n\n$$\\mbox{Mobility}_i = \\beta_0 + \\beta_1 \\, \\mbox{State}_i + \\beta_2 \\, \\mbox{Urban}_i + \\cdots + \\epsilon_i$$ \n \nor equivalently\n\n$$E \\left[ \\biggl. \\mbox{mobility} \\, \\biggr| \\, \\mbox{State}, \\mbox{Urban}, \n \\ldots \\right] = \\beta_0 + \\beta_1 \\, \\mbox{State} + \n \\beta_2 \\, \\mbox{Urban} + \\cdots$$\n\n\n\n## Analysis\n\n\n- Randomly split into a training (say 3/4) and a test set (1/4)\n\n- Use training set to fit a model \n\n- Fit the \"full\" model\n\n- \"Look\" at the fit\n\n. . .\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nset.seed(20220914)\nmob <- mobility[complete.cases(mobility), ]\nn <- nrow(mob)\nmob <- mob |> select(-Name, -ID, -State)\nset <- sample.int(n, floor(n * .75), FALSE)\ntrain <- mob[set, ]\ntest <- mob[setdiff(1:n, set), ]\nfull <- lm(Mobility ~ ., data = train)\n```\n:::\n\n\n::: {.notes}\n\nWhy don't we include `Name` or `ID`?\n\n:::\n\n\n## Results\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsummary(full)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = Mobility ~ ., data = train)\n\nResiduals:\n Min 1Q Median 3Q Max \n-0.072092 -0.010256 -0.001452 0.009170 0.090428 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 1.849e-01 8.083e-02 2.288 0.022920 * \nPopulation 3.378e-09 2.478e-09 1.363 0.173916 \nUrban 2.853e-03 3.892e-03 0.733 0.464202 \nBlack 7.807e-02 2.859e-02 2.731 0.006735 ** \nSeg_racial -5.626e-02 1.780e-02 -3.160 0.001754 ** \nSeg_income 8.677e-01 9.355e-01 0.928 0.354453 \nSeg_poverty -7.416e-01 5.014e-01 -1.479 0.140316 \nSeg_affluence -2.224e-01 4.763e-01 -0.467 0.640874 \nCommute 6.313e-02 2.838e-02 2.225 0.026915 * \nIncome 4.207e-07 6.997e-07 0.601 0.548112 \nGini 3.592e+00 3.357e+00 1.070 0.285578 \nShare01 -3.635e-02 3.357e-02 -1.083 0.279925 \nGini_99 -3.657e+00 3.356e+00 -1.090 0.276704 \nMiddle_class 1.031e-01 4.835e-02 2.133 0.033828 * \nLocal_tax_rate 2.268e-01 2.620e-01 0.866 0.387487 \nLocal_gov_spending 1.273e-07 3.016e-06 0.042 0.966374 \nProgressivity 4.983e-03 1.324e-03 3.764 0.000205 ***\nEITC -3.324e-04 4.528e-04 -0.734 0.463549 \nSchool_spending -9.019e-04 2.272e-03 -0.397 0.691658 \nStudent_teacher_ratio -1.639e-03 1.123e-03 -1.459 0.145748 \nTest_scores 2.487e-04 3.137e-04 0.793 0.428519 \nHS_dropout -1.698e-01 9.352e-02 -1.816 0.070529 . \nColleges -2.811e-02 7.661e-02 -0.367 0.713942 \nTuition 3.459e-07 4.362e-07 0.793 0.428417 \nGraduation -1.702e-02 1.425e-02 -1.194 0.233650 \nLabor_force_participation -7.850e-02 5.405e-02 -1.452 0.147564 \nManufacturing -1.605e-01 2.816e-02 -5.700 3.1e-08 ***\nChinese_imports -5.165e-04 1.004e-03 -0.514 0.607378 \nTeenage_labor -1.019e+00 2.111e+00 -0.483 0.629639 \nMigration_in 4.490e-02 3.480e-01 0.129 0.897436 \nMigration_out -4.475e-01 4.093e-01 -1.093 0.275224 \nForeign_born 9.137e-02 5.494e-02 1.663 0.097454 . \nSocial_capital -1.114e-03 2.728e-03 -0.408 0.683245 \nReligious 4.570e-02 1.298e-02 3.520 0.000506 ***\nViolent_crime -3.393e+00 1.622e+00 -2.092 0.037373 * \nSingle_mothers -3.590e-01 9.442e-02 -3.802 0.000177 ***\nDivorced 1.707e-02 1.603e-01 0.107 0.915250 \nMarried -5.894e-02 7.246e-02 -0.813 0.416720 \nLongitude -4.239e-05 2.239e-04 -0.189 0.850001 \nLatitude 6.725e-04 5.687e-04 1.182 0.238037 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.02128 on 273 degrees of freedom\nMultiple R-squared: 0.7808,\tAdjusted R-squared: 0.7494 \nF-statistic: 24.93 on 39 and 273 DF, p-value: < 2.2e-16\n```\n:::\n:::\n\n\n\n## Diagnostic plots\n\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\npar(mar = c(5, 3, 0, 0))\nplot(full, 1)\n```\n\n::: {.cell-output-display}\n![](02-lm-example_files/figure-revealjs/unnamed-chunk-4-1.svg){fig-align='center'}\n:::\n:::\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nplot(full, 2)\n```\n\n::: {.cell-output-display}\n![](02-lm-example_files/figure-revealjs/unnamed-chunk-5-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n##\n\n(Those were `plot` methods for objects of class `lm`)\n\n\nSame thing in `ggplot`\n\n\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nstuff <- tibble(\n residuals = residuals(full), \n fitted = fitted(full),\n stdresiduals = rstandard(full)\n)\nggplot(stuff, aes(fitted, residuals)) +\n geom_point(colour = \"salmon\") +\n geom_smooth(\n se = FALSE, \n colour = \"steelblue\", \n linewidth = 2) +\n ggtitle(\"Residuals vs Fitted\")\n```\n\n::: {.cell-output-display}\n![](02-lm-example_files/figure-revealjs/unnamed-chunk-6-1.svg){fig-align='center'}\n:::\n:::\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nggplot(stuff, aes(sample = stdresiduals)) +\n geom_qq(colour = \"purple\", size = 2) +\n geom_qq_line(colour = \"peachpuff\", linewidth = 2) +\n labs(\n x = \"Theoretical quantiles\", \n y = \"Standardized residuals\",\n title = \"Normal Q-Q\")\n```\n\n::: {.cell-output-display}\n![](02-lm-example_files/figure-revealjs/unnamed-chunk-7-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Fit a reduced model\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nreduced <- lm(\n Mobility ~ Commute + Gini_99 + Test_scores + HS_dropout +\n Manufacturing + Migration_in + Religious + Single_mothers, \n data = train)\n\nsummary(reduced)$coefficients |> as_tibble()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 9 × 4\n Estimate `Std. Error` `t value` `Pr(>|t|)`\n \n1 0.166 0.0178 9.36 1.83e-18\n2 0.0637 0.0149 4.27 2.62e- 5\n3 -0.109 0.0390 -2.79 5.64e- 3\n4 0.000500 0.000256 1.95 5.19e- 2\n5 -0.216 0.0820 -2.64 8.81e- 3\n6 -0.159 0.0202 -7.89 5.65e-14\n7 -0.389 0.172 -2.26 2.42e- 2\n8 0.0436 0.0105 4.16 4.08e- 5\n9 -0.286 0.0466 -6.15 2.44e- 9\n```\n:::\n\n```{.r .cell-code}\nreduced |> broom::glance() |> print(width = 120)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1 × 12\n r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC\n \n1 0.718 0.711 0.0229 96.9 5.46e-79 8 743. -1466. -1429.\n deviance df.residual nobs\n \n1 0.159 304 313\n```\n:::\n:::\n\n\n\n## Diagnostic plots for reduced model\n\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nplot(reduced, 1)\n```\n\n::: {.cell-output-display}\n![](02-lm-example_files/figure-revealjs/unnamed-chunk-9-1.svg){fig-align='center'}\n:::\n:::\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nplot(reduced, 2)\n```\n\n::: {.cell-output-display}\n![](02-lm-example_files/figure-revealjs/unnamed-chunk-10-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n\n## How do we decide which model is better?\n\n::: flex\n\n::: w-50\n\n* Goodness of fit versus prediction power\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nmap( # smaller AIC is better\n list(full = full, reduced = reduced), \n ~ c(aic = AIC(.x), rsq = summary(.x)$r.sq))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$full\n aic rsq \n-1482.5981023 0.7807509 \n\n$reduced\n aic rsq \n-1466.088492 0.718245 \n```\n:::\n:::\n\n\n* Use both models to predict `Mobility` \n\n* Compare both sets of predictions\n\n:::\n\n::: w-50\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nmses <- function(preds, obs) {\n round(mean((obs - preds)^2), 5)\n}\nc(\n full = mses(\n predict(full, newdata = test), \n test$Mobility),\n reduced = mses(\n predict(reduced, newdata = test), \n test$Mobility)\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n full reduced \n0.00072 0.00084 \n```\n:::\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\ntest$full <- predict(full, newdata = test)\ntest$reduced <- predict(reduced, newdata = test)\ntest |> \n select(Mobility, full, reduced) |>\n pivot_longer(-Mobility) |>\n ggplot(aes(Mobility, value)) + \n geom_point(color = \"orange\") + \n facet_wrap(~name, 2) +\n xlab('observed mobility') + \n ylab('predicted mobility') +\n geom_abline(slope = 1, intercept = 0, col = \"darkblue\")\n```\n\n::: {.cell-output-display}\n![](02-lm-example_files/figure-revealjs/unnamed-chunk-13-1.svg){fig-align='center'}\n:::\n:::\n\n\n:::\n:::\n\n\n# Next time... \n\n\n[Module 1]{.secondary .large}\n\n[Selecting models]{.secondary .large}\n\n\n\n", + "markdown": "---\nlecture: \"02 Linear model example\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n---\n---\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 06 September 2023\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n$$\n\n\n\n\n\n## Economic mobility\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ndata(\"mobility\", package = \"Stat406\")\nmobility\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 741 × 43\n ID Name Mobility State Population Urban Black Seg_racial Seg_income\n \n 1 100 Johnson Ci… 0.0622 TN 576081 1 0.021 0.09 0.035\n 2 200 Morristown 0.0537 TN 227816 1 0.02 0.093 0.026\n 3 301 Middlesbor… 0.0726 TN 66708 0 0.015 0.064 0.024\n 4 302 Knoxville 0.0563 TN 727600 1 0.056 0.21 0.092\n 5 401 Winston-Sa… 0.0448 NC 493180 1 0.174 0.262 0.072\n 6 402 Martinsvil… 0.0518 VA 92753 0 0.224 0.137 0.024\n 7 500 Greensboro 0.0474 NC 1055133 1 0.218 0.22 0.068\n 8 601 North Wilk… 0.0517 NC 90016 0 0.032 0.114 0.012\n 9 602 Galax 0.0796 VA 64676 0 0.029 0.131 0.005\n10 700 Spartanburg 0.0431 SC 354533 1 0.207 0.139 0.045\n# ℹ 731 more rows\n# ℹ 34 more variables: Seg_poverty , Seg_affluence , Commute ,\n# Income , Gini , Share01 , Gini_99 , Middle_class ,\n# Local_tax_rate , Local_gov_spending , Progressivity ,\n# EITC , School_spending , Student_teacher_ratio ,\n# Test_scores , HS_dropout , Colleges , Tuition ,\n# Graduation , Labor_force_participation , Manufacturing , …\n```\n:::\n:::\n\n\n::: {.notes}\n\nNote how many observations and predictors it has.\n\nWe'll use `Mobility` as the response\n\n:::\n\n\n\n## A linear model\n\n\n$$\\mbox{Mobility}_i = \\beta_0 + \\beta_1 \\, \\mbox{State}_i + \\beta_2 \\, \\mbox{Urban}_i + \\cdots + \\epsilon_i$$ \n \nor equivalently\n\n$$E \\left[ \\biggl. \\mbox{mobility} \\, \\biggr| \\, \\mbox{State}, \\mbox{Urban}, \n \\ldots \\right] = \\beta_0 + \\beta_1 \\, \\mbox{State} + \n \\beta_2 \\, \\mbox{Urban} + \\cdots$$\n\n\n\n## Analysis\n\n\n- Randomly split into a training (say 3/4) and a test set (1/4)\n\n- Use training set to fit a model \n\n- Fit the \"full\" model\n\n- \"Look\" at the fit\n\n. . .\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nset.seed(20220914)\nmob <- mobility[complete.cases(mobility), ]\nn <- nrow(mob)\nmob <- mob |> select(-Name, -ID, -State)\nset <- sample.int(n, floor(n * .75), FALSE)\ntrain <- mob[set, ]\ntest <- mob[setdiff(1:n, set), ]\nfull <- lm(Mobility ~ ., data = train)\n```\n:::\n\n\n::: {.notes}\n\nWhy don't we include `Name` or `ID`?\n\n:::\n\n\n## Results\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nsummary(full)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nlm(formula = Mobility ~ ., data = train)\n\nResiduals:\n Min 1Q Median 3Q Max \n-0.072092 -0.010256 -0.001452 0.009170 0.090428 \n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 1.849e-01 8.083e-02 2.288 0.022920 * \nPopulation 3.378e-09 2.478e-09 1.363 0.173916 \nUrban 2.853e-03 3.892e-03 0.733 0.464202 \nBlack 7.807e-02 2.859e-02 2.731 0.006735 ** \nSeg_racial -5.626e-02 1.780e-02 -3.160 0.001754 ** \nSeg_income 8.677e-01 9.355e-01 0.928 0.354453 \nSeg_poverty -7.416e-01 5.014e-01 -1.479 0.140316 \nSeg_affluence -2.224e-01 4.763e-01 -0.467 0.640874 \nCommute 6.313e-02 2.838e-02 2.225 0.026915 * \nIncome 4.207e-07 6.997e-07 0.601 0.548112 \nGini 3.592e+00 3.357e+00 1.070 0.285578 \nShare01 -3.635e-02 3.357e-02 -1.083 0.279925 \nGini_99 -3.657e+00 3.356e+00 -1.090 0.276704 \nMiddle_class 1.031e-01 4.835e-02 2.133 0.033828 * \nLocal_tax_rate 2.268e-01 2.620e-01 0.866 0.387487 \nLocal_gov_spending 1.273e-07 3.016e-06 0.042 0.966374 \nProgressivity 4.983e-03 1.324e-03 3.764 0.000205 ***\nEITC -3.324e-04 4.528e-04 -0.734 0.463549 \nSchool_spending -9.019e-04 2.272e-03 -0.397 0.691658 \nStudent_teacher_ratio -1.639e-03 1.123e-03 -1.459 0.145748 \nTest_scores 2.487e-04 3.137e-04 0.793 0.428519 \nHS_dropout -1.698e-01 9.352e-02 -1.816 0.070529 . \nColleges -2.811e-02 7.661e-02 -0.367 0.713942 \nTuition 3.459e-07 4.362e-07 0.793 0.428417 \nGraduation -1.702e-02 1.425e-02 -1.194 0.233650 \nLabor_force_participation -7.850e-02 5.405e-02 -1.452 0.147564 \nManufacturing -1.605e-01 2.816e-02 -5.700 3.1e-08 ***\nChinese_imports -5.165e-04 1.004e-03 -0.514 0.607378 \nTeenage_labor -1.019e+00 2.111e+00 -0.483 0.629639 \nMigration_in 4.490e-02 3.480e-01 0.129 0.897436 \nMigration_out -4.475e-01 4.093e-01 -1.093 0.275224 \nForeign_born 9.137e-02 5.494e-02 1.663 0.097454 . \nSocial_capital -1.114e-03 2.728e-03 -0.408 0.683245 \nReligious 4.570e-02 1.298e-02 3.520 0.000506 ***\nViolent_crime -3.393e+00 1.622e+00 -2.092 0.037373 * \nSingle_mothers -3.590e-01 9.442e-02 -3.802 0.000177 ***\nDivorced 1.707e-02 1.603e-01 0.107 0.915250 \nMarried -5.894e-02 7.246e-02 -0.813 0.416720 \nLongitude -4.239e-05 2.239e-04 -0.189 0.850001 \nLatitude 6.725e-04 5.687e-04 1.182 0.238037 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\nResidual standard error: 0.02128 on 273 degrees of freedom\nMultiple R-squared: 0.7808,\tAdjusted R-squared: 0.7494 \nF-statistic: 24.93 on 39 and 273 DF, p-value: < 2.2e-16\n```\n:::\n:::\n\n\n\n## Diagnostic plots\n\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\npar(mar = c(5, 3, 0, 0))\nplot(full, 1)\n```\n\n::: {.cell-output-display}\n![](02-lm-example_files/figure-revealjs/unnamed-chunk-4-1.svg){fig-align='center'}\n:::\n:::\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nplot(full, 2)\n```\n\n::: {.cell-output-display}\n![](02-lm-example_files/figure-revealjs/unnamed-chunk-5-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n##\n\n(Those were `plot` methods for objects of class `lm`)\n\n\nSame thing in `ggplot`\n\n\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nstuff <- tibble(\n residuals = residuals(full), \n fitted = fitted(full),\n stdresiduals = rstandard(full)\n)\nggplot(stuff, aes(fitted, residuals)) +\n geom_point(colour = \"salmon\") +\n geom_smooth(\n se = FALSE, \n colour = \"steelblue\", \n linewidth = 2) +\n ggtitle(\"Residuals vs Fitted\")\n```\n\n::: {.cell-output-display}\n![](02-lm-example_files/figure-revealjs/unnamed-chunk-6-1.svg){fig-align='center'}\n:::\n:::\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nggplot(stuff, aes(sample = stdresiduals)) +\n geom_qq(colour = \"purple\", size = 2) +\n geom_qq_line(colour = \"peachpuff\", linewidth = 2) +\n labs(\n x = \"Theoretical quantiles\", \n y = \"Standardized residuals\",\n title = \"Normal Q-Q\")\n```\n\n::: {.cell-output-display}\n![](02-lm-example_files/figure-revealjs/unnamed-chunk-7-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Fit a reduced model\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nreduced <- lm(\n Mobility ~ Commute + Gini_99 + Test_scores + HS_dropout +\n Manufacturing + Migration_in + Religious + Single_mothers, \n data = train)\n\nsummary(reduced)$coefficients |> as_tibble()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 9 × 4\n Estimate `Std. Error` `t value` `Pr(>|t|)`\n \n1 0.166 0.0178 9.36 1.83e-18\n2 0.0637 0.0149 4.27 2.62e- 5\n3 -0.109 0.0390 -2.79 5.64e- 3\n4 0.000500 0.000256 1.95 5.19e- 2\n5 -0.216 0.0820 -2.64 8.81e- 3\n6 -0.159 0.0202 -7.89 5.65e-14\n7 -0.389 0.172 -2.26 2.42e- 2\n8 0.0436 0.0105 4.16 4.08e- 5\n9 -0.286 0.0466 -6.15 2.44e- 9\n```\n:::\n\n```{.r .cell-code}\nreduced |> broom::glance() |> print(width = 120)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1 × 12\n r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC\n \n1 0.718 0.711 0.0229 96.9 5.46e-79 8 743. -1466. -1429.\n deviance df.residual nobs\n \n1 0.159 304 313\n```\n:::\n:::\n\n\n\n## Diagnostic plots for reduced model\n\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nplot(reduced, 1)\n```\n\n::: {.cell-output-display}\n![](02-lm-example_files/figure-revealjs/unnamed-chunk-9-1.svg){fig-align='center'}\n:::\n:::\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\nplot(reduced, 2)\n```\n\n::: {.cell-output-display}\n![](02-lm-example_files/figure-revealjs/unnamed-chunk-10-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n\n## How do we decide which model is better?\n\n::: flex\n\n::: w-50\n\n* Goodness of fit versus prediction power\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nmap( # smaller AIC is better\n list(full = full, reduced = reduced), \n ~ c(aic = AIC(.x), rsq = summary(.x)$r.sq))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$full\n aic rsq \n-1482.5981023 0.7807509 \n\n$reduced\n aic rsq \n-1466.088492 0.718245 \n```\n:::\n:::\n\n\n* Use both models to predict `Mobility` \n\n* Compare both sets of predictions\n\n:::\n\n::: w-50\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nmses <- function(preds, obs) {\n round(mean((obs - preds)^2), 5)\n}\nc(\n full = mses(\n predict(full, newdata = test), \n test$Mobility),\n reduced = mses(\n predict(reduced, newdata = test), \n test$Mobility)\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n full reduced \n0.00072 0.00084 \n```\n:::\n:::\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code code-fold=\"true\"}\ntest$full <- predict(full, newdata = test)\ntest$reduced <- predict(reduced, newdata = test)\ntest |> \n select(Mobility, full, reduced) |>\n pivot_longer(-Mobility) |>\n ggplot(aes(Mobility, value)) + \n geom_point(color = \"orange\") + \n facet_wrap(~name, 2) +\n xlab('observed mobility') + \n ylab('predicted mobility') +\n geom_abline(slope = 1, intercept = 0, col = \"darkblue\")\n```\n\n::: {.cell-output-display}\n![](02-lm-example_files/figure-revealjs/unnamed-chunk-13-1.svg){fig-align='center'}\n:::\n:::\n\n\n:::\n:::\n\n\n# Next time... \n\n\n[Module 1]{.secondary .large}\n\nSelecting models\n\n\n\n", "supporting": [ "02-lm-example_files" ], diff --git a/_freeze/schedule/slides/06-information-criteria/execute-results/html.json b/_freeze/schedule/slides/06-information-criteria/execute-results/html.json index a8561ad..ffcf729 100644 --- a/_freeze/schedule/slides/06-information-criteria/execute-results/html.json +++ b/_freeze/schedule/slides/06-information-criteria/execute-results/html.json @@ -1,8 +1,10 @@ { - "hash": "78c53cf00c38813aded6a13a9dc30abf", + "hash": "1d0c2ea24c3b17c70eef1c5f998b6716", "result": { - "markdown": "---\nlecture: \"06 Information criteria\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n---\n---\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 05 September 2023\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n$$\n\n\n\n\n\n## Generalized CV\n\nLast time we saw a nice trick, that works some of the time (OLS, Ridge regression,...)\n\n$$\\mbox{LOO-CV} = \\frac{1}{n} \\sum_{i=1}^n \\frac{(y_i -\\widehat{y}_i)^2}{(1-h_{ii})^2} = \\frac{1}{n} \\sum_{i=1}^n \\frac{\\widehat{e}_i^2}{(1-h_{ii})^2}.$$\n\n1. $\\widehat{\\y} = \\widehat{f}(\\mathbf{X}) = \\mathbf{H}\\mathbf{y}$ for some matrix $\\mathbf{H}$.\n2. A technical thing.\n\n$$\\newcommand{\\H}{\\mathbf{H}}$$\n\n. . .\n\n### This is another nice trick.\n\nIdea: replace $h_{ii}$ with $\\frac{1}{n}\\sum_{i=1}^n h_{ii} = \\frac{1}{n}\\textrm{tr}(\\mathbf{H})$\n\nLet's call $\\textrm{tr}(\\mathbf{H})$ the _degrees-of-freedom_ (or just _df_)\n\n$$\\textrm{GCV} = \\frac{1}{n} \\sum_{i=1}^n \\frac{\\widehat{e}_i^2}{(1-\\textrm{df}/n)^2} = \\frac{\\textrm{MSE}}{(1-\\textrm{df}/n)^2}$$\n\n\n[Where does this stuff come from?]{.hand}\n\n\n## What are `hatvalues`?\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ncv_nice <- function(mdl) mean((residuals(mdl) / (1 - hatvalues(mdl)))^2)\n```\n:::\n\n\nIn OLS, $\\widehat{\\y} = \\X\\widehat{\\beta} = \\X(\\X^\\top \\X)^{-1}\\X^\\top \\y$\n\nWe often call $\\mathbf{H} = \\X(\\X^\\top \\X)^{-1}\\X^\\top$ the Hat matrix, because it [puts the hat]{.hand} on $\\y$\n\nGCV uses $\\textrm{tr}(\\mathbf{H})$. For `lm()`, this is just `p`, the number of predictors (Why?)\n\nThis is one way of understanding the name _effective degrees-of-freedom_\n\n\n\n## Alternative interpretation:\n\n\nSuppose, $Y_i$ comes (independently) from some distribution and has mean $\\mu_i$ and variance $\\sigma^2$\n\n(remember: in the linear model $\\Expect{Y_i} = x_i^\\top \\beta = \\mu_i$ )\n\nLet $\\widehat{Y}$ be an estimator of $\\mu$ (all $i=1,\\ldots,n$ elements of the vector).\n\n. . .\n\n\\begin{aligned}\n& \\Expect{\\frac{1}{n}\\sum (\\widehat Y_i-\\mu_i)^2} \\\\\n&= \\Expect{\\frac{1}{n}\\sum (\\widehat Y_i-Y_i + Y_i -\\mu_i)^2}\\\\\n&= \\frac{1}{n}\\Expect{\\sum (\\widehat Y_i-Y_i)^2} + \\frac{1}{n}\\Expect{\\sum (Y_i-\\mu_i)^2} + \\frac{2}{n}\\Expect{\\sum (\\widehat Y_i-Y_i)(Y_i-\\mu_i)}\\\\\n&= \\frac{1}{n}\\sum \\Expect{(\\widehat Y_i-Y_i)^2} + \\sigma^2 + \\frac{2}{n}\\Expect{\\sum (\\widehat Y_i-Y_i)(Y_i-\\mu_i)} = \\cdots =\\\\\n&= \\frac{1}{n}\\sum \\Expect{(\\widehat Y_i-Y_i)^2} - \\sigma^2 + \\frac{2}{n}\\sum\\Cov{Y_i}{\\widehat Y_i}\n\\end{aligned}\n\n\n## Alternative interpretation:\n\n$$\\Expect{\\frac{1}{n}\\sum (\\widehat Y_i-\\mu_i)^2} = \\frac{1}{n}\\sum \\Expect{(\\widehat Y_i-Y_i)^2} - \\sigma^2 + \\frac{2}{n}\\sum\\Cov{Y_i}{\\widehat Y_i}$$\n\nNow, if $\\widehat{Y} = \\H Y$ for some matrix $\\H$,\n\n$\\sum\\Cov{Y_i}{\\widehat Y_i} = \\Expect{Y^\\top \\H Y} = \\sigma^2 \\textrm{tr}(\\H)$\n\n\nThis gives _Mallow's $C_p$_ aka _Stein's Unbiased Risk Estimator_:\n\n$MSE + 2\\hat{\\sigma}^2\\textrm{df}/n$\n\n\n\n## AIC and BIC\n\nThese have a very similar flavor to $C_p$, but their genesis is different.\n\nWithout going into too much detail, they look like\n\n$\\textrm{AIC}/n = -2\\textrm{loglikelihood}/n + 2\\textrm{df}/n$\n\n$\\textrm{BIC}/n = -2\\textrm{loglikelihood}/n + 2\\log(n)\\textrm{df}/n$\n\n. . .\n\nIn the case of a linear model with Gaussian errors, \n\n\\begin{aligned}\n\\textrm{AIC} &= -n + 2n\\log(2\\pi) - 2 + 2\\log(n) - 2\\log(RSS) + 2(p+2) \\\\\n&\\propto -2\\log(RSS) + 2(p+2)\n\\end{aligned}\n\n( $p+2$ because of the intercept and the unknown variance)\n\n. . .\n\n::: callout-important\nUnfortunately, different books/software/notes define these differently. Even different R packages. This is __super annoying__. \n\nForms above are in [ESL] eq. (7.29) and (7.35). [ISLR] gives special cases in Section 6.1.3. Remember the generic form here.\n:::\n\n\n\n## Over-fitting vs. Under-fitting\n\n\n> Over-fitting means estimating a really complicated function when you don't have enough data.\n\n\nThis is likely a [low-bias / high-variance]{.hand} situation.\n\n\n> Under-fitting means estimating a really simple function when you have lots of data. \n\n\nThis is likely a [high-bias / low-variance]{.hand} situation.\n\nBoth of these outcomes are bad (they have high risk $=$ big $R_n$ ).\n\nThe best way to avoid them is to use a reasonable estimate of _prediction risk_ to choose how complicated your model should be.\n\n\n## Recommendations\n\n::: secondary\nWhen comparing models, choose one criterion: CV / AIC / BIC / Cp / GCV. \n\nCV is usually easiest to make sense of and doesn't depend on other unknown parameters.\n\nBut, it requires refitting the model.\n\nAlso, it can be strange in cases with discrete predictors, time series, repeated measurements, graph structures, etc.\n:::\n\n. . .\n\nHigh-level intuition of these:\n\n* GCV tends to choose \"dense\" models.\n\n* Theory says AIC chooses the \"best predictor\" asymptotically.\n\n* Theory says BIC should choose the \"true model\" asymptotically, tends to select fewer predictors.\n\n* In some special cases, AIC = Cp = SURE $\\approx$ LOO-CV\n\n\n::: aside\nFor more information: see [ESL] Chapter 7.\nThis material is more challenging than the level of this course, and is easily and often misunderstood.\n:::\n\n\n\n# My recommendation: \n\n[Use CV]{.hand-blue}\n\n\n## A few more caveats\n\nIt is often tempting to \"just compare\" risk estimates from vastly different models. \n\nFor example, \n\n* different transformations of the predictors, \n\n* different transformations of the response, \n\n* Poisson likelihood vs. Gaussian likelihood in `glm()`\n\n\n**This is not always justified.** \n\n1. The \"high-level intuition\" is for \"nested\" models.\n\n1. Different likelihoods aren't comparable.\n\n1. Residuals / response variables on different scales aren't directly comparable.\n\n\"Validation set\" is easy, because you're always comparing to the \"right\" thing. But it has lots of drawbacks.\n\n\n\n# Next time ...\n\nGreedy selection\n", - "supporting": [], + "markdown": "---\nlecture: \"06 Information criteria\"\nformat: revealjs\nmetadata-files: \n - _metadata.yml\n---\n---\n---\n\n## {{< meta lecture >}} {.large background-image=\"gfx/smooths.svg\" background-opacity=\"0.3\"}\n\n[Stat 406]{.secondary}\n\n[{{< meta author >}}]{.secondary}\n\nLast modified -- 06 September 2023\n\n\n\n$$\n\\DeclareMathOperator*{\\argmin}{argmin}\n\\DeclareMathOperator*{\\argmax}{argmax}\n\\DeclareMathOperator*{\\minimize}{minimize}\n\\DeclareMathOperator*{\\maximize}{maximize}\n\\DeclareMathOperator*{\\find}{find}\n\\DeclareMathOperator{\\st}{subject\\,\\,to}\n\\newcommand{\\E}{E}\n\\newcommand{\\Expect}[1]{\\E\\left[ #1 \\right]}\n\\newcommand{\\Var}[1]{\\mathrm{Var}\\left[ #1 \\right]}\n\\newcommand{\\Cov}[2]{\\mathrm{Cov}\\left[#1,\\ #2\\right]}\n\\newcommand{\\given}{\\ \\vert\\ }\n\\newcommand{\\X}{\\mathbf{X}}\n\\newcommand{\\x}{\\mathbf{x}}\n\\newcommand{\\y}{\\mathbf{y}}\n\\newcommand{\\P}{\\mathcal{P}}\n\\newcommand{\\R}{\\mathbb{R}}\n$$\n\n\n\n\n\n## Generalized CV\n\nLast time we saw a nice trick, that works some of the time (OLS, Ridge regression,...)\n\n$$\\mbox{LOO-CV} = \\frac{1}{n} \\sum_{i=1}^n \\frac{(y_i -\\widehat{y}_i)^2}{(1-h_{ii})^2} = \\frac{1}{n} \\sum_{i=1}^n \\frac{\\widehat{e}_i^2}{(1-h_{ii})^2}.$$\n\n1. $\\widehat{\\y} = \\widehat{f}(\\mathbf{X}) = \\mathbf{H}\\mathbf{y}$ for some matrix $\\mathbf{H}$.\n2. A technical thing.\n\n$$\\newcommand{\\H}{\\mathbf{H}}$$\n\n\n## This is another nice trick.\n\nIdea: replace $h_{ii}$ with $\\frac{1}{n}\\sum_{i=1}^n h_{ii} = \\frac{1}{n}\\textrm{tr}(\\mathbf{H})$\n\nLet's call $\\textrm{tr}(\\mathbf{H})$ the _degrees-of-freedom_ (or just _df_)\n\n$$\\textrm{GCV} = \\frac{1}{n} \\sum_{i=1}^n \\frac{\\widehat{e}_i^2}{(1-\\textrm{df}/n)^2} = \\frac{\\textrm{MSE}}{(1-\\textrm{df}/n)^2}$$\n\n\n[Where does this stuff come from?]{.hand}\n\n\n## What are `hatvalues`?\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ncv_nice <- function(mdl) mean((residuals(mdl) / (1 - hatvalues(mdl)))^2)\n```\n:::\n\n\nIn OLS, $\\widehat{\\y} = \\X\\widehat{\\beta} = \\X(\\X^\\top \\X)^{-1}\\X^\\top \\y$\n\nWe often call $\\mathbf{H} = \\X(\\X^\\top \\X)^{-1}\\X^\\top$ the Hat matrix, because it [puts the hat]{.hand} on $\\y$\n\nGCV uses $\\textrm{tr}(\\mathbf{H})$. For `lm()`, this is just `p`, the number of predictors (Why?)\n\nThis is one way of understanding the name _effective degrees-of-freedom_\n\n\n\n## Alternative interpretation:\n\n\nSuppose, $Y_i$ is independent from some distribution with mean $\\mu_i$ and variance $\\sigma^2$\n\n(remember: in the linear model $\\Expect{Y_i} = x_i^\\top \\beta = \\mu_i$ )\n\nLet $\\widehat{Y}$ be an estimator of $\\mu$ (all $i=1,\\ldots,n$ elements of the vector).\n\n. . .\n\n\\begin{aligned}\n& \\Expect{\\frac{1}{n}\\sum (\\widehat Y_i-\\mu_i)^2} \\\\\n&= \\Expect{\\frac{1}{n}\\sum (\\widehat Y_i-Y_i + Y_i -\\mu_i)^2}\\\\\n&= \\frac{1}{n}\\Expect{\\sum (\\widehat Y_i-Y_i)^2} + \\frac{1}{n}\\Expect{\\sum (Y_i-\\mu_i)^2} + \\frac{2}{n}\\Expect{\\sum (\\widehat Y_i-Y_i)(Y_i-\\mu_i)}\\\\\n&= \\frac{1}{n}\\sum \\Expect{(\\widehat Y_i-Y_i)^2} + \\sigma^2 + \\frac{2}{n}\\Expect{\\sum (\\widehat Y_i-Y_i)(Y_i-\\mu_i)} = \\cdots =\\\\\n&= \\frac{1}{n}\\sum \\Expect{(\\widehat Y_i-Y_i)^2} - \\sigma^2 + \\frac{2}{n}\\sum\\Cov{Y_i}{\\widehat Y_i}\n\\end{aligned}\n\n\n## Alternative interpretation:\n\n$$\\Expect{\\frac{1}{n}\\sum (\\widehat Y_i-\\mu_i)^2} = \\frac{1}{n}\\sum \\Expect{(\\widehat Y_i-Y_i)^2} - \\sigma^2 + \\frac{2}{n}\\sum\\Cov{Y_i}{\\widehat Y_i}$$\n\nNow, if $\\widehat{Y} = \\H Y$ for some matrix $\\H$,\n\n$\\sum\\Cov{Y_i}{\\widehat Y_i} = \\Expect{Y^\\top \\H Y} = \\sigma^2 \\textrm{tr}(\\H)$\n\n\nThis gives _Mallow's $C_p$_ aka _Stein's Unbiased Risk Estimator_:\n\n$MSE + 2\\hat{\\sigma}^2\\textrm{df}/n$\n\n\n\n## AIC and BIC\n\nThese have a very similar flavor to $C_p$, but their genesis is different.\n\nWithout going into too much detail, they look like\n\n$\\textrm{AIC}/n = -2\\textrm{loglikelihood}/n + 2\\textrm{df}/n$\n\n$\\textrm{BIC}/n = -2\\textrm{loglikelihood}/n + 2\\log(n)\\textrm{df}/n$\n\n. . .\n\nIn the case of a linear model with Gaussian errors, \n\n\\begin{aligned}\n\\textrm{AIC} &= -n + 2n\\log(2\\pi) - 2 + 2\\log(n) - 2\\log(RSS) + 2(p+2) \\\\\n&\\propto -2\\log(RSS) + 2(p+2)\n\\end{aligned}\n\n( $p+2$ because of the intercept and the unknown variance)\n\n. . .\n\n::: callout-important\nUnfortunately, different books/software/notes define these differently. Even different R packages. This is __super annoying__. \n\nForms above are in [ESL] eq. (7.29) and (7.35). [ISLR] gives special cases in Section 6.1.3. Remember the generic form here.\n:::\n\n\n\n## Over-fitting vs. Under-fitting\n\n\n> Over-fitting means estimating a really complicated function when you don't have enough data.\n\n\nThis is likely a [low-bias / high-variance]{.hand} situation.\n\n\n> Under-fitting means estimating a really simple function when you have lots of data. \n\n\nThis is likely a [high-bias / low-variance]{.hand} situation.\n\nBoth of these outcomes are bad (they have high risk $=$ big $R_n$ ).\n\nThe best way to avoid them is to use a reasonable estimate of _prediction risk_ to choose how complicated your model should be.\n\n\n## Recommendations\n\n::: secondary\nWhen comparing models, choose one criterion: CV / AIC / BIC / Cp / GCV. \n\nCV is usually easiest to make sense of and doesn't depend on other unknown parameters.\n\nBut, it requires refitting the model.\n\nAlso, it can be strange in cases with discrete predictors, time series, repeated measurements, graph structures, etc.\n:::\n\n. . .\n\nHigh-level intuition of these:\n\n* GCV tends to choose \"dense\" models.\n\n* Theory says AIC chooses the \"best predictor\" asymptotically.\n\n* Theory says BIC should choose the \"true model\" asymptotically, tends to select fewer predictors.\n\n* In some special cases, AIC = Cp = SURE $\\approx$ LOO-CV\n\n\n::: aside\nFor more information: see [ESL] Chapter 7.\nThis material is more challenging than the level of this course, and is easily and often misunderstood.\n:::\n\n\n\n# My recommendation: \n\n[Use CV]{.hand-blue}\n\n\n## A few more caveats\n\nIt is often tempting to \"just compare\" risk estimates from vastly different models. \n\nFor example, \n\n* different transformations of the predictors, \n\n* different transformations of the response, \n\n* Poisson likelihood vs. Gaussian likelihood in `glm()`\n\n\n[This is not always justified.]{.secondary}\n\n1. The \"high-level intuition\" is for \"nested\" models.\n\n1. Different likelihoods aren't comparable.\n\n1. Residuals / response variables on different scales aren't directly comparable.\n\n\"Validation set\" is easy, because you're always comparing to the \"right\" thing. But it has lots of drawbacks.\n\n\n\n# Next time ...\n\nGreedy selection\n", + "supporting": [ + "06-information-criteria_files" + ], "filters": [ "rmarkdown/pagebreak.lua" ],