diff --git a/_freeze/archive/CaseStudy01/execute-results/html.json b/_freeze/archive/CaseStudy01/execute-results/html.json new file mode 100644 index 0000000..df84273 --- /dev/null +++ b/_freeze/archive/CaseStudy01/execute-results/html.json @@ -0,0 +1,18 @@ +{ + "hash": "c061862e27098cb02ccfc71ba1c8ea30", + "result": { + "markdown": "---\ntitle: \"Algorithmic Thinking Case Study 1\"\nsubtitle: \"SISMID 2024 -- Introduction to R\"\nformat:\n revealjs:\n toc: false\nexecute: \n echo: false\n---\n\n\n## Learning goals\n\n* Use logical operators, subsetting functions, and math calculations in R\n* Translate human-understandable problem descriptions into instructions that\nR can understand.\n\n# Remember, R always does EXACTLY what you tell it to do!\n\n## Instructions\n\n* Make a new R script for this case study, and save it to your code folder.\n* We'll use the diphtheria serosample data from Exercise 1 for this case study.\nLoad it into R and use the functions we've learned to look at it.\n\n## Instructions\n\n* Make a new R script for this case study, and save it to your code folder.\n* We'll use the diphtheria serosample data from Exercise 1 for this case study.\nLoad it into R and use the functions we've learned to look at it.\n* The `str()` of your dataset should look like this.\n\n\n\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n```\ntibble [250 × 5] (S3: tbl_df/tbl/data.frame)\n $ age_months : num [1:250] 15 44 103 88 88 118 85 19 78 112 ...\n $ group : chr [1:250] \"urban\" \"rural\" \"urban\" \"urban\" ...\n $ DP_antibody : num [1:250] 0.481 0.657 1.368 1.218 0.333 ...\n $ DP_infection: num [1:250] 1 1 1 1 1 1 1 1 1 1 ...\n $ DP_vacc : num [1:250] 0 1 1 1 1 1 1 1 1 1 ...\n```\n:::\n:::\n\n\n## Q1: Was the overall prevalence higher in urban or rural areas?\n\n::: {.incremental}\n\n1. How do we calculate the prevalence from the data?\n1. How do we calculate the prevalence separately for urban and rural areas?\n1. How do we determine which prevalence is higher and if the difference is\nmeaningful?\n\n:::\n\n## Q1: How do we calculate the prevalence from the data?\n\n::: {.incremental}\n\n* The variable `DP_infection` in our dataset is binary / dichotomous.\n* The prevalence is the number or percent of people who had the disease over\nsome duration.\n* The average of a binary variable gives the prevalence!\n\n:::\n\n. . .\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmean(diph$DP_infection)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.8\n```\n:::\n:::\n\n\n## Q1: How do we calculate the prevalence separately for urban and rural areas?\n\n. . .\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmean(diph[diph$group == \"urban\", ]$DP_infection)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.8235294\n```\n:::\n\n```{.r .cell-code}\nmean(diph[diph$group == \"rural\", ]$DP_infection)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.778626\n```\n:::\n:::\n\n\n. . .\n\n* There are many ways you could write this code! You can use `subset()` or you\ncan write the indices many ways.\n* Using `tbl_df` objects from `haven` uses different `[[` rules than a base R\ndata frame.\n\n## Q1: How do we calculate the prevalence separately for urban and rural areas?\n\n* One easy way is to use the `aggregate()` function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\naggregate(DP_infection ~ group, data = diph, FUN = mean)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n group DP_infection\n1 rural 0.7786260\n2 urban 0.8235294\n```\n:::\n:::\n\n\n## Q1: How do we determine which prevalence is higher and if the difference is meaningful?\n\n::: {.incremental}\n\n* We probably need to include a confidence interval in our calculation.\n* This is actually not so easy without more advanced tools that we will learn\nin upcoming modules.\n* Right now the best options are to do it by hand or google a function.\n\n:::\n\n## Q1: By hand\n\n\n::: {.cell}\n\n```{.r .cell-code}\np_urban <- mean(diph[diph$group == \"urban\", ]$DP_infection)\np_rural <- mean(diph[diph$group == \"rural\", ]$DP_infection)\nse_urban <- sqrt(p_urban * (1 - p_urban) / nrow(diph[diph$group == \"urban\", ]))\nse_rural <- sqrt(p_rural * (1 - p_rural) / nrow(diph[diph$group == \"rural\", ])) \n\nresult_urban <- paste0(\n\t\"Urban: \", round(p_urban, 2), \"; 95% CI: (\",\n\tround(p_urban - 1.96 * se_urban, 2), \", \",\n\tround(p_urban + 1.96 * se_urban, 2), \")\"\n)\n\nresult_rural <- paste0(\n\t\"Rural: \", round(p_rural, 2), \"; 95% CI: (\",\n\tround(p_rural - 1.96 * se_rural, 2), \", \",\n\tround(p_rural + 1.96 * se_rural, 2), \")\"\n)\n\ncat(result_urban, result_rural, sep = \"\\n\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nUrban: 0.82; 95% CI: (0.76, 0.89)\nRural: 0.78; 95% CI: (0.71, 0.85)\n```\n:::\n:::\n\n\n## Q1: By hand\n\n* We can see that the 95% CI's overlap, so the groups are probably not that\ndifferent. **To be sure, we need to do a 2-sample test! But this is not a\nstatistics class.**\n* Some people will tell you that coding like this is \"bad\". **But 'bad' code\nthat gives you answers is better than broken code!** We will learn techniques for writing this with less work and less repetition\nin upcoming modules.\n\n## Q1: Googling a package\n\n. . .\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# install.packages(\"DescTools\")\nlibrary(DescTools)\n\naggregate(DP_infection ~ group, data = diph, FUN = DescTools::MeanCI)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n group DP_infection.mean DP_infection.lwr.ci DP_infection.upr.ci\n1 rural 0.7786260 0.7065872 0.8506647\n2 urban 0.8235294 0.7540334 0.8930254\n```\n:::\n:::\n\n\n## You try it!\n\n* Using any of the approaches you can think of, answer this question!\n* **How many children under 5 were vaccinated? In children under 5, did\nvaccination lower the prevalence of infection?**\n\n## You try it!\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# How many children under 5 were vaccinated\nsum(diph$DP_vacc[diph$age_months < 60])\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 91\n```\n:::\n\n```{.r .cell-code}\n# Prevalence in both vaccine groups for children under 5\naggregate(\n\tDP_infection ~ DP_vacc,\n\tdata = subset(diph, age_months < 60),\n\tFUN = DescTools::MeanCI\n)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n DP_vacc DP_infection.mean DP_infection.lwr.ci DP_infection.upr.ci\n1 0 0.4285714 0.1977457 0.6593972\n2 1 0.6373626 0.5366845 0.7380407\n```\n:::\n:::\n\n\nIt appears that prevalence was HIGHER in the vaccine group? That is\ncounterintuitive, but the sample size for the unvaccinated group is too small\nto be sure.\n\n## Congratulations for finishing the first case study!\n\n* What R functions and skills did you practice?\n* What other questions could you answer about the same dataset with the skills\nyou know now?\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-after-body": [ + "\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/modules/Module00-Welcome/execute-results/html.json b/_freeze/modules/Module00-Welcome/execute-results/html.json index 01be4ab..8a8349c 100644 --- a/_freeze/modules/Module00-Welcome/execute-results/html.json +++ b/_freeze/modules/Module00-Welcome/execute-results/html.json @@ -1,8 +1,7 @@ { - "hash": "c70ee3c3328bbebb542de6ef3986aef7", + "hash": "8bfd8f2bc8586d363e99a6e7f763f712", "result": { - "engine": "knitr", - "markdown": "---\ntitle: \"Welcome to SISMID Workshop: Introduction to R\"\nformat: \n revealjs:\n scrollable: true\n smaller: true\n---\n\n\n\n## Welcome to SISMID Workshop: Introduction to R!\n\n**Amy Winter (she/her)** \nAssistant Professor, Department of Epidemiology and Biostatistics\nEmail: awinter@uga.edu\n\n**Zane Billings (he/him)** \nPhD Candidate, Department of Epidemiology and Biostatistics\nEmail: Wesley.Billings@uga.edu\n\n\n## Introductions\n\n* Name?\n* Current position / institution?\n* Past experience with other statistical programs, including R?\n* Why do you want to learn R?\n* Favorite useful app\n* Favorite guilty pleasure app\n\n\n## What is R?\n\n- R is a language and environment for statistical computing and graphics developed in 1991\n\n- R is the open source implementation of the [S language](https://en.wikipedia.org/wiki/S_(programming_language)), which was developed by [Bell laboratories](https://ca.slack-edge.com/T023TPZA8LF-U024EN26Q0L-113294823b2c-512) in the 70s.\n\n- The aim of the S language, as expressed by John Chambers, is \"to turn ideas into software, quickly and faithfully\"\n\n## What is R?\n\n- **R**oss Ihaka and **R**obert Gentleman at the University of Auckland, New Zealand developed R\n\n\n- R is both [open source](https://en.wikipedia.org/wiki/Open_source) and [open development](https://en.wikipedia.org/wiki/Open-source_software_development)\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](https://www.r-project.org/logo/Rlogo.png){fig-align='center' fig-alt='R logo' width=20%}\n:::\n:::\n\n\n\n## What is R?\n\n* R possesses an extensive catalog of statistical and graphical methods \n * includes machine learning algorithm, linear regression, time series, statistical inference to name a few. \n\n* Data analysis with R is done in a series of steps; programming, transforming, discovering, modeling and communicate the results\n\n\n## What is R?\n\n- Program: R is a clear and accessible programming tool\n- Transform: R is made up of a collection of libraries designed specifically for data science\n- Discover: Investigate the data, refine your hypothesis and analyze them\n- Model: R provides a wide array of tools to capture the right model for your data\n- Communicate: Integrate codes, graphs, and outputs to a report with R Markdown or build Shiny apps to share with the world\n\n\n## Why R?\n\n* Free (open source)\n\n* High level language designed for statistical computing\n\n* Powerful and flexible - especially for data wrangling and visualization\n\n* Extensive add-on software (packages)\n\n* Strong community \n\n\n## Why not R?\n\n \n* Little centralized support, relies on online community and package developers\n\n* Annoying to update\n\n* Slower, and more memory intensive, than the more traditional programming languages (C, Perl, Python)\n\n\n## Is R DIfficult?\n\n* Short answer – It has a steep learning curve. \n* Years ago, R was a difficult language to master. The language was confusing and not as structured as the other programming tools. \n* Hadley Wickham developed a collection of packages called tidyverse. Data manipulation became trivial and intuitive. Creating a graph was not so difficult anymore.\n\n\n\n## Overall Workshop Objectives\n\nBy the end of this workshop, you should be able to \n\n1. start a new project, read in data, and conduct basic data manipulation, analysis, and visualization\n2. know how to use and find packages/functions that we did not specifically learn in class\n3. troubleshoot errors (xxzane? -- not included right now)\n\n\n## This workshop differs from \"Introduction to Tidyverse\"\n\nWe will focus this class on using **Base R** functions and packages, i.e., pre-installed into R and the basis for most other functions and packages! If you know Base R then are will be more equipped to use all the other useful/pretty packages that exit.\n\nthe Tidyverse is one set of useful/pretty packages, designed to can make your code more **intuitive** as compared to the original older Base R. **Tidyverse advantages**: \n\n-\t**consistent structure** - making it easier to learn how to use different packages\n-\tparticularly good for **wrangling** (manipulating, cleaning, joining) data \n-\tmore flexible for **visualizing** data \n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](https://tidyverse.tidyverse.org/logo.png){fig-align='center' fig-alt='Tidyverse hex sticker' width=10%}\n:::\n:::\n\n\n\n\n## Workshop Overview\n\n14 lecture blocks that will each:\n- Start with learning objectives\n- End with summary slides\n- Include mini-exercise(s) or a full exercise\n\nThemes that will show up throughout the workshop:\n- Reproducibility\n- Good coding techniques\n- Thinking algorithmically\n- [Basic terms / R jargon](https://link.springer.com/content/pdf/bbm%3A978-1-4419-1318-0%2F1.pdf)\n\n\n## Reproducibility\n\nxxzane slides\n\n## Good coding techniques\n\n\n## Thinking algorithmically \n\n\n## Useful (+ Free) Resources\n\n**Want more?** \n\n- R for Data Science: http://r4ds.had.co.nz/ \n(great general information)\n\n- Fundamentals of Data Visualization: https://clauswilke.com/dataviz/ \n\n- R for Epidemiology: https://www.r4epi.com/\n\n- The Epidemiologist R Handbook: https://epirhandbook.com/en/\n\n- R basics by Rafael A. Irizarry: https://rafalab.github.io/dsbook/r-basics.html\n(great general information)\n \n- Open Case Studies: https://www.opencasestudies.org/ \n(resource for specific public health cases with statistical implementation and interpretation)\n\n## Useful (+Free) Resources\n\n**Need help?** \n\n- Various \"Cheat Sheets\": https://github.com/rstudio/cheatsheets/\n\n- R reference card: http://cran.r-project.org/doc/contrib/Short-refcard.pdf \n\n- R jargon: https://link.springer.com/content/pdf/bbm%3A978-1-4419-1318-0%2F1.pdf \n\n- R vs Stata: https://link.springer.com/content/pdf/bbm%3A978-1-4419-1318-0%2F1.pdf \n\n- R terminology: https://cran.r-project.org/doc/manuals/r-release/R-lang.pdf\n\n\n## Installing R\n\n\nHopefully everyone has pre-installed R and RStudio. We will take a moment to go around and make sure everyone is ready to go. Please open up your RStudio and leave it open as we check everyone's laptops.\n\n- Install the latest version from: [http://cran.r-project.org/](http://cran.r-project.org/ )\n- [Install RStudio](https://www.rstudio.com/products/rstudio/download/)\n\n\n", + "markdown": "---\ntitle: \"Welcome to SISMID Workshop: Introduction to R\"\nformat: \n revealjs:\n scrollable: true\n smaller: true\n toc: false\n---\n\n\n## Welcome to SISMID Workshop: Introduction to R!\n\n**Amy Winter (she/her)** \n\nAssistant Professor, Department of Epidemiology and Biostatistics\n\nEmail: awinter@uga.edu\n\n
\n\n**Zane Billings (he/him)** \n\nPhD Candidate, Department of Epidemiology and Biostatistics\n\nEmail: Wesley.Billings@uga.edu\n\n\n## Introductions\n\n* Name?\n* Current position / institution?\n* Past experience with other statistical programs, including R?\n* Why do you want to learn R?\n* Favorite useful app\n* Favorite guilty pleasure app\n\n\n## What is R?\n\n- R is a language and environment for statistical computing and graphics developed in 1991\n\n- R is the open source implementation of the [S language](https://en.wikipedia.org/wiki/S_(programming_language)), which was developed by [Bell laboratories](https://ca.slack-edge.com/T023TPZA8LF-U024EN26Q0L-113294823b2c-512) in the 70s.\n\n- The aim of the S language, as expressed by John Chambers, is \"to turn ideas into software, quickly and faithfully\"\n\n## What is R?\n\n- **R**oss Ihaka and **R**obert Gentleman at the University of Auckland, New Zealand developed R\n\n\n- R is both [open source](https://en.wikipedia.org/wiki/Open_source) and [open development](https://en.wikipedia.org/wiki/Open-source_software_development)\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](https://www.r-project.org/logo/Rlogo.png){fig-align='center' fig-alt='R logo' width=20%}\n:::\n:::\n\n\n## What is R?\n\n* R possesses an extensive catalog of statistical and graphical methods \n * includes machine learning algorithm, linear regression, time series, statistical inference to name a few. \n\n* Data analysis with R is done in a series of steps; programming, transforming, discovering, modeling and communicate the results\n\n\n## What is R?\n\n- Program: R is a clear and accessible programming tool\n- Transform: R is made up of a collection of packages/libraries designed specifically for statistical computing\n- Discover: Investigate the data, refine your hypothesis and analyze them\n- Model: R provides a wide array of tools to capture the right model for your data\n- Communicate: Integrate codes, graphs, and outputs to a report with R Markdown or build Shiny apps to share with the world\n\n\n## Why R?\n\n* Free (open source)\n\n* High level language designed for statistical computing\n\n* Powerful and flexible - especially for data wrangling and visualization\n\n* Extensive add-on software (packages)\n\n* Strong community \n\n\n## Why not R?\n\n \n* Little centralized support, relies on online community and package developers\n\n* Annoying to update\n\n* Slower, and more memory intensive, than the more traditional programming languages (C, Perl, Python)\n\n\n## Is R Difficult?\n\n* Short answer – It has a steep learning curve, like all programming languages\n* Years ago, R was a difficult language to master. \n* Hadley Wickham developed a collection of packages called tidyverse. Data manipulation became trivial and intuitive. Creating a graph was not so difficult anymore.\n\n\n## Overall Workshop Objectives\n\nBy the end of this workshop, you should be able to \n\n1. start a new project, read in data, and conduct basic data manipulation, analysis, and visualization\n2. know how to use and find packages/functions that we did not specifically learn in class\n3. troubleshoot errors\n\n\n## This workshop differs from \"Introduction to Tidyverse\"\n\nWe will focus this class on using **Base R** functions and packages, i.e., pre-installed into R and the basis for most other functions and packages! If you know Base R then are will be more equipped to use all the other useful/pretty packages that exit.\n\nThe Tidyverse is one set of useful/pretty sets of packages, designed to can make your code more **intuitive** as compared to the original older Base R. **Tidyverse advantages**: \n\n-\t**consistent structure** - making it easier to learn how to use different packages\n-\tparticularly good for **wrangling** (manipulating, cleaning, joining) data \n-\tmore flexible for **visualizing** data \n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](https://tidyverse.tidyverse.org/logo.png){fig-align='center' fig-alt='Tidyverse hex sticker' width=10%}\n:::\n:::\n\n\n\n## Workshop Overview\n\n14 lecture blocks that will each:\n\n- Start with learning objectives\n- End with summary slides\n- Include mini-exercise(s) or a full exercise\n\nThemes that will show up throughout the workshop:\n\n- Reproducibility\n- Good coding techniques\n- Thinking algorithmically\n- [Basic terms / R jargon](https://link.springer.com/content/pdf/bbm%3A978-1-4419-1318-0%2F1.pdf)\n\n\n## Reproducibility\n\nxxzane slides\n\n\n## Useful (+ Free) Resources\n\n**Want more?** \n\n- R for Data Science: http://r4ds.had.co.nz/ \n(great general information)\n\n- Fundamentals of Data Visualization: https://clauswilke.com/dataviz/ \n\n- R for Epidemiology: https://www.r4epi.com/\n\n- The Epidemiologist R Handbook: https://epirhandbook.com/en/\n\n- R basics by Rafael A. Irizarry: https://rafalab.github.io/dsbook/r-basics.html\n(great general information)\n \n- Open Case Studies: https://www.opencasestudies.org/ \n(resource for specific public health cases with statistical implementation and interpretation)\n\n## Useful (+Free) Resources\n\n**Need help?** \n\n- Various \"Cheat Sheets\": https://github.com/rstudio/cheatsheets/\n\n- R reference card: http://cran.r-project.org/doc/contrib/Short-refcard.pdf \n\n- R jargon: https://link.springer.com/content/pdf/bbm%3A978-1-4419-1318-0%2F1.pdf \n\n- R vs Stata: https://link.springer.com/content/pdf/bbm%3A978-1-4419-1318-0%2F1.pdf \n\n- R terminology: https://cran.r-project.org/doc/manuals/r-release/R-lang.pdf\n\n\n## Installing R\n\n\nHopefully everyone has pre-installed R and RStudio. We will take a moment to go around and make sure everyone is ready to go. Please open up your RStudio and leave it open as we check everyone's laptops.\n\n- Install the latest version from: [http://cran.r-project.org/](http://cran.r-project.org/ )\n- [Install RStudio](https://www.rstudio.com/products/rstudio/download/)\n\n\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/modules/Module01-Intro/execute-results/html.json b/_freeze/modules/Module01-Intro/execute-results/html.json index f5f7304..0484237 100644 --- a/_freeze/modules/Module01-Intro/execute-results/html.json +++ b/_freeze/modules/Module01-Intro/execute-results/html.json @@ -1,8 +1,7 @@ { - "hash": "f445c448019a47d959fea49d68987f67", + "hash": "f7be0bcf0c004397e5a35535a3dd9a72", "result": { - "engine": "knitr", - "markdown": "---\ntitle: \"Module 1: Introduction to RStudio and R Basics\"\nformat: \n revealjs:\n scrollable: true\n smaller: true\n---\n\n\n\n\n## Learning Objectives\n\nAfter module 1, you should be able to...\n\n- Create and save an R script\n- Describe the utility and differences b/w the console and an R script\n- Modify R Studio windows\n- Create objects\n- Describe the difference b/w character, numeric, list, and matrix objects\n- Reference objects in the RStudio Global Environment\n- Use basic arithmetic operators in R\n- Use comments within an R script to create header, sections, and make notes\n\n## Working with R -- RStudio\n\nRStudio is an Integrated Development Environment (IDE) for R\n\n- It helps the user effectively use R\n- Makes things easier\n- Is NOT a dropdown statistical tool (such as Stata)\n - See [Rcmdr](https://cran.r-project.org/web/packages/Rcmdr/index.html) or [Radiant](http://vnijs.github.io/radiant/)\n\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](https://d33wubrfki0l68.cloudfront.net/62bcc8535a06077094ca3c29c383e37ad7334311/a263f/assets/img/logo.svg){fig-align='center' fig-alt='RStudio logo' width=30%}\n:::\n:::\n\n\n\n\n## RStudio\n\nEasier working with R\n\n- Syntax highlighting, code completion, and smart indentation\n- Easily manage multiple working directories and projects\n\nMore information\n\n- Workspace browser and data viewer\n- Plot history, zooming, and flexible image and file export\n- Integrated R help and documentation\n- Searchable command history\n\n## RStudio\n\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](https://ayeimanolr.files.wordpress.com/2013/04/r-rstudio-1-1.png?w=640&h=382){fig-align='center' fig-alt='RStudio' width=80%}\n:::\n:::\n\n\n\n\n## Getting the editor\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/both.png){width=90%}\n:::\n:::\n\n\n\n\n## Working with R in RStudio - 2 major panes:\n\n1) The **Source/Editor**: \"Analysis\" Script + Interactive Exploration\n - Static copy of what you did (reproducibility)\n - Top by default\n2) The **R Console**: \"interprets\" whatever you type\n - Calculator\n - Try things out interactively, then add to your editor\n - Bottom by default\n\n## Source / Editor\n\n- Where files open to\n- Have R code and comments in them\n- Where code is saved\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/rstudio_script.png){width=200%}\n:::\n:::\n\n\n\n\n## R Console\n\n- Where code is executed (where things happen)\n- You can type here for things interactively\n- Code is **not saved**\n\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/rstudio_console.png){fig-align='center' width=60%}\n:::\n:::\n\n\n\n\n\n## RStudio\n\nUseful RStudio \"cheat sheet\": \n\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/rstudio_sheet.png){fig-align='center' fig-alt='RStudio' width=65%}\n:::\n:::\n\n\n\n\n\n## RStudio Layout\n\nIf RStudio doesn't look the way you want (or like our RStudio), then do:\n\nRStudio --\\> View --\\> Panes --\\> Pane Layout\n\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/pane_layout.png){fig-align='center' width=500px}\n:::\n:::\n\n\n\n\n## Workspace/Environment\n\n- Tells you what **objects** are in R\n- What exists in memory/what is loaded?/what did I read in?\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/rstudio_environment.png){width=90%}\n:::\n:::\n\n\n\n\n## Workspace/History\n\n- Shows previous commands. Good to look at for debugging, but **don't rely** on it.\n- Also type the \"up\" key in the Console to scroll through previous commands\n\n## Workspace/Other Panes\n\n- **Files** - shows the files on your computer of the directory you are working in\n- **Viewer** - can view data or R objects\n- **Help** - shows help of R commands\n- **Plots** - pictures and figures\n- **Packages** - list of R packages that are loaded in memory\n\n## Getting Started\n\n- File --\\> New File --\\> R Script\n- Save the blank R script as Module1.R\n\n## Explaining output on slides\n\nIn slides, a command (we'll also call them code or a code chunk) will look like this\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint(\"I'm code\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"I'm code\"\n```\n\n\n:::\n:::\n\n\n\n\nAnd then directly after it, will be the output of the code. \nSo `print(\"I'm code\")` is the code chunk and `[1] \"I'm code\"` is the output.\n\n## R as a calculator\n\nYou can do basic arithmetic in R, which I surprisingly use all the time.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n2 + 2\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 4\n```\n\n\n:::\n\n```{.r .cell-code}\n2 * 4\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 8\n```\n\n\n:::\n\n```{.r .cell-code}\n2^3\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 8\n```\n\n\n:::\n:::\n\n\n\n\n## R as a calculator\n\n- The R console is a full calculator\n- Try to play around with it:\n - +, -, /, * are add, subtract, divide and multiply\n - ^ or ** is power\n - parentheses -- ( and ) -- work with order of operations \n - %% finds the remainder\n \n\n## Execute / Run Code\n\nTo execute or run a line of code, you just put your cursor on line of code and then:\n\n 1. Press Run (which you will find at the top of your window)\n\n OR\n\n 2. Press `Cmd + Return` (iOS) OR `Ctrl + Enter` (Windows).\n\nTo execute or run multiple lines of code, you just need to highlight the code you want to run and then follow option 1 or 2.\n\n## Mini exercise \n\nExecute `5+4` from your .R file, and then find the answer 9 in the Console.\n\n## Commenting in Scripts\n\nThe syntax `#` creates a comment, which means anything to the right of `#` will not be executed / run\n\nCommenting is useful to:\n\n1. Create headers for R Scripts\n2. Create sections within an R Script\n3. Explain what is happening in your code \n\n## Commenting an R Script header\n\nAdd a comment header to Module1.R. This is the one I typically use, but you may have your own preference. The goal is that you are consistent so that future you / collaborators can make sense of your code.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n### Title: Module 1\n### Author: Amy Winter \n### Objective: Mini Exercise - Developing first R Script\n### Date: 15 July 2024\n```\n:::\n\n\n\n\n## Commenting to create sections\n\nYou can also create sections within your code by ending a comment with 4 hash marks. **This is very useful for creating an outline of your R Script.** The \"Outline\" can be found in the top right of the your source window.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Section 1 Header ####\n## Section 2 Sub-header ####\n### Section 3 Sub-sub-header ####\n#### Section 4 Sub-sub-sub-header ####\n```\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/outline.png){width=90%}\n:::\n:::\n\n\n\n\n\n## Commenting to explain code\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## this # is still a comment\n### you can use many #'s as you want\n\n# sometimes you have a really long comment,\n# like explaining what you are doing\n# for a step in analysis. \n# Take it to another line\n```\n:::\n\n\n\n\n## Commenting to explain code\n\nI tend to use:\n\n- One hash tag with a space to describe what is happening in the following few lines of code\n- One hastag with no space after a command to list specifics \n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Practicing my arithmetic\n5+2\n3*5\n9/8\n\n5+2 #5 plus 2 \n```\n:::\n\n\n\n\n## Object - Basic terms\n\n**Object** - an object is something that can be worked with in R - can be lots of different things!\n\n- a scalar / number\n- a vector\n- a matrix of numbers\n- a list\n- a plot\n- a function\n\n... many more\n\n## Objects\n\n- You can create objects from within the R environment and from files on your computer\n- R uses `<-` to assign values to an object name \n- Note: Object names are case-sensitive, i.e. X and x are different\n- Here are examples of creating five different objects:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnumber.object <- 3\ncharacter.object <- \"blue\"\nvector.object1 <- c(2,3,4,5)\nvector.object2 <- c(\"blue\", \"red\", \"yellow\")\nmatrix.object <- matrix(data=vector.object1, nrow=2, ncol=2, byrow=TRUE)\n```\n:::\n\n\n\n\nNote, `c()` and `matrix()` are functions, which we will talk more about in module 2.\n\n\n## Mini Exercise\n\nTry creating one or two of these objects in your R script\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnumber.object <- 3\ncharacter.object <- \"blue\"\nvector.object1 <- c(2,3,4,5)\nvector.object2 <- c(\"blue\", \"red\", \"yellow\")\nmatrix.object <- matrix(data=vector.object1, nrow=2, ncol=2, byrow=TRUE)\n```\n:::\n\n\n\n\n## Objects \n\nNote, you can find these objects now in the Global Environment.\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/global_env.png){width=90%}\n:::\n:::\n\n\n\n\n\nAlso, you can call them anytime (i.e, see them in the Console) by executing (running) the object. For example,\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncharacter.object\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] \"blue\"\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nmatrix.object\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [,1] [,2]\n[1,] 2 3\n[2,] 4 5\n```\n\n\n:::\n:::\n\n\n\n\n\n## Assignment - Good coding\n\n`=` and `<-` can both be used for assignment, but `<-` is better coding practice, because `==` is a logical operator. We will talk about this more, later.\n\n## Lists\n\nList is a special data class, that can hold vectors, strings, matrices, models, list of other lists.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlist.object <- list(number.object, vector.object2, matrix.object)\nlist.object\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[[1]]\n[1] 3\n\n[[2]]\n[1] \"blue\" \"red\" \"yellow\"\n\n[[3]]\n [,1] [,2]\n[1,] 2 3\n[2,] 4 5\n```\n\n\n:::\n:::\n\n\n\n\n\n## Useful R Studio Shortcuts\n\nWill certainly save you time\n\n- `Cmd + Return` (iOS) OR `Ctrl + Enter` (Windows) in your script evaluates current line/selection\n - It's like copying and pasting the code into the console for it to run.\n- pressing Up/Down in the Console allows you to navigate command history\n\nSee for many more\n\n\n## RStudio helps with \"tab completion\"\n\nIf you start typing a object, RStudio will show you options that you can choose without typing out the whole object.\n\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/tab.completion.png){width=90%}\n:::\n:::\n\n\n\n\n\n\n\n## Summary\n\n- RStudio makes working in R easier\n- The Editor is for static code like R Scripts\n- The Console is for testing code that can't be saved\n- Commenting is your new best friend\n- In R we create objects that can be viewed in the Environment panel and called anytime\n- An object is something that can be worked with in R\n- Use `<-` syntax to create objects\n\n\n## Mini Exercise\n\n1. Create a new number object and name it `my.object`\n2. Create a vector of 4 numbers and name it `my.vector` using the `c()` function\n3. Add `my.object` and `my.vector` together use arithmatic operator\n\n## Acknowledgements\n\nThese are the materials I looked through, modified, or extracted to complete this module's lecture.\n\n- [\"Introduction to R for Public Health Researchers\" Johns Hopkins University](https://jhudatascience.org/intro_to_r/)\n- Some RStudio snapshots were pulled from \n", + "markdown": "---\ntitle: \"Module 1: Introduction to RStudio and R Basics\"\nformat: \n revealjs:\n scrollable: true\n smaller: true\n toc: false\n---\n\n\n## Learning Objectives\n\nAfter module 1, you should be able to...\n\n- Create and save an R script\n- Describe the utility and differences b/w the Console and the Source panes\n- Modify R Studio panes\n- Create objects\n- Describe the difference b/w character, numeric, list, and matrix objects\n- Reference objects in the RStudio Environment pane\n- Use basic arithmetic operators in R\n- Use comments within an R script to create header, sections, and make notes\n\n## Working with R -- RStudio\n\nRStudio is an Integrated Development Environment (IDE) for R\n\n- It helps the user effectively use R\n- Makes things easier\n- Is NOT a dropdown statistical tool (such as Stata)\n - See [jamovi](https://www.jamovi.org/) or also [Rcmdr](https://cran.r-project.org/web/packages/Rcmdr/index.html), [Radiant](http://vnijs.github.io/radiant/)\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](https://d33wubrfki0l68.cloudfront.net/62bcc8535a06077094ca3c29c383e37ad7334311/a263f/assets/img/logo.svg){fig-align='center' fig-alt='RStudio logo' width=30%}\n:::\n:::\n\n\n## RStudio\n\nEasier working with R\n\n- Syntax highlighting, code completion, and smart indentation\n- Easily manage multiple working directories and projects\n\nMore information\n\n- Workspace browser and data viewer\n- Plot history, zooming, and flexible image and file export\n- Integrated R help and documentation\n- Searchable command history\n\n## RStudio\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](https://ayeimanolr.files.wordpress.com/2013/04/r-rstudio-1-1.png?w=640&h=382){fig-align='center' fig-alt='RStudio' width=80%}\n:::\n:::\n\n\n## Getting the editor\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/both.png){width=90%}\n:::\n:::\n\n\n## Working with R in RStudio - 2 major panes:\n\n1) The **Source/Editor**: xxamy\n\n- \"Analysis\" Script\n- Static copy of what you did (reproducibility)\n- Top by default\n \n2) The **R Console**: \"interprets\" whatever you type:\n\n - Calculator\n - Try things out interactively, then add to your editor\n - Bottom by default\n\n## Source / Editor\n\n- Where files open to\n- Have R code and comments in them\n- Where code is saved\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/rstudio_script.png){width=200%}\n:::\n:::\n\n\n## R Console\n\n- Where code is executed (where things happen)\n- You can type here for things interactively\n- Code is **not saved**\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/rstudio_console.png){fig-align='center' width=60%}\n:::\n:::\n\n\n\n## RStudio\n\nUseful RStudio \"cheat sheet\": \n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/rstudio_sheet.png){fig-align='center' fig-alt='RStudio' width=65%}\n:::\n:::\n\n\n\n## RStudio Layout\n\nIf RStudio doesn't look the way you want (or like our RStudio), then do:\n\nIn R Studio Menu Bar go to View Menu --\\> Panes --\\> Pane Layout\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/pane_layout.png){fig-align='center' width=500px}\n:::\n:::\n\n\n## Workspace/Environment\n\n- Tells you what **objects** are in R\n- What exists in memory/what is loaded?/what did I read in?\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/rstudio_environment.png){width=90%}\n:::\n:::\n\n\n## Workspace/History\n\n- Shows previous commands. Good to look at for debugging, but **don't rely** on it.\n- Also type the \"up\" and \"down\" key in the Console to scroll through previous commands\n\n## Workspace/Other Panes\n\n- **Files** - shows the files on your computer of the directory you are working in\n- **Viewer** - can view data or R objects\n- **Help** - shows help of R commands\n- **Plots** - pictures and figures\n- **Packages** - list of R packages that are loaded in memory\n\n## Getting Started\n\n- In R Studio Menu Bar go to File Menu --\\> New File --\\> R Script\n- Save the blank R script as Module1.R\n\n## Explaining output on slides\n\nIn slides, the R command/code will be in a box, and then directly after it, will be the output of the code starting with `[1]`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nprint(\"I'm code\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"I'm code\"\n```\n:::\n:::\n\n\nSo `print(\"I'm code\")` is the command and `[1] \"I'm code\"` is the output.\n\n
\n\nCommands/code and output written as inline text will be typewriter blue font. For example `code`\n\n## R as a calculator\n\nYou can do basic arithmetic in R, which I surprisingly use all the time.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n2 + 2\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 4\n```\n:::\n\n```{.r .cell-code}\n2 * 4\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 8\n```\n:::\n\n```{.r .cell-code}\n2^3\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 8\n```\n:::\n:::\n\n\n## R as a calculator\n\n- The R console is a full calculator\n- Arithmetic operators:\n - `+`, `-`, `/`, `*` are add, subtract, divide and multiply\n - `^` or `**` is power\n - parentheses -- `(` and `)` -- work with order of operations \n - `%%` finds the remainder\n \n\n## Execute / Run Code\n\nTo execute or run a line of code (i.e., command), you just put your cursor on the command and then:\n\n 1. Press Run (which you will find at the top of your window)\n\n OR\n\n 2. Press `Cmd + Return` (iOS) OR `Ctrl + Enter` (Windows).\n\nTo execute or run multiple lines of code, you need to highlight the code you want to run and then follow option 1 or 2.\n\n## Mini exercise \n\nExecute `5+4` from your .R file, and then find the answer 9 in the Console.\n\n## Commenting in Scripts\n\nThe syntax `#` creates a comment, which means anything to the right of `#` will not be executed / run\n\nCommenting is useful to:\n\n1. Create headers for R Scripts\n2. Create sections within an R Script\n3. Explain what is happening in your code \n\n## Commenting an R Script header\n\nAdd a comment header to Module1.R. This is the one I typically use, but you may have your own preference. The goal is that you are consistent so that future you / collaborators can make sense of your code.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n### Title: Module 1\n### Author: Amy Winter \n### Objective: Mini Exercise - Developing first R Script\n### Date: 15 July 2024\n```\n:::\n\n\n## Commenting to create sections\n\nYou can also create sections within your code by ending a comment with 4 hash marks. **This is very useful for creating an outline of your R Script.** The \"Outline\" can be found in the top right of the your Source pane\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Section 1 Header ####\n## Section 2 Sub-header ####\n### Section 3 Sub-sub-header ####\n#### Section 4 Sub-sub-sub-header ####\n```\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/outline.png){width=90%}\n:::\n:::\n\n\n\n## Commenting to explain code\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## this # is still a comment\n### you can use many #'s as you want\n\n# sometimes you have a really long comment,\n# like explaining what you are doing\n# for a step in analysis. \n# Take it to another line\n```\n:::\n\n\nI tend to use:\n\n- One hash mark with a space to describe what is happening in the following few lines of code\n- One hash mark with no space after a command to list specifics \n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Practicing my arithmetic\n5+2\n3*5\n9/8\n\n5+2 #5 plus 2 \n```\n:::\n\n\n## Object - Basic terms\n\n**Object** - an object is something that can be worked with in R - can be lots of different things!\n\n- a scalar / number\n- a vector\n- a matrix of numbers\n- a list\n- a plot\n- a function\n\n... many more\n\n## Objects\n\n- You can create objects from within the R environment and from files on your computer\n- R uses `<-` to assign values to an object name \n- Note: Object names are case-sensitive, i.e. `X` and `x` are different\n- Here are examples of creating five different objects:\n\n::: {.cell}\n\n```{.r .cell-code}\nnumber.object <- 3\ncharacter.object <- \"blue\"\nvector.object1 <- c(2,3,4,5)\nvector.object2 <- c(\"blue\", \"red\", \"yellow\")\nmatrix.object <- matrix(data=vector.object1, nrow=2, ncol=2, byrow=TRUE)\n```\n:::\n\n\nNote, `c()` and `matrix()` are functions, which we will talk more about in module 2.\n\n\n## Mini Exercise\n\nTry creating one or two of these objects in your R script\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnumber.object <- 3\ncharacter.object <- \"blue\"\nvector.object1 <- c(2,3,4,5)\nvector.object2 <- c(\"blue\", \"red\", \"yellow\")\nmatrix.object <- matrix(data=vector.object1, nrow=2, ncol=2, byrow=TRUE)\n```\n:::\n\n\n## Objects \n\nNote, you can find these objects now in the Global Environment.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/global_env.png){width=90%}\n:::\n:::\n\n\n\nAlso, you can print them anytime (i.e, see them in the Console) by executing (running) the object. For example,\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncharacter.object\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"blue\"\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nmatrix.object\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [,1] [,2]\n[1,] 2 3\n[2,] 4 5\n```\n:::\n:::\n\n\n\n# Object names and assingment - Good coding\n\nxxzane\n\n`=` and `<-` can both be used for assignment, but `<-` is better coding practice, because sometimes `=` doesn't work and we want to distinguish between the logical operator `==`. We will talk about this more, later.\n\n## Lists\n\nList is a special data class, that can hold vectors, strings, matrices, models, list of other lists.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlist.object <- list(number.object, vector.object2, matrix.object)\nlist.object\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[[1]]\n[1] 3\n\n[[2]]\n[1] \"blue\" \"red\" \"yellow\"\n\n[[3]]\n [,1] [,2]\n[1,] 2 3\n[2,] 4 5\n```\n:::\n:::\n\n\n\n## Useful R Studio Shortcuts\n\nWill certainly save you time\n\n- `Cmd + Return` (iOS) OR `Ctrl + Enter` (Windows) in your script evaluates current line/selection\n - It's like copying and pasting the code into the console for it to run.\n- pressing Up/Down in the Console allows you to navigate command history\n\nSee for many more\n\n\n## RStudio helps with \"tab completion\"\n\nIf you start typing a object, RStudio will show you options that you can choose without typing out the whole object.\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/tab.completion.png){width=90%}\n:::\n:::\n\n\n\n\n\n## Summary\n\n- RStudio makes working in R easier\n- The Editor is for static code like R Scripts\n- The Console is for testing code that can't be saved\n- Commenting is your new best friend\n- In R we create objects that can be viewed in the Environment pane and used anytime\n- An object is something that can be worked with in R\n- Use `<-` syntax to create objects\n\n\n## Mini Exercise\n\n1. Create a new number object and name it `my.object`\n2. Create a vector of 4 numbers and name it `my.vector` using the `c()` function\n3. Add `my.object` and `my.vector` together using an arithmetic operator\n\n## Acknowledgements\n\nThese are the materials we looked through, modified, or extracted to complete this module's lecture.\n\n- [\"Introduction to R for Public Health Researchers\" Johns Hopkins University](https://jhudatascience.org/intro_to_r/)\n- Some RStudio snapshots were pulled from \n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/modules/Module02-Functions/execute-results/html.json b/_freeze/modules/Module02-Functions/execute-results/html.json index fe5023e..f66418b 100644 --- a/_freeze/modules/Module02-Functions/execute-results/html.json +++ b/_freeze/modules/Module02-Functions/execute-results/html.json @@ -1,8 +1,7 @@ { - "hash": "0531e7ec69b41ee43083c73f617056fc", + "hash": "147c719ed518b56df6eee7d8e94ffde0", "result": { - "engine": "knitr", - "markdown": "---\ntitle: \"Module 2: Functions\"\nformat: \n revealjs:\n scrollable: true\n smaller: true\n---\n\n\n\n## Learning Objectives\n\nAfter module 2, you should be able to...\n\n- Describe and execute functions in R\n- Modify default behavior of functions using arguments in R\n- Use R-specific sources of help to get more information about functions and packages \n- Differentiate between Base R functions and functions that come from other packages\n\n\n## Function - Basic term\n\n**Function** - Functions are \"self contained\" modules of code that accomplish specific tasks. Functions usually take in some sort of object (e.g., vector, list), process it, and return a result. You can write your own, use functions that come directly from installing R (i.e., Base R functions), or use functions from external packages.\n\nA function might help you add numbers together, create a plot, or organize your data. In fact, we have already used three functions in the Module 1, including `c()`, `matrix()`, `list()`. Here is another one, `sum()`\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsum(1, 20234)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 20235\n```\n\n\n:::\n:::\n\n\n\n\n## Function\n\nThe general usage for a function is the name of the function followed by parentheses. Within the parentheses are **arguments**.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfunction_name(argument1, argument2, ...)\n```\n:::\n\n\n\n\n## Arguments - Basic term\n\n**Arguments** are what you pass to the function and can include:\n\n1. the physical object on which the function carries out a task (e.g., can be data such as a number 1 or 20234)\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsum(1, 20234)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 20235\n```\n\n\n:::\n:::\n\n\n\n2. options that alter the way the function operates (e.g., such as the `base` argument in the function `log()`)\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlog(10, base = 10)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 1\n```\n\n\n:::\n\n```{.r .cell-code}\nlog(10, base = 2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 3.321928\n```\n\n\n:::\n\n```{.r .cell-code}\nlog(10, base=exp(1))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 2.302585\n```\n\n\n:::\n:::\n\n\n\n## Arguments\n\nMost functions are created with **default argument options**. The defaults represent standard values that the author of the function specified as being \"good enough in standard cases\". This means if you don't specify an argument when calling the function, it will use a default.\n\n- If you want something specific, simply change the argument yourself with a value of your choice.\n- If an argument is required but you did not specify it and there is no default argument specified when the function was created, you will receive an error.\n\n## Example\n\nWhat is the default in the `base` argument of the `log()` function?\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlog(10)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 2.302585\n```\n\n\n:::\n:::\n\n\n\n## Sure that is easy enough, but how do you know\n\n- the purpose of a function? \n- what arguments a function includes? \n- how to specify the arguments?\n\n## Seeking help for using functions\n\nThe best way of finding out this information is to use the `?` followed by the name of the function. Doing this will open up the help manual in the bottom RStudio Help panel. It provides a description of the function, usage, arguments, details, and examples. Lets look at the help file for the function `round()`\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?log\n```\n:::\n\n\n```\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n```\n\nLogarithms and Exponentials\n\nDescription:\n\n 'log' computes logarithms, by default natural logarithms, 'log10'\n computes common (i.e., base 10) logarithms, and 'log2' computes\n binary (i.e., base 2) logarithms. The general form 'log(x, base)'\n computes logarithms with base 'base'.\n\n 'log1p(x)' computes log(1+x) accurately also for |x| << 1.\n\n 'exp' computes the exponential function.\n\n 'expm1(x)' computes exp(x) - 1 accurately also for |x| << 1.\n\nUsage:\n\n log(x, base = exp(1))\n logb(x, base = exp(1))\n log10(x)\n log2(x)\n \n log1p(x)\n \n exp(x)\n expm1(x)\n \nArguments:\n\n x: a numeric or complex vector.\n\n base: a positive or complex number: the base with respect to which\n logarithms are computed. Defaults to e='exp(1)'.\n\nDetails:\n\n All except 'logb' are generic functions: methods can be defined\n for them individually or via the 'Math' group generic.\n\n 'log10' and 'log2' are only convenience wrappers, but logs to\n bases 10 and 2 (whether computed _via_ 'log' or the wrappers) will\n be computed more efficiently and accurately where supported by the\n OS. Methods can be set for them individually (and otherwise\n methods for 'log' will be used).\n\n 'logb' is a wrapper for 'log' for compatibility with S. If (S3 or\n S4) methods are set for 'log' they will be dispatched. Do not set\n S4 methods on 'logb' itself.\n\n All except 'log' are primitive functions.\n\nValue:\n\n A vector of the same length as 'x' containing the transformed\n values. 'log(0)' gives '-Inf', and 'log(x)' for negative values\n of 'x' is 'NaN'. 'exp(-Inf)' is '0'.\n\n For complex inputs to the log functions, the value is a complex\n number with imaginary part in the range [-pi, pi]: which end of\n the range is used might be platform-specific.\n\nS4 methods:\n\n 'exp', 'expm1', 'log', 'log10', 'log2' and 'log1p' are S4 generic\n and are members of the 'Math' group generic.\n\n Note that this means that the S4 generic for 'log' has a signature\n with only one argument, 'x', but that 'base' can be passed to\n methods (but will not be used for method selection). On the other\n hand, if you only set a method for the 'Math' group generic then\n 'base' argument of 'log' will be ignored for your class.\n\nSource:\n\n 'log1p' and 'expm1' may be taken from the operating system, but if\n not available there then they are based on the Fortran subroutine\n 'dlnrel' by W. Fullerton of Los Alamos Scientific Laboratory (see\n ) and (for small x) a\n single Newton step for the solution of 'log1p(y) = x'\n respectively.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole. (for 'log', 'log10' and\n 'exp'.)\n\n Chambers, J. M. (1998) _Programming with Data. A Guide to the S\n Language_. Springer. (for 'logb'.)\n\nSee Also:\n\n 'Trig', 'sqrt', 'Arithmetic'.\n\nExamples:\n\n log(exp(3))\n log10(1e7) # = 7\n \n x <- 10^-(1+2*1:9)\n cbind(deparse.level=2, # to get nice column names\n x, log(1+x), log1p(x), exp(x)-1, expm1(x))\n\n\n\n## How to specify arguments\n\n1. Arguments are separated with a comma\n2. You can specify arguments by either including them in the correct order OR by assigning the argument within the function parentheses.\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/log_args.png){width=70%}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nlog(10, 2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 3.321928\n```\n\n\n:::\n\n```{.r .cell-code}\nlog(base=2, x=10)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 3.321928\n```\n\n\n:::\n\n```{.r .cell-code}\nlog(x=10, 2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 3.321928\n```\n\n\n:::\n\n```{.r .cell-code}\nlog(10, base=2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 3.321928\n```\n\n\n:::\n:::\n\n\n\n## Package - Basic term\n\nWhen you download R, it has a \"base\" set of functions, that are associated with a \"base\" set of packages including: 'base', 'datasets', 'graphics', 'grDevices', 'methods', 'stats', 'methods' (typically just referred to as **Base R**).\n\n- e.g., the `log()` function comes from the 'base' package\n\n**Package** - a package in R is a bundle or \"package\" of code (and or possibly data) that can be loaded together for easy repeated use or for **sharing** with others.\n\nPackages are analogous to software applications like Microsoft Word. After installation, your operating system allows you to use it, just like having Word installed allows you to use it.\n\n## Packages\n\nThe Packages window in RStudio can help you identify what have been installed (listed), and which one have been called (check mark).\n\nLets go look at the Packages window, find the `base` package and find the `log()` function. It automatically loads the help file that we looked at earlier using `?log`.\n\n\n## Additional Packages\n\nYou can install additional packages for your uses from [CRAN](https://cran.r-project.org/) or [GitHub](https://github.com/). These additional packages are written by RStudio or R users/developers (like us)\n\n- Not all packages available on CRAN or GitHub are trustworthy\n- RStudio (the company) makes a lot of great packages\n- Who wrote it? **Hadley Wickham** is a major authority on R (Employee and Developer at RStudio)\n- How to [trust](https://simplystatistics.org/posts/2015-11-06-how-i-decide-when-to-trust-an-r-package/#:~:text=The%20first%20thing%20I%20do,I%20immediately%20trust%20the%20package.) an R package\n\n## **Installing** and calling packages\n\nTo use the bundle or \"package\" of code (and or possibly data) from a package, you need to install and also call the package.\n\nTo install a package you can \n\n1. go to Tools ---\\> Install Packages in the RStudio header\n\nOR\n\n2. use the following code:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(package_name)\n```\n:::\n\n\n\n\n## Installing and **calling** packages\n\nTo call (i.e., be able to use the package) you can use the following code:\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(package_name)\n```\n:::\n\n\n\nMore on installing and calling packages later...\n\n\n## Mini Exercise\n\nFind and execute a **Base R** function that will round the number 0.86424 to two digits.\n\n\n## Functions from Module 1\n\nThe combine function `c()` collects/combines/joins single R objects into a vector of R objects. It is mostly used for creating vectors of numbers, character strings, and other data types. \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?c\n```\n:::\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n\n```\nCombine Values into a Vector or List\n\nDescription:\n\n This is a generic function which combines its arguments.\n\n The default method combines its arguments to form a vector. All\n arguments are coerced to a common type which is the type of the\n returned value, and all attributes except names are removed.\n\nUsage:\n\n ## S3 Generic function\n c(...)\n \n ## Default S3 method:\n c(..., recursive = FALSE, use.names = TRUE)\n \nArguments:\n\n ...: objects to be concatenated. All 'NULL' entries are dropped\n before method dispatch unless at the very beginning of the\n argument list.\n\nrecursive: logical. If 'recursive = TRUE', the function recursively\n descends through lists (and pairlists) combining all their\n elements into a vector.\n\nuse.names: logical indicating if 'names' should be preserved.\n\nDetails:\n\n The output type is determined from the highest type of the\n components in the hierarchy NULL < raw < logical < integer <\n double < complex < character < list < expression. Pairlists are\n treated as lists, whereas non-vector components (such as 'name's /\n 'symbol's and 'call's) are treated as one-element 'list's which\n cannot be unlisted even if 'recursive = TRUE'.\n\n There is a 'c.factor' method which combines factors into a factor.\n\n 'c' is sometimes used for its side effect of removing attributes\n except names, for example to turn an 'array' into a vector.\n 'as.vector' is a more intuitive way to do this, but also drops\n names. Note that methods other than the default are not required\n to do this (and they will almost certainly preserve a class\n attribute).\n\n This is a primitive function.\n\nValue:\n\n 'NULL' or an expression or a vector of an appropriate mode. (With\n no arguments the value is 'NULL'.)\n\nS4 methods:\n\n This function is S4 generic, but with argument list '(x, ...)'.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'unlist' and 'as.vector' to produce attribute-free vectors.\n\nExamples:\n\n c(1,7:9)\n c(1:5, 10.5, \"next\")\n \n ## uses with a single argument to drop attributes\n x <- 1:4\n names(x) <- letters[1:4]\n x\n c(x) # has names\n as.vector(x) # no names\n dim(x) <- c(2,2)\n x\n c(x)\n as.vector(x)\n \n ## append to a list:\n ll <- list(A = 1, c = \"C\")\n ## do *not* use\n c(ll, d = 1:3) # which is == c(ll, as.list(c(d = 1:3)))\n ## but rather\n c(ll, d = list(1:3)) # c() combining two lists\n \n c(list(A = c(B = 1)), recursive = TRUE)\n \n c(options(), recursive = TRUE)\n c(list(A = c(B = 1, C = 2), B = c(E = 7)), recursive = TRUE)\n```\n\n\n:::\n:::\n\n\n\n## Functions from Module 1\n\nThe `matrix()` function creates a matrix from the given set of values.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?matrix\n```\n:::\n\n\n\nxxamy - doesn't seem to work - may need to paste in a screen shot figure\n\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n\n```\nNo documentation for 'matix' in specified packages and libraries\n```\n\n\n:::\n:::\n\n\n\n\n## Summary\n\n- Functions are \"self contained\" modules of code that accomplish specific tasks.\n- Arguments are what you pass to functions (e.g., objects on which you carry out the task or options for how to carry out the task)\n- Arguments may include defaults that the author of the function specified as being \"good enough in standard cases\", but that can be changed.\n- An R Package is a bundle or \"package\" of code (and or possibly data) that can be used by installing it once and calling it (using `library()`) each time R/Rstudio is opened\n- The Help window in RStudio is useful for to get more information about functions and packages \n\n\n## Acknowledgements\n\nThese are the materials I looked through, modified, or extracted to complete this module's lecture.\n\n- [\"Introduction to R - ARCHIVED\" from Harvard Chan Bioinformatics Core (HBC)](https://hbctraining.github.io/Intro-to-R/lessons/03_introR-functions-and-arguments.html#:\\~:text=A%20key%20feature%20of%20R,it%2C%20and%20return%20a%20result.)\n\n\n", + "markdown": "---\ntitle: \"Module 2: Functions\"\nformat: \n revealjs:\n scrollable: true\n smaller: true\n toc: false\n---\n\n\n## Learning Objectives\n\nAfter module 2, you should be able to...\n\n- Describe and execute functions in R\n- Modify default behavior of functions using arguments in R\n- Use R-specific sources of help to get more information about functions and packages \n- Differentiate between Base R functions and functions that come from other packages\n\n\n## Function - Basic term\n\n**Function** - Functions are \"self contained\" modules of code that **accomplish specific tasks**. Functions usually take in some sort of object (e.g., vector, list), process it, and return a result. You can write your own, use functions that come directly from installing R (i.e., Base R functions), or use functions from external packages.\n\nA function might help you add numbers together, create a plot, or organize your data. In fact, we have already used three functions in the Module 1, including `c()`, `matrix()`, `list()`. Here is another one, `sum()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsum(1, 20234)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 20235\n```\n:::\n:::\n\n\n\n## Function\n\nThe general usage for a function is the name of the function followed by parentheses (i.e., the function signature). Within the parentheses are **arguments**.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfunction_name(argument1, argument2, ...)\n```\n:::\n\n\n\n## Arguments - Basic term\n\n**Arguments** are what you pass to the function and can include:\n\n1. the physical object on which the function carries out a task (e.g., can be data such as a number 1 or 20234)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsum(1, 20234)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 20235\n```\n:::\n:::\n\n\n2. options that alter the way the function operates (e.g., such as the `base` argument in the function `log()`)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlog(10, base = 10)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1\n```\n:::\n\n```{.r .cell-code}\nlog(10, base = 2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3.321928\n```\n:::\n\n```{.r .cell-code}\nlog(10, base=exp(1))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 2.302585\n```\n:::\n:::\n\n\n## Arguments\n\nMost functions are created with **default argument options**. The defaults represent standard values that the author of the function specified as being \"good enough in standard cases\". This means if you don't specify an argument when calling the function, it will use a default.\n\n- If you want something specific, simply change the argument yourself with a value of your choice.\n- If an argument is required but you did not specify it and there is no default argument specified when the function was created, you will receive an error.\n\n## Example\n\nWhat is the default in the `base` argument of the `log()` function?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlog(10)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 2.302585\n```\n:::\n:::\n\n\n## Sure that is easy enough, but how do you know\n\n- the purpose of a function? \n- what arguments a function includes? \n- how to specify the arguments?\n\n## Seeking help for using functions\n\nThe best way of finding out this information is to use the `?` followed by the name of the function. Doing this will open up the help manual in the bottom RStudio Help panel. It provides a description of the function, usage, arguments, details, and examples. Lets look at the help file for the function `round()`\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?log\n```\n:::\n\n\n```\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n```\n\nLogarithms and Exponentials\n\nDescription:\n\n 'log' computes logarithms, by default natural logarithms, 'log10'\n computes common (i.e., base 10) logarithms, and 'log2' computes\n binary (i.e., base 2) logarithms. The general form 'log(x, base)'\n computes logarithms with base 'base'.\n\n 'log1p(x)' computes log(1+x) accurately also for |x| << 1.\n\n 'exp' computes the exponential function.\n\n 'expm1(x)' computes exp(x) - 1 accurately also for |x| << 1.\n\nUsage:\n\n log(x, base = exp(1))\n logb(x, base = exp(1))\n log10(x)\n log2(x)\n \n log1p(x)\n \n exp(x)\n expm1(x)\n \nArguments:\n\n x: a numeric or complex vector.\n\n base: a positive or complex number: the base with respect to which\n logarithms are computed. Defaults to e='exp(1)'.\n\nDetails:\n\n All except 'logb' are generic functions: methods can be defined\n for them individually or via the 'Math' group generic.\n\n 'log10' and 'log2' are only convenience wrappers, but logs to\n bases 10 and 2 (whether computed _via_ 'log' or the wrappers) will\n be computed more efficiently and accurately where supported by the\n OS. Methods can be set for them individually (and otherwise\n methods for 'log' will be used).\n\n 'logb' is a wrapper for 'log' for compatibility with S. If (S3 or\n S4) methods are set for 'log' they will be dispatched. Do not set\n S4 methods on 'logb' itself.\n\n All except 'log' are primitive functions.\n\nValue:\n\n A vector of the same length as 'x' containing the transformed\n values. 'log(0)' gives '-Inf', and 'log(x)' for negative values\n of 'x' is 'NaN'. 'exp(-Inf)' is '0'.\n\n For complex inputs to the log functions, the value is a complex\n number with imaginary part in the range [-pi, pi]: which end of\n the range is used might be platform-specific.\n\nS4 methods:\n\n 'exp', 'expm1', 'log', 'log10', 'log2' and 'log1p' are S4 generic\n and are members of the 'Math' group generic.\n\n Note that this means that the S4 generic for 'log' has a signature\n with only one argument, 'x', but that 'base' can be passed to\n methods (but will not be used for method selection). On the other\n hand, if you only set a method for the 'Math' group generic then\n 'base' argument of 'log' will be ignored for your class.\n\nSource:\n\n 'log1p' and 'expm1' may be taken from the operating system, but if\n not available there then they are based on the Fortran subroutine\n 'dlnrel' by W. Fullerton of Los Alamos Scientific Laboratory (see\n ) and (for small x) a\n single Newton step for the solution of 'log1p(y) = x'\n respectively.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole. (for 'log', 'log10' and\n 'exp'.)\n\n Chambers, J. M. (1998) _Programming with Data. A Guide to the S\n Language_. Springer. (for 'logb'.)\n\nSee Also:\n\n 'Trig', 'sqrt', 'Arithmetic'.\n\nExamples:\n\n log(exp(3))\n log10(1e7) # = 7\n \n x <- 10^-(1+2*1:9)\n cbind(deparse.level=2, # to get nice column names\n x, log(1+x), log1p(x), exp(x)-1, expm1(x))\n\n\n## How to specify arguments\n\n1. Arguments are separated with a comma\n2. You can specify arguments by either including them in the correct order OR by assigning the argument within the function parentheses.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/log_args.png){width=70%}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nlog(10, 2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3.321928\n```\n:::\n\n```{.r .cell-code}\nlog(base=2, x=10)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3.321928\n```\n:::\n\n```{.r .cell-code}\nlog(x=10, 2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3.321928\n```\n:::\n\n```{.r .cell-code}\nlog(10, base=2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3.321928\n```\n:::\n:::\n\n\n## Package - Basic term\n\nWhen you download R, it has a \"base\" set of functions, that are associated with a \"base\" set of packages including: 'base', 'datasets', 'graphics', 'grDevices', 'methods', 'stats' (typically just referred to as **Base R**).\n\n- e.g., the `log()` function comes from the 'base' package\n\n**Package** - a package in R is a bundle or \"package\" of code (and or possibly data) that can be loaded together for easy repeated use or for **sharing** with others.\n\nPackages are analogous to software applications like Microsoft Word. After installation, your operating system allows you to use it, just like having Word installed allows you to use it.\n\n## Packages\n\nThe Packages pane in RStudio can help you identify what have been installed (listed), and which one have been attached (check mark).\n\nLets go look at the Packages window, find the `base` package and find the `log()` function. It automatically loads the help file that we looked at earlier using `?log`.\n\n\n## Additional Packages\n\nYou can install additional packages for your use from [CRAN](https://cran.r-project.org/) or [GitHub](https://github.com/). These additional packages are written by RStudio or R users/developers (like us)\n\n- Not all packages available on CRAN or GitHub are trustworthy\n- RStudio (the company) makes a lot of great packages\n- Who wrote it? **Hadley Wickham** is a major authority on R (Employee and Developer at RStudio)\n- How to [trust](https://simplystatistics.org/posts/2015-11-06-how-i-decide-when-to-trust-an-r-package/#:~:text=The%20first%20thing%20I%20do,I%20immediately%20trust%20the%20package.) an R package\n\n## **Installing** and attaching packages\n\nTo use the bundle or \"package\" of code (and or possibly data) from a package, you need to install and also attach the package.\n\nTo install a package you can \n\n1. go to R Studio Menu Bar Tools Menu ---\\> Install Packages in the RStudio header\n\nOR\n\n2. use the following code:\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"package_name\")\n```\n:::\n\n\n\n## Installing and **attaching** packages\n\nTo attach (i.e., be able to use the package) you can use the following code:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrequire(package_name) #library(package_name) also works\n```\n:::\n\n\nMore on installing and attaching packages later...\n\n\n## Mini Exercise\n\nFind and execute a **Base R** function that will round the number 0.86424 to two digits.\n\n\n## Functions from Module 1\n\nThe combine function `c()` concatenate/collects/combines single R objects into a vector of R objects. It is mostly used for creating vectors of numbers, character strings, and other data types. \n\n\n::: {.cell}\n\n```{.r .cell-code}\n?c\n```\n:::\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n```\nCombine Values into a Vector or List\n\nDescription:\n\n This is a generic function which combines its arguments.\n\n The default method combines its arguments to form a vector. All\n arguments are coerced to a common type which is the type of the\n returned value, and all attributes except names are removed.\n\nUsage:\n\n ## S3 Generic function\n c(...)\n \n ## Default S3 method:\n c(..., recursive = FALSE, use.names = TRUE)\n \nArguments:\n\n ...: objects to be concatenated. All 'NULL' entries are dropped\n before method dispatch unless at the very beginning of the\n argument list.\n\nrecursive: logical. If 'recursive = TRUE', the function recursively\n descends through lists (and pairlists) combining all their\n elements into a vector.\n\nuse.names: logical indicating if 'names' should be preserved.\n\nDetails:\n\n The output type is determined from the highest type of the\n components in the hierarchy NULL < raw < logical < integer <\n double < complex < character < list < expression. Pairlists are\n treated as lists, whereas non-vector components (such as 'name's /\n 'symbol's and 'call's) are treated as one-element 'list's which\n cannot be unlisted even if 'recursive = TRUE'.\n\n There is a 'c.factor' method which combines factors into a factor.\n\n 'c' is sometimes used for its side effect of removing attributes\n except names, for example to turn an 'array' into a vector.\n 'as.vector' is a more intuitive way to do this, but also drops\n names. Note that methods other than the default are not required\n to do this (and they will almost certainly preserve a class\n attribute).\n\n This is a primitive function.\n\nValue:\n\n 'NULL' or an expression or a vector of an appropriate mode. (With\n no arguments the value is 'NULL'.)\n\nS4 methods:\n\n This function is S4 generic, but with argument list '(x, ...)'.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'unlist' and 'as.vector' to produce attribute-free vectors.\n\nExamples:\n\n c(1,7:9)\n c(1:5, 10.5, \"next\")\n \n ## uses with a single argument to drop attributes\n x <- 1:4\n names(x) <- letters[1:4]\n x\n c(x) # has names\n as.vector(x) # no names\n dim(x) <- c(2,2)\n x\n c(x)\n as.vector(x)\n \n ## append to a list:\n ll <- list(A = 1, c = \"C\")\n ## do *not* use\n c(ll, d = 1:3) # which is == c(ll, as.list(c(d = 1:3)))\n ## but rather\n c(ll, d = list(1:3)) # c() combining two lists\n \n c(list(A = c(B = 1)), recursive = TRUE)\n \n c(options(), recursive = TRUE)\n c(list(A = c(B = 1, C = 2), B = c(E = 7)), recursive = TRUE)\n```\n:::\n:::\n\n\n## Functions from Module 1\n\nThe `matrix()` function creates a matrix from the given set of values.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?matrix\n```\n:::\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n```\nMatrices\n\nDescription:\n\n 'matrix' creates a matrix from the given set of values.\n\n 'as.matrix' attempts to turn its argument into a matrix.\n\n 'is.matrix' tests if its argument is a (strict) matrix.\n\nUsage:\n\n matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE,\n dimnames = NULL)\n \n as.matrix(x, ...)\n ## S3 method for class 'data.frame'\n as.matrix(x, rownames.force = NA, ...)\n \n is.matrix(x)\n \nArguments:\n\n data: an optional data vector (including a list or 'expression'\n vector). Non-atomic classed R objects are coerced by\n 'as.vector' and all attributes discarded.\n\n nrow: the desired number of rows.\n\n ncol: the desired number of columns.\n\n byrow: logical. If 'FALSE' (the default) the matrix is filled by\n columns, otherwise the matrix is filled by rows.\n\ndimnames: A 'dimnames' attribute for the matrix: 'NULL' or a 'list' of\n length 2 giving the row and column names respectively. An\n empty list is treated as 'NULL', and a list of length one as\n row names. The list can be named, and the list names will be\n used as names for the dimensions.\n\n x: an R object.\n\n ...: additional arguments to be passed to or from methods.\n\nrownames.force: logical indicating if the resulting matrix should have\n character (rather than 'NULL') 'rownames'. The default,\n 'NA', uses 'NULL' rownames if the data frame has 'automatic'\n row.names or for a zero-row data frame.\n\nDetails:\n\n If one of 'nrow' or 'ncol' is not given, an attempt is made to\n infer it from the length of 'data' and the other parameter. If\n neither is given, a one-column matrix is returned.\n\n If there are too few elements in 'data' to fill the matrix, then\n the elements in 'data' are recycled. If 'data' has length zero,\n 'NA' of an appropriate type is used for atomic vectors ('0' for\n raw vectors) and 'NULL' for lists.\n\n 'is.matrix' returns 'TRUE' if 'x' is a vector and has a '\"dim\"'\n attribute of length 2 and 'FALSE' otherwise. Note that a\n 'data.frame' is *not* a matrix by this test. The function is\n generic: you can write methods to handle specific classes of\n objects, see InternalMethods.\n\n 'as.matrix' is a generic function. The method for data frames\n will return a character matrix if there is only atomic columns and\n any non-(numeric/logical/complex) column, applying 'as.vector' to\n factors and 'format' to other non-character columns. Otherwise,\n the usual coercion hierarchy (logical < integer < double <\n complex) will be used, e.g., all-logical data frames will be\n coerced to a logical matrix, mixed logical-integer will give a\n integer matrix, etc.\n\n The default method for 'as.matrix' calls 'as.vector(x)', and hence\n e.g. coerces factors to character vectors.\n\n When coercing a vector, it produces a one-column matrix, and\n promotes the names (if any) of the vector to the rownames of the\n matrix.\n\n 'is.matrix' is a primitive function.\n\n The 'print' method for a matrix gives a rectangular layout with\n dimnames or indices. For a list matrix, the entries of length not\n one are printed in the form 'integer,7' indicating the type and\n length.\n\nNote:\n\n If you just want to convert a vector to a matrix, something like\n\n dim(x) <- c(nx, ny)\n dimnames(x) <- list(row_names, col_names)\n \n will avoid duplicating 'x' _and_ preserve 'class(x)' which may be\n useful, e.g., for 'Date' objects.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'data.matrix', which attempts to convert to a numeric matrix.\n\n A matrix is the special case of a two-dimensional 'array'.\n 'inherits(m, \"array\")' is true for a 'matrix' 'm'.\n\nExamples:\n\n is.matrix(as.matrix(1:10))\n !is.matrix(warpbreaks) # data.frame, NOT matrix!\n warpbreaks[1:10,]\n as.matrix(warpbreaks[1:10,]) # using as.matrix.data.frame(.) method\n \n ## Example of setting row and column names\n mdat <- matrix(c(1,2,3, 11,12,13), nrow = 2, ncol = 3, byrow = TRUE,\n dimnames = list(c(\"row1\", \"row2\"),\n c(\"C.1\", \"C.2\", \"C.3\")))\n mdat\n```\n:::\n:::\n\n\n\n## Summary\n\n- Functions are \"self contained\" modules of code that accomplish specific tasks.\n- Arguments are what you pass to functions (e.g., objects on which you carry out the task or options for how to carry out the task)\n- Arguments may include defaults that the author of the function specified as being \"good enough in standard cases\", but that can be changed.\n- An R Package is a bundle or \"package\" of code (and or possibly data) that can be used by installing it once and attaching it (using `library()`) each time R/Rstudio is opened\n- The Help window in RStudio is useful for to get more information about functions and packages \n\n\n## Acknowledgements\n\nThese are the materials we looked through, modified, or extracted to complete this module's lecture.\n\n- [\"Introduction to R - ARCHIVED\" from Harvard Chan Bioinformatics Core (HBC)](https://hbctraining.github.io/Intro-to-R/lessons/03_introR-functions-and-arguments.html#:\\~:text=A%20key%20feature%20of%20R,it%2C%20and%20return%20a%20result.)\n\n\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/modules/Module03-WorkingDirectories/execute-results/html.json b/_freeze/modules/Module03-WorkingDirectories/execute-results/html.json index 2e6560c..3061fc1 100644 --- a/_freeze/modules/Module03-WorkingDirectories/execute-results/html.json +++ b/_freeze/modules/Module03-WorkingDirectories/execute-results/html.json @@ -1,8 +1,7 @@ { - "hash": "cc0e87ad3f332df20d0071d6ad92faff", + "hash": "8434fd2c84bea4b8dd46c1e3247e7a9d", "result": { - "engine": "knitr", - "markdown": "---\ntitle: \"Module 3: Working Directories\"\nformat: \n revealjs:\n scrollable: true\n smaller: true\n---\n\n\n\n## Learning Objectives\n\nAfter module 3, you should be able to...\n\n- Understand your own systems file structure and the purpose of the working directory\n- Determine the working directory\n- Change the working directory\n\n## File Structure\n\nxxzane slide(s)\n\n## Working Directory -- Basic term\n\n- R \"looks\" for files on your computer relative to the \"working\" directory\n- For example, if you want to load data into R or save a figure, you will need to tell R where/store the file\n- Many people recommend not setting a directory in the scripts, rather assume you're in the directory the script is in\n\n\n## Getting and setting the working directory using code\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## get the working directory\ngetwd()\nsetwd(\"~/\") \n```\n:::\n\n\n\n## Setting a working directory\n\n- Setting the directory can sometimes (almost always when new to R) be finicky\n - **Windows**: Default directory structure involves single backslashes (\"`\\`\"), but R interprets these as\"escape\" characters. So you must replace the backslash with forward slashes (\"/\") or two backslashes (\"`\\\\`\")\n - **Mac/Linux**: Default is forward slashes, so you are okay\n- Typical directory structure syntax applies\n - \"..\" - goes up one level\n - \"./\" - is the current directory\n - \"\\~\" - is your \"home\" directory\n\n\n## Absolute vs. relative paths\n\nFrom Wiki\n\n- An **absolute or full path** points to the same location in a file system, regardless of the current working directory. To do that, it must include the root directory. Absolute path is specific to your system alone. This means if I try your code, and you use absolute paths, it won't work unless we have the exact same folder structure where R is looking (bad).\n\n- By contrast, a **relative path starts from some given working directory**, avoiding the need to provide the full absolute path.\n\n## Relative path\n\nYou want to set you code up based on relative paths. This allows sharing of code, and also, allows you to modify your own file structure (above the working directory) without breaking your own code.\n\n\n## Setting the working directory using your cursor\n\nRemember above \"Many people recommend not setting a directory in the scripts, rather assume you're in the directory the script is in.\" To do so, go to Session --\\> Set Working Directory --\\> To Source File Location\n\nRStudio will show the code in the Console for the action you took with your cursor. This is a good way to learn about your file system how to set a correct working directory!\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsetwd(\"~/Dropbox/Git/SISMID-2024\")\n```\n:::\n\n\n\n\n## Setting the Working Directory\n\nIf you have not yet saved a \"source\" file, it will set working directory to the default location. See RStudio -\\> Preferences -\\> General for default location.\n\nTo change the working directory to another location, go to Session --\\> Set Working Directory --\\> Choose Directory`\n\nAgain, RStudio will show the code in the Console for the action you took with your cursor.\n\n\n## Summary\n\n- R \"looks\" for files on your computer relative to the \"working\" directory\n- Absolute path points to the same location in a file system - it is specific to your system and your system alone\n- Relative path points is based on the current working directory \n- Two functions, `setwd()` and `getwd()`, are your new best friends.\n\n\n## Acknowledgements\n\nThese are the materials we looked through, modified, or extracted to complete this module's lecture.\n\n- [\"Introduction to R for Public Health Researchers\" Johns Hopkins University](https://jhudatascience.org/intro_to_r/)\n", + "markdown": "---\ntitle: \"Module 3: Working Directories\"\nformat: \n revealjs:\n scrollable: true\n smaller: true\n toc: false\n---\n\n\n## Learning Objectives\n\nAfter module 3, you should be able to...\n\n- Understand your own systems' file structure and the purpose of the working directory\n- Determine the working directory\n- Change the working directory\n\n## File Structure\n\nxxzane slide(s)\n\n## Working Directory -- Basic term\n\n- R \"looks\" for files on your computer relative to the \"working\" directory\n- For example, if you want to load data into R or save a figure, you will need to tell R where to look for or store the file\n- Many people recommend not setting a directory in the scripts, rather assume you're in the directory the script is in\n\n\n## Getting and setting the working directory using code\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## get the working directory\ngetwd()\nsetwd(\"~/\") \n```\n:::\n\n\n## Setting a working directory\n\n- Setting the directory can sometimes (almost always when new to R) be finicky\n - **Windows**: Default directory structure involves single backslashes (\"`\\`\"), but R interprets these as\"escape\" characters. So you must replace the backslash with forward slashes (\"/\") or two backslashes (\"`\\\\`\")\n - **Mac/Linux**: Default is forward slashes, so you are okay\n- Typical directory structure syntax applies\n - \"..\" - goes up one level\n - \"./\" - is the current directory\n - \"\\~\" - is your \"home\" directory\n\n\n## Absolute vs. relative paths\n\nFrom Wiki\n\n- An **absolute or full path** points to the same location in a file system, regardless of the current working directory. To do that, it must include the root directory. Absolute path is specific to your system alone. This means if I try your code, and you use absolute paths, it won't work unless we have the exact same folder structure where R is looking (bad).\n\n- By contrast, a **relative path starts from some given working directory**, avoiding the need to provide the full absolute path.\n\n## Relative path\n\nYou want to set you code up based on relative paths. This allows sharing of code, and also, allows you to modify your own file structure (above the working directory) without breaking your own code.\n\n\n## Setting the working directory using your cursor\n\nRemember above \"Many people recommend not setting a directory in the scripts, rather assume you're in the directory the script is in.\" To do so, go to Session --\\> Set Working Directory --\\> To Source File Location\n\nRStudio will show the code in the Console for the action you took with your cursor. This is a good way to learn about your file system how to set a correct working directory!\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsetwd(\"~/Dropbox/Git/SISMID-2024\")\n```\n:::\n\n\n\n## Setting the Working Directory\n\nIf you have not yet saved a \"source\" file, it will set working directory to the default location.Find the Tool Menu in the Menu Bar -\\> Global Opsions -\\> General for default location.\n\nTo change the working directory to another location, find Session Menu in the Menu Bar --\\> Set Working Directory --\\> Choose Directory`\n\nAgain, RStudio will show the code in the Console for the action you took with your cursor.\n\n\n## Summary\n\n- R \"looks\" for files on your computer relative to the \"working\" directory\n- Absolute path points to the same location in a file system - it is specific to your system and your system alone\n- Relative path points is based on the current working directory \n- Two functions, `setwd()` and `getwd()` are useful for identifying and manipulating the working directory.\n\n\n## Acknowledgements\n\nThese are the materials we looked through, modified, or extracted to complete this module's lecture.\n\n- [\"Introduction to R for Public Health Researchers\" Johns Hopkins University](https://jhudatascience.org/intro_to_r/)\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/modules/Module05-DataImportExport/execute-results/html.json b/_freeze/modules/Module05-DataImportExport/execute-results/html.json index 5833ba8..2e3766f 100644 --- a/_freeze/modules/Module05-DataImportExport/execute-results/html.json +++ b/_freeze/modules/Module05-DataImportExport/execute-results/html.json @@ -1,8 +1,7 @@ { - "hash": "2ccda6bd4bed2b1d83f2f251ba9dd6ce", + "hash": "a0db8f0dfe70bceb90779858e46280be", "result": { - "engine": "knitr", - "markdown": "---\ntitle: \"Module 5: Data Import and Export\"\nformat: \n revealjs:\n scrollable: true\n smaller: true\n---\n\n\n\n## Learning Objectives\n\nAfter module 5, you should be able to...\n\n- Use Base R functions to load data\n- Install and call external R Packages to extend R's functionality\n- Install any type of data into R\n- Find loaded data in the Global Environment window of RStudio\n- Reading and writing R .Rds and .Rda/.RData files\n\n\n## Import (read) Data\n\n- Importing or 'Reading in' data is the first step of any real project/analysis\n- R can read almost any file format, especially with external, non-Base R, packages\n- We are going to focus on simple delimited files first. \n - comma separated (e.g. '.csv')\n - tab delimited (e.g. '.txt')\n\nA delimited file is a sequential file with column delimiters. Each delimited file is a stream of records, which consists of fields that are ordered by column. Each record contains fields for one row. Within each row, individual fields are separated by column **delimiters** (IBM.com definition)\n\n## Mini exercise\n\n1. Download Module 5 data from the website and save the data to your data subdirectory -- specifically `SISMID_IntroToR_RProject/data`\n\n2. Open the data files in a text editor application and familiarize you self with the data.\n\n3. Determine the delminiter of the two '.txt' files\n\n\n## Import delimited data\n\nWithin the Base R 'util' package we can find a handful of useful functions including `read.csv()` and `read.delim()` to importing data.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?read.csv\n```\n:::\n\n::: {.cell}\n::: {.cell-output .cell-output-stderr}\n\n```\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\nData Input\n\nDescription:\n\n Reads a file in table format and creates a data frame from it,\n with cases corresponding to lines and variables to fields in the\n file.\n\nUsage:\n\n read.table(file, header = FALSE, sep = \"\", quote = \"\\\"'\",\n dec = \".\", numerals = c(\"allow.loss\", \"warn.loss\", \"no.loss\"),\n row.names, col.names, as.is = !stringsAsFactors, tryLogical = TRUE,\n na.strings = \"NA\", colClasses = NA, nrows = -1,\n skip = 0, check.names = TRUE, fill = !blank.lines.skip,\n strip.white = FALSE, blank.lines.skip = TRUE,\n comment.char = \"#\",\n allowEscapes = FALSE, flush = FALSE,\n stringsAsFactors = FALSE,\n fileEncoding = \"\", encoding = \"unknown\", text, skipNul = FALSE)\n \n read.csv(file, header = TRUE, sep = \",\", quote = \"\\\"\",\n dec = \".\", fill = TRUE, comment.char = \"\", ...)\n \n read.csv2(file, header = TRUE, sep = \";\", quote = \"\\\"\",\n dec = \",\", fill = TRUE, comment.char = \"\", ...)\n \n read.delim(file, header = TRUE, sep = \"\\t\", quote = \"\\\"\",\n dec = \".\", fill = TRUE, comment.char = \"\", ...)\n \n read.delim2(file, header = TRUE, sep = \"\\t\", quote = \"\\\"\",\n dec = \",\", fill = TRUE, comment.char = \"\", ...)\n \nArguments:\n\n file: the name of the file which the data are to be read from.\n Each row of the table appears as one line of the file. If it\n does not contain an _absolute_ path, the file name is\n _relative_ to the current working directory, 'getwd()'.\n Tilde-expansion is performed where supported. This can be a\n compressed file (see 'file').\n\n Alternatively, 'file' can be a readable text-mode connection\n (which will be opened for reading if necessary, and if so\n 'close'd (and hence destroyed) at the end of the function\n call). (If 'stdin()' is used, the prompts for lines may be\n somewhat confusing. Terminate input with a blank line or an\n EOF signal, 'Ctrl-D' on Unix and 'Ctrl-Z' on Windows. Any\n pushback on 'stdin()' will be cleared before return.)\n\n 'file' can also be a complete URL. (For the supported URL\n schemes, see the 'URLs' section of the help for 'url'.)\n\n header: a logical value indicating whether the file contains the\n names of the variables as its first line. If missing, the\n value is determined from the file format: 'header' is set to\n 'TRUE' if and only if the first row contains one fewer field\n than the number of columns.\n\n sep: the field separator character. Values on each line of the\n file are separated by this character. If 'sep = \"\"' (the\n default for 'read.table') the separator is 'white space',\n that is one or more spaces, tabs, newlines or carriage\n returns.\n\n quote: the set of quoting characters. To disable quoting altogether,\n use 'quote = \"\"'. See 'scan' for the behaviour on quotes\n embedded in quotes. Quoting is only considered for columns\n read as character, which is all of them unless 'colClasses'\n is specified.\n\n dec: the character used in the file for decimal points.\n\nnumerals: string indicating how to convert numbers whose conversion to\n double precision would lose accuracy, see 'type.convert'.\n Can be abbreviated. (Applies also to complex-number inputs.)\n\nrow.names: a vector of row names. This can be a vector giving the\n actual row names, or a single number giving the column of the\n table which contains the row names, or character string\n giving the name of the table column containing the row names.\n\n If there is a header and the first row contains one fewer\n field than the number of columns, the first column in the\n input is used for the row names. Otherwise if 'row.names' is\n missing, the rows are numbered.\n\n Using 'row.names = NULL' forces row numbering. Missing or\n 'NULL' 'row.names' generate row names that are considered to\n be 'automatic' (and not preserved by 'as.matrix').\n\ncol.names: a vector of optional names for the variables. The default\n is to use '\"V\"' followed by the column number.\n\n as.is: controls conversion of character variables (insofar as they\n are not converted to logical, numeric or complex) to factors,\n if not otherwise specified by 'colClasses'. Its value is\n either a vector of logicals (values are recycled if\n necessary), or a vector of numeric or character indices which\n specify which columns should not be converted to factors.\n\n Note: to suppress all conversions including those of numeric\n columns, set 'colClasses = \"character\"'.\n\n Note that 'as.is' is specified per column (not per variable)\n and so includes the column of row names (if any) and any\n columns to be skipped.\n\ntryLogical: a 'logical' determining if columns consisting entirely of\n '\"F\"', '\"T\"', '\"FALSE\"', and '\"TRUE\"' should be converted to\n 'logical'; passed to 'type.convert', true by default.\n\nna.strings: a character vector of strings which are to be interpreted\n as 'NA' values. Blank fields are also considered to be\n missing values in logical, integer, numeric and complex\n fields. Note that the test happens _after_ white space is\n stripped from the input, so 'na.strings' values may need\n their own white space stripped in advance.\n\ncolClasses: character. A vector of classes to be assumed for the\n columns. If unnamed, recycled as necessary. If named, names\n are matched with unspecified values being taken to be 'NA'.\n\n Possible values are 'NA' (the default, when 'type.convert' is\n used), '\"NULL\"' (when the column is skipped), one of the\n atomic vector classes (logical, integer, numeric, complex,\n character, raw), or '\"factor\"', '\"Date\"' or '\"POSIXct\"'.\n Otherwise there needs to be an 'as' method (from package\n 'methods') for conversion from '\"character\"' to the specified\n formal class.\n\n Note that 'colClasses' is specified per column (not per\n variable) and so includes the column of row names (if any).\n\n nrows: integer: the maximum number of rows to read in. Negative and\n other invalid values are ignored.\n\n skip: integer: the number of lines of the data file to skip before\n beginning to read data.\n\ncheck.names: logical. If 'TRUE' then the names of the variables in the\n data frame are checked to ensure that they are syntactically\n valid variable names. If necessary they are adjusted (by\n 'make.names') so that they are, and also to ensure that there\n are no duplicates.\n\n fill: logical. If 'TRUE' then in case the rows have unequal length,\n blank fields are implicitly added. See 'Details'.\n\nstrip.white: logical. Used only when 'sep' has been specified, and\n allows the stripping of leading and trailing white space from\n unquoted 'character' fields ('numeric' fields are always\n stripped). See 'scan' for further details (including the\n exact meaning of 'white space'), remembering that the columns\n may include the row names.\n\nblank.lines.skip: logical: if 'TRUE' blank lines in the input are\n ignored.\n\ncomment.char: character: a character vector of length one containing a\n single character or an empty string. Use '\"\"' to turn off\n the interpretation of comments altogether.\n\nallowEscapes: logical. Should C-style escapes such as '\\n' be\n processed or read verbatim (the default)? Note that if not\n within quotes these could be interpreted as a delimiter (but\n not as a comment character). For more details see 'scan'.\n\n flush: logical: if 'TRUE', 'scan' will flush to the end of the line\n after reading the last of the fields requested. This allows\n putting comments after the last field.\n\nstringsAsFactors: logical: should character vectors be converted to\n factors? Note that this is overridden by 'as.is' and\n 'colClasses', both of which allow finer control.\n\nfileEncoding: character string: if non-empty declares the encoding used\n on a file (not a connection) so the character data can be\n re-encoded. See the 'Encoding' section of the help for\n 'file', the 'R Data Import/Export' manual and 'Note'.\n\nencoding: encoding to be assumed for input strings. It is used to mark\n character strings as known to be in Latin-1 or UTF-8 (see\n 'Encoding'): it is not used to re-encode the input, but\n allows R to handle encoded strings in their native encoding\n (if one of those two). See 'Value' and 'Note'.\n\n text: character string: if 'file' is not supplied and this is, then\n data are read from the value of 'text' via a text connection.\n Notice that a literal string can be used to include (small)\n data sets within R code.\n\n skipNul: logical: should nuls be skipped?\n\n ...: Further arguments to be passed to 'read.table'.\n\nDetails:\n\n This function is the principal means of reading tabular data into\n R.\n\n Unless 'colClasses' is specified, all columns are read as\n character columns and then converted using 'type.convert' to\n logical, integer, numeric, complex or (depending on 'as.is')\n factor as appropriate. Quotes are (by default) interpreted in all\n fields, so a column of values like '\"42\"' will result in an\n integer column.\n\n A field or line is 'blank' if it contains nothing (except\n whitespace if no separator is specified) before a comment\n character or the end of the field or line.\n\n If 'row.names' is not specified and the header line has one less\n entry than the number of columns, the first column is taken to be\n the row names. This allows data frames to be read in from the\n format in which they are printed. If 'row.names' is specified and\n does not refer to the first column, that column is discarded from\n such files.\n\n The number of data columns is determined by looking at the first\n five lines of input (or the whole input if it has less than five\n lines), or from the length of 'col.names' if it is specified and\n is longer. This could conceivably be wrong if 'fill' or\n 'blank.lines.skip' are true, so specify 'col.names' if necessary\n (as in the 'Examples').\n\n 'read.csv' and 'read.csv2' are identical to 'read.table' except\n for the defaults. They are intended for reading 'comma separated\n value' files ('.csv') or ('read.csv2') the variant used in\n countries that use a comma as decimal point and a semicolon as\n field separator. Similarly, 'read.delim' and 'read.delim2' are\n for reading delimited files, defaulting to the TAB character for\n the delimiter. Notice that 'header = TRUE' and 'fill = TRUE' in\n these variants, and that the comment character is disabled.\n\n The rest of the line after a comment character is skipped; quotes\n are not processed in comments. Complete comment lines are allowed\n provided 'blank.lines.skip = TRUE'; however, comment lines prior\n to the header must have the comment character in the first\n non-blank column.\n\n Quoted fields with embedded newlines are supported except after a\n comment character. Embedded nuls are unsupported: skipping them\n (with 'skipNul = TRUE') may work.\n\nValue:\n\n A data frame ('data.frame') containing a representation of the\n data in the file.\n\n Empty input is an error unless 'col.names' is specified, when a\n 0-row data frame is returned: similarly giving just a header line\n if 'header = TRUE' results in a 0-row data frame. Note that in\n either case the columns will be logical unless 'colClasses' was\n supplied.\n\n Character strings in the result (including factor levels) will\n have a declared encoding if 'encoding' is '\"latin1\"' or '\"UTF-8\"'.\n\nCSV files:\n\n See the help on 'write.csv' for the various conventions for '.csv'\n files. The commonest form of CSV file with row names needs to be\n read with 'read.csv(..., row.names = 1)' to use the names in the\n first column of the file as row names.\n\nMemory usage:\n\n These functions can use a surprising amount of memory when reading\n large files. There is extensive discussion in the 'R Data\n Import/Export' manual, supplementing the notes here.\n\n Less memory will be used if 'colClasses' is specified as one of\n the six atomic vector classes. This can be particularly so when\n reading a column that takes many distinct numeric values, as\n storing each distinct value as a character string can take up to\n 14 times as much memory as storing it as an integer.\n\n Using 'nrows', even as a mild over-estimate, will help memory\n usage.\n\n Using 'comment.char = \"\"' will be appreciably faster than the\n 'read.table' default.\n\n 'read.table' is not the right tool for reading large matrices,\n especially those with many columns: it is designed to read _data\n frames_ which may have columns of very different classes. Use\n 'scan' instead for matrices.\n\nNote:\n\n The columns referred to in 'as.is' and 'colClasses' include the\n column of row names (if any).\n\n There are two approaches for reading input that is not in the\n local encoding. If the input is known to be UTF-8 or Latin1, use\n the 'encoding' argument to declare that. If the input is in some\n other encoding, then it may be translated on input. The\n 'fileEncoding' argument achieves this by setting up a connection\n to do the re-encoding into the current locale. Note that on\n Windows or other systems not running in a UTF-8 locale, this may\n not be possible.\n\nReferences:\n\n Chambers, J. M. (1992) _Data for models._ Chapter 3 of\n _Statistical Models in S_ eds J. M. Chambers and T. J. Hastie,\n Wadsworth & Brooks/Cole.\n\nSee Also:\n\n The 'R Data Import/Export' manual.\n\n 'scan', 'type.convert', 'read.fwf' for reading _f_ixed _w_idth\n _f_ormatted input; 'write.table'; 'data.frame'.\n\n 'count.fields' can be useful to determine problems with reading\n files which result in reports of incorrect record lengths (see the\n 'Examples' below).\n\n for the IANA definition\n of CSV files (which requires comma as separator and CRLF line\n endings).\n\nExamples:\n\n ## using count.fields to handle unknown maximum number of fields\n ## when fill = TRUE\n test1 <- c(1:5, \"6,7\", \"8,9,10\")\n tf <- tempfile()\n writeLines(test1, tf)\n \n read.csv(tf, fill = TRUE) # 1 column\n ncol <- max(count.fields(tf, sep = \",\"))\n read.csv(tf, fill = TRUE, header = FALSE,\n col.names = paste0(\"V\", seq_len(ncol)))\n unlink(tf)\n \n ## \"Inline\" data set, using text=\n ## Notice that leading and trailing empty lines are auto-trimmed\n \n read.table(header = TRUE, text = \"\n a b\n 1 2\n 3 4\n \")\n```\n\n\n:::\n:::\n\n\n\n## Import .csv files\n\nReminder\n```\nread.csv(file, header = TRUE, sep = \",\", quote = \"\\\"\",\n dec = \".\", fill = TRUE, comment.char = \"\", ...)\n```\n\n`file` is the first argument and is the path to your file, in quotes \n\n\t- \t\tcan be path in your local computer -- absolute file path or relative file path \n\t- \t\tcan be path to a file on a website\n\n## Mini Exercise\n\nIf your R Project is not already open, open it so we take advantage of it setting a useful working directory for us in order to import data.\n\n\n## Import .csv files\n\nLets import a new data file\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Examples\ndf <- read.csv(file = \"data/serodata.csv\") #relative path\ndf <- read.csv(file = \"~/Dropbox/Git/SISMID-2024/modules/data/serodata.csv\") #absolute path starting from my home directory\n```\n:::\n\n\n\n\nNote #1, I assigned the data frame to an object called `df`. I could have called the data anything, but in order to use the data (i.e., as an object we can find in the Environment), I need to assign it as an object. \n\nNote #2, Look to the Environment window, you will see the `df` object ready to be used.\n\n\n## Import .txt files\n\n`read.csv()` is a special case of `read.delim()` -- a general function to read a delimited file into a data frame \n\n```\nread.delim(file, header = TRUE, sep = \"\\t\", quote = \"\\\"\",\n dec = \".\", fill = TRUE, comment.char = \"\", ...)\n```\n\n- `file` is the path to your file, in quotes \n- `delim` is what separates the fields within a record. The default for csv is comma\n\n## Import .txt files\n\nLets first import 'serodata1.txt' which uses a tab delminiter and 'serodata2.txt' which uses a semicolon delminiter.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Examples\ndf <- read.delim(file = \"data/serodata.txt\", sep = \"\\t\")\ndf <- read.delim(file = \"data/serodata.txt\", sep = \";\")\n```\n:::\n\n\n\nThe data is now successfully read into your R workspace, **many times actually.** Notice, that each time we imported the data we assigned the data to the `df` object, meaning we replaced it each time we reassinged the `df` object. \n\n\n## What if we have a .xlsx file - what do we do?\n\n1. Google / Ask ChatGPT\n2. Find and vet function and package you want\n3. Install package\n4. Call package\n5. Use function\n\n\n## 1. Internet Search\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/ChatGPT.png){width=100%}\n:::\n\n::: {.cell-output-display}\n![](images/GoogleSearch.png){width=100%}\n:::\n\n::: {.cell-output-display}\n![](images/StackOverflow.png){width=100%}\n:::\n:::\n\n\n\n## 2. Find and vet function and package you want\n\nI am getting consistent message to use the the `read_excel()` function found in the `readxl` package. This package was developed by Hadley Wickham, who we know is reputable. Also, you can check that data was read in correctly, b/c this is a straightforward task. \n\n## 3. Install Package\n\nTo use the bundle or \"package\" of code (and or possibly data) from a package, you need to install and also call the package.\n\nTo install a package you can \n\n1. go to Tools ---\\> Install Packages in the RStudio header\n\nOR\n\n2. use the following code:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"package_name\")\n```\n:::\n\n\n\n\nTherefore,\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"readxl\")\n```\n:::\n\n\n\n## 4. Call Package\n\nReminder -- Installing and calling packages\n\nTo call (i.e., be able to use the package) you can use the following code:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(package_name)\n```\n:::\n\n\n\nTherefore, \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(readxl)\n```\n:::\n\n\n\n## 5. Use Function\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?read_excel\n```\n:::\n\nRead xls and xlsx files\n\nDescription:\n\n Read xls and xlsx files\n\n 'read_excel()' calls 'excel_format()' to determine if 'path' is\n xls or xlsx, based on the file extension and the file itself, in\n that order. Use 'read_xls()' and 'read_xlsx()' directly if you\n know better and want to prevent such guessing.\n\nUsage:\n\n read_excel(\n path,\n sheet = NULL,\n range = NULL,\n col_names = TRUE,\n col_types = NULL,\n na = \"\",\n trim_ws = TRUE,\n skip = 0,\n n_max = Inf,\n guess_max = min(1000, n_max),\n progress = readxl_progress(),\n .name_repair = \"unique\"\n )\n \n read_xls(\n path,\n sheet = NULL,\n range = NULL,\n col_names = TRUE,\n col_types = NULL,\n na = \"\",\n trim_ws = TRUE,\n skip = 0,\n n_max = Inf,\n guess_max = min(1000, n_max),\n progress = readxl_progress(),\n .name_repair = \"unique\"\n )\n \n read_xlsx(\n path,\n sheet = NULL,\n range = NULL,\n col_names = TRUE,\n col_types = NULL,\n na = \"\",\n trim_ws = TRUE,\n skip = 0,\n n_max = Inf,\n guess_max = min(1000, n_max),\n progress = readxl_progress(),\n .name_repair = \"unique\"\n )\n \nArguments:\n\n path: Path to the xls/xlsx file.\n\n sheet: Sheet to read. Either a string (the name of a sheet), or an\n integer (the position of the sheet). Ignored if the sheet is\n specified via 'range'. If neither argument specifies the\n sheet, defaults to the first sheet.\n\n range: A cell range to read from, as described in\n cell-specification. Includes typical Excel ranges like\n \"B3:D87\", possibly including the sheet name like\n \"Budget!B2:G14\", and more. Interpreted strictly, even if the\n range forces the inclusion of leading or trailing empty rows\n or columns. Takes precedence over 'skip', 'n_max' and\n 'sheet'.\n\ncol_names: 'TRUE' to use the first row as column names, 'FALSE' to get\n default names, or a character vector giving a name for each\n column. If user provides 'col_types' as a vector, 'col_names'\n can have one entry per column, i.e. have the same length as\n 'col_types', or one entry per unskipped column.\n\ncol_types: Either 'NULL' to guess all from the spreadsheet or a\n character vector containing one entry per column from these\n options: \"skip\", \"guess\", \"logical\", \"numeric\", \"date\",\n \"text\" or \"list\". If exactly one 'col_type' is specified, it\n will be recycled. The content of a cell in a skipped column\n is never read and that column will not appear in the data\n frame output. A list cell loads a column as a list of length\n 1 vectors, which are typed using the type guessing logic from\n 'col_types = NULL', but on a cell-by-cell basis.\n\n na: Character vector of strings to interpret as missing values.\n By default, readxl treats blank cells as missing data.\n\n trim_ws: Should leading and trailing whitespace be trimmed?\n\n skip: Minimum number of rows to skip before reading anything, be it\n column names or data. Leading empty rows are automatically\n skipped, so this is a lower bound. Ignored if 'range' is\n given.\n\n n_max: Maximum number of data rows to read. Trailing empty rows are\n automatically skipped, so this is an upper bound on the\n number of rows in the returned tibble. Ignored if 'range' is\n given.\n\nguess_max: Maximum number of data rows to use for guessing column\n types.\n\nprogress: Display a progress spinner? By default, the spinner appears\n only in an interactive session, outside the context of\n knitting a document, and when the call is likely to run for\n several seconds or more. See 'readxl_progress()' for more\n details.\n\n.name_repair: Handling of column names. Passed along to\n 'tibble::as_tibble()'. readxl's default is `.name_repair =\n \"unique\", which ensures column names are not empty and are\n unique.\n\nValue:\n\n A tibble\n\nSee Also:\n\n cell-specification for more details on targetting cells with the\n 'range' argument\n\nExamples:\n\n datasets <- readxl_example(\"datasets.xlsx\")\n read_excel(datasets)\n \n # Specify sheet either by position or by name\n read_excel(datasets, 2)\n read_excel(datasets, \"mtcars\")\n \n # Skip rows and use default column names\n read_excel(datasets, skip = 148, col_names = FALSE)\n \n # Recycle a single column type\n read_excel(datasets, col_types = \"text\")\n \n # Specify some col_types and guess others\n read_excel(datasets, col_types = c(\"text\", \"guess\", \"numeric\", \"guess\", \"guess\"))\n \n # Accomodate a column with disparate types via col_type = \"list\"\n df <- read_excel(readxl_example(\"clippy.xlsx\"), col_types = c(\"text\", \"list\"))\n df\n df$value\n sapply(df$value, class)\n \n # Limit the number of data rows read\n read_excel(datasets, n_max = 3)\n \n # Read from an Excel range using A1 or R1C1 notation\n read_excel(datasets, range = \"C1:E7\")\n read_excel(datasets, range = \"R1C2:R2C5\")\n \n # Specify the sheet as part of the range\n read_excel(datasets, range = \"mtcars!B1:D5\")\n \n # Read only specific rows or columns\n read_excel(datasets, range = cell_rows(102:151), col_names = FALSE)\n read_excel(datasets, range = cell_cols(\"B:D\"))\n \n # Get a preview of column names\n names(read_excel(readxl_example(\"datasets.xlsx\"), n_max = 0))\n \n # exploit full .name_repair flexibility from tibble\n \n # \"universal\" names are unique and syntactic\n read_excel(\n readxl_example(\"deaths.xlsx\"),\n range = \"arts!A5:F15\",\n .name_repair = \"universal\"\n )\n \n # specify name repair as a built-in function\n read_excel(readxl_example(\"clippy.xlsx\"), .name_repair = toupper)\n \n # specify name repair as a custom function\n my_custom_name_repair <- function(nms) tolower(gsub(\"[.]\", \"_\", nms))\n read_excel(\n readxl_example(\"datasets.xlsx\"),\n .name_repair = my_custom_name_repair\n )\n \n # specify name repair as an anonymous function\n read_excel(\n readxl_example(\"datasets.xlsx\"),\n sheet = \"chickwts\",\n .name_repair = ~ substr(.x, start = 1, stop = 3)\n )\n\n\n\n## 5. Use Function\n\nReminder\n```\nread_excel(\n path,\n sheet = NULL,\n range = NULL,\n col_names = TRUE,\n col_types = NULL,\n na = \"\",\n trim_ws = TRUE,\n skip = 0,\n n_max = Inf,\n guess_max = min(1000, n_max),\n progress = readxl_progress(),\n .name_repair = \"unique\"\n)\n```\n\nLet's practice\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- read_excel(path = \"data/serodata.xlsx\", sheet = \"Data\")\n```\n:::\n\n\n\n\n## Mini exercise\n\nLets make some mistakes\n\n1. What if we read in the data without assigning it to an object (i.e., `read_excel(path = \"data/serodata.xlsx\", sheet = \"Data\")`)?\n\n2. What if we forget to specify the sheet argument? (i.e., `dd <- read_excel(path = \"data/serodata.xlsx\")`)?\n\n\n## Installing and calling packages - Common confusion\n\nYou only need to install a package once (unless you update R), but you will need to call or load a package each time you want to use it. \n\nThe exception to this rule are the \"base\" set of packages (i.e., **Base R**) that are installed automatically when you install R and that automatically called whenever you open R or RStudio.\n\n\n## Common Error\n\nBe prepared to see the error \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nError: could not find function \"some_function\"\n```\n:::\n\n\n\nThis usually mean that either \n\n- you called the function by the wrong name \n- you have not installed a package that contains the function\n- you have installed a package but you forgot to call it (i.e., `library(package_name)`) -- **most likely**\n\n\n## Export (write) Data \n\n- Exporting or 'Writing out' data allows you to save modified files to future use or sharing\n- R can write almost any file format, especially with external, non-Base R, packages\n- We are going to focus again on writing delimited files\n\n\n## Export delimited data\n\nWithin the Base R 'util' package we can find a handful of useful functions including `write.csv()` and `write.table()` to exporting data.\n\n\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n\n```\nData Output\n\nDescription:\n\n 'write.table' prints its required argument 'x' (after converting\n it to a data frame if it is not one nor a matrix) to a file or\n connection.\n\nUsage:\n\n write.table(x, file = \"\", append = FALSE, quote = TRUE, sep = \" \",\n eol = \"\\n\", na = \"NA\", dec = \".\", row.names = TRUE,\n col.names = TRUE, qmethod = c(\"escape\", \"double\"),\n fileEncoding = \"\")\n \n write.csv(...)\n write.csv2(...)\n \nArguments:\n\n x: the object to be written, preferably a matrix or data frame.\n If not, it is attempted to coerce 'x' to a data frame.\n\n file: either a character string naming a file or a connection open\n for writing. '\"\"' indicates output to the console.\n\n append: logical. Only relevant if 'file' is a character string. If\n 'TRUE', the output is appended to the file. If 'FALSE', any\n existing file of the name is destroyed.\n\n quote: a logical value ('TRUE' or 'FALSE') or a numeric vector. If\n 'TRUE', any character or factor columns will be surrounded by\n double quotes. If a numeric vector, its elements are taken\n as the indices of columns to quote. In both cases, row and\n column names are quoted if they are written. If 'FALSE',\n nothing is quoted.\n\n sep: the field separator string. Values within each row of 'x'\n are separated by this string.\n\n eol: the character(s) to print at the end of each line (row). For\n example, 'eol = \"\\r\\n\"' will produce Windows' line endings on\n a Unix-alike OS, and 'eol = \"\\r\"' will produce files as\n expected by Excel:mac 2004.\n\n na: the string to use for missing values in the data.\n\n dec: the string to use for decimal points in numeric or complex\n columns: must be a single character.\n\nrow.names: either a logical value indicating whether the row names of\n 'x' are to be written along with 'x', or a character vector\n of row names to be written.\n\ncol.names: either a logical value indicating whether the column names\n of 'x' are to be written along with 'x', or a character\n vector of column names to be written. See the section on\n 'CSV files' for the meaning of 'col.names = NA'.\n\n qmethod: a character string specifying how to deal with embedded\n double quote characters when quoting strings. Must be one of\n '\"escape\"' (default for 'write.table'), in which case the\n quote character is escaped in C style by a backslash, or\n '\"double\"' (default for 'write.csv' and 'write.csv2'), in\n which case it is doubled. You can specify just the initial\n letter.\n\nfileEncoding: character string: if non-empty declares the encoding to\n be used on a file (not a connection) so the character data\n can be re-encoded as they are written. See 'file'.\n\n ...: arguments to 'write.table': 'append', 'col.names', 'sep',\n 'dec' and 'qmethod' cannot be altered.\n\nDetails:\n\n If the table has no columns the rownames will be written only if\n 'row.names = TRUE', and _vice versa_.\n\n Real and complex numbers are written to the maximal possible\n precision.\n\n If a data frame has matrix-like columns these will be converted to\n multiple columns in the result (_via_ 'as.matrix') and so a\n character 'col.names' or a numeric 'quote' should refer to the\n columns in the result, not the input. Such matrix-like columns\n are unquoted by default.\n\n Any columns in a data frame which are lists or have a class (e.g.,\n dates) will be converted by the appropriate 'as.character' method:\n such columns are unquoted by default. On the other hand, any\n class information for a matrix is discarded and non-atomic (e.g.,\n list) matrices are coerced to character.\n\n Only columns which have been converted to character will be quoted\n if specified by 'quote'.\n\n The 'dec' argument only applies to columns that are not subject to\n conversion to character because they have a class or are part of a\n matrix-like column (or matrix), in particular to columns protected\n by 'I()'. Use 'options(\"OutDec\")' to control such conversions.\n\n In almost all cases the conversion of numeric quantities is\n governed by the option '\"scipen\"' (see 'options'), but with the\n internal equivalent of 'digits = 15'. For finer control, use\n 'format' to make a character matrix/data frame, and call\n 'write.table' on that.\n\n These functions check for a user interrupt every 1000 lines of\n output.\n\n If 'file' is a non-open connection, an attempt is made to open it\n and then close it after use.\n\n To write a Unix-style file on Windows, use a binary connection\n e.g. 'file = file(\"filename\", \"wb\")'.\n\nCSV files:\n\n By default there is no column name for a column of row names. If\n 'col.names = NA' and 'row.names = TRUE' a blank column name is\n added, which is the convention used for CSV files to be read by\n spreadsheets. Note that such CSV files can be read in R by\n\n read.csv(file = \"\", row.names = 1)\n \n 'write.csv' and 'write.csv2' provide convenience wrappers for\n writing CSV files. They set 'sep' and 'dec' (see below), 'qmethod\n = \"double\"', and 'col.names' to 'NA' if 'row.names = TRUE' (the\n default) and to 'TRUE' otherwise.\n\n 'write.csv' uses '\".\"' for the decimal point and a comma for the\n separator.\n\n 'write.csv2' uses a comma for the decimal point and a semicolon\n for the separator, the Excel convention for CSV files in some\n Western European locales.\n\n These wrappers are deliberately inflexible: they are designed to\n ensure that the correct conventions are used to write a valid\n file. Attempts to change 'append', 'col.names', 'sep', 'dec' or\n 'qmethod' are ignored, with a warning.\n\n CSV files do not record an encoding, and this causes problems if\n they are not ASCII for many other applications. Windows Excel\n 2007/10 will open files (e.g., by the file association mechanism)\n correctly if they are ASCII or UTF-16 (use 'fileEncoding =\n \"UTF-16LE\"') or perhaps in the current Windows codepage (e.g.,\n '\"CP1252\"'), but the 'Text Import Wizard' (from the 'Data' tab)\n allows far more choice of encodings. Excel:mac 2004/8 can\n _import_ only 'Macintosh' (which seems to mean Mac Roman),\n 'Windows' (perhaps Latin-1) and 'PC-8' files. OpenOffice 3.x asks\n for the character set when opening the file.\n\n There is an IETF RFC4180\n () for CSV files, which\n mandates comma as the separator and CRLF line endings.\n 'write.csv' writes compliant files on Windows: use 'eol = \"\\r\\n\"'\n on other platforms.\n\nNote:\n\n 'write.table' can be slow for data frames with large numbers\n (hundreds or more) of columns: this is inevitable as each column\n could be of a different class and so must be handled separately.\n If they are all of the same class, consider using a matrix\n instead.\n\nSee Also:\n\n The 'R Data Import/Export' manual.\n\n 'read.table', 'write'.\n\n 'write.matrix' in package 'MASS'.\n\nExamples:\n\n x <- data.frame(a = I(\"a \\\" quote\"), b = pi)\n tf <- tempfile(fileext = \".csv\")\n \n ## To write a CSV file for input to Excel one might use\n write.table(x, file = tf, sep = \",\", col.names = NA,\n qmethod = \"double\")\n file.show(tf)\n ## and to read this file back into R one needs\n read.table(tf, header = TRUE, sep = \",\", row.names = 1)\n ## NB: you do need to specify a separator if qmethod = \"double\".\n \n ### Alternatively\n write.csv(x, file = tf)\n read.csv(tf, row.names = 1)\n ## or without row names\n write.csv(x, file = tf, row.names = FALSE)\n read.csv(tf)\n \n ## Not run:\n \n ## To write a file in Mac Roman for simple use in Mac Excel 2004/8\n write.csv(x, file = \"foo.csv\", fileEncoding = \"macroman\")\n ## or for Windows Excel 2007/10\n write.csv(x, file = \"foo.csv\", fileEncoding = \"UTF-16LE\")\n ## End(Not run)\n```\n\n\n:::\n:::\n\n\n\n## Export delimited data\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwrite.csv(df, file=\"data/serodata_new.csv\", row.names = FALSE) #comma delimited\nwrite.table(df, file=\"data/serodata1_new.txt\", sep=\"\\t\", row.names = FALSE) #tab delimited\nwrite.table(df, file=\"data/serodata2_new.txt\", sep=\";\", row.names = FALSE) #semicolon delimited\n```\n:::\n\n\n\nNote, I wrote the data to new file names. Even though we didn't change the data at all in this module, it is good practice to keep raw data raw, and not to write over it.\n\n## R .rds and .rda/RData files\n\nThere are two file extensions worth discussing.\n\nR has two native data formats—Rdata (sometimes shortened to Rda) and Rds. These formats are used when R objects are saved for later use. Rdata is used to save multiple R objects, while Rds is used to save a single R object. \n\n## .rds binary file\n\nSaving datasets in `.rds` format can save time if you have to read it back in later.\n\n`write_rds()` and `read_rds()` from `readr` package can be used to write/read a single R object to/from file.\n\n```\nlibrary(readr)\nwrite_rds(object1, file = \"filename.rds\")\nobject1 <- read_rds(file = \"filename.rds\")\n```\n\n\n## .rda/RData files \n\nThe Base R functions `save()` and `load()` can be used to save and load multiple R objects. \n\n`save()` writes an external representation of R objects to the specified file, and can by loaded back into the environment using `load()`. A nice feature about using `save` and `load` is that the R object is directly imported into the environment and you don't have to assign it to an object. The files can be saved as `.RData` or `.rda` files.\n\n```\nsave(object1, object2, file = \"filename.RData\")\nload(\"filename.RData\")\n```\n\nNote, that when you read .RData files you don't need to assign it to an abjecct. It simply reads in the objects as they were saved. Therefore, `load(\"filename.RData\")` will read in `object1` and `object2` directly into the Global Environment.\n\n\n\n## Summary\n\n- Importing or 'Reading in' data is the first step of any real project/analysis\n- The Base R 'util' package we can find a handful of useful functions including `read.csv()` and `read.delim()` to importing/reading data or `write.csv()` and `write.table()` for exporti/writing data\n- When importing data (exception is object from .RData), you must assign it to an object, otherwise it cannot be called/used\n- Properly read data can be found in the Environment window of RStudio\n- You only need to install a package once (unless you update R), but you will need to call or load a package each time you want to use it. \n- To complete a tasek you don't know how to do (e.g., reading in an excel data file) use the following steps: 1. Google / Ask ChatGPT, 2. Find and vet function and package you want, 3. Install package, 4. Call package, 5. Use function\n\n\n## Acknowledgements\n\nThese are the materials we looked through, modified, or extracted to complete this module's lecture.\n\n- [\"Introduction to R for Public Health Researchers\" Johns Hopkins University](https://jhudatascience.org/intro_to_r/)\n\n", + "markdown": "---\ntitle: \"Module 5: Data Import and Export\"\nformat: \n revealjs:\n scrollable: true\n smaller: true\n toc: false\n---\n\n\n## Learning Objectives\n\nAfter module 5, you should be able to...\n\n- Use Base R functions to load data\n- Install and attach external R Packages to extend R's functionality\n- Load any type of data into R\n- Find loaded data in the Environment pane of RStudio\n- Reading and writing R .Rds and .Rda/.RData files\n\n\n## Import (read) Data\n\n- Importing or 'Reading in' data are the first step of any real project / data analysis\n- R can read almost any file format, especially with external, non-Base R, packages\n- We are going to focus on simple delimited files first. \n - comma separated (e.g. '.csv')\n - tab delimited (e.g. '.txt')\n\nA delimited file is a sequential file with column delimiters. Each delimited file is a stream of records, which consists of fields that are ordered by column. Each record contains fields for one row. Within each row, individual fields are separated by column **delimiters** (IBM.com definition)\n\n## Mini exercise\n\n1. Download Module 5 data from the website and save the data to your data subdirectory -- specifically `SISMID_IntroToR_RProject/data`\n\n1. Open the '.csv' and '.txt' data files in a text editor application and familiarize yourself with the data (i.e., Notepad for Windows and TextEdit for Mac)\n\n1. Open the '.xlsx' data file in excel and familiarize yourself with the data\n\t\t-\t\tif you use a Mac **do not** open in Numbers, it can corrupt the file\n\t\t-\t\tif you do not have excel, you can upload it to Google Sheets\n\n1. Determine the delimiter of the two '.txt' files\n\n\n## Import delimited data\n\nWithin the Base R 'util' package we can find a handful of useful functions including `read.csv()` and `read.delim()` to importing data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?read.csv\n```\n:::\n\n::: {.cell}\n::: {.cell-output .cell-output-stderr}\n```\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\nData Input\n\nDescription:\n\n Reads a file in table format and creates a data frame from it,\n with cases corresponding to lines and variables to fields in the\n file.\n\nUsage:\n\n read.table(file, header = FALSE, sep = \"\", quote = \"\\\"'\",\n dec = \".\", numerals = c(\"allow.loss\", \"warn.loss\", \"no.loss\"),\n row.names, col.names, as.is = !stringsAsFactors, tryLogical = TRUE,\n na.strings = \"NA\", colClasses = NA, nrows = -1,\n skip = 0, check.names = TRUE, fill = !blank.lines.skip,\n strip.white = FALSE, blank.lines.skip = TRUE,\n comment.char = \"#\",\n allowEscapes = FALSE, flush = FALSE,\n stringsAsFactors = FALSE,\n fileEncoding = \"\", encoding = \"unknown\", text, skipNul = FALSE)\n \n read.csv(file, header = TRUE, sep = \",\", quote = \"\\\"\",\n dec = \".\", fill = TRUE, comment.char = \"\", ...)\n \n read.csv2(file, header = TRUE, sep = \";\", quote = \"\\\"\",\n dec = \",\", fill = TRUE, comment.char = \"\", ...)\n \n read.delim(file, header = TRUE, sep = \"\\t\", quote = \"\\\"\",\n dec = \".\", fill = TRUE, comment.char = \"\", ...)\n \n read.delim2(file, header = TRUE, sep = \"\\t\", quote = \"\\\"\",\n dec = \",\", fill = TRUE, comment.char = \"\", ...)\n \nArguments:\n\n file: the name of the file which the data are to be read from.\n Each row of the table appears as one line of the file. If it\n does not contain an _absolute_ path, the file name is\n _relative_ to the current working directory, 'getwd()'.\n Tilde-expansion is performed where supported. This can be a\n compressed file (see 'file').\n\n Alternatively, 'file' can be a readable text-mode connection\n (which will be opened for reading if necessary, and if so\n 'close'd (and hence destroyed) at the end of the function\n call). (If 'stdin()' is used, the prompts for lines may be\n somewhat confusing. Terminate input with a blank line or an\n EOF signal, 'Ctrl-D' on Unix and 'Ctrl-Z' on Windows. Any\n pushback on 'stdin()' will be cleared before return.)\n\n 'file' can also be a complete URL. (For the supported URL\n schemes, see the 'URLs' section of the help for 'url'.)\n\n header: a logical value indicating whether the file contains the\n names of the variables as its first line. If missing, the\n value is determined from the file format: 'header' is set to\n 'TRUE' if and only if the first row contains one fewer field\n than the number of columns.\n\n sep: the field separator character. Values on each line of the\n file are separated by this character. If 'sep = \"\"' (the\n default for 'read.table') the separator is 'white space',\n that is one or more spaces, tabs, newlines or carriage\n returns.\n\n quote: the set of quoting characters. To disable quoting altogether,\n use 'quote = \"\"'. See 'scan' for the behaviour on quotes\n embedded in quotes. Quoting is only considered for columns\n read as character, which is all of them unless 'colClasses'\n is specified.\n\n dec: the character used in the file for decimal points.\n\nnumerals: string indicating how to convert numbers whose conversion to\n double precision would lose accuracy, see 'type.convert'.\n Can be abbreviated. (Applies also to complex-number inputs.)\n\nrow.names: a vector of row names. This can be a vector giving the\n actual row names, or a single number giving the column of the\n table which contains the row names, or character string\n giving the name of the table column containing the row names.\n\n If there is a header and the first row contains one fewer\n field than the number of columns, the first column in the\n input is used for the row names. Otherwise if 'row.names' is\n missing, the rows are numbered.\n\n Using 'row.names = NULL' forces row numbering. Missing or\n 'NULL' 'row.names' generate row names that are considered to\n be 'automatic' (and not preserved by 'as.matrix').\n\ncol.names: a vector of optional names for the variables. The default\n is to use '\"V\"' followed by the column number.\n\n as.is: controls conversion of character variables (insofar as they\n are not converted to logical, numeric or complex) to factors,\n if not otherwise specified by 'colClasses'. Its value is\n either a vector of logicals (values are recycled if\n necessary), or a vector of numeric or character indices which\n specify which columns should not be converted to factors.\n\n Note: to suppress all conversions including those of numeric\n columns, set 'colClasses = \"character\"'.\n\n Note that 'as.is' is specified per column (not per variable)\n and so includes the column of row names (if any) and any\n columns to be skipped.\n\ntryLogical: a 'logical' determining if columns consisting entirely of\n '\"F\"', '\"T\"', '\"FALSE\"', and '\"TRUE\"' should be converted to\n 'logical'; passed to 'type.convert', true by default.\n\nna.strings: a character vector of strings which are to be interpreted\n as 'NA' values. Blank fields are also considered to be\n missing values in logical, integer, numeric and complex\n fields. Note that the test happens _after_ white space is\n stripped from the input, so 'na.strings' values may need\n their own white space stripped in advance.\n\ncolClasses: character. A vector of classes to be assumed for the\n columns. If unnamed, recycled as necessary. If named, names\n are matched with unspecified values being taken to be 'NA'.\n\n Possible values are 'NA' (the default, when 'type.convert' is\n used), '\"NULL\"' (when the column is skipped), one of the\n atomic vector classes (logical, integer, numeric, complex,\n character, raw), or '\"factor\"', '\"Date\"' or '\"POSIXct\"'.\n Otherwise there needs to be an 'as' method (from package\n 'methods') for conversion from '\"character\"' to the specified\n formal class.\n\n Note that 'colClasses' is specified per column (not per\n variable) and so includes the column of row names (if any).\n\n nrows: integer: the maximum number of rows to read in. Negative and\n other invalid values are ignored.\n\n skip: integer: the number of lines of the data file to skip before\n beginning to read data.\n\ncheck.names: logical. If 'TRUE' then the names of the variables in the\n data frame are checked to ensure that they are syntactically\n valid variable names. If necessary they are adjusted (by\n 'make.names') so that they are, and also to ensure that there\n are no duplicates.\n\n fill: logical. If 'TRUE' then in case the rows have unequal length,\n blank fields are implicitly added. See 'Details'.\n\nstrip.white: logical. Used only when 'sep' has been specified, and\n allows the stripping of leading and trailing white space from\n unquoted 'character' fields ('numeric' fields are always\n stripped). See 'scan' for further details (including the\n exact meaning of 'white space'), remembering that the columns\n may include the row names.\n\nblank.lines.skip: logical: if 'TRUE' blank lines in the input are\n ignored.\n\ncomment.char: character: a character vector of length one containing a\n single character or an empty string. Use '\"\"' to turn off\n the interpretation of comments altogether.\n\nallowEscapes: logical. Should C-style escapes such as '\\n' be\n processed or read verbatim (the default)? Note that if not\n within quotes these could be interpreted as a delimiter (but\n not as a comment character). For more details see 'scan'.\n\n flush: logical: if 'TRUE', 'scan' will flush to the end of the line\n after reading the last of the fields requested. This allows\n putting comments after the last field.\n\nstringsAsFactors: logical: should character vectors be converted to\n factors? Note that this is overridden by 'as.is' and\n 'colClasses', both of which allow finer control.\n\nfileEncoding: character string: if non-empty declares the encoding used\n on a file (not a connection) so the character data can be\n re-encoded. See the 'Encoding' section of the help for\n 'file', the 'R Data Import/Export' manual and 'Note'.\n\nencoding: encoding to be assumed for input strings. It is used to mark\n character strings as known to be in Latin-1 or UTF-8 (see\n 'Encoding'): it is not used to re-encode the input, but\n allows R to handle encoded strings in their native encoding\n (if one of those two). See 'Value' and 'Note'.\n\n text: character string: if 'file' is not supplied and this is, then\n data are read from the value of 'text' via a text connection.\n Notice that a literal string can be used to include (small)\n data sets within R code.\n\n skipNul: logical: should nuls be skipped?\n\n ...: Further arguments to be passed to 'read.table'.\n\nDetails:\n\n This function is the principal means of reading tabular data into\n R.\n\n Unless 'colClasses' is specified, all columns are read as\n character columns and then converted using 'type.convert' to\n logical, integer, numeric, complex or (depending on 'as.is')\n factor as appropriate. Quotes are (by default) interpreted in all\n fields, so a column of values like '\"42\"' will result in an\n integer column.\n\n A field or line is 'blank' if it contains nothing (except\n whitespace if no separator is specified) before a comment\n character or the end of the field or line.\n\n If 'row.names' is not specified and the header line has one less\n entry than the number of columns, the first column is taken to be\n the row names. This allows data frames to be read in from the\n format in which they are printed. If 'row.names' is specified and\n does not refer to the first column, that column is discarded from\n such files.\n\n The number of data columns is determined by looking at the first\n five lines of input (or the whole input if it has less than five\n lines), or from the length of 'col.names' if it is specified and\n is longer. This could conceivably be wrong if 'fill' or\n 'blank.lines.skip' are true, so specify 'col.names' if necessary\n (as in the 'Examples').\n\n 'read.csv' and 'read.csv2' are identical to 'read.table' except\n for the defaults. They are intended for reading 'comma separated\n value' files ('.csv') or ('read.csv2') the variant used in\n countries that use a comma as decimal point and a semicolon as\n field separator. Similarly, 'read.delim' and 'read.delim2' are\n for reading delimited files, defaulting to the TAB character for\n the delimiter. Notice that 'header = TRUE' and 'fill = TRUE' in\n these variants, and that the comment character is disabled.\n\n The rest of the line after a comment character is skipped; quotes\n are not processed in comments. Complete comment lines are allowed\n provided 'blank.lines.skip = TRUE'; however, comment lines prior\n to the header must have the comment character in the first\n non-blank column.\n\n Quoted fields with embedded newlines are supported except after a\n comment character. Embedded nuls are unsupported: skipping them\n (with 'skipNul = TRUE') may work.\n\nValue:\n\n A data frame ('data.frame') containing a representation of the\n data in the file.\n\n Empty input is an error unless 'col.names' is specified, when a\n 0-row data frame is returned: similarly giving just a header line\n if 'header = TRUE' results in a 0-row data frame. Note that in\n either case the columns will be logical unless 'colClasses' was\n supplied.\n\n Character strings in the result (including factor levels) will\n have a declared encoding if 'encoding' is '\"latin1\"' or '\"UTF-8\"'.\n\nCSV files:\n\n See the help on 'write.csv' for the various conventions for '.csv'\n files. The commonest form of CSV file with row names needs to be\n read with 'read.csv(..., row.names = 1)' to use the names in the\n first column of the file as row names.\n\nMemory usage:\n\n These functions can use a surprising amount of memory when reading\n large files. There is extensive discussion in the 'R Data\n Import/Export' manual, supplementing the notes here.\n\n Less memory will be used if 'colClasses' is specified as one of\n the six atomic vector classes. This can be particularly so when\n reading a column that takes many distinct numeric values, as\n storing each distinct value as a character string can take up to\n 14 times as much memory as storing it as an integer.\n\n Using 'nrows', even as a mild over-estimate, will help memory\n usage.\n\n Using 'comment.char = \"\"' will be appreciably faster than the\n 'read.table' default.\n\n 'read.table' is not the right tool for reading large matrices,\n especially those with many columns: it is designed to read _data\n frames_ which may have columns of very different classes. Use\n 'scan' instead for matrices.\n\nNote:\n\n The columns referred to in 'as.is' and 'colClasses' include the\n column of row names (if any).\n\n There are two approaches for reading input that is not in the\n local encoding. If the input is known to be UTF-8 or Latin1, use\n the 'encoding' argument to declare that. If the input is in some\n other encoding, then it may be translated on input. The\n 'fileEncoding' argument achieves this by setting up a connection\n to do the re-encoding into the current locale. Note that on\n Windows or other systems not running in a UTF-8 locale, this may\n not be possible.\n\nReferences:\n\n Chambers, J. M. (1992) _Data for models._ Chapter 3 of\n _Statistical Models in S_ eds J. M. Chambers and T. J. Hastie,\n Wadsworth & Brooks/Cole.\n\nSee Also:\n\n The 'R Data Import/Export' manual.\n\n 'scan', 'type.convert', 'read.fwf' for reading _f_ixed _w_idth\n _f_ormatted input; 'write.table'; 'data.frame'.\n\n 'count.fields' can be useful to determine problems with reading\n files which result in reports of incorrect record lengths (see the\n 'Examples' below).\n\n for the IANA definition\n of CSV files (which requires comma as separator and CRLF line\n endings).\n\nExamples:\n\n ## using count.fields to handle unknown maximum number of fields\n ## when fill = TRUE\n test1 <- c(1:5, \"6,7\", \"8,9,10\")\n tf <- tempfile()\n writeLines(test1, tf)\n \n read.csv(tf, fill = TRUE) # 1 column\n ncol <- max(count.fields(tf, sep = \",\"))\n read.csv(tf, fill = TRUE, header = FALSE,\n col.names = paste0(\"V\", seq_len(ncol)))\n unlink(tf)\n \n ## \"Inline\" data set, using text=\n ## Notice that leading and trailing empty lines are auto-trimmed\n \n read.table(header = TRUE, text = \"\n a b\n 1 2\n 3 4\n \")\n```\n:::\n:::\n\n\n## Import .csv files\n\nFunction signature reminder\n```\nread.csv(file, header = TRUE, sep = \",\", quote = \"\\\"\",\n dec = \".\", fill = TRUE, comment.char = \"\", ...)\n```\n\n\t\t-\t\t`file` is the first argument and is the path to your file, in quotes \n\t\t\n\t\t\t\t-\t\tcan be path in your local computer -- absolute file path or relative file path \n\t\t\t\t-\t\tcan be path to a file on a website\n\n## Mini exercise\n\nIf your R Project is not already open, open it so we take advantage of it setting a useful working directory for us in order to import data.\n\n\n## Import .csv files\n\nLets import a new data file\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Examples\ndf <- read.csv(file = \"data/serodata.csv\") #relative path\n```\n:::\n\n\n\nNote #1, I assigned the data frame to an object called `df`. I could have called the data anything, but in order to use the data (i.e., as an object we can find in the Environment), I need to assign it as an object. \n\nNote #2, Look to the Environment pane, you will see the `df` object ready to be used.\n\n\n## Import .txt files\n\n`read.csv()` is a special case of `read.delim()` -- a general function to read a delimited file into a data frame \n\nReminder function signature\n```\nread.delim(file, header = TRUE, sep = \"\\t\", quote = \"\\\"\",\n dec = \".\", fill = TRUE, comment.char = \"\", ...)\n```\n\n\t\t- `file` is the path to your file, in quotes \n\t\t- `delim` is what separates the fields within a record. The default for csv is comma\n\n## Import .txt files\n\nLets first import 'serodata1.txt' which uses a tab delimiter and 'serodata2.txt' which uses a semicolon delimiter.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n## Examples\ndf <- read.delim(file = \"data/serodata.txt\", sep = \"\\t\")\ndf <- read.delim(file = \"data/serodata.txt\", sep = \";\")\n```\n:::\n\n\nThe dataset is now successfully read into your R workspace, **many times actually.** Notice, that each time we imported the data we assigned the data to the `df` object, meaning we replaced it each time we reassinged the `df` object. \n\n\n## What if we have a .xlsx file - what do we do?\n\n1. Google / Ask ChatGPT\n2. Find and vet function and package you want\n3. Install package\n4. Attach package\n5. Use function\n\n\n## 1. Internet Search\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/ChatGPT.png){width=100%}\n:::\n\n::: {.cell-output-display}\n![](images/GoogleSearch.png){width=100%}\n:::\n\n::: {.cell-output-display}\n![](images/StackOverflow.png){width=100%}\n:::\n:::\n\n\n## 2. Find and vet function and package you want\n\nI am getting consistent message to use the the `read_excel()` function found in the `readxl` package. This package was developed by Hadley Wickham, who we know is reputable. Also, you can check that data was read in correctly, b/c this is a straightforward task. \n\n## 3. Install Package\n\nTo use the bundle or \"package\" of code (and or possibly data) from a package, you need to install and also call the package.\n\nTo install a package you can \n\n1. go to Tools ---\\> Install Packages in the RStudio header\n\nOR\n\n2. use the following code:\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"package_name\")\n```\n:::\n\n\n\nTherefore,\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"readxl\")\n```\n:::\n\n\n## 4. Attach Package\n\nReminder - To attach (i.e., be able to use the package) you can use the following code:\n\n::: {.cell}\n\n```{.r .cell-code}\nrequire(package_name)\n```\n:::\n\n\nTherefore, \n\n\n::: {.cell}\n\n```{.r .cell-code}\nrequire(readxl)\n```\n:::\n\n\n## 5. Use Function\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?read_excel\n```\n:::\n\nRead xls and xlsx files\n\nDescription:\n\n Read xls and xlsx files\n\n 'read_excel()' calls 'excel_format()' to determine if 'path' is\n xls or xlsx, based on the file extension and the file itself, in\n that order. Use 'read_xls()' and 'read_xlsx()' directly if you\n know better and want to prevent such guessing.\n\nUsage:\n\n read_excel(\n path,\n sheet = NULL,\n range = NULL,\n col_names = TRUE,\n col_types = NULL,\n na = \"\",\n trim_ws = TRUE,\n skip = 0,\n n_max = Inf,\n guess_max = min(1000, n_max),\n progress = readxl_progress(),\n .name_repair = \"unique\"\n )\n \n read_xls(\n path,\n sheet = NULL,\n range = NULL,\n col_names = TRUE,\n col_types = NULL,\n na = \"\",\n trim_ws = TRUE,\n skip = 0,\n n_max = Inf,\n guess_max = min(1000, n_max),\n progress = readxl_progress(),\n .name_repair = \"unique\"\n )\n \n read_xlsx(\n path,\n sheet = NULL,\n range = NULL,\n col_names = TRUE,\n col_types = NULL,\n na = \"\",\n trim_ws = TRUE,\n skip = 0,\n n_max = Inf,\n guess_max = min(1000, n_max),\n progress = readxl_progress(),\n .name_repair = \"unique\"\n )\n \nArguments:\n\n path: Path to the xls/xlsx file.\n\n sheet: Sheet to read. Either a string (the name of a sheet), or an\n integer (the position of the sheet). Ignored if the sheet is\n specified via 'range'. If neither argument specifies the\n sheet, defaults to the first sheet.\n\n range: A cell range to read from, as described in\n cell-specification. Includes typical Excel ranges like\n \"B3:D87\", possibly including the sheet name like\n \"Budget!B2:G14\", and more. Interpreted strictly, even if the\n range forces the inclusion of leading or trailing empty rows\n or columns. Takes precedence over 'skip', 'n_max' and\n 'sheet'.\n\ncol_names: 'TRUE' to use the first row as column names, 'FALSE' to get\n default names, or a character vector giving a name for each\n column. If user provides 'col_types' as a vector, 'col_names'\n can have one entry per column, i.e. have the same length as\n 'col_types', or one entry per unskipped column.\n\ncol_types: Either 'NULL' to guess all from the spreadsheet or a\n character vector containing one entry per column from these\n options: \"skip\", \"guess\", \"logical\", \"numeric\", \"date\",\n \"text\" or \"list\". If exactly one 'col_type' is specified, it\n will be recycled. The content of a cell in a skipped column\n is never read and that column will not appear in the data\n frame output. A list cell loads a column as a list of length\n 1 vectors, which are typed using the type guessing logic from\n 'col_types = NULL', but on a cell-by-cell basis.\n\n na: Character vector of strings to interpret as missing values.\n By default, readxl treats blank cells as missing data.\n\n trim_ws: Should leading and trailing whitespace be trimmed?\n\n skip: Minimum number of rows to skip before reading anything, be it\n column names or data. Leading empty rows are automatically\n skipped, so this is a lower bound. Ignored if 'range' is\n given.\n\n n_max: Maximum number of data rows to read. Trailing empty rows are\n automatically skipped, so this is an upper bound on the\n number of rows in the returned tibble. Ignored if 'range' is\n given.\n\nguess_max: Maximum number of data rows to use for guessing column\n types.\n\nprogress: Display a progress spinner? By default, the spinner appears\n only in an interactive session, outside the context of\n knitting a document, and when the call is likely to run for\n several seconds or more. See 'readxl_progress()' for more\n details.\n\n.name_repair: Handling of column names. Passed along to\n 'tibble::as_tibble()'. readxl's default is `.name_repair =\n \"unique\", which ensures column names are not empty and are\n unique.\n\nValue:\n\n A tibble\n\nSee Also:\n\n cell-specification for more details on targetting cells with the\n 'range' argument\n\nExamples:\n\n datasets <- readxl_example(\"datasets.xlsx\")\n read_excel(datasets)\n \n # Specify sheet either by position or by name\n read_excel(datasets, 2)\n read_excel(datasets, \"mtcars\")\n \n # Skip rows and use default column names\n read_excel(datasets, skip = 148, col_names = FALSE)\n \n # Recycle a single column type\n read_excel(datasets, col_types = \"text\")\n \n # Specify some col_types and guess others\n read_excel(datasets, col_types = c(\"text\", \"guess\", \"numeric\", \"guess\", \"guess\"))\n \n # Accomodate a column with disparate types via col_type = \"list\"\n df <- read_excel(readxl_example(\"clippy.xlsx\"), col_types = c(\"text\", \"list\"))\n df\n df$value\n sapply(df$value, class)\n \n # Limit the number of data rows read\n read_excel(datasets, n_max = 3)\n \n # Read from an Excel range using A1 or R1C1 notation\n read_excel(datasets, range = \"C1:E7\")\n read_excel(datasets, range = \"R1C2:R2C5\")\n \n # Specify the sheet as part of the range\n read_excel(datasets, range = \"mtcars!B1:D5\")\n \n # Read only specific rows or columns\n read_excel(datasets, range = cell_rows(102:151), col_names = FALSE)\n read_excel(datasets, range = cell_cols(\"B:D\"))\n \n # Get a preview of column names\n names(read_excel(readxl_example(\"datasets.xlsx\"), n_max = 0))\n \n # exploit full .name_repair flexibility from tibble\n \n # \"universal\" names are unique and syntactic\n read_excel(\n readxl_example(\"deaths.xlsx\"),\n range = \"arts!A5:F15\",\n .name_repair = \"universal\"\n )\n \n # specify name repair as a built-in function\n read_excel(readxl_example(\"clippy.xlsx\"), .name_repair = toupper)\n \n # specify name repair as a custom function\n my_custom_name_repair <- function(nms) tolower(gsub(\"[.]\", \"_\", nms))\n read_excel(\n readxl_example(\"datasets.xlsx\"),\n .name_repair = my_custom_name_repair\n )\n \n # specify name repair as an anonymous function\n read_excel(\n readxl_example(\"datasets.xlsx\"),\n sheet = \"chickwts\",\n .name_repair = ~ substr(.x, start = 1, stop = 3)\n )\n\n\n## 5. Use Function\n\nReminder of function signature\n```\nread_excel(\n path,\n sheet = NULL,\n range = NULL,\n col_names = TRUE,\n col_types = NULL,\n na = \"\",\n trim_ws = TRUE,\n skip = 0,\n n_max = Inf,\n guess_max = min(1000, n_max),\n progress = readxl_progress(),\n .name_repair = \"unique\"\n)\n```\n\nLet's practice\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- read_excel(path = \"data/serodata.xlsx\", sheet = \"Data\")\n```\n:::\n\n\n\n## Lets make some mistakes\n\n1. What if we read in the data without assigning it to an object (i.e., `read_excel(path = \"data/serodata.xlsx\", sheet = \"Data\")`)?\n\n2. What if we forget to specify the sheet argument? (i.e., `dd <- read_excel(path = \"data/serodata.xlsx\")`)?\n\n\n## Installing and calling packages - Common confusion\n\n
\n\nYou only need to install a package once (unless you update R or want to update the package), but you will need to call or load a package each time you want to use it. \n\n
\n\nThe exception to this rule are the \"base\" set of packages (i.e., **Base R**) that are installed automatically when you install R and that automatically called whenever you open R or RStudio.\n\n\n## Common Error\n\nBe prepared to see this error\n\n\n::: {.cell}\n\n```{.r .cell-code}\nError: could not find function \"some_function_name\"\n```\n:::\n\n\nThis usually means that either \n\n- you called the function by the wrong name \n- you have not installed a package that contains the function\n- you have installed a package but you forgot to attach it (i.e., `require(package_name)`) -- **most likely**\n\n\n## Export (write) Data \n\n- Exporting or 'Writing out' data allows you to save modified files for future use or sharing\n- R can write almost any file format, especially with external, non-Base R, packages\n- We are going to focus again on writing delimited files\n\n\n## Export delimited data\n\nWithin the Base R 'util' package we can find a handful of useful functions including `write.csv()` and `write.table()` to exporting data.\n\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n```\nData Output\n\nDescription:\n\n 'write.table' prints its required argument 'x' (after converting\n it to a data frame if it is not one nor a matrix) to a file or\n connection.\n\nUsage:\n\n write.table(x, file = \"\", append = FALSE, quote = TRUE, sep = \" \",\n eol = \"\\n\", na = \"NA\", dec = \".\", row.names = TRUE,\n col.names = TRUE, qmethod = c(\"escape\", \"double\"),\n fileEncoding = \"\")\n \n write.csv(...)\n write.csv2(...)\n \nArguments:\n\n x: the object to be written, preferably a matrix or data frame.\n If not, it is attempted to coerce 'x' to a data frame.\n\n file: either a character string naming a file or a connection open\n for writing. '\"\"' indicates output to the console.\n\n append: logical. Only relevant if 'file' is a character string. If\n 'TRUE', the output is appended to the file. If 'FALSE', any\n existing file of the name is destroyed.\n\n quote: a logical value ('TRUE' or 'FALSE') or a numeric vector. If\n 'TRUE', any character or factor columns will be surrounded by\n double quotes. If a numeric vector, its elements are taken\n as the indices of columns to quote. In both cases, row and\n column names are quoted if they are written. If 'FALSE',\n nothing is quoted.\n\n sep: the field separator string. Values within each row of 'x'\n are separated by this string.\n\n eol: the character(s) to print at the end of each line (row). For\n example, 'eol = \"\\r\\n\"' will produce Windows' line endings on\n a Unix-alike OS, and 'eol = \"\\r\"' will produce files as\n expected by Excel:mac 2004.\n\n na: the string to use for missing values in the data.\n\n dec: the string to use for decimal points in numeric or complex\n columns: must be a single character.\n\nrow.names: either a logical value indicating whether the row names of\n 'x' are to be written along with 'x', or a character vector\n of row names to be written.\n\ncol.names: either a logical value indicating whether the column names\n of 'x' are to be written along with 'x', or a character\n vector of column names to be written. See the section on\n 'CSV files' for the meaning of 'col.names = NA'.\n\n qmethod: a character string specifying how to deal with embedded\n double quote characters when quoting strings. Must be one of\n '\"escape\"' (default for 'write.table'), in which case the\n quote character is escaped in C style by a backslash, or\n '\"double\"' (default for 'write.csv' and 'write.csv2'), in\n which case it is doubled. You can specify just the initial\n letter.\n\nfileEncoding: character string: if non-empty declares the encoding to\n be used on a file (not a connection) so the character data\n can be re-encoded as they are written. See 'file'.\n\n ...: arguments to 'write.table': 'append', 'col.names', 'sep',\n 'dec' and 'qmethod' cannot be altered.\n\nDetails:\n\n If the table has no columns the rownames will be written only if\n 'row.names = TRUE', and _vice versa_.\n\n Real and complex numbers are written to the maximal possible\n precision.\n\n If a data frame has matrix-like columns these will be converted to\n multiple columns in the result (_via_ 'as.matrix') and so a\n character 'col.names' or a numeric 'quote' should refer to the\n columns in the result, not the input. Such matrix-like columns\n are unquoted by default.\n\n Any columns in a data frame which are lists or have a class (e.g.,\n dates) will be converted by the appropriate 'as.character' method:\n such columns are unquoted by default. On the other hand, any\n class information for a matrix is discarded and non-atomic (e.g.,\n list) matrices are coerced to character.\n\n Only columns which have been converted to character will be quoted\n if specified by 'quote'.\n\n The 'dec' argument only applies to columns that are not subject to\n conversion to character because they have a class or are part of a\n matrix-like column (or matrix), in particular to columns protected\n by 'I()'. Use 'options(\"OutDec\")' to control such conversions.\n\n In almost all cases the conversion of numeric quantities is\n governed by the option '\"scipen\"' (see 'options'), but with the\n internal equivalent of 'digits = 15'. For finer control, use\n 'format' to make a character matrix/data frame, and call\n 'write.table' on that.\n\n These functions check for a user interrupt every 1000 lines of\n output.\n\n If 'file' is a non-open connection, an attempt is made to open it\n and then close it after use.\n\n To write a Unix-style file on Windows, use a binary connection\n e.g. 'file = file(\"filename\", \"wb\")'.\n\nCSV files:\n\n By default there is no column name for a column of row names. If\n 'col.names = NA' and 'row.names = TRUE' a blank column name is\n added, which is the convention used for CSV files to be read by\n spreadsheets. Note that such CSV files can be read in R by\n\n read.csv(file = \"\", row.names = 1)\n \n 'write.csv' and 'write.csv2' provide convenience wrappers for\n writing CSV files. They set 'sep' and 'dec' (see below), 'qmethod\n = \"double\"', and 'col.names' to 'NA' if 'row.names = TRUE' (the\n default) and to 'TRUE' otherwise.\n\n 'write.csv' uses '\".\"' for the decimal point and a comma for the\n separator.\n\n 'write.csv2' uses a comma for the decimal point and a semicolon\n for the separator, the Excel convention for CSV files in some\n Western European locales.\n\n These wrappers are deliberately inflexible: they are designed to\n ensure that the correct conventions are used to write a valid\n file. Attempts to change 'append', 'col.names', 'sep', 'dec' or\n 'qmethod' are ignored, with a warning.\n\n CSV files do not record an encoding, and this causes problems if\n they are not ASCII for many other applications. Windows Excel\n 2007/10 will open files (e.g., by the file association mechanism)\n correctly if they are ASCII or UTF-16 (use 'fileEncoding =\n \"UTF-16LE\"') or perhaps in the current Windows codepage (e.g.,\n '\"CP1252\"'), but the 'Text Import Wizard' (from the 'Data' tab)\n allows far more choice of encodings. Excel:mac 2004/8 can\n _import_ only 'Macintosh' (which seems to mean Mac Roman),\n 'Windows' (perhaps Latin-1) and 'PC-8' files. OpenOffice 3.x asks\n for the character set when opening the file.\n\n There is an IETF RFC4180\n () for CSV files, which\n mandates comma as the separator and CRLF line endings.\n 'write.csv' writes compliant files on Windows: use 'eol = \"\\r\\n\"'\n on other platforms.\n\nNote:\n\n 'write.table' can be slow for data frames with large numbers\n (hundreds or more) of columns: this is inevitable as each column\n could be of a different class and so must be handled separately.\n If they are all of the same class, consider using a matrix\n instead.\n\nSee Also:\n\n The 'R Data Import/Export' manual.\n\n 'read.table', 'write'.\n\n 'write.matrix' in package 'MASS'.\n\nExamples:\n\n x <- data.frame(a = I(\"a \\\" quote\"), b = pi)\n tf <- tempfile(fileext = \".csv\")\n \n ## To write a CSV file for input to Excel one might use\n write.table(x, file = tf, sep = \",\", col.names = NA,\n qmethod = \"double\")\n file.show(tf)\n ## and to read this file back into R one needs\n read.table(tf, header = TRUE, sep = \",\", row.names = 1)\n ## NB: you do need to specify a separator if qmethod = \"double\".\n \n ### Alternatively\n write.csv(x, file = tf)\n read.csv(tf, row.names = 1)\n ## or without row names\n write.csv(x, file = tf, row.names = FALSE)\n read.csv(tf)\n \n ## Not run:\n \n ## To write a file in Mac Roman for simple use in Mac Excel 2004/8\n write.csv(x, file = \"foo.csv\", fileEncoding = \"macroman\")\n ## or for Windows Excel 2007/10\n write.csv(x, file = \"foo.csv\", fileEncoding = \"UTF-16LE\")\n ## End(Not run)\n```\n:::\n:::\n\n\n## Export delimited data\n\nLet's practice exporting the data as three files with three different delimiters (comma, tab, semicolon)\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwrite.csv(df, file=\"data/serodata_new.csv\", row.names = FALSE) #comma delimited\nwrite.table(df, file=\"data/serodata1_new.txt\", sep=\"\\t\", row.names = FALSE) #tab delimited\nwrite.table(df, file=\"data/serodata2_new.txt\", sep=\";\", row.names = FALSE) #semicolon delimited\n```\n:::\n\n\nNote, I wrote the data to new file names. Even though we didn't change the data at all in this module, it is good practice to keep raw data raw, and not to write over it.\n\n## R .rds and .rda/RData files\n\nThere are two file extensions worth discussing.\n\nR has two native data formats—'Rdata' (sometimes shortened to 'Rda') and 'Rds'. These formats are used when R objects are saved for later use. 'Rdata' is used to save multiple R objects, while 'Rds' is used to save a single R object. 'Rds' is fast to write/read and is very small.\n\n## .rds binary file\n\nSaving datasets in `.rds` format can save time if you have to read it back in later.\n\n`write_rds()` and `read_rds()` from `readr` package can be used to write/read a single R object to/from file.\n\n```\nlibrary(readr)\nwrite_rds(object1, file = \"filename.rds\")\nobject1 <- read_rds(file = \"filename.rds\")\n```\n\n\n## .rda/RData files \n\nThe Base R functions `save()` and `load()` can be used to save and load multiple R objects. \n\n`save()` writes an external representation of R objects to the specified file, and can by loaded back into the environment using `load()`. A nice feature about using `save` and `load` is that the R object(s) is directly imported into the environment and you don't have to specify the name. The files can be saved as `.RData` or `.Rda` files.\n\nFunction signature\n```\nsave(object1, object2, file = \"filename.RData\")\nload(\"filename.RData\")\n```\n\nNote, that you separate the objects you want to save with commas.\n\n\n\n## Summary\n\n- Importing or 'Reading in' data are the first step of any real project / data analysis\n- The Base R 'util' package has useful functions including `read.csv()` and `read.delim()` to importing/reading data or `write.csv()` and `write.table()` for exporting/writing data\n- When importing data (exception is object from .RData), you must assign it to an object, otherwise it cannot be used\n- If data are imported correctly, they can be found in the Environment pane of RStudio\n- You only need to install a package once (unless you update R or the package), but you will need to attach a package each time you want to use it. \n- To complete a task you don't know how to do (e.g., reading in an excel data file) use the following steps: 1. Google / Ask ChatGPT, 2. Find and vet function and package you want, 3. Install package, 4. Attach package, 5. Use function\n\n\n## Acknowledgements\n\nThese are the materials we looked through, modified, or extracted to complete this module's lecture.\n\n- [\"Introduction to R for Public Health Researchers\" Johns Hopkins University](https://jhudatascience.org/intro_to_r/)\n\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/modules/Module06-DataSubset/execute-results/html.json b/_freeze/modules/Module06-DataSubset/execute-results/html.json index 9376191..f443d53 100644 --- a/_freeze/modules/Module06-DataSubset/execute-results/html.json +++ b/_freeze/modules/Module06-DataSubset/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "a55663183334bb6cd6f8411f5a7fd0e8", + "hash": "cb8299ad6bc8167b765b1cfd90875b0a", "result": { - "markdown": "---\ntitle: \"Module 6: Get to Know Your Data and Subsetting\"\nformat: \n revealjs:\n scrollable: true\n smaller: true\n toc: false\n---\n\n\n## Learning Objectives\n\nAfter module 6, you should be able to...\n\n- Use basic functions to get to know you data\n- Use three indexing approaches\n- Rely on indexing to extract part of an object (e.g., subset data) and to replace parts of an object (e.g., rename variables / columns)\n- Describe what logical operators are and how to use them\n- Use on the `subset()` function to subset data\n\n\n## Getting to know our data\n\nThe `dim()`, `nrow()`, and `ncol()` functions are good options to check the dimensions of your data before moving forward. \n\nLet's first read in the data from the previous module.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- read.csv(file = \"data/serodata.csv\") #relative path\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ndim(df) # rows, columns\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 651 5\n```\n:::\n\n```{.r .cell-code}\nnrow(df) # number of rows\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 651\n```\n:::\n\n```{.r .cell-code}\nncol(df) # number of columns\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 5\n```\n:::\n:::\n\n\n## Quick summary of data\n\nThe `colnames()`, `str()` and `summary()`functions from Base R are great functions to assess the data type and some summary statistics. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ncolnames(df)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"observation_id\" \"IgG_concentration\" \"age\" \n[4] \"gender\" \"slum\" \n```\n:::\n\n```{.r .cell-code}\nstr(df)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n'data.frame':\t651 obs. of 5 variables:\n $ observation_id : int 5772 8095 9784 9338 6369 6885 6252 8913 7332 6941 ...\n $ IgG_concentration: num 0.318 3.437 0.3 143.236 0.448 ...\n $ age : int 2 4 4 4 1 4 4 NA 4 2 ...\n $ gender : chr \"Female\" \"Female\" \"Male\" \"Male\" ...\n $ slum : chr \"Non slum\" \"Non slum\" \"Non slum\" \"Non slum\" ...\n```\n:::\n\n```{.r .cell-code}\nsummary(df)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n observation_id IgG_concentration age gender \n Min. :5006 Min. : 0.0054 Min. : 1.000 Length:651 \n 1st Qu.:6306 1st Qu.: 0.3000 1st Qu.: 3.000 Class :character \n Median :7495 Median : 1.6658 Median : 6.000 Mode :character \n Mean :7492 Mean : 87.3683 Mean : 6.606 \n 3rd Qu.:8749 3rd Qu.:141.4405 3rd Qu.:10.000 \n Max. :9982 Max. :916.4179 Max. :15.000 \n NA's :10 NA's :9 \n slum \n Length:651 \n Class :character \n Mode :character \n \n \n \n \n```\n:::\n:::\n\n\nNote, if you have a very large dataset with 15+ variables, `summary()` is not so efficient. \n\n## Description of data\n\nThis is data based on a simulated pathogen X IgG antibody serological survey. The rows represent individuals. Variables include IgG concentrations in IU/mL, age in years, gender, and residence based on slum characterization. We will use this dataset for lectures throughout the Workshop.\n\n## View the data as a whole dataframe\n\nThe `View()` function, one of the few Base R functions with a capital letter can be used to open a new tab in the Console and view the data as you would in excel.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nView(df)\n```\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/ViewTab.png){width=100%}\n:::\n:::\n\n\n## View the data as a whole dataframe\n\nYou can also open a new tab of the data by clicking on the data icon beside the object in the Environment window.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/View.png){width=90%}\n:::\n:::\n\n\n## Indexing\n\nR contains several constructs which allow access to individual elements or subsets through indexing operations. Indexing can be used both to extract part of an object and to replace parts of an object (or to add parts). There are three basic indexing syntax: `[ ]`, `[[ ]]` and `$`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nx[i] #if x is a vector\nx[i, j] #if x is a matrix/data frame\nx[[i]] #if x is a list\nx$a #if x is a data frame or list\nx$\"a\" #if x is a data frame or list\n```\n:::\n\n\n## Vectors and multi-dimensional objects\n\nTo index a vector, `vector[i]` select the ith element. To index a multi-dimensional objects such as a matrix, `matrix[i, j]` selects the element in row i and column j, where as in a three dimensional `array[k, i, i, j]` selects the element in matrix k, row i, and column j. \n\nLet's practice by first creating the same objects as we did in Module 1.\n\n::: {.cell}\n\n```{.r .cell-code}\nnumber.object <- 3\ncharacter.object <- \"blue\"\nvector.object1 <- c(2,3,4,5)\nvector.object2 <- c(\"blue\", \"red\", \"yellow\")\nmatrix.object <- matrix(data=vector.object1, nrow=2, ncol=2, byrow=TRUE)\n```\n:::\n\n\nHere is a reminder of what these objects look like.\n\n::: {.cell}\n\n```{.r .cell-code}\nvector.object1\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 2 3 4 5\n```\n:::\n\n```{.r .cell-code}\nmatrix.object\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [,1] [,2]\n[1,] 2 3\n[2,] 4 5\n```\n:::\n:::\n\n\nFinally, let's use indexing to pull our elements of the objects. \n\n::: {.cell}\n\n```{.r .cell-code}\nvector.object1[2] #pulling the second element\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3\n```\n:::\n\n```{.r .cell-code}\nmatrix.object[1,2] #pulling the element in row 1 column 2\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3\n```\n:::\n:::\n\n\n\n## List objects\n\nFor lists, one generally uses `list[[p]]` to select any single element p.\n\nLet's practice by creating the same list as we did in Module 1.\n\n::: {.cell}\n\n```{.r .cell-code}\nlist.object <- list(number.object, vector.object2, matrix.object)\nlist.object\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[[1]]\n[1] 3\n\n[[2]]\n[1] \"blue\" \"red\" \"yellow\"\n\n[[3]]\n [,1] [,2]\n[1,] 2 3\n[2,] 4 5\n```\n:::\n:::\n\n\nNow we use indexing to pull out the 3rd element in the list.\n\n::: {.cell}\n\n```{.r .cell-code}\nlist.object[[3]]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [,1] [,2]\n[1,] 2 3\n[2,] 4 5\n```\n:::\n:::\n\n\n## $ for indexing\n\n`$` allows only a literal character string or a symbol as the index.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$IgG_concentration\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 3.176895e-01 3.436823e+00 3.000000e-01 1.432363e+02 4.476534e-01\n [6] 2.527076e-02 6.101083e-01 3.000000e-01 2.916968e+00 1.649819e+00\n [11] 4.574007e+00 1.583904e+02 NA 1.065068e+02 1.113870e+02\n [16] 4.144893e+01 3.000000e-01 2.527076e-01 8.159247e+01 1.825342e+02\n [21] 4.244656e+01 1.193493e+02 3.000000e-01 3.000000e-01 9.025271e-01\n [26] 3.501805e-01 3.000000e-01 1.227437e+00 1.702055e+02 3.000000e-01\n [31] 4.801444e-01 2.527076e-02 3.000000e-01 5.776173e-02 4.801444e-01\n [36] 3.826715e-01 3.000000e-01 4.048558e+02 3.000000e-01 5.451264e-01\n [41] 3.000000e-01 5.590753e+01 2.202166e-01 1.709760e+02 1.227437e+00\n [46] 4.567527e+02 4.838480e+01 1.227437e-01 1.877256e-01 3.000000e-01\n [51] 3.501805e-01 3.339350e+00 3.000000e-01 5.451264e-01 NA\n [56] 2.104693e+00 NA 3.826715e-01 3.926366e+01 1.129964e+00\n [61] 3.501805e+00 7.542808e+01 4.800475e+01 1.000000e+00 4.068884e+01\n [66] 3.000000e-01 4.377672e+01 1.193493e+02 6.977740e+01 1.373288e+02\n [71] 1.642979e+02 NA 1.542808e+02 6.033058e-01 2.809917e-01\n [76] 1.966942e+00 2.041322e+00 2.115702e+00 4.663043e+02 3.000000e-01\n [81] 1.500796e+02 1.543790e+02 2.561983e-01 1.596338e+02 1.732484e+02\n [86] 4.641304e+02 3.736364e+01 1.572452e+02 3.000000e-01 3.000000e-01\n [91] 8.264463e-02 6.776859e-01 7.272727e-01 2.066116e-01 1.966942e+00\n [96] 3.000000e-01 3.000000e-01 2.809917e-01 8.016529e-01 1.818182e-01\n[101] 1.818182e-01 8.264463e-02 3.422727e+01 8.743506e+00 3.000000e-01\n[106] 1.641720e+02 4.049587e-01 1.001592e+02 4.489130e+02 1.101911e+02\n[111] 4.440909e+01 1.288217e+02 2.840909e+01 1.003981e+02 8.512397e-01\n[116] 1.322314e-01 1.297521e+00 1.570248e-01 1.966942e+00 1.536624e+02\n[121] 3.000000e-01 3.000000e-01 1.074380e+00 1.099174e+00 3.057851e-01\n[126] 3.000000e-01 5.785124e-02 4.391304e+02 6.130435e+02 1.074380e-01\n[131] 7.125796e+01 4.222727e+01 1.620223e+02 3.750000e+01 1.534236e+02\n[136] 6.239130e+02 5.521739e+02 5.785124e-02 6.547945e-01 8.767123e-02\n[141] 3.000000e-01 2.849315e+00 3.835616e-02 2.849315e-01 4.649315e+00\n[146] 1.369863e-01 3.589041e-01 1.049315e+00 4.668998e+01 1.473510e+02\n[151] 4.589744e+01 2.109589e-01 1.741722e+02 2.496503e+01 1.850993e+02\n[156] 1.863014e-01 1.863014e-01 4.589744e+01 1.942881e+02 5.079646e+02\n[161] 8.767123e-01 2.750685e+00 1.503311e+02 3.000000e-01 3.095890e-01\n[166] 3.000000e-01 6.371681e+02 6.054795e-01 1.955298e+02 1.786424e+02\n[171] 1.120861e+02 1.331954e+02 2.159292e+02 5.628319e+02 1.900662e+02\n[176] 6.547945e-01 1.665753e+00 1.739238e+02 9.991722e+01 9.321192e+01\n[181] 8.767123e-02 NA 6.794521e-01 5.808219e-01 1.369863e-01\n[186] 2.060274e+00 1.610099e+02 4.082192e-01 8.273973e-01 4.601770e+02\n[191] 1.389073e+02 3.867133e+01 9.260274e-01 5.918874e+01 1.870861e+02\n[196] 4.328767e-01 6.301370e-02 3.000000e-01 1.548013e+02 5.819536e+01\n[201] 1.724338e+02 1.932401e+01 2.164420e+00 9.757412e-01 1.509434e-01\n[206] 1.509434e-01 7.766571e+01 4.319563e+01 1.752022e-01 3.094775e+01\n[211] 1.266846e-01 2.919806e+01 9.545455e+00 2.735115e+01 1.314841e+02\n[216] 3.643985e+01 1.498559e+02 9.363636e+00 2.479784e-01 5.390836e-02\n[221] 8.787062e-01 1.994609e-01 3.000000e-01 3.000000e-01 5.390836e-03\n[226] 4.177898e-01 3.000000e-01 2.479784e-01 2.964960e-02 2.964960e-01\n[231] 5.148248e+00 1.994609e-01 3.000000e-01 1.779539e+02 3.290210e+02\n[236] 3.000000e-01 1.809798e+02 4.905660e-01 1.266846e-01 1.543948e+02\n[241] 1.379683e+02 6.153846e+02 1.474784e+02 3.000000e-01 1.024259e+00\n[246] 4.444056e+02 3.000000e-01 2.504043e+00 3.000000e-01 3.000000e-01\n[251] 7.816712e-02 3.000000e-01 5.390836e-02 1.494236e+02 5.972622e+01\n[256] 6.361186e-01 1.837896e+02 1.320809e+02 1.571906e-01 1.520231e+02\n[261] 3.000000e-01 3.000000e-01 1.823699e+02 3.000000e-01 2.173913e+00\n[266] 2.142202e+01 3.000000e-01 3.408027e+00 4.155963e+01 9.698997e-02\n[271] 1.238532e+01 9.528926e+00 1.916185e+02 1.060201e+00 3.679104e+02\n[276] 4.288991e+01 9.971098e+01 3.000000e-01 1.208092e+02 3.000000e-01\n[281] 6.688963e-03 2.505017e+00 1.481605e+00 3.000000e-01 5.183946e-01\n[286] 3.000000e-01 1.872910e-01 3.678930e-01 3.000000e-01 4.529851e+02\n[291] 3.169725e+01 3.000000e-01 4.922018e+01 2.548507e+02 1.661850e+02\n[296] 9.164179e+02 3.678930e-01 1.236994e+02 6.705202e+01 3.834862e+01\n[301] 1.963211e+00 3.000000e-01 2.474916e-01 3.000000e-01 2.173913e-01\n[306] 8.193980e-01 2.444816e+00 3.000000e-01 1.571906e-01 1.849711e+02\n[311] 6.119403e+02 3.000000e-01 4.280936e-01 9.698997e-02 3.678930e-02\n[316] 4.832090e+02 1.390173e+02 3.000000e-01 6.555970e+02 1.526012e+02\n[321] 3.000000e-01 7.222222e-01 7.724426e+01 3.000000e-01 6.111111e-01\n[326] 1.555556e+00 3.055556e-01 1.500000e+00 1.470772e+02 1.694444e+00\n[331] 3.138298e+02 1.414405e+02 1.990605e+02 4.212766e+02 3.000000e-01\n[336] 3.000000e-01 6.478723e+02 3.000000e-01 2.222222e+00 3.000000e-01\n[341] 2.055556e+00 2.777778e-02 8.333333e-02 1.032359e+02 1.611111e+00\n[346] 8.333333e-02 2.333333e+00 5.755319e+02 1.686848e+02 1.111111e-01\n[351] 3.000000e-01 8.372340e+02 3.000000e-01 3.784504e+01 3.819149e+02\n[356] 5.555556e-02 3.000000e+02 1.855950e+02 1.944444e-01 3.000000e-01\n[361] 5.555556e-02 1.138889e+00 4.254237e+01 3.000000e-01 3.000000e-01\n[366] 3.000000e-01 3.000000e-01 3.138298e+02 1.235908e+02 4.159574e+02\n[371] 3.009685e+01 1.567850e+02 1.367432e+02 3.731235e+01 9.164927e+01\n[376] 2.936170e+02 8.820459e+01 1.035491e+02 7.379958e+01 3.000000e-01\n[381] 1.718750e+02 2.128527e+00 1.253918e+00 2.382445e-01 4.639498e-01\n[386] 1.253918e-01 1.253918e-01 3.000000e-01 1.000000e+00 1.570043e+02\n[391] 4.344086e+02 2.184953e+00 1.507837e+00 3.228840e-01 4.588024e+01\n[396] 1.660560e+02 3.000000e-01 3.043011e+02 2.612903e+02 1.621767e+02\n[401] 3.228840e-01 4.639498e-01 2.495298e+00 3.257053e+00 3.793103e-01\n[406] NA 6.896552e-02 3.000000e-01 1.423197e+00 3.000000e-01\n[411] 3.000000e-01 1.786638e+02 3.279570e+02 NA 1.903017e+02\n[416] 1.654095e+02 4.639498e-01 1.815733e+02 1.366771e+00 1.536050e-01\n[421] 1.306587e+01 2.129032e+02 1.925647e+02 3.000000e-01 1.028213e+00\n[426] 3.793103e-01 8.025078e-01 4.860215e+02 3.000000e-01 2.100313e-01\n[431] 2.767665e+01 1.592476e+00 9.717868e-02 1.028213e+00 3.793103e-01\n[436] 1.292026e+02 4.425150e+01 3.193548e+02 1.860991e+02 6.614420e-01\n[441] 5.203762e-01 1.330819e+02 1.673491e+02 3.000000e-01 1.117457e+02\n[446] 3.045509e+01 3.000000e-01 8.280255e-02 3.000000e-01 1.200637e+00\n[451] 1.687898e-01 7.367273e+02 8.280255e-02 5.127389e-01 1.974522e-01\n[456] 7.993631e-01 3.000000e-01 3.298182e+02 9.736842e+01 3.000000e-01\n[461] 3.000000e-01 4.214545e+02 3.000000e-01 2.578182e+02 2.261147e-01\n[466] 3.000000e-01 1.883901e+02 9.458204e+01 3.000000e-01 3.000000e-01\n[471] 7.707006e-01 5.032727e+02 1.544586e+00 1.431115e+02 3.000000e-01\n[476] 1.458599e+00 1.247678e+02 NA 4.334545e+02 3.000000e-01\n[481] 6.156364e+02 9.574303e+01 1.928019e+02 1.888545e+02 1.598297e+02\n[486] 5.127389e-01 1.171053e+02 NA 2.547771e-02 1.707430e+02\n[491] 3.000000e-01 1.869969e+02 4.731481e+01 1.988390e+02 3.000000e-01\n[496] 8.808050e+01 2.003185e+00 3.000000e-01 3.509259e+01 9.365325e+01\n[501] 3.000000e-01 3.736111e+01 1.674923e+02 8.808050e+01 1.656347e+02\n[506] 3.722222e+01 6.756364e+02 3.000000e-01 1.698142e+02 1.628483e+02\n[511] 5.985130e-01 1.903346e+00 3.000000e-01 3.000000e-01 8.996283e-01\n[516] 3.977695e-01 3.000000e-01 3.000000e-01 3.000000e-01 3.000000e-01\n[521] 7.446809e+02 6.095745e+02 1.427445e+02 3.000000e-01 2.973978e-02\n[526] 3.977695e-01 4.095745e+02 4.595745e+02 3.000000e-01 1.976341e+02\n[531] 3.776596e+02 1.777603e+02 4.312268e-01 6.765957e+02 7.978723e+02\n[536] 9.665427e-02 1.879338e+02 4.358670e+01 3.000000e-01 3.000000e-01\n[541] 2.638955e+01 3.180523e+01 1.746845e+02 1.876972e+02 1.044164e+02\n[546] 1.202681e+02 1.630915e+02 1.276025e+02 8.880126e+01 3.563830e+02\n[551] 2.212766e+02 1.969121e+01 3.755319e+02 1.214511e+02 1.034700e+02\n[556] 3.000000e-01 3.643123e-01 6.319703e-02 3.000000e-01 3.000000e-01\n[561] 3.000000e-01 3.000000e-01 3.000000e-01 3.000000e-01 3.000000e-01\n[566] 3.000000e-01 1.664038e+02 2.946809e+02 4.391924e+01 1.874606e+02\n[571] 1.143533e+02 1.600158e+02 1.635688e-01 8.809148e+01 1.337539e+02\n[576] 1.985804e+02 1.578864e+02 3.000000e-01 3.000000e-01 1.953642e-01\n[581] 1.119205e+00 2.523636e+02 3.000000e-01 4.844371e+00 3.000000e-01\n[586] 1.492553e+02 1.993617e+02 2.847682e-01 3.145695e-01 3.000000e-01\n[591] 3.406429e+01 6.595745e+01 3.000000e-01 2.174545e+02 NA\n[596] 5.957447e+01 7.236364e+02 3.000000e-01 3.000000e-01 3.000000e-01\n[601] 2.676364e+02 1.891489e+02 3.036364e+02 3.000000e-01 3.000000e-01\n[606] 3.000000e-01 3.000000e-01 3.000000e-01 1.447020e+00 2.130909e+02\n[611] 1.357616e-01 3.000000e-01 3.000000e-01 5.534545e+02 1.891489e+02\n[616] 7.202128e+01 3.250287e+01 1.655629e-02 3.123636e+02 3.000000e-01\n[621] 7.138298e+01 3.000000e-01 6.946809e+01 4.012629e+01 1.629787e+02\n[626] 1.508511e+02 1.655629e-02 3.000000e-01 4.635762e-02 3.000000e-01\n[631] 3.000000e-01 3.000000e-01 1.942553e+02 3.690909e+02 3.000000e-01\n[636] 3.000000e-01 2.847682e+00 1.435106e+02 3.000000e-01 4.752009e+01\n[641] 2.621125e+01 1.055319e+02 3.000000e-01 1.149007e+00 2.927273e+02\n[646] 3.000000e-01 3.000000e-01 4.839265e+01 3.000000e-01 3.000000e-01\n[651] 2.251656e-01\n```\n:::\n:::\n\n\nNote, if you have spaces in your variable name, you will need to use back ticks `variable name` after the `$`. This is a good reason to not create variables / column names with spaces.\n\n## $ for indexing with lists\n\nList elements can be named\n\n::: {.cell}\n\n```{.r .cell-code}\nlist.object.named <- list(\n emory = number.object,\n uga = vector.object2,\n gsu = matrix.object\n)\nlist.object.named\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$emory\n[1] 3\n\n$uga\n[1] \"blue\" \"red\" \"yellow\"\n\n$gsu\n [,1] [,2]\n[1,] 2 3\n[2,] 4 5\n```\n:::\n:::\n\n\nIf list elements are named, than you can reference data from list using `$` or using double square brackets, `[[ ]]`\n\n::: {.cell}\n\n```{.r .cell-code}\nlist.object.named$uga \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"blue\" \"red\" \"yellow\"\n```\n:::\n\n```{.r .cell-code}\nlist.object.named[[\"uga\"]] \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"blue\" \"red\" \"yellow\"\n```\n:::\n:::\n\n\n\n## Using indexing to rename columns\n\nAs mentioned above, indexing can be used both to extract part of an object and to replace parts of an object (or to add parts).\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncolnames(df) # just prints\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"observation_id\" \"IgG_concentration\" \"age\" \n[4] \"gender\" \"slum\" \n```\n:::\n\n```{.r .cell-code}\ncolnames(df)[1:2] <- c(\"IgG_concentration_mIU/mL\", \"age_year\") # reassigns\ncolnames(df)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"IgG_concentration_mIU/mL\" \"age_year\" \n[3] \"age\" \"gender\" \n[5] \"slum\" \n```\n:::\n\n```{.r .cell-code}\ncolnames(df)[1:2] <- c(\"IgG_concentration\", \"age\") #reset\n```\n:::\n\n\n## Using indexing to subset by columns\n\nWe can also subset a data frames and matrices (2-dimensional objects) using the bracket `[ row , column ]`. We can subset by columns and pull the `x` column using the index of the column or the column name. \n\nFor example, here I am pulling the 3nd column, which has the variable name `age`\n\n::: {.cell}\n\n```{.r .cell-code}\ndf[ , \"age\"] #same as df[ , 3]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 3.176895e-01 3.436823e+00 3.000000e-01 1.432363e+02 4.476534e-01\n [6] 2.527076e-02 6.101083e-01 3.000000e-01 2.916968e+00 1.649819e+00\n [11] 4.574007e+00 1.583904e+02 NA 1.065068e+02 1.113870e+02\n [16] 4.144893e+01 3.000000e-01 2.527076e-01 8.159247e+01 1.825342e+02\n [21] 4.244656e+01 1.193493e+02 3.000000e-01 3.000000e-01 9.025271e-01\n [26] 3.501805e-01 3.000000e-01 1.227437e+00 1.702055e+02 3.000000e-01\n [31] 4.801444e-01 2.527076e-02 3.000000e-01 5.776173e-02 4.801444e-01\n [36] 3.826715e-01 3.000000e-01 4.048558e+02 3.000000e-01 5.451264e-01\n [41] 3.000000e-01 5.590753e+01 2.202166e-01 1.709760e+02 1.227437e+00\n [46] 4.567527e+02 4.838480e+01 1.227437e-01 1.877256e-01 3.000000e-01\n [51] 3.501805e-01 3.339350e+00 3.000000e-01 5.451264e-01 NA\n [56] 2.104693e+00 NA 3.826715e-01 3.926366e+01 1.129964e+00\n [61] 3.501805e+00 7.542808e+01 4.800475e+01 1.000000e+00 4.068884e+01\n [66] 3.000000e-01 4.377672e+01 1.193493e+02 6.977740e+01 1.373288e+02\n [71] 1.642979e+02 NA 1.542808e+02 6.033058e-01 2.809917e-01\n [76] 1.966942e+00 2.041322e+00 2.115702e+00 4.663043e+02 3.000000e-01\n [81] 1.500796e+02 1.543790e+02 2.561983e-01 1.596338e+02 1.732484e+02\n [86] 4.641304e+02 3.736364e+01 1.572452e+02 3.000000e-01 3.000000e-01\n [91] 8.264463e-02 6.776859e-01 7.272727e-01 2.066116e-01 1.966942e+00\n [96] 3.000000e-01 3.000000e-01 2.809917e-01 8.016529e-01 1.818182e-01\n[101] 1.818182e-01 8.264463e-02 3.422727e+01 8.743506e+00 3.000000e-01\n[106] 1.641720e+02 4.049587e-01 1.001592e+02 4.489130e+02 1.101911e+02\n[111] 4.440909e+01 1.288217e+02 2.840909e+01 1.003981e+02 8.512397e-01\n[116] 1.322314e-01 1.297521e+00 1.570248e-01 1.966942e+00 1.536624e+02\n[121] 3.000000e-01 3.000000e-01 1.074380e+00 1.099174e+00 3.057851e-01\n[126] 3.000000e-01 5.785124e-02 4.391304e+02 6.130435e+02 1.074380e-01\n[131] 7.125796e+01 4.222727e+01 1.620223e+02 3.750000e+01 1.534236e+02\n[136] 6.239130e+02 5.521739e+02 5.785124e-02 6.547945e-01 8.767123e-02\n[141] 3.000000e-01 2.849315e+00 3.835616e-02 2.849315e-01 4.649315e+00\n[146] 1.369863e-01 3.589041e-01 1.049315e+00 4.668998e+01 1.473510e+02\n[151] 4.589744e+01 2.109589e-01 1.741722e+02 2.496503e+01 1.850993e+02\n[156] 1.863014e-01 1.863014e-01 4.589744e+01 1.942881e+02 5.079646e+02\n[161] 8.767123e-01 2.750685e+00 1.503311e+02 3.000000e-01 3.095890e-01\n[166] 3.000000e-01 6.371681e+02 6.054795e-01 1.955298e+02 1.786424e+02\n[171] 1.120861e+02 1.331954e+02 2.159292e+02 5.628319e+02 1.900662e+02\n[176] 6.547945e-01 1.665753e+00 1.739238e+02 9.991722e+01 9.321192e+01\n[181] 8.767123e-02 NA 6.794521e-01 5.808219e-01 1.369863e-01\n[186] 2.060274e+00 1.610099e+02 4.082192e-01 8.273973e-01 4.601770e+02\n[191] 1.389073e+02 3.867133e+01 9.260274e-01 5.918874e+01 1.870861e+02\n[196] 4.328767e-01 6.301370e-02 3.000000e-01 1.548013e+02 5.819536e+01\n[201] 1.724338e+02 1.932401e+01 2.164420e+00 9.757412e-01 1.509434e-01\n[206] 1.509434e-01 7.766571e+01 4.319563e+01 1.752022e-01 3.094775e+01\n[211] 1.266846e-01 2.919806e+01 9.545455e+00 2.735115e+01 1.314841e+02\n[216] 3.643985e+01 1.498559e+02 9.363636e+00 2.479784e-01 5.390836e-02\n[221] 8.787062e-01 1.994609e-01 3.000000e-01 3.000000e-01 5.390836e-03\n[226] 4.177898e-01 3.000000e-01 2.479784e-01 2.964960e-02 2.964960e-01\n[231] 5.148248e+00 1.994609e-01 3.000000e-01 1.779539e+02 3.290210e+02\n[236] 3.000000e-01 1.809798e+02 4.905660e-01 1.266846e-01 1.543948e+02\n[241] 1.379683e+02 6.153846e+02 1.474784e+02 3.000000e-01 1.024259e+00\n[246] 4.444056e+02 3.000000e-01 2.504043e+00 3.000000e-01 3.000000e-01\n[251] 7.816712e-02 3.000000e-01 5.390836e-02 1.494236e+02 5.972622e+01\n[256] 6.361186e-01 1.837896e+02 1.320809e+02 1.571906e-01 1.520231e+02\n[261] 3.000000e-01 3.000000e-01 1.823699e+02 3.000000e-01 2.173913e+00\n[266] 2.142202e+01 3.000000e-01 3.408027e+00 4.155963e+01 9.698997e-02\n[271] 1.238532e+01 9.528926e+00 1.916185e+02 1.060201e+00 3.679104e+02\n[276] 4.288991e+01 9.971098e+01 3.000000e-01 1.208092e+02 3.000000e-01\n[281] 6.688963e-03 2.505017e+00 1.481605e+00 3.000000e-01 5.183946e-01\n[286] 3.000000e-01 1.872910e-01 3.678930e-01 3.000000e-01 4.529851e+02\n[291] 3.169725e+01 3.000000e-01 4.922018e+01 2.548507e+02 1.661850e+02\n[296] 9.164179e+02 3.678930e-01 1.236994e+02 6.705202e+01 3.834862e+01\n[301] 1.963211e+00 3.000000e-01 2.474916e-01 3.000000e-01 2.173913e-01\n[306] 8.193980e-01 2.444816e+00 3.000000e-01 1.571906e-01 1.849711e+02\n[311] 6.119403e+02 3.000000e-01 4.280936e-01 9.698997e-02 3.678930e-02\n[316] 4.832090e+02 1.390173e+02 3.000000e-01 6.555970e+02 1.526012e+02\n[321] 3.000000e-01 7.222222e-01 7.724426e+01 3.000000e-01 6.111111e-01\n[326] 1.555556e+00 3.055556e-01 1.500000e+00 1.470772e+02 1.694444e+00\n[331] 3.138298e+02 1.414405e+02 1.990605e+02 4.212766e+02 3.000000e-01\n[336] 3.000000e-01 6.478723e+02 3.000000e-01 2.222222e+00 3.000000e-01\n[341] 2.055556e+00 2.777778e-02 8.333333e-02 1.032359e+02 1.611111e+00\n[346] 8.333333e-02 2.333333e+00 5.755319e+02 1.686848e+02 1.111111e-01\n[351] 3.000000e-01 8.372340e+02 3.000000e-01 3.784504e+01 3.819149e+02\n[356] 5.555556e-02 3.000000e+02 1.855950e+02 1.944444e-01 3.000000e-01\n[361] 5.555556e-02 1.138889e+00 4.254237e+01 3.000000e-01 3.000000e-01\n[366] 3.000000e-01 3.000000e-01 3.138298e+02 1.235908e+02 4.159574e+02\n[371] 3.009685e+01 1.567850e+02 1.367432e+02 3.731235e+01 9.164927e+01\n[376] 2.936170e+02 8.820459e+01 1.035491e+02 7.379958e+01 3.000000e-01\n[381] 1.718750e+02 2.128527e+00 1.253918e+00 2.382445e-01 4.639498e-01\n[386] 1.253918e-01 1.253918e-01 3.000000e-01 1.000000e+00 1.570043e+02\n[391] 4.344086e+02 2.184953e+00 1.507837e+00 3.228840e-01 4.588024e+01\n[396] 1.660560e+02 3.000000e-01 3.043011e+02 2.612903e+02 1.621767e+02\n[401] 3.228840e-01 4.639498e-01 2.495298e+00 3.257053e+00 3.793103e-01\n[406] NA 6.896552e-02 3.000000e-01 1.423197e+00 3.000000e-01\n[411] 3.000000e-01 1.786638e+02 3.279570e+02 NA 1.903017e+02\n[416] 1.654095e+02 4.639498e-01 1.815733e+02 1.366771e+00 1.536050e-01\n[421] 1.306587e+01 2.129032e+02 1.925647e+02 3.000000e-01 1.028213e+00\n[426] 3.793103e-01 8.025078e-01 4.860215e+02 3.000000e-01 2.100313e-01\n[431] 2.767665e+01 1.592476e+00 9.717868e-02 1.028213e+00 3.793103e-01\n[436] 1.292026e+02 4.425150e+01 3.193548e+02 1.860991e+02 6.614420e-01\n[441] 5.203762e-01 1.330819e+02 1.673491e+02 3.000000e-01 1.117457e+02\n[446] 3.045509e+01 3.000000e-01 8.280255e-02 3.000000e-01 1.200637e+00\n[451] 1.687898e-01 7.367273e+02 8.280255e-02 5.127389e-01 1.974522e-01\n[456] 7.993631e-01 3.000000e-01 3.298182e+02 9.736842e+01 3.000000e-01\n[461] 3.000000e-01 4.214545e+02 3.000000e-01 2.578182e+02 2.261147e-01\n[466] 3.000000e-01 1.883901e+02 9.458204e+01 3.000000e-01 3.000000e-01\n[471] 7.707006e-01 5.032727e+02 1.544586e+00 1.431115e+02 3.000000e-01\n[476] 1.458599e+00 1.247678e+02 NA 4.334545e+02 3.000000e-01\n[481] 6.156364e+02 9.574303e+01 1.928019e+02 1.888545e+02 1.598297e+02\n[486] 5.127389e-01 1.171053e+02 NA 2.547771e-02 1.707430e+02\n[491] 3.000000e-01 1.869969e+02 4.731481e+01 1.988390e+02 3.000000e-01\n[496] 8.808050e+01 2.003185e+00 3.000000e-01 3.509259e+01 9.365325e+01\n[501] 3.000000e-01 3.736111e+01 1.674923e+02 8.808050e+01 1.656347e+02\n[506] 3.722222e+01 6.756364e+02 3.000000e-01 1.698142e+02 1.628483e+02\n[511] 5.985130e-01 1.903346e+00 3.000000e-01 3.000000e-01 8.996283e-01\n[516] 3.977695e-01 3.000000e-01 3.000000e-01 3.000000e-01 3.000000e-01\n[521] 7.446809e+02 6.095745e+02 1.427445e+02 3.000000e-01 2.973978e-02\n[526] 3.977695e-01 4.095745e+02 4.595745e+02 3.000000e-01 1.976341e+02\n[531] 3.776596e+02 1.777603e+02 4.312268e-01 6.765957e+02 7.978723e+02\n[536] 9.665427e-02 1.879338e+02 4.358670e+01 3.000000e-01 3.000000e-01\n[541] 2.638955e+01 3.180523e+01 1.746845e+02 1.876972e+02 1.044164e+02\n[546] 1.202681e+02 1.630915e+02 1.276025e+02 8.880126e+01 3.563830e+02\n[551] 2.212766e+02 1.969121e+01 3.755319e+02 1.214511e+02 1.034700e+02\n[556] 3.000000e-01 3.643123e-01 6.319703e-02 3.000000e-01 3.000000e-01\n[561] 3.000000e-01 3.000000e-01 3.000000e-01 3.000000e-01 3.000000e-01\n[566] 3.000000e-01 1.664038e+02 2.946809e+02 4.391924e+01 1.874606e+02\n[571] 1.143533e+02 1.600158e+02 1.635688e-01 8.809148e+01 1.337539e+02\n[576] 1.985804e+02 1.578864e+02 3.000000e-01 3.000000e-01 1.953642e-01\n[581] 1.119205e+00 2.523636e+02 3.000000e-01 4.844371e+00 3.000000e-01\n[586] 1.492553e+02 1.993617e+02 2.847682e-01 3.145695e-01 3.000000e-01\n[591] 3.406429e+01 6.595745e+01 3.000000e-01 2.174545e+02 NA\n[596] 5.957447e+01 7.236364e+02 3.000000e-01 3.000000e-01 3.000000e-01\n[601] 2.676364e+02 1.891489e+02 3.036364e+02 3.000000e-01 3.000000e-01\n[606] 3.000000e-01 3.000000e-01 3.000000e-01 1.447020e+00 2.130909e+02\n[611] 1.357616e-01 3.000000e-01 3.000000e-01 5.534545e+02 1.891489e+02\n[616] 7.202128e+01 3.250287e+01 1.655629e-02 3.123636e+02 3.000000e-01\n[621] 7.138298e+01 3.000000e-01 6.946809e+01 4.012629e+01 1.629787e+02\n[626] 1.508511e+02 1.655629e-02 3.000000e-01 4.635762e-02 3.000000e-01\n[631] 3.000000e-01 3.000000e-01 1.942553e+02 3.690909e+02 3.000000e-01\n[636] 3.000000e-01 2.847682e+00 1.435106e+02 3.000000e-01 4.752009e+01\n[641] 2.621125e+01 1.055319e+02 3.000000e-01 1.149007e+00 2.927273e+02\n[646] 3.000000e-01 3.000000e-01 4.839265e+01 3.000000e-01 3.000000e-01\n[651] 2.251656e-01\n```\n:::\n:::\n\nWe can select multiple columns using multiple column names:\n\n::: {.cell}\n\n```{.r .cell-code}\ndf[, c(\"age\", \"gender\")] #same as df[ , c(3,4)]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n age gender\n1 3.176895e-01 Female\n2 3.436823e+00 Female\n3 3.000000e-01 Male\n4 1.432363e+02 Male\n5 4.476534e-01 Male\n6 2.527076e-02 Male\n7 6.101083e-01 Female\n8 3.000000e-01 Female\n9 2.916968e+00 Male\n10 1.649819e+00 Male\n11 4.574007e+00 Male\n12 1.583904e+02 Female\n13 NA Male\n14 1.065068e+02 Male\n15 1.113870e+02 Male\n16 4.144893e+01 Male\n17 3.000000e-01 Male\n18 2.527076e-01 Female\n19 8.159247e+01 Female\n20 1.825342e+02 Male\n21 4.244656e+01 Male\n22 1.193493e+02 Female\n23 3.000000e-01 Male\n24 3.000000e-01 Female\n25 9.025271e-01 Female\n26 3.501805e-01 Male\n27 3.000000e-01 Male\n28 1.227437e+00 Female\n29 1.702055e+02 Female\n30 3.000000e-01 Female\n31 4.801444e-01 Male\n32 2.527076e-02 Male\n33 3.000000e-01 Female\n34 5.776173e-02 Male\n35 4.801444e-01 Female\n36 3.826715e-01 Female\n37 3.000000e-01 Male\n38 4.048558e+02 Male\n39 3.000000e-01 Male\n40 5.451264e-01 Male\n41 3.000000e-01 Female\n42 5.590753e+01 Male\n43 2.202166e-01 Female\n44 1.709760e+02 Male\n45 1.227437e+00 Male\n46 4.567527e+02 Male\n47 4.838480e+01 Male\n48 1.227437e-01 Female\n49 1.877256e-01 Female\n50 3.000000e-01 Female\n51 3.501805e-01 Male\n52 3.339350e+00 Male\n53 3.000000e-01 Female\n54 5.451264e-01 Female\n55 NA Male\n56 2.104693e+00 Male\n57 NA Male\n58 3.826715e-01 Female\n59 3.926366e+01 Female\n60 1.129964e+00 Male\n61 3.501805e+00 Female\n62 7.542808e+01 Female\n63 4.800475e+01 Female\n64 1.000000e+00 Male\n65 4.068884e+01 Male\n66 3.000000e-01 Female\n67 4.377672e+01 Female\n68 1.193493e+02 Male\n69 6.977740e+01 Male\n70 1.373288e+02 Female\n71 1.642979e+02 Male\n72 NA Female\n73 1.542808e+02 Male\n74 6.033058e-01 Male\n75 2.809917e-01 Male\n76 1.966942e+00 Male\n77 2.041322e+00 Male\n78 2.115702e+00 Female\n79 4.663043e+02 Male\n80 3.000000e-01 Male\n81 1.500796e+02 Male\n82 1.543790e+02 Female\n83 2.561983e-01 Female\n84 1.596338e+02 Male\n85 1.732484e+02 Female\n86 4.641304e+02 Female\n87 3.736364e+01 Male\n88 1.572452e+02 Female\n89 3.000000e-01 Male\n90 3.000000e-01 Male\n91 8.264463e-02 Male\n92 6.776859e-01 Female\n93 7.272727e-01 Male\n94 2.066116e-01 Female\n95 1.966942e+00 Male\n96 3.000000e-01 Male\n97 3.000000e-01 Male\n98 2.809917e-01 Female\n99 8.016529e-01 Female\n100 1.818182e-01 Female\n101 1.818182e-01 Male\n102 8.264463e-02 Female\n103 3.422727e+01 Female\n104 8.743506e+00 Male\n105 3.000000e-01 Male\n106 1.641720e+02 Female\n107 4.049587e-01 Male\n108 1.001592e+02 Male\n109 4.489130e+02 Female\n110 1.101911e+02 Female\n111 4.440909e+01 Male\n112 1.288217e+02 Female\n113 2.840909e+01 Male\n114 1.003981e+02 Female\n115 8.512397e-01 Female\n116 1.322314e-01 Male\n117 1.297521e+00 Female\n118 1.570248e-01 Male\n119 1.966942e+00 Female\n120 1.536624e+02 Male\n121 3.000000e-01 Female\n122 3.000000e-01 Female\n123 1.074380e+00 Male\n124 1.099174e+00 Female\n125 3.057851e-01 Female\n126 3.000000e-01 Female\n127 5.785124e-02 Female\n128 4.391304e+02 Female\n129 6.130435e+02 Female\n130 1.074380e-01 Male\n131 7.125796e+01 Male\n132 4.222727e+01 Male\n133 1.620223e+02 Female\n134 3.750000e+01 Female\n135 1.534236e+02 Female\n136 6.239130e+02 Female\n137 5.521739e+02 Male\n138 5.785124e-02 Female\n139 6.547945e-01 Female\n140 8.767123e-02 Female\n141 3.000000e-01 Male\n142 2.849315e+00 Female\n143 3.835616e-02 Male\n144 2.849315e-01 Male\n145 4.649315e+00 Male\n146 1.369863e-01 Female\n147 3.589041e-01 Male\n148 1.049315e+00 Male\n149 4.668998e+01 Female\n150 1.473510e+02 Female\n151 4.589744e+01 Male\n152 2.109589e-01 Male\n153 1.741722e+02 Female\n154 2.496503e+01 Female\n155 1.850993e+02 Male\n156 1.863014e-01 Male\n157 1.863014e-01 Male\n158 4.589744e+01 Female\n159 1.942881e+02 Female\n160 5.079646e+02 Female\n161 8.767123e-01 Male\n162 2.750685e+00 Male\n163 1.503311e+02 Female\n164 3.000000e-01 Male\n165 3.095890e-01 Male\n166 3.000000e-01 Male\n167 6.371681e+02 Female\n168 6.054795e-01 Female\n169 1.955298e+02 Female\n170 1.786424e+02 Male\n171 1.120861e+02 Female\n172 1.331954e+02 Male\n173 2.159292e+02 Male\n174 5.628319e+02 Male\n175 1.900662e+02 Female\n176 6.547945e-01 Male\n177 1.665753e+00 Male\n178 1.739238e+02 Male\n179 9.991722e+01 Male\n180 9.321192e+01 Male\n181 8.767123e-02 Female\n182 NA Male\n183 6.794521e-01 Female\n184 5.808219e-01 Male\n185 1.369863e-01 Female\n186 2.060274e+00 Female\n187 1.610099e+02 Male\n188 4.082192e-01 Female\n189 8.273973e-01 Male\n190 4.601770e+02 Female\n191 1.389073e+02 Female\n192 3.867133e+01 Female\n193 9.260274e-01 Female\n194 5.918874e+01 Female\n195 1.870861e+02 Female\n196 4.328767e-01 Male\n197 6.301370e-02 Male\n198 3.000000e-01 Female\n199 1.548013e+02 Male\n200 5.819536e+01 Female\n201 1.724338e+02 Female\n202 1.932401e+01 Female\n203 2.164420e+00 Female\n204 9.757412e-01 Female\n205 1.509434e-01 Male\n206 1.509434e-01 Female\n207 7.766571e+01 Male\n208 4.319563e+01 Female\n209 1.752022e-01 Male\n210 3.094775e+01 Female\n211 1.266846e-01 Male\n212 2.919806e+01 Male\n213 9.545455e+00 Female\n214 2.735115e+01 Female\n215 1.314841e+02 Female\n216 3.643985e+01 Male\n217 1.498559e+02 Female\n218 9.363636e+00 Female\n219 2.479784e-01 Male\n220 5.390836e-02 Female\n221 8.787062e-01 Female\n222 1.994609e-01 Male\n223 3.000000e-01 Female\n224 3.000000e-01 Male\n225 5.390836e-03 Female\n226 4.177898e-01 Female\n227 3.000000e-01 Female\n228 2.479784e-01 Male\n229 2.964960e-02 Male\n230 2.964960e-01 Male\n231 5.148248e+00 Female\n232 1.994609e-01 Male\n233 3.000000e-01 Male\n234 1.779539e+02 Male\n235 3.290210e+02 Female\n236 3.000000e-01 Male\n237 1.809798e+02 Female\n238 4.905660e-01 Male\n239 1.266846e-01 Male\n240 1.543948e+02 Female\n241 1.379683e+02 Female\n242 6.153846e+02 Male\n243 1.474784e+02 Male\n244 3.000000e-01 Female\n245 1.024259e+00 Male\n246 4.444056e+02 Female\n247 3.000000e-01 Male\n248 2.504043e+00 Female\n249 3.000000e-01 Female\n250 3.000000e-01 Female\n251 7.816712e-02 Female\n252 3.000000e-01 Female\n253 5.390836e-02 Male\n254 1.494236e+02 Female\n255 5.972622e+01 Male\n256 6.361186e-01 Female\n257 1.837896e+02 Female\n258 1.320809e+02 Female\n259 1.571906e-01 Male\n260 1.520231e+02 Male\n261 3.000000e-01 Female\n262 3.000000e-01 Female\n263 1.823699e+02 Male\n264 3.000000e-01 Male\n265 2.173913e+00 Male\n266 2.142202e+01 Male\n267 3.000000e-01 Female\n268 3.408027e+00 Male\n269 4.155963e+01 Male\n270 9.698997e-02 Male\n271 1.238532e+01 Female\n272 9.528926e+00 Male\n273 1.916185e+02 Female\n274 1.060201e+00 Male\n275 3.679104e+02 Female\n276 4.288991e+01 Male\n277 9.971098e+01 Male\n278 3.000000e-01 Male\n279 1.208092e+02 Male\n280 3.000000e-01 Male\n281 6.688963e-03 Female\n282 2.505017e+00 Female\n283 1.481605e+00 Male\n284 3.000000e-01 Female\n285 5.183946e-01 Female\n286 3.000000e-01 Female\n287 1.872910e-01 Male\n288 3.678930e-01 Female\n289 3.000000e-01 Male\n290 4.529851e+02 Female\n291 3.169725e+01 Female\n292 3.000000e-01 Male\n293 4.922018e+01 Male\n294 2.548507e+02 Male\n295 1.661850e+02 Male\n296 9.164179e+02 Male\n297 3.678930e-01 Female\n298 1.236994e+02 Male\n299 6.705202e+01 Male\n300 3.834862e+01 Male\n301 1.963211e+00 Female\n302 3.000000e-01 Male\n303 2.474916e-01 Male\n304 3.000000e-01 Female\n305 2.173913e-01 Male\n306 8.193980e-01 Male\n307 2.444816e+00 Female\n308 3.000000e-01 Male\n309 1.571906e-01 Female\n310 1.849711e+02 Male\n311 6.119403e+02 Female\n312 3.000000e-01 Female\n313 4.280936e-01 Female\n314 9.698997e-02 Male\n315 3.678930e-02 Female\n316 4.832090e+02 Male\n317 1.390173e+02 Female\n318 3.000000e-01 Male\n319 6.555970e+02 Female\n320 1.526012e+02 Female\n321 3.000000e-01 Female\n322 7.222222e-01 Male\n323 7.724426e+01 Male\n324 3.000000e-01 Male\n325 6.111111e-01 Female\n326 1.555556e+00 Female\n327 3.055556e-01 Male\n328 1.500000e+00 Male\n329 1.470772e+02 Male\n330 1.694444e+00 Female\n331 3.138298e+02 Female\n332 1.414405e+02 Female\n333 1.990605e+02 Female\n334 4.212766e+02 Male\n335 3.000000e-01 Male\n336 3.000000e-01 Male\n337 6.478723e+02 Male\n338 3.000000e-01 Male\n339 2.222222e+00 Female\n340 3.000000e-01 Male\n341 2.055556e+00 Male\n342 2.777778e-02 Female\n343 8.333333e-02 Male\n344 1.032359e+02 Female\n345 1.611111e+00 Female\n346 8.333333e-02 Female\n347 2.333333e+00 Female\n348 5.755319e+02 Male\n349 1.686848e+02 Female\n350 1.111111e-01 Male\n351 3.000000e-01 Male\n352 8.372340e+02 Female\n353 3.000000e-01 Male\n354 3.784504e+01 Male\n355 3.819149e+02 Male\n356 5.555556e-02 Female\n357 3.000000e+02 Female\n358 1.855950e+02 Male\n359 1.944444e-01 Female\n360 3.000000e-01 Male\n361 5.555556e-02 Female\n362 1.138889e+00 Male\n363 4.254237e+01 Female\n364 3.000000e-01 Male\n365 3.000000e-01 Male\n366 3.000000e-01 Female\n367 3.000000e-01 Female\n368 3.138298e+02 Female\n369 1.235908e+02 Male\n370 4.159574e+02 Male\n371 3.009685e+01 Female\n372 1.567850e+02 Female\n373 1.367432e+02 Female\n374 3.731235e+01 Female\n375 9.164927e+01 Male\n376 2.936170e+02 Female\n377 8.820459e+01 Female\n378 1.035491e+02 Male\n379 7.379958e+01 Female\n380 3.000000e-01 Male\n381 1.718750e+02 Male\n382 2.128527e+00 Male\n383 1.253918e+00 Female\n384 2.382445e-01 Male\n385 4.639498e-01 Female\n386 1.253918e-01 Male\n387 1.253918e-01 Male\n388 3.000000e-01 Female\n389 1.000000e+00 Male\n390 1.570043e+02 Male\n391 4.344086e+02 Female\n392 2.184953e+00 Male\n393 1.507837e+00 Female\n394 3.228840e-01 Female\n395 4.588024e+01 Male\n396 1.660560e+02 Male\n397 3.000000e-01 Male\n398 3.043011e+02 Male\n399 2.612903e+02 Female\n400 1.621767e+02 Male\n401 3.228840e-01 Male\n402 4.639498e-01 Female\n403 2.495298e+00 Female\n404 3.257053e+00 Female\n405 3.793103e-01 Female\n406 NA Male\n407 6.896552e-02 Female\n408 3.000000e-01 Male\n409 1.423197e+00 Female\n410 3.000000e-01 Female\n411 3.000000e-01 Female\n412 1.786638e+02 Male\n413 3.279570e+02 Male\n414 NA Female\n415 1.903017e+02 Male\n416 1.654095e+02 Female\n417 4.639498e-01 Female\n418 1.815733e+02 Male\n419 1.366771e+00 Male\n420 1.536050e-01 Female\n421 1.306587e+01 Male\n422 2.129032e+02 Female\n423 1.925647e+02 Male\n424 3.000000e-01 Female\n425 1.028213e+00 Female\n426 3.793103e-01 Female\n427 8.025078e-01 Female\n428 4.860215e+02 Female\n429 3.000000e-01 Female\n430 2.100313e-01 Male\n431 2.767665e+01 Female\n432 1.592476e+00 Male\n433 9.717868e-02 Female\n434 1.028213e+00 Female\n435 3.793103e-01 Male\n436 1.292026e+02 Male\n437 4.425150e+01 Female\n438 3.193548e+02 Female\n439 1.860991e+02 Female\n440 6.614420e-01 Female\n441 5.203762e-01 Male\n442 1.330819e+02 Male\n443 1.673491e+02 Female\n444 3.000000e-01 Male\n445 1.117457e+02 Male\n446 3.045509e+01 Female\n447 3.000000e-01 Male\n448 8.280255e-02 Female\n449 3.000000e-01 Female\n450 1.200637e+00 Female\n451 1.687898e-01 Male\n452 7.367273e+02 Female\n453 8.280255e-02 Male\n454 5.127389e-01 Male\n455 1.974522e-01 Male\n456 7.993631e-01 Female\n457 3.000000e-01 Male\n458 3.298182e+02 Male\n459 9.736842e+01 Female\n460 3.000000e-01 Female\n461 3.000000e-01 Female\n462 4.214545e+02 Female\n463 3.000000e-01 Male\n464 2.578182e+02 Female\n465 2.261147e-01 Male\n466 3.000000e-01 Female\n467 1.883901e+02 Male\n468 9.458204e+01 Female\n469 3.000000e-01 Female\n470 3.000000e-01 Male\n471 7.707006e-01 Female\n472 5.032727e+02 Male\n473 1.544586e+00 Female\n474 1.431115e+02 Female\n475 3.000000e-01 Male\n476 1.458599e+00 Male\n477 1.247678e+02 Female\n478 NA Female\n479 4.334545e+02 Male\n480 3.000000e-01 Female\n481 6.156364e+02 Female\n482 9.574303e+01 Male\n483 1.928019e+02 Male\n484 1.888545e+02 Male\n485 1.598297e+02 Female\n486 5.127389e-01 Male\n487 1.171053e+02 Female\n488 NA Male\n489 2.547771e-02 Female\n490 1.707430e+02 Female\n491 3.000000e-01 Male\n492 1.869969e+02 Male\n493 4.731481e+01 Male\n494 1.988390e+02 Female\n495 3.000000e-01 Male\n496 8.808050e+01 Male\n497 2.003185e+00 Female\n498 3.000000e-01 Male\n499 3.509259e+01 Female\n500 9.365325e+01 Female\n501 3.000000e-01 Male\n502 3.736111e+01 Female\n503 1.674923e+02 Female\n504 8.808050e+01 Male\n505 1.656347e+02 Female\n506 3.722222e+01 Female\n507 6.756364e+02 Female\n508 3.000000e-01 Male\n509 1.698142e+02 Male\n510 1.628483e+02 Female\n511 5.985130e-01 Male\n512 1.903346e+00 Female\n513 3.000000e-01 Male\n514 3.000000e-01 Male\n515 8.996283e-01 Male\n516 3.977695e-01 Female\n517 3.000000e-01 Male\n518 3.000000e-01 Male\n519 3.000000e-01 Male\n520 3.000000e-01 Female\n521 7.446809e+02 Male\n522 6.095745e+02 Female\n523 1.427445e+02 Male\n524 3.000000e-01 Female\n525 2.973978e-02 Male\n526 3.977695e-01 Female\n527 4.095745e+02 Female\n528 4.595745e+02 Male\n529 3.000000e-01 Female\n530 1.976341e+02 Female\n531 3.776596e+02 Female\n532 1.777603e+02 Female\n533 4.312268e-01 Male\n534 6.765957e+02 Female\n535 7.978723e+02 Male\n536 9.665427e-02 Male\n537 1.879338e+02 Male\n538 4.358670e+01 Female\n539 3.000000e-01 Female\n540 3.000000e-01 Male\n541 2.638955e+01 Male\n542 3.180523e+01 Female\n543 1.746845e+02 Male\n544 1.876972e+02 Male\n545 1.044164e+02 Male\n546 1.202681e+02 Male\n547 1.630915e+02 Female\n548 1.276025e+02 Female\n549 8.880126e+01 Male\n550 3.563830e+02 Male\n551 2.212766e+02 Male\n552 1.969121e+01 Female\n553 3.755319e+02 Female\n554 1.214511e+02 Male\n555 1.034700e+02 Female\n556 3.000000e-01 Female\n557 3.643123e-01 Female\n558 6.319703e-02 Female\n559 3.000000e-01 Male\n560 3.000000e-01 Male\n561 3.000000e-01 Female\n562 3.000000e-01 Female\n563 3.000000e-01 Male\n564 3.000000e-01 Male\n565 3.000000e-01 Female\n566 3.000000e-01 Male\n567 1.664038e+02 Female\n568 2.946809e+02 Female\n569 4.391924e+01 Male\n570 1.874606e+02 Female\n571 1.143533e+02 Male\n572 1.600158e+02 Male\n573 1.635688e-01 Male\n574 8.809148e+01 Female\n575 1.337539e+02 Male\n576 1.985804e+02 Male\n577 1.578864e+02 Female\n578 3.000000e-01 Female\n579 3.000000e-01 Male\n580 1.953642e-01 Female\n581 1.119205e+00 Male\n582 2.523636e+02 Male\n583 3.000000e-01 Male\n584 4.844371e+00 Female\n585 3.000000e-01 Male\n586 1.492553e+02 Female\n587 1.993617e+02 Male\n588 2.847682e-01 Female\n589 3.145695e-01 Female\n590 3.000000e-01 Male\n591 3.406429e+01 Female\n592 6.595745e+01 Male\n593 3.000000e-01 Male\n594 2.174545e+02 Male\n595 NA Female\n596 5.957447e+01 Female\n597 7.236364e+02 Female\n598 3.000000e-01 Male\n599 3.000000e-01 Female\n600 3.000000e-01 Male\n601 2.676364e+02 Male\n602 1.891489e+02 Male\n603 3.036364e+02 Female\n604 3.000000e-01 Female\n605 3.000000e-01 Male\n606 3.000000e-01 Male\n607 3.000000e-01 Female\n608 3.000000e-01 Male\n609 1.447020e+00 Male\n610 2.130909e+02 Female\n611 1.357616e-01 Female\n612 3.000000e-01 Female\n613 3.000000e-01 Female\n614 5.534545e+02 Female\n615 1.891489e+02 Female\n616 7.202128e+01 Female\n617 3.250287e+01 Male\n618 1.655629e-02 Male\n619 3.123636e+02 Male\n620 3.000000e-01 Male\n621 7.138298e+01 Male\n622 3.000000e-01 Female\n623 6.946809e+01 Female\n624 4.012629e+01 Male\n625 1.629787e+02 Female\n626 1.508511e+02 Female\n627 1.655629e-02 Male\n628 3.000000e-01 Male\n629 4.635762e-02 Male\n630 3.000000e-01 Female\n631 3.000000e-01 Female\n632 3.000000e-01 Male\n633 1.942553e+02 Male\n634 3.690909e+02 Male\n635 3.000000e-01 Female\n636 3.000000e-01 Female\n637 2.847682e+00 Male\n638 1.435106e+02 Female\n639 3.000000e-01 Male\n640 4.752009e+01 Female\n641 2.621125e+01 Female\n642 1.055319e+02 Female\n643 3.000000e-01 Female\n644 1.149007e+00 Male\n645 2.927273e+02 Female\n646 3.000000e-01 Female\n647 3.000000e-01 Female\n648 4.839265e+01 Male\n649 3.000000e-01 Male\n650 3.000000e-01 Female\n651 2.251656e-01 Female\n```\n:::\n:::\n\nWe can remove select columns using indexing as well, OR by simply changing the column to `NULL`\n\n::: {.cell}\n\n```{.r .cell-code}\ndf[, -5] #remove column 5, \"slum\" variable\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n IgG_concentration age age.1 gender\n1 5772 3.176895e-01 2 Female\n2 8095 3.436823e+00 4 Female\n3 9784 3.000000e-01 4 Male\n4 9338 1.432363e+02 4 Male\n5 6369 4.476534e-01 1 Male\n6 6885 2.527076e-02 4 Male\n7 6252 6.101083e-01 4 Female\n8 8913 3.000000e-01 NA Female\n9 7332 2.916968e+00 4 Male\n10 6941 1.649819e+00 2 Male\n11 5104 4.574007e+00 3 Male\n12 9078 1.583904e+02 15 Female\n13 9960 NA 8 Male\n14 9651 1.065068e+02 12 Male\n15 9229 1.113870e+02 15 Male\n16 5210 4.144893e+01 9 Male\n17 5105 3.000000e-01 8 Male\n18 7607 2.527076e-01 7 Female\n19 7582 8.159247e+01 11 Female\n20 8179 1.825342e+02 10 Male\n21 5660 4.244656e+01 8 Male\n22 6696 1.193493e+02 11 Female\n23 7842 3.000000e-01 2 Male\n24 6578 3.000000e-01 2 Female\n25 9619 9.025271e-01 3 Female\n26 9838 3.501805e-01 5 Male\n27 6935 3.000000e-01 1 Male\n28 5885 1.227437e+00 3 Female\n29 9657 1.702055e+02 5 Female\n30 9146 3.000000e-01 5 Female\n31 7056 4.801444e-01 3 Male\n32 9144 2.527076e-02 1 Male\n33 8696 3.000000e-01 4 Female\n34 7042 5.776173e-02 3 Male\n35 5278 4.801444e-01 2 Female\n36 6541 3.826715e-01 11 Female\n37 6070 3.000000e-01 7 Male\n38 5490 4.048558e+02 8 Male\n39 6527 3.000000e-01 6 Male\n40 5389 5.451264e-01 6 Male\n41 9003 3.000000e-01 11 Female\n42 6682 5.590753e+01 10 Male\n43 7844 2.202166e-01 6 Female\n44 8257 1.709760e+02 12 Male\n45 7767 1.227437e+00 11 Male\n46 8391 4.567527e+02 10 Male\n47 8317 4.838480e+01 11 Male\n48 7397 1.227437e-01 13 Female\n49 8495 1.877256e-01 3 Female\n50 8093 3.000000e-01 4 Female\n51 7375 3.501805e-01 3 Male\n52 5255 3.339350e+00 1 Male\n53 8445 3.000000e-01 2 Female\n54 8959 5.451264e-01 2 Female\n55 8400 NA 4 Male\n56 7420 2.104693e+00 2 Male\n57 5206 NA 2 Male\n58 7431 3.826715e-01 3 Female\n59 7230 3.926366e+01 3 Female\n60 8208 1.129964e+00 4 Male\n61 8538 3.501805e+00 1 Female\n62 6125 7.542808e+01 13 Female\n63 5767 4.800475e+01 13 Female\n64 5487 1.000000e+00 6 Male\n65 5539 4.068884e+01 13 Male\n66 5759 3.000000e-01 5 Female\n67 6845 4.377672e+01 13 Female\n68 7170 1.193493e+02 14 Male\n69 6588 6.977740e+01 13 Male\n70 7939 1.373288e+02 8 Female\n71 5006 1.642979e+02 7 Male\n72 9180 NA 6 Female\n73 9638 1.542808e+02 13 Male\n74 7781 6.033058e-01 3 Male\n75 6932 2.809917e-01 4 Male\n76 8120 1.966942e+00 2 Male\n77 9292 2.041322e+00 NA Male\n78 9228 2.115702e+00 5 Female\n79 8185 4.663043e+02 3 Male\n80 6797 3.000000e-01 3 Male\n81 5970 1.500796e+02 14 Male\n82 7219 1.543790e+02 11 Female\n83 6870 2.561983e-01 7 Female\n84 7653 1.596338e+02 7 Male\n85 8824 1.732484e+02 11 Female\n86 8311 4.641304e+02 9 Female\n87 9458 3.736364e+01 14 Male\n88 8275 1.572452e+02 13 Female\n89 6786 3.000000e-01 1 Male\n90 6595 3.000000e-01 1 Male\n91 5264 8.264463e-02 4 Male\n92 9188 6.776859e-01 1 Female\n93 6611 7.272727e-01 2 Male\n94 6840 2.066116e-01 3 Female\n95 5663 1.966942e+00 2 Male\n96 9611 3.000000e-01 1 Male\n97 7717 3.000000e-01 2 Male\n98 8374 2.809917e-01 2 Female\n99 5134 8.016529e-01 4 Female\n100 8122 1.818182e-01 5 Female\n101 6192 1.818182e-01 5 Male\n102 9668 8.264463e-02 6 Female\n103 9577 3.422727e+01 14 Female\n104 6403 8.743506e+00 14 Male\n105 9464 3.000000e-01 10 Male\n106 8157 1.641720e+02 6 Female\n107 9451 4.049587e-01 6 Male\n108 6615 1.001592e+02 8 Male\n109 9074 4.489130e+02 6 Female\n110 7479 1.101911e+02 12 Female\n111 8946 4.440909e+01 12 Male\n112 5296 1.288217e+02 14 Female\n113 6238 2.840909e+01 15 Male\n114 6303 1.003981e+02 12 Female\n115 6662 8.512397e-01 4 Female\n116 6251 1.322314e-01 4 Male\n117 9110 1.297521e+00 3 Female\n118 8480 1.570248e-01 NA Male\n119 5229 1.966942e+00 2 Female\n120 9173 1.536624e+02 3 Male\n121 9896 3.000000e-01 NA Female\n122 5057 3.000000e-01 3 Female\n123 7732 1.074380e+00 3 Male\n124 6882 1.099174e+00 2 Female\n125 9587 3.057851e-01 4 Female\n126 9930 3.000000e-01 10 Female\n127 6960 5.785124e-02 7 Female\n128 6335 4.391304e+02 11 Female\n129 6286 6.130435e+02 6 Female\n130 9035 1.074380e-01 11 Male\n131 5720 7.125796e+01 9 Male\n132 7368 4.222727e+01 6 Male\n133 5170 1.620223e+02 13 Female\n134 6691 3.750000e+01 10 Female\n135 6173 1.534236e+02 6 Female\n136 8170 6.239130e+02 11 Female\n137 9637 5.521739e+02 7 Male\n138 9482 5.785124e-02 6 Female\n139 7880 6.547945e-01 4 Female\n140 6307 8.767123e-02 4 Female\n141 8822 3.000000e-01 4 Male\n142 8190 2.849315e+00 4 Female\n143 7554 3.835616e-02 4 Male\n144 6519 2.849315e-01 4 Male\n145 9764 4.649315e+00 3 Male\n146 8792 1.369863e-01 4 Female\n147 6721 3.589041e-01 3 Male\n148 9042 1.049315e+00 3 Male\n149 7407 4.668998e+01 13 Female\n150 7229 1.473510e+02 7 Female\n151 7532 4.589744e+01 10 Male\n152 6516 2.109589e-01 6 Male\n153 7941 1.741722e+02 10 Female\n154 8124 2.496503e+01 12 Female\n155 7869 1.850993e+02 10 Male\n156 5647 1.863014e-01 10 Male\n157 9120 1.863014e-01 13 Male\n158 6608 4.589744e+01 13 Female\n159 8635 1.942881e+02 5 Female\n160 9341 5.079646e+02 3 Female\n161 9982 8.767123e-01 4 Male\n162 6976 2.750685e+00 1 Male\n163 6008 1.503311e+02 3 Female\n164 5432 3.000000e-01 4 Male\n165 5749 3.095890e-01 4 Male\n166 6428 3.000000e-01 1 Male\n167 5947 6.371681e+02 5 Female\n168 6027 6.054795e-01 6 Female\n169 5064 1.955298e+02 14 Female\n170 5861 1.786424e+02 6 Male\n171 6702 1.120861e+02 13 Female\n172 7851 1.331954e+02 9 Male\n173 8310 2.159292e+02 11 Male\n174 5897 5.628319e+02 10 Male\n175 9249 1.900662e+02 5 Female\n176 9163 6.547945e-01 14 Male\n177 6550 1.665753e+00 7 Male\n178 5859 1.739238e+02 10 Male\n179 5607 9.991722e+01 6 Male\n180 8746 9.321192e+01 5 Male\n181 5274 8.767123e-02 3 Female\n182 9412 NA 4 Male\n183 5691 6.794521e-01 2 Female\n184 9016 5.808219e-01 3 Male\n185 9128 1.369863e-01 3 Female\n186 8539 2.060274e+00 2 Female\n187 5703 1.610099e+02 3 Male\n188 9573 4.082192e-01 5 Female\n189 5852 8.273973e-01 2 Male\n190 5971 4.601770e+02 3 Female\n191 7015 1.389073e+02 14 Female\n192 8221 3.867133e+01 9 Female\n193 6752 9.260274e-01 14 Female\n194 7436 5.918874e+01 9 Female\n195 6869 1.870861e+02 8 Female\n196 8947 4.328767e-01 7 Male\n197 7360 6.301370e-02 13 Male\n198 7494 3.000000e-01 8 Female\n199 8243 1.548013e+02 6 Male\n200 6176 5.819536e+01 12 Female\n201 6818 1.724338e+02 14 Female\n202 8083 1.932401e+01 15 Female\n203 6711 2.164420e+00 2 Female\n204 8890 9.757412e-01 4 Female\n205 5576 1.509434e-01 3 Male\n206 8396 1.509434e-01 3 Female\n207 5986 7.766571e+01 3 Male\n208 9758 4.319563e+01 4 Female\n209 5444 1.752022e-01 3 Male\n210 6394 3.094775e+01 14 Female\n211 5694 1.266846e-01 8 Male\n212 9604 2.919806e+01 7 Male\n213 7895 9.545455e+00 14 Female\n214 5141 2.735115e+01 13 Female\n215 8034 1.314841e+02 13 Female\n216 6566 3.643985e+01 7 Male\n217 6827 1.498559e+02 8 Female\n218 7400 9.363636e+00 10 Female\n219 9094 2.479784e-01 9 Male\n220 9474 5.390836e-02 9 Female\n221 7984 8.787062e-01 3 Female\n222 9524 1.994609e-01 4 Male\n223 9598 3.000000e-01 4 Female\n224 9664 3.000000e-01 4 Male\n225 9910 5.390836e-03 2 Female\n226 9216 4.177898e-01 1 Female\n227 9706 3.000000e-01 3 Female\n228 5320 2.479784e-01 2 Male\n229 5256 2.964960e-02 3 Male\n230 9006 2.964960e-01 5 Male\n231 6413 5.148248e+00 2 Female\n232 8717 1.994609e-01 2 Male\n233 9873 3.000000e-01 9 Male\n234 6699 1.779539e+02 13 Male\n235 8228 3.290210e+02 10 Female\n236 6494 3.000000e-01 6 Male\n237 9294 1.809798e+02 13 Female\n238 7680 4.905660e-01 11 Male\n239 7534 1.266846e-01 10 Male\n240 9920 1.543948e+02 8 Female\n241 9814 1.379683e+02 9 Female\n242 5363 6.153846e+02 10 Male\n243 5842 1.474784e+02 14 Male\n244 7992 3.000000e-01 1 Female\n245 5565 1.024259e+00 2 Male\n246 5258 4.444056e+02 3 Female\n247 8200 3.000000e-01 2 Male\n248 8795 2.504043e+00 3 Female\n249 7676 3.000000e-01 2 Female\n250 7029 3.000000e-01 3 Female\n251 7535 7.816712e-02 5 Female\n252 5026 3.000000e-01 10 Female\n253 8630 5.390836e-02 7 Male\n254 6989 1.494236e+02 13 Female\n255 8454 5.972622e+01 15 Male\n256 9741 6.361186e-01 11 Female\n257 6418 1.837896e+02 10 Female\n258 9922 1.320809e+02 3 Female\n259 8504 1.571906e-01 2 Male\n260 6491 1.520231e+02 3 Male\n261 6002 3.000000e-01 3 Female\n262 7127 3.000000e-01 3 Female\n263 8540 1.823699e+02 4 Male\n264 7115 3.000000e-01 3 Male\n265 7268 2.173913e+00 2 Male\n266 8279 2.142202e+01 4 Male\n267 8880 3.000000e-01 2 Female\n268 8076 3.408027e+00 8 Male\n269 6250 4.155963e+01 11 Male\n270 8542 9.698997e-02 6 Male\n271 5393 1.238532e+01 14 Female\n272 9197 9.528926e+00 14 Male\n273 6651 1.916185e+02 5 Female\n274 7473 1.060201e+00 5 Male\n275 6589 3.679104e+02 10 Female\n276 6867 4.288991e+01 13 Male\n277 5413 9.971098e+01 6 Male\n278 6765 3.000000e-01 5 Male\n279 8933 1.208092e+02 12 Male\n280 6294 3.000000e-01 2 Male\n281 8688 6.688963e-03 3 Female\n282 8108 2.505017e+00 1 Female\n283 6926 1.481605e+00 1 Male\n284 5880 3.000000e-01 1 Female\n285 5529 5.183946e-01 2 Female\n286 8963 3.000000e-01 5 Female\n287 9594 1.872910e-01 5 Male\n288 8075 3.678930e-01 4 Female\n289 5680 3.000000e-01 2 Male\n290 5617 4.529851e+02 NA Female\n291 5080 3.169725e+01 6 Female\n292 7719 3.000000e-01 8 Male\n293 6780 4.922018e+01 15 Male\n294 8768 2.548507e+02 11 Male\n295 7031 1.661850e+02 14 Male\n296 7740 9.164179e+02 6 Male\n297 8855 3.678930e-01 10 Female\n298 7241 1.236994e+02 12 Male\n299 8156 6.705202e+01 14 Male\n300 7333 3.834862e+01 10 Male\n301 6906 1.963211e+00 1 Female\n302 9511 3.000000e-01 3 Male\n303 9336 2.474916e-01 2 Male\n304 6644 3.000000e-01 3 Female\n305 5554 2.173913e-01 4 Male\n306 8094 8.193980e-01 3 Male\n307 8836 2.444816e+00 4 Female\n308 7147 3.000000e-01 4 Male\n309 7745 1.571906e-01 1 Female\n310 9345 1.849711e+02 7 Male\n311 5606 6.119403e+02 11 Female\n312 9766 3.000000e-01 7 Female\n313 6666 4.280936e-01 5 Female\n314 9965 9.698997e-02 10 Male\n315 7927 3.678930e-02 9 Female\n316 6266 4.832090e+02 13 Male\n317 9487 1.390173e+02 11 Female\n318 7089 3.000000e-01 13 Male\n319 5731 6.555970e+02 9 Female\n320 7962 1.526012e+02 15 Female\n321 9532 3.000000e-01 7 Female\n322 6687 7.222222e-01 4 Male\n323 6570 7.724426e+01 1 Male\n324 5781 3.000000e-01 1 Male\n325 8935 6.111111e-01 2 Female\n326 5780 1.555556e+00 2 Female\n327 9029 3.055556e-01 3 Male\n328 5668 1.500000e+00 2 Male\n329 8203 1.470772e+02 3 Male\n330 7381 1.694444e+00 4 Female\n331 7734 3.138298e+02 7 Female\n332 7257 1.414405e+02 11 Female\n333 8418 1.990605e+02 10 Female\n334 8259 4.212766e+02 5 Male\n335 5587 3.000000e-01 8 Male\n336 8499 3.000000e-01 15 Male\n337 7897 6.478723e+02 14 Male\n338 8300 3.000000e-01 2 Male\n339 9691 2.222222e+00 2 Female\n340 5873 3.000000e-01 2 Male\n341 6690 2.055556e+00 5 Male\n342 9970 2.777778e-02 4 Female\n343 8978 8.333333e-02 3 Male\n344 6181 1.032359e+02 5 Female\n345 8218 1.611111e+00 4 Female\n346 5387 8.333333e-02 2 Female\n347 7850 2.333333e+00 1 Female\n348 7326 5.755319e+02 7 Male\n349 8448 1.686848e+02 8 Female\n350 7264 1.111111e-01 NA Male\n351 8361 3.000000e-01 9 Male\n352 7497 8.372340e+02 8 Female\n353 5559 3.000000e-01 5 Male\n354 7321 3.784504e+01 14 Male\n355 8372 3.819149e+02 14 Male\n356 5030 5.555556e-02 7 Female\n357 6936 3.000000e+02 13 Female\n358 9628 1.855950e+02 2 Male\n359 8558 1.944444e-01 1 Female\n360 7840 3.000000e-01 1 Male\n361 5100 5.555556e-02 4 Female\n362 8244 1.138889e+00 3 Male\n363 9115 4.254237e+01 4 Female\n364 5489 3.000000e-01 3 Male\n365 5766 3.000000e-01 1 Male\n366 5024 3.000000e-01 5 Female\n367 8599 3.000000e-01 4 Female\n368 8895 3.138298e+02 4 Female\n369 7708 1.235908e+02 4 Male\n370 7646 4.159574e+02 11 Male\n371 6640 3.009685e+01 15 Female\n372 8958 1.567850e+02 12 Female\n373 6477 1.367432e+02 11 Female\n374 7910 3.731235e+01 8 Female\n375 7829 9.164927e+01 13 Male\n376 7503 2.936170e+02 10 Female\n377 5209 8.820459e+01 10 Female\n378 6763 1.035491e+02 15 Male\n379 8976 7.379958e+01 8 Female\n380 9223 3.000000e-01 14 Male\n381 7692 1.718750e+02 4 Male\n382 7453 2.128527e+00 1 Male\n383 9775 1.253918e+00 5 Female\n384 9662 2.382445e-01 2 Male\n385 8733 4.639498e-01 2 Female\n386 5695 1.253918e-01 4 Male\n387 7714 1.253918e-01 4 Male\n388 9224 3.000000e-01 2 Female\n389 7635 1.000000e+00 3 Male\n390 7176 1.570043e+02 11 Male\n391 6102 4.344086e+02 10 Female\n392 7817 2.184953e+00 6 Male\n393 9719 1.507837e+00 12 Female\n394 9740 3.228840e-01 10 Female\n395 9528 4.588024e+01 8 Male\n396 7142 1.660560e+02 8 Male\n397 5689 3.000000e-01 13 Male\n398 5439 3.043011e+02 10 Male\n399 6718 2.612903e+02 13 Female\n400 6569 1.621767e+02 10 Male\n401 9444 3.228840e-01 2 Male\n402 6964 4.639498e-01 4 Female\n403 6420 2.495298e+00 3 Female\n404 9189 3.257053e+00 2 Female\n405 9368 3.793103e-01 1 Female\n406 6360 NA 3 Male\n407 8196 6.896552e-02 3 Female\n408 8297 3.000000e-01 4 Male\n409 6674 1.423197e+00 5 Female\n410 5269 3.000000e-01 5 Female\n411 6599 3.000000e-01 1 Female\n412 7713 1.786638e+02 11 Male\n413 8644 3.279570e+02 6 Male\n414 9680 NA 14 Female\n415 6305 1.903017e+02 8 Male\n416 8493 1.654095e+02 8 Female\n417 5297 4.639498e-01 9 Female\n418 7723 1.815733e+02 7 Male\n419 7510 1.366771e+00 6 Male\n420 5102 1.536050e-01 12 Female\n421 7816 1.306587e+01 8 Male\n422 5143 2.129032e+02 11 Female\n423 7414 1.925647e+02 14 Male\n424 5127 3.000000e-01 3 Female\n425 5830 1.028213e+00 1 Female\n426 8929 3.793103e-01 5 Female\n427 7993 8.025078e-01 2 Female\n428 8092 4.860215e+02 3 Female\n429 9750 3.000000e-01 4 Female\n430 6660 2.100313e-01 2 Male\n431 8054 2.767665e+01 3 Female\n432 6086 1.592476e+00 4 Male\n433 6878 9.717868e-02 1 Female\n434 8125 1.028213e+00 7 Female\n435 9500 3.793103e-01 10 Male\n436 8105 1.292026e+02 11 Male\n437 9593 4.425150e+01 7 Female\n438 5202 3.193548e+02 10 Female\n439 7207 1.860991e+02 14 Female\n440 5518 6.614420e-01 7 Female\n441 9820 5.203762e-01 11 Male\n442 6958 1.330819e+02 12 Male\n443 9445 1.673491e+02 10 Female\n444 8774 3.000000e-01 6 Male\n445 9614 1.117457e+02 13 Male\n446 9810 3.045509e+01 8 Female\n447 7271 3.000000e-01 2 Male\n448 8031 8.280255e-02 3 Female\n449 7232 3.000000e-01 1 Female\n450 7452 1.200637e+00 2 Female\n451 5921 1.687898e-01 NA Male\n452 8136 7.367273e+02 NA Female\n453 6605 8.280255e-02 4 Male\n454 5125 5.127389e-01 4 Male\n455 5911 1.974522e-01 1 Male\n456 9644 7.993631e-01 2 Female\n457 5760 3.000000e-01 2 Male\n458 7055 3.298182e+02 12 Male\n459 9064 9.736842e+01 12 Female\n460 6925 3.000000e-01 8 Female\n461 7757 3.000000e-01 14 Female\n462 8527 4.214545e+02 13 Female\n463 8521 3.000000e-01 6 Male\n464 6260 2.578182e+02 11 Female\n465 9578 2.261147e-01 11 Male\n466 9570 3.000000e-01 10 Female\n467 6246 1.883901e+02 12 Male\n468 9622 9.458204e+01 14 Female\n469 7661 3.000000e-01 11 Female\n470 9374 3.000000e-01 1 Male\n471 8446 7.707006e-01 2 Female\n472 8332 5.032727e+02 3 Male\n473 8008 1.544586e+00 3 Female\n474 9365 1.431115e+02 5 Female\n475 9819 3.000000e-01 3 Male\n476 5173 1.458599e+00 1 Male\n477 6722 1.247678e+02 4 Female\n478 7668 NA 4 Female\n479 8980 4.334545e+02 4 Male\n480 5204 3.000000e-01 2 Female\n481 6412 6.156364e+02 5 Female\n482 6404 9.574303e+01 7 Male\n483 5693 1.928019e+02 8 Male\n484 8100 1.888545e+02 10 Male\n485 9760 1.598297e+02 6 Female\n486 6377 5.127389e-01 7 Male\n487 6012 1.171053e+02 10 Female\n488 6224 NA 6 Male\n489 6561 2.547771e-02 6 Female\n490 8475 1.707430e+02 15 Female\n491 6629 3.000000e-01 5 Male\n492 7200 1.869969e+02 3 Male\n493 9453 4.731481e+01 5 Male\n494 6449 1.988390e+02 3 Female\n495 9452 3.000000e-01 5 Male\n496 7162 8.808050e+01 5 Male\n497 8962 2.003185e+00 1 Female\n498 7328 3.000000e-01 1 Male\n499 9097 3.509259e+01 7 Female\n500 9131 9.365325e+01 14 Female\n501 7280 3.000000e-01 9 Male\n502 5783 3.736111e+01 10 Female\n503 9895 1.674923e+02 10 Female\n504 7986 8.808050e+01 11 Male\n505 7146 1.656347e+02 11 Female\n506 8671 3.722222e+01 12 Female\n507 5273 6.756364e+02 11 Female\n508 5063 3.000000e-01 12 Male\n509 6729 1.698142e+02 12 Male\n510 9085 1.628483e+02 10 Female\n511 9929 5.985130e-01 1 Male\n512 8479 1.903346e+00 2 Female\n513 7395 3.000000e-01 4 Male\n514 6374 3.000000e-01 2 Male\n515 7878 8.996283e-01 3 Male\n516 9603 3.977695e-01 3 Female\n517 7994 3.000000e-01 2 Male\n518 5277 3.000000e-01 4 Male\n519 5054 3.000000e-01 3 Male\n520 5440 3.000000e-01 1 Female\n521 6551 7.446809e+02 4 Male\n522 5281 6.095745e+02 12 Female\n523 7145 1.427445e+02 6 Male\n524 5275 3.000000e-01 7 Female\n525 9542 2.973978e-02 7 Male\n526 9371 3.977695e-01 13 Female\n527 5598 4.095745e+02 8 Female\n528 7148 4.595745e+02 7 Male\n529 5624 3.000000e-01 8 Female\n530 6998 1.976341e+02 8 Female\n531 9286 3.776596e+02 11 Female\n532 7589 1.777603e+02 14 Female\n533 7095 4.312268e-01 3 Male\n534 5455 6.765957e+02 2 Female\n535 6257 7.978723e+02 2 Male\n536 8627 9.665427e-02 3 Male\n537 9786 1.879338e+02 2 Male\n538 8176 4.358670e+01 2 Female\n539 9198 3.000000e-01 3 Female\n540 6586 3.000000e-01 2 Male\n541 8850 2.638955e+01 5 Male\n542 9560 3.180523e+01 10 Female\n543 7144 1.746845e+02 14 Male\n544 8230 1.876972e+02 9 Male\n545 7559 1.044164e+02 6 Male\n546 5312 1.202681e+02 7 Male\n547 6560 1.630915e+02 14 Female\n548 6091 1.276025e+02 7 Female\n549 5578 8.880126e+01 7 Male\n550 5837 3.563830e+02 9 Male\n551 8347 2.212766e+02 14 Male\n552 6453 1.969121e+01 10 Female\n553 5758 3.755319e+02 13 Female\n554 5569 1.214511e+02 5 Male\n555 8766 1.034700e+02 4 Female\n556 8002 3.000000e-01 4 Female\n557 7839 3.643123e-01 5 Female\n558 5434 6.319703e-02 4 Female\n559 7636 3.000000e-01 4 Male\n560 6164 3.000000e-01 4 Male\n561 9243 3.000000e-01 3 Female\n562 5872 3.000000e-01 1 Female\n563 8079 3.000000e-01 4 Male\n564 9762 3.000000e-01 1 Male\n565 9476 3.000000e-01 1 Female\n566 8345 3.000000e-01 7 Male\n567 8128 1.664038e+02 13 Female\n568 7956 2.946809e+02 10 Female\n569 8677 4.391924e+01 14 Male\n570 5881 1.874606e+02 12 Female\n571 7498 1.143533e+02 14 Male\n572 8134 1.600158e+02 8 Male\n573 7748 1.635688e-01 7 Male\n574 7990 8.809148e+01 11 Female\n575 6184 1.337539e+02 8 Male\n576 6339 1.985804e+02 12 Male\n577 5113 1.578864e+02 9 Female\n578 9449 3.000000e-01 5 Female\n579 8110 3.000000e-01 4 Male\n580 9307 1.953642e-01 3 Female\n581 5555 1.119205e+00 2 Male\n582 9152 2.523636e+02 2 Male\n583 7969 3.000000e-01 3 Male\n584 6116 4.844371e+00 4 Female\n585 8294 3.000000e-01 4 Male\n586 8938 1.492553e+02 4 Female\n587 9539 1.993617e+02 5 Male\n588 9470 2.847682e-01 3 Female\n589 6677 3.145695e-01 6 Female\n590 8752 3.000000e-01 3 Male\n591 5574 3.406429e+01 11 Female\n592 5989 6.595745e+01 11 Male\n593 9813 3.000000e-01 7 Male\n594 6150 2.174545e+02 8 Male\n595 5730 NA 6 Female\n596 8038 5.957447e+01 10 Female\n597 5964 7.236364e+02 8 Female\n598 9043 3.000000e-01 8 Male\n599 5095 3.000000e-01 9 Female\n600 8922 3.000000e-01 8 Male\n601 5469 2.676364e+02 13 Male\n602 6726 1.891489e+02 11 Male\n603 7495 3.036364e+02 8 Female\n604 8159 3.000000e-01 2 Female\n605 6709 3.000000e-01 4 Male\n606 5855 3.000000e-01 2 Male\n607 6058 3.000000e-01 2 Female\n608 7292 3.000000e-01 4 Male\n609 6437 1.447020e+00 2 Male\n610 9326 2.130909e+02 4 Female\n611 8222 1.357616e-01 2 Female\n612 6789 3.000000e-01 4 Female\n613 6348 3.000000e-01 1 Female\n614 5958 5.534545e+02 4 Female\n615 9211 1.891489e+02 12 Female\n616 9450 7.202128e+01 7 Female\n617 6540 3.250287e+01 11 Male\n618 8796 1.655629e-02 6 Male\n619 7971 3.123636e+02 8 Male\n620 7549 3.000000e-01 14 Male\n621 9799 7.138298e+01 11 Male\n622 7013 3.000000e-01 7 Female\n623 5599 6.946809e+01 14 Female\n624 8601 4.012629e+01 6 Male\n625 7383 1.629787e+02 13 Female\n626 6656 1.508511e+02 13 Female\n627 5641 1.655629e-02 3 Male\n628 6222 3.000000e-01 1 Male\n629 7674 4.635762e-02 3 Male\n630 5293 3.000000e-01 1 Female\n631 6715 3.000000e-01 1 Female\n632 7057 3.000000e-01 2 Male\n633 7072 1.942553e+02 4 Male\n634 6380 3.690909e+02 4 Male\n635 6762 3.000000e-01 2 Female\n636 5799 3.000000e-01 4 Female\n637 6681 2.847682e+00 5 Male\n638 8755 1.435106e+02 3 Female\n639 6896 3.000000e-01 3 Male\n640 5945 4.752009e+01 6 Female\n641 5035 2.621125e+01 11 Female\n642 6776 1.055319e+02 9 Female\n643 7863 3.000000e-01 7 Female\n644 9836 1.149007e+00 8 Male\n645 7860 2.927273e+02 NA Female\n646 5248 3.000000e-01 8 Female\n647 5677 3.000000e-01 14 Female\n648 9576 4.839265e+01 10 Male\n649 5824 3.000000e-01 10 Male\n650 9184 3.000000e-01 11 Female\n651 5397 2.251656e-01 13 Female\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$slum <- NULL # this is the same as above\n```\n:::\n\nWe can also grab the `age` column using the `$` operator. \n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 3.176895e-01 3.436823e+00 3.000000e-01 1.432363e+02 4.476534e-01\n [6] 2.527076e-02 6.101083e-01 3.000000e-01 2.916968e+00 1.649819e+00\n [11] 4.574007e+00 1.583904e+02 NA 1.065068e+02 1.113870e+02\n [16] 4.144893e+01 3.000000e-01 2.527076e-01 8.159247e+01 1.825342e+02\n [21] 4.244656e+01 1.193493e+02 3.000000e-01 3.000000e-01 9.025271e-01\n [26] 3.501805e-01 3.000000e-01 1.227437e+00 1.702055e+02 3.000000e-01\n [31] 4.801444e-01 2.527076e-02 3.000000e-01 5.776173e-02 4.801444e-01\n [36] 3.826715e-01 3.000000e-01 4.048558e+02 3.000000e-01 5.451264e-01\n [41] 3.000000e-01 5.590753e+01 2.202166e-01 1.709760e+02 1.227437e+00\n [46] 4.567527e+02 4.838480e+01 1.227437e-01 1.877256e-01 3.000000e-01\n [51] 3.501805e-01 3.339350e+00 3.000000e-01 5.451264e-01 NA\n [56] 2.104693e+00 NA 3.826715e-01 3.926366e+01 1.129964e+00\n [61] 3.501805e+00 7.542808e+01 4.800475e+01 1.000000e+00 4.068884e+01\n [66] 3.000000e-01 4.377672e+01 1.193493e+02 6.977740e+01 1.373288e+02\n [71] 1.642979e+02 NA 1.542808e+02 6.033058e-01 2.809917e-01\n [76] 1.966942e+00 2.041322e+00 2.115702e+00 4.663043e+02 3.000000e-01\n [81] 1.500796e+02 1.543790e+02 2.561983e-01 1.596338e+02 1.732484e+02\n [86] 4.641304e+02 3.736364e+01 1.572452e+02 3.000000e-01 3.000000e-01\n [91] 8.264463e-02 6.776859e-01 7.272727e-01 2.066116e-01 1.966942e+00\n [96] 3.000000e-01 3.000000e-01 2.809917e-01 8.016529e-01 1.818182e-01\n[101] 1.818182e-01 8.264463e-02 3.422727e+01 8.743506e+00 3.000000e-01\n[106] 1.641720e+02 4.049587e-01 1.001592e+02 4.489130e+02 1.101911e+02\n[111] 4.440909e+01 1.288217e+02 2.840909e+01 1.003981e+02 8.512397e-01\n[116] 1.322314e-01 1.297521e+00 1.570248e-01 1.966942e+00 1.536624e+02\n[121] 3.000000e-01 3.000000e-01 1.074380e+00 1.099174e+00 3.057851e-01\n[126] 3.000000e-01 5.785124e-02 4.391304e+02 6.130435e+02 1.074380e-01\n[131] 7.125796e+01 4.222727e+01 1.620223e+02 3.750000e+01 1.534236e+02\n[136] 6.239130e+02 5.521739e+02 5.785124e-02 6.547945e-01 8.767123e-02\n[141] 3.000000e-01 2.849315e+00 3.835616e-02 2.849315e-01 4.649315e+00\n[146] 1.369863e-01 3.589041e-01 1.049315e+00 4.668998e+01 1.473510e+02\n[151] 4.589744e+01 2.109589e-01 1.741722e+02 2.496503e+01 1.850993e+02\n[156] 1.863014e-01 1.863014e-01 4.589744e+01 1.942881e+02 5.079646e+02\n[161] 8.767123e-01 2.750685e+00 1.503311e+02 3.000000e-01 3.095890e-01\n[166] 3.000000e-01 6.371681e+02 6.054795e-01 1.955298e+02 1.786424e+02\n[171] 1.120861e+02 1.331954e+02 2.159292e+02 5.628319e+02 1.900662e+02\n[176] 6.547945e-01 1.665753e+00 1.739238e+02 9.991722e+01 9.321192e+01\n[181] 8.767123e-02 NA 6.794521e-01 5.808219e-01 1.369863e-01\n[186] 2.060274e+00 1.610099e+02 4.082192e-01 8.273973e-01 4.601770e+02\n[191] 1.389073e+02 3.867133e+01 9.260274e-01 5.918874e+01 1.870861e+02\n[196] 4.328767e-01 6.301370e-02 3.000000e-01 1.548013e+02 5.819536e+01\n[201] 1.724338e+02 1.932401e+01 2.164420e+00 9.757412e-01 1.509434e-01\n[206] 1.509434e-01 7.766571e+01 4.319563e+01 1.752022e-01 3.094775e+01\n[211] 1.266846e-01 2.919806e+01 9.545455e+00 2.735115e+01 1.314841e+02\n[216] 3.643985e+01 1.498559e+02 9.363636e+00 2.479784e-01 5.390836e-02\n[221] 8.787062e-01 1.994609e-01 3.000000e-01 3.000000e-01 5.390836e-03\n[226] 4.177898e-01 3.000000e-01 2.479784e-01 2.964960e-02 2.964960e-01\n[231] 5.148248e+00 1.994609e-01 3.000000e-01 1.779539e+02 3.290210e+02\n[236] 3.000000e-01 1.809798e+02 4.905660e-01 1.266846e-01 1.543948e+02\n[241] 1.379683e+02 6.153846e+02 1.474784e+02 3.000000e-01 1.024259e+00\n[246] 4.444056e+02 3.000000e-01 2.504043e+00 3.000000e-01 3.000000e-01\n[251] 7.816712e-02 3.000000e-01 5.390836e-02 1.494236e+02 5.972622e+01\n[256] 6.361186e-01 1.837896e+02 1.320809e+02 1.571906e-01 1.520231e+02\n[261] 3.000000e-01 3.000000e-01 1.823699e+02 3.000000e-01 2.173913e+00\n[266] 2.142202e+01 3.000000e-01 3.408027e+00 4.155963e+01 9.698997e-02\n[271] 1.238532e+01 9.528926e+00 1.916185e+02 1.060201e+00 3.679104e+02\n[276] 4.288991e+01 9.971098e+01 3.000000e-01 1.208092e+02 3.000000e-01\n[281] 6.688963e-03 2.505017e+00 1.481605e+00 3.000000e-01 5.183946e-01\n[286] 3.000000e-01 1.872910e-01 3.678930e-01 3.000000e-01 4.529851e+02\n[291] 3.169725e+01 3.000000e-01 4.922018e+01 2.548507e+02 1.661850e+02\n[296] 9.164179e+02 3.678930e-01 1.236994e+02 6.705202e+01 3.834862e+01\n[301] 1.963211e+00 3.000000e-01 2.474916e-01 3.000000e-01 2.173913e-01\n[306] 8.193980e-01 2.444816e+00 3.000000e-01 1.571906e-01 1.849711e+02\n[311] 6.119403e+02 3.000000e-01 4.280936e-01 9.698997e-02 3.678930e-02\n[316] 4.832090e+02 1.390173e+02 3.000000e-01 6.555970e+02 1.526012e+02\n[321] 3.000000e-01 7.222222e-01 7.724426e+01 3.000000e-01 6.111111e-01\n[326] 1.555556e+00 3.055556e-01 1.500000e+00 1.470772e+02 1.694444e+00\n[331] 3.138298e+02 1.414405e+02 1.990605e+02 4.212766e+02 3.000000e-01\n[336] 3.000000e-01 6.478723e+02 3.000000e-01 2.222222e+00 3.000000e-01\n[341] 2.055556e+00 2.777778e-02 8.333333e-02 1.032359e+02 1.611111e+00\n[346] 8.333333e-02 2.333333e+00 5.755319e+02 1.686848e+02 1.111111e-01\n[351] 3.000000e-01 8.372340e+02 3.000000e-01 3.784504e+01 3.819149e+02\n[356] 5.555556e-02 3.000000e+02 1.855950e+02 1.944444e-01 3.000000e-01\n[361] 5.555556e-02 1.138889e+00 4.254237e+01 3.000000e-01 3.000000e-01\n[366] 3.000000e-01 3.000000e-01 3.138298e+02 1.235908e+02 4.159574e+02\n[371] 3.009685e+01 1.567850e+02 1.367432e+02 3.731235e+01 9.164927e+01\n[376] 2.936170e+02 8.820459e+01 1.035491e+02 7.379958e+01 3.000000e-01\n[381] 1.718750e+02 2.128527e+00 1.253918e+00 2.382445e-01 4.639498e-01\n[386] 1.253918e-01 1.253918e-01 3.000000e-01 1.000000e+00 1.570043e+02\n[391] 4.344086e+02 2.184953e+00 1.507837e+00 3.228840e-01 4.588024e+01\n[396] 1.660560e+02 3.000000e-01 3.043011e+02 2.612903e+02 1.621767e+02\n[401] 3.228840e-01 4.639498e-01 2.495298e+00 3.257053e+00 3.793103e-01\n[406] NA 6.896552e-02 3.000000e-01 1.423197e+00 3.000000e-01\n[411] 3.000000e-01 1.786638e+02 3.279570e+02 NA 1.903017e+02\n[416] 1.654095e+02 4.639498e-01 1.815733e+02 1.366771e+00 1.536050e-01\n[421] 1.306587e+01 2.129032e+02 1.925647e+02 3.000000e-01 1.028213e+00\n[426] 3.793103e-01 8.025078e-01 4.860215e+02 3.000000e-01 2.100313e-01\n[431] 2.767665e+01 1.592476e+00 9.717868e-02 1.028213e+00 3.793103e-01\n[436] 1.292026e+02 4.425150e+01 3.193548e+02 1.860991e+02 6.614420e-01\n[441] 5.203762e-01 1.330819e+02 1.673491e+02 3.000000e-01 1.117457e+02\n[446] 3.045509e+01 3.000000e-01 8.280255e-02 3.000000e-01 1.200637e+00\n[451] 1.687898e-01 7.367273e+02 8.280255e-02 5.127389e-01 1.974522e-01\n[456] 7.993631e-01 3.000000e-01 3.298182e+02 9.736842e+01 3.000000e-01\n[461] 3.000000e-01 4.214545e+02 3.000000e-01 2.578182e+02 2.261147e-01\n[466] 3.000000e-01 1.883901e+02 9.458204e+01 3.000000e-01 3.000000e-01\n[471] 7.707006e-01 5.032727e+02 1.544586e+00 1.431115e+02 3.000000e-01\n[476] 1.458599e+00 1.247678e+02 NA 4.334545e+02 3.000000e-01\n[481] 6.156364e+02 9.574303e+01 1.928019e+02 1.888545e+02 1.598297e+02\n[486] 5.127389e-01 1.171053e+02 NA 2.547771e-02 1.707430e+02\n[491] 3.000000e-01 1.869969e+02 4.731481e+01 1.988390e+02 3.000000e-01\n[496] 8.808050e+01 2.003185e+00 3.000000e-01 3.509259e+01 9.365325e+01\n[501] 3.000000e-01 3.736111e+01 1.674923e+02 8.808050e+01 1.656347e+02\n[506] 3.722222e+01 6.756364e+02 3.000000e-01 1.698142e+02 1.628483e+02\n[511] 5.985130e-01 1.903346e+00 3.000000e-01 3.000000e-01 8.996283e-01\n[516] 3.977695e-01 3.000000e-01 3.000000e-01 3.000000e-01 3.000000e-01\n[521] 7.446809e+02 6.095745e+02 1.427445e+02 3.000000e-01 2.973978e-02\n[526] 3.977695e-01 4.095745e+02 4.595745e+02 3.000000e-01 1.976341e+02\n[531] 3.776596e+02 1.777603e+02 4.312268e-01 6.765957e+02 7.978723e+02\n[536] 9.665427e-02 1.879338e+02 4.358670e+01 3.000000e-01 3.000000e-01\n[541] 2.638955e+01 3.180523e+01 1.746845e+02 1.876972e+02 1.044164e+02\n[546] 1.202681e+02 1.630915e+02 1.276025e+02 8.880126e+01 3.563830e+02\n[551] 2.212766e+02 1.969121e+01 3.755319e+02 1.214511e+02 1.034700e+02\n[556] 3.000000e-01 3.643123e-01 6.319703e-02 3.000000e-01 3.000000e-01\n[561] 3.000000e-01 3.000000e-01 3.000000e-01 3.000000e-01 3.000000e-01\n[566] 3.000000e-01 1.664038e+02 2.946809e+02 4.391924e+01 1.874606e+02\n[571] 1.143533e+02 1.600158e+02 1.635688e-01 8.809148e+01 1.337539e+02\n[576] 1.985804e+02 1.578864e+02 3.000000e-01 3.000000e-01 1.953642e-01\n[581] 1.119205e+00 2.523636e+02 3.000000e-01 4.844371e+00 3.000000e-01\n[586] 1.492553e+02 1.993617e+02 2.847682e-01 3.145695e-01 3.000000e-01\n[591] 3.406429e+01 6.595745e+01 3.000000e-01 2.174545e+02 NA\n[596] 5.957447e+01 7.236364e+02 3.000000e-01 3.000000e-01 3.000000e-01\n[601] 2.676364e+02 1.891489e+02 3.036364e+02 3.000000e-01 3.000000e-01\n[606] 3.000000e-01 3.000000e-01 3.000000e-01 1.447020e+00 2.130909e+02\n[611] 1.357616e-01 3.000000e-01 3.000000e-01 5.534545e+02 1.891489e+02\n[616] 7.202128e+01 3.250287e+01 1.655629e-02 3.123636e+02 3.000000e-01\n[621] 7.138298e+01 3.000000e-01 6.946809e+01 4.012629e+01 1.629787e+02\n[626] 1.508511e+02 1.655629e-02 3.000000e-01 4.635762e-02 3.000000e-01\n[631] 3.000000e-01 3.000000e-01 1.942553e+02 3.690909e+02 3.000000e-01\n[636] 3.000000e-01 2.847682e+00 1.435106e+02 3.000000e-01 4.752009e+01\n[641] 2.621125e+01 1.055319e+02 3.000000e-01 1.149007e+00 2.927273e+02\n[646] 3.000000e-01 3.000000e-01 4.839265e+01 3.000000e-01 3.000000e-01\n[651] 2.251656e-01\n```\n:::\n:::\n\n\n\n## Using indexing to subset by rows\n\nWe can use indexing to also subset by rows. For example, here we pull the 100th observation/row.\n\n::: {.cell}\n\n```{.r .cell-code}\ndf[100,] \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n IgG_concentration age age gender slum\n100 8122 0.1818182 5 Female Non slum\n```\n:::\n:::\n\nAnd, here we pull the `age` of the 100th observation/row.\n\n::: {.cell}\n\n```{.r .cell-code}\ndf[100,\"age\"] \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.1818182\n```\n:::\n:::\n\n \n\n## Logical operators\n\nLogical operators can be evaluated on object(s) in order to return a binary response of TRUE/FALSE\n\noperator | operator option |description\n-----|-----|-----:\n`<`|%l%|less than\n`<=`|%le%|less than or equal to\n`>`|%g%|greater than\n`>=`|%ge%|greater than or equal to\n`==`||equal to\n`!=`|not equal to\n`x&y`||x and y\n`x|y`||x or y\n`%in%`||match\n`%!in%`||do not match\n\n\n## Logical operators examples\n\nLet's practice. First, here is a reminder of what the number.object contains.\n\n::: {.cell}\n\n```{.r .cell-code}\nnumber.object\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3\n```\n:::\n:::\n\n\nNow, we will use logical operators to evaluate the object.\n\n::: {.cell}\n\n```{.r .cell-code}\nnumber.object<4\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nnumber.object>=3\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nnumber.object!=5\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nnumber.object %in% c(6,7,2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n:::\n\n\n\n## Using indexing and logical operators to rename columns\n\n1. We can assign the column names from data frame `df` to an object `cn`, then we can modify `cn` directly using indexing and logical operators, finally we reassign the column names, `cn`, back to the data frame `df`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncn <- colnames(df)\ncn\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"IgG_concentration\" \"age\" \"age\" \n[4] \"gender\" \"slum\" \n```\n:::\n\n```{.r .cell-code}\ncn[cn==\"IgG_concentration\"] <-\"IgG_concentration_mIU\" #rename cn to \"IgG_concentration_mIU\" when cn is \"IgG_concentration\"\ncolnames(df) <- cn\n```\n:::\n\n\nNote, I am resetting the column name back to the original name for the sake of the rest of the module.\n\n::: {.cell}\n\n```{.r .cell-code}\ncolnames(df)[colnames(df)==\"IgG_concentration_mIU\"] <- \"IgG_concentration\" #reset\n```\n:::\n\n\n\n## Using indexing and logical operators to subset data\n\n\nIn this example, we subset by rows and pull only observations with an age of less than or equal to 10 and then saved the subset data to `df_lt10`. Note that the logical operators `df$age<=10` is before the comma because I want to subset by rows (the first dimension).\n\n::: {.cell}\n\n```{.r .cell-code}\ndf_lte10 <- df[df$age<=10, ]\n```\n:::\n\nIn this example, we subset by rows and pull only observations with an age of less than or equal to 5 OR greater than 10.\n\n::: {.cell}\n\n```{.r .cell-code}\ndf_lte5_gt10 <- df[df$age<=5 | df$age>10, ]\n```\n:::\n\nLets check that my subsets worked using the `summary()` function. \n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(df_lte10$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n Min. 1st Qu. Median Mean 3rd Qu. Max. NA's \n0.005391 0.300000 0.300000 0.724742 0.640788 9.545455 10 \n```\n:::\n\n```{.r .cell-code}\nsummary(df_lte5_gt10$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n Min. 1st Qu. Median Mean 3rd Qu. Max. NA's \n 0.0054 0.3000 1.6018 87.9886 142.8362 916.4179 10 \n```\n:::\n:::\n\n\n\n## Missing values \n\nMissing data need to be carefully described and dealt with in data analysis. Understanding the different types of missing data and how you can identify them, is the first step to data cleaning.\n\nTypes of \"missing\" values:\n\n- `NA` - general missing data\n- `NaN` - stands for \"**N**ot **a** **N**umber\", happens when you do\n 0/0.\n- `Inf` and `-Inf` - Infinity, happens when you divide a positive\n number (or negative number) by 0.\n- blank space - sometimes when data is read it, there is a blank space left\n\n## Logical operators to help identify and missing data\n\noperator | operator option |description\n-----|-----|-----:\n`is.na`||is NAN or NA\n`is.nan`||is NAN\n`!is.na`||is not NAN or NA\n`!is.nan`||is not NAN\n`is.infinite`||is infinite\n`any`||are any TRUE\n`which`||which are TRUE\n\n## More logical operators examples\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntest <- c(0,NA, -1)/0\ntest\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] NaN NA -Inf\n```\n:::\n\n```{.r .cell-code}\nis.na(test)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE TRUE FALSE\n```\n:::\n\n```{.r .cell-code}\nis.nan(test)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE FALSE FALSE\n```\n:::\n\n```{.r .cell-code}\nis.infinite(test)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE FALSE TRUE\n```\n:::\n:::\n\n\n## More logical operators examples\n\n`any(is.na(x))` means do we have any `NA`'s in the object `x`?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nany(is.na(df$IgG_concentration)) # are there any NAs - YES/TRUE\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n\n```{.r .cell-code}\nany(is.na(df$slum)) # are there any NAs- NO/FALSE\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n:::\n\n\n`which(is.na(x))` means which of the elements in object `x` are `NA`'s?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwhich(is.na(df$IgG_concentration)) \n```\n\n::: {.cell-output .cell-output-stdout}\n```\ninteger(0)\n```\n:::\n\n```{.r .cell-code}\nwhich(is.na(df$slum)) \n```\n\n::: {.cell-output .cell-output-stdout}\n```\ninteger(0)\n```\n:::\n:::\n\n\n## `subset()` function\n\nThe Base R `subset()` function is a slightly easier way to select variables and observations.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?subset\n```\n:::\n\n\n```\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n```\n\nSubsetting Vectors, Matrices and Data Frames\n\nDescription:\n\n Return subsets of vectors, matrices or data frames which meet\n conditions.\n\nUsage:\n\n subset(x, ...)\n \n ## Default S3 method:\n subset(x, subset, ...)\n \n ## S3 method for class 'matrix'\n subset(x, subset, select, drop = FALSE, ...)\n \n ## S3 method for class 'data.frame'\n subset(x, subset, select, drop = FALSE, ...)\n \nArguments:\n\n x: object to be subsetted.\n\n subset: logical expression indicating elements or rows to keep:\n missing values are taken as false.\n\n select: expression, indicating columns to select from a data frame.\n\n drop: passed on to '[' indexing operator.\n\n ...: further arguments to be passed to or from other methods.\n\nDetails:\n\n This is a generic function, with methods supplied for matrices,\n data frames and vectors (including lists). Packages and users can\n add further methods.\n\n For ordinary vectors, the result is simply 'x[subset &\n !is.na(subset)]'.\n\n For data frames, the 'subset' argument works on the rows. Note\n that 'subset' will be evaluated in the data frame, so columns can\n be referred to (by name) as variables in the expression (see the\n examples).\n\n The 'select' argument exists only for the methods for data frames\n and matrices. It works by first replacing column names in the\n selection expression with the corresponding column numbers in the\n data frame and then using the resulting integer vector to index\n the columns. This allows the use of the standard indexing\n conventions so that for example ranges of columns can be specified\n easily, or single columns can be dropped (see the examples).\n\n The 'drop' argument is passed on to the indexing method for\n matrices and data frames: note that the default for matrices is\n different from that for indexing.\n\n Factors may have empty levels after subsetting; unused levels are\n not automatically removed. See 'droplevels' for a way to drop all\n unused levels from a data frame.\n\nValue:\n\n An object similar to 'x' contain just the selected elements (for a\n vector), rows and columns (for a matrix or data frame), and so on.\n\nWarning:\n\n This is a convenience function intended for use interactively.\n For programming it is better to use the standard subsetting\n functions like '[', and in particular the non-standard evaluation\n of argument 'subset' can have unanticipated consequences.\n\nAuthor(s):\n\n Peter Dalgaard and Brian Ripley\n\nSee Also:\n\n '[', 'transform' 'droplevels'\n\nExamples:\n\n subset(airquality, Temp > 80, select = c(Ozone, Temp))\n subset(airquality, Day == 1, select = -Temp)\n subset(airquality, select = Ozone:Wind)\n \n with(airquality, subset(Ozone, Temp > 80))\n \n ## sometimes requiring a logical 'subset' argument is a nuisance\n nm <- rownames(state.x77)\n start_with_M <- nm %in% grep(\"^M\", nm, value = TRUE)\n subset(state.x77, start_with_M, Illiteracy:Murder)\n # but in recent versions of R this can simply be\n subset(state.x77, grepl(\"^M\", nm), Illiteracy:Murder)\n\n\n## Subsetting use the `subset()` function\n\nHere are a few examples using the `subset()` function\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf_lte10_v2 <- subset(df, df$age<=10, select=c(IgG_concentration, age))\ndf_lt5_f <- subset(df, df$age<=5 & gender==\"Female\", select=c(IgG_concentration, slum))\n```\n:::\n\n\n## `subset()` function vs logical operators\n\n`subset()` automatically removes NAs, which is a different behavior from doing logical operations on NAs.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(df_lte10$age)\n```\n\n::: {.cell-output-display}\n| Min.| 1st Qu.| Median| Mean| 3rd Qu.| Max.| NA's|\n|---------:|-------:|------:|---------:|---------:|--------:|----:|\n| 0.0053908| 0.3| 0.3| 0.7247421| 0.6407876| 9.545454| 10|\n:::\n\n```{.r .cell-code}\nsummary(df_lte10_v2$age)\n```\n\n::: {.cell-output-display}\n| Min.| 1st Qu.| Median| Mean| 3rd Qu.| Max.|\n|---------:|-------:|------:|---------:|---------:|--------:|\n| 0.0053908| 0.3| 0.3| 0.7247421| 0.6407876| 9.545454|\n:::\n:::\n\n\nWe can also see this by looking at the number or rows in each dataset.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnrow(df_lte10)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 370\n```\n:::\n\n```{.r .cell-code}\nnrow(df_lte10_v2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 360\n```\n:::\n:::\n\n\n\n\n## Summary\n\n- `colnames()`, `str()` and `summary()`functions from Base R are great functions to assess the data type and some summary statistics\n- There are three basic indexing syntax: `[ ]`, `[[ ]]` and `$`\n- Indexing can be used to extract part of an object (e.g., subset data) and to replace parts of an object (e.g., rename variables / columns)\n- Logical operators can be evaluated on object(s) in order to return a binary response of TRUE/FALSE, and are useful for decision rules for indexing\n- There are 5 “types” of missing values, the most common being “NA”\n- Logical operators meant to determine missing values are very helpful for data cleaning\n- The Base R `subset()` function is a slightly easier way to select variables and observations.\n\n## Acknowledgements\n\nThese are the materials we looked through, modified, or extracted to complete this module's lecture.\n\n- [\"Introduction to R for Public Health Researchers\" Johns Hopkins University](https://jhudatascience.org/intro_to_r/)\n- [\"Indexing\" CRAN Project](https://cran.r-project.org/doc/manuals/R-lang.html#Indexing)\n- [\"Logical operators\" CRAN Project](https://cran.r-project.org/web/packages/extraoperators/vignettes/logicals-vignette.html)\n\n", + "markdown": "---\ntitle: \"Module 6: Get to Know Your Data and Subsetting\"\nformat: \n revealjs:\n scrollable: true\n smaller: true\n toc: false\n#execute: \n# echo: true\n---\n\n\n## Learning Objectives\n\nAfter module 6, you should be able to...\n\n- Use basic functions to get to know you data\n- Use three indexing approaches\n- Rely on indexing to extract part of an object (e.g., subset data) and to replace parts of an object (e.g., rename variables / columns)\n- Describe what logical operators are and how to use them\n- Use on the `subset()` function to subset data\n\n\n## Getting to know our data\n\nThe `dim()`, `nrow()`, and `ncol()` functions are good options to check the dimensions of your data before moving forward. \n\nLet's first read in the data from the previous module.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- read.csv(file = \"data/serodata.csv\") #relative path\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ndim(df) # rows, columns\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 651 5\n```\n:::\n\n```{.r .cell-code}\nnrow(df) # number of rows\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 651\n```\n:::\n\n```{.r .cell-code}\nncol(df) # number of columns\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 5\n```\n:::\n:::\n\n\n## Quick summary of data\n\nThe `colnames()`, `str()` and `summary()`functions from Base R are great functions to assess the data type and some summary statistics. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ncolnames(df)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"observation_id\" \"IgG_concentration\" \"age\" \n[4] \"gender\" \"slum\" \n```\n:::\n\n```{.r .cell-code}\nstr(df)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n'data.frame':\t651 obs. of 5 variables:\n $ observation_id : int 5772 8095 9784 9338 6369 6885 6252 8913 7332 6941 ...\n $ IgG_concentration: num 0.318 3.437 0.3 143.236 0.448 ...\n $ age : int 2 4 4 4 1 4 4 NA 4 2 ...\n $ gender : chr \"Female\" \"Female\" \"Male\" \"Male\" ...\n $ slum : chr \"Non slum\" \"Non slum\" \"Non slum\" \"Non slum\" ...\n```\n:::\n\n```{.r .cell-code}\nsummary(df)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n observation_id IgG_concentration age gender \n Min. :5006 Min. : 0.0054 Min. : 1.000 Length:651 \n 1st Qu.:6306 1st Qu.: 0.3000 1st Qu.: 3.000 Class :character \n Median :7495 Median : 1.6658 Median : 6.000 Mode :character \n Mean :7492 Mean : 87.3683 Mean : 6.606 \n 3rd Qu.:8749 3rd Qu.:141.4405 3rd Qu.:10.000 \n Max. :9982 Max. :916.4179 Max. :15.000 \n NA's :10 NA's :9 \n slum \n Length:651 \n Class :character \n Mode :character \n \n \n \n \n```\n:::\n:::\n\n\nNote, if you have a very large dataset with 15+ variables, `summary()` is not so efficient. \n\n## Description of data\n\nThis is data based on a simulated pathogen X IgG antibody serological survey. The rows represent individuals. Variables include IgG concentrations in IU/mL, age in years, gender, and residence based on slum characterization. We will use this dataset for modules throughout the Workshop.\n\n## View the data as a whole dataframe\n\nThe `View()` function, one of the few Base R functions with a capital letter, and can be used to open a new tab in the Console and view the data as you would in excel.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nView(df)\n```\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/ViewTab.png){width=100%}\n:::\n:::\n\n\n## View the data as a whole dataframe\n\nYou can also open a new tab of the data by clicking on the data icon beside the object in the Environment pane\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/View.png){width=90%}\n:::\n:::\n\n\nYou can also hold down `Cmd` or `CTRL` and click on the name of a data frame in your code.\n\n## Indexing\n\nR contains several operators which allow access to individual elements or subsets through indexing. Indexing can be used both to extract part of an object and to replace parts of an object (or to add parts). There are three basic indexing operators: `[`, `[[` and `$`. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nx[i] #if x is a vector\nx[i, j] #if x is a matrix/data frame\nx[[i]] #if x is a list\nx$a #if x is a data frame or list\nx$\"a\" #if x is a data frame or list\n```\n:::\n\n\n## Vectors and multi-dimensional objects\n\nTo index a vector, `vector[i]` select the ith element. To index a multi-dimensional objects such as a matrix, `matrix[i, j]` selects the element in row i and column j, where as in a three dimensional `array[k, i, j]` selects the element in matrix k, row i, and column j. \n\nLet's practice by first creating the same objects as we did in Module 1.\n\n::: {.cell}\n\n```{.r .cell-code}\nnumber.object <- 3\ncharacter.object <- \"blue\"\nvector.object1 <- c(2,3,4,5)\nvector.object2 <- c(\"blue\", \"red\", \"yellow\")\nmatrix.object <- matrix(data=vector.object1, nrow=2, ncol=2, byrow=TRUE)\n```\n:::\n\n\nHere is a reminder of what these objects look like.\n\n::: {.cell}\n\n```{.r .cell-code}\nvector.object1\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 2 3 4 5\n```\n:::\n\n```{.r .cell-code}\nmatrix.object\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [,1] [,2]\n[1,] 2 3\n[2,] 4 5\n```\n:::\n:::\n\n\nFinally, let's use indexing to pull out elements of the objects. \n\n::: {.cell}\n\n```{.r .cell-code}\nvector.object1[2] #pulling the second element\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3\n```\n:::\n\n```{.r .cell-code}\nmatrix.object[1,2] #pulling the element in row 1 column 2\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3\n```\n:::\n:::\n\n\n\n## List objects\n\nFor lists, one generally uses `list[[p]]` to select any single element p.\n\nLet's practice by creating the same list as we did in Module 1.\n\n::: {.cell}\n\n```{.r .cell-code}\nlist.object <- list(number.object, vector.object2, matrix.object)\nlist.object\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[[1]]\n[1] 3\n\n[[2]]\n[1] \"blue\" \"red\" \"yellow\"\n\n[[3]]\n [,1] [,2]\n[1,] 2 3\n[2,] 4 5\n```\n:::\n:::\n\n\nNow we use indexing to pull out the 3rd element in the list.\n\n::: {.cell}\n\n```{.r .cell-code}\nlist.object[[3]]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [,1] [,2]\n[1,] 2 3\n[2,] 4 5\n```\n:::\n:::\n\n\nWhat happens if we use a single square bracket?\n\n::: {.cell}\n\n```{.r .cell-code}\nlist.object[3]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[[1]]\n [,1] [,2]\n[1,] 2 3\n[2,] 4 5\n```\n:::\n:::\n\n\nThe `[[` operator is called the \"extract\" operator and gives us the element\nfrom the list. The `[` operator is called the \"subset\" operator and gives\nus a subset of the list, that is still a list.\n\n## $ for indexing for data frame\n\n`$` allows only a literal character string or a symbol as the index. For a data frame it extracts a variable.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$IgG_concentration\n```\n:::\n\n\nNote, if you have spaces in your variable name, you will need to use back ticks \\` after the `$`. This is a good reason to not create variables / column names with spaces.\n\n## $ for indexing with lists\n\n`$` allows only a literal character string or a symbol as the index. For a list it extracts a named element.\n\nList elements can be named\n\n::: {.cell}\n\n```{.r .cell-code}\nlist.object.named <- list(\n emory = number.object,\n uga = vector.object2,\n gsu = matrix.object\n)\nlist.object.named\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n$emory\n[1] 3\n\n$uga\n[1] \"blue\" \"red\" \"yellow\"\n\n$gsu\n [,1] [,2]\n[1,] 2 3\n[2,] 4 5\n```\n:::\n:::\n\n\nIf list elements are named, than you can reference data from list using `$` or using double square brackets, `[[`\n\n::: {.cell}\n\n```{.r .cell-code}\nlist.object.named$uga \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"blue\" \"red\" \"yellow\"\n```\n:::\n\n```{.r .cell-code}\nlist.object.named[[\"uga\"]] \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"blue\" \"red\" \"yellow\"\n```\n:::\n:::\n\n\n\n## Using indexing to rename columns\n\nAs mentioned above, indexing can be used both to extract part of an object and to replace parts of an object (or to add parts).\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncolnames(df) \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"observation_id\" \"IgG_concentration\" \"age\" \n[4] \"gender\" \"slum\" \n```\n:::\n\n```{.r .cell-code}\ncolnames(df)[2:3] <- c(\"IgG_concentration_IU/mL\", \"age_year\") # reassigns\ncolnames(df)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"observation_id\" \"IgG_concentration_IU/mL\"\n[3] \"age_year\" \"gender\" \n[5] \"slum\" \n```\n:::\n:::\n\n\nFor the sake of the module, I am going to reassign them back to the original variable names\n\n::: {.cell}\n\n```{.r .cell-code}\ncolnames(df)[2:3] <- c(\"IgG_concentration\", \"age\") #reset\n```\n:::\n\n\n## Using indexing to subset by columns\n\nWe can also subset data frames and matrices (2-dimensional objects) using the bracket `[ row , column ]`. We can subset by columns and pull the `x` column using the index of the column or the column name. Leaving either row or column dimension blank means to select all of them.\n\nFor example, here I am pulling the 3rd column, which has the variable name `age`, for all of rows.\n\n::: {.cell}\n\n```{.r .cell-code}\ndf[ , \"age\"] #same as df[ , 3]\n```\n:::\n\nWe can select multiple columns using multiple column names, again this is selecting these variables for all of the rows.\n\n::: {.cell}\n\n```{.r .cell-code}\ndf[, c(\"age\", \"gender\")] #same as df[ , c(3,4)]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n age gender\n1 2 Female\n2 4 Female\n3 4 Male\n4 4 Male\n5 1 Male\n6 4 Male\n7 4 Female\n8 NA Female\n9 4 Male\n10 2 Male\n11 3 Male\n12 15 Female\n13 8 Male\n14 12 Male\n15 15 Male\n16 9 Male\n17 8 Male\n18 7 Female\n19 11 Female\n20 10 Male\n21 8 Male\n22 11 Female\n23 2 Male\n24 2 Female\n25 3 Female\n26 5 Male\n27 1 Male\n28 3 Female\n29 5 Female\n30 5 Female\n31 3 Male\n32 1 Male\n33 4 Female\n34 3 Male\n35 2 Female\n36 11 Female\n37 7 Male\n38 8 Male\n39 6 Male\n40 6 Male\n41 11 Female\n42 10 Male\n43 6 Female\n44 12 Male\n45 11 Male\n46 10 Male\n47 11 Male\n48 13 Female\n49 3 Female\n50 4 Female\n51 3 Male\n52 1 Male\n53 2 Female\n54 2 Female\n55 4 Male\n56 2 Male\n57 2 Male\n58 3 Female\n59 3 Female\n60 4 Male\n61 1 Female\n62 13 Female\n63 13 Female\n64 6 Male\n65 13 Male\n66 5 Female\n67 13 Female\n68 14 Male\n69 13 Male\n70 8 Female\n71 7 Male\n72 6 Female\n73 13 Male\n74 3 Male\n75 4 Male\n76 2 Male\n77 NA Male\n78 5 Female\n79 3 Male\n80 3 Male\n81 14 Male\n82 11 Female\n83 7 Female\n84 7 Male\n85 11 Female\n86 9 Female\n87 14 Male\n88 13 Female\n89 1 Male\n90 1 Male\n91 4 Male\n92 1 Female\n93 2 Male\n94 3 Female\n95 2 Male\n96 1 Male\n97 2 Male\n98 2 Female\n99 4 Female\n100 5 Female\n101 5 Male\n102 6 Female\n103 14 Female\n104 14 Male\n105 10 Male\n106 6 Female\n107 6 Male\n108 8 Male\n109 6 Female\n110 12 Female\n111 12 Male\n112 14 Female\n113 15 Male\n114 12 Female\n115 4 Female\n116 4 Male\n117 3 Female\n118 NA Male\n119 2 Female\n120 3 Male\n121 NA Female\n122 3 Female\n123 3 Male\n124 2 Female\n125 4 Female\n126 10 Female\n127 7 Female\n128 11 Female\n129 6 Female\n130 11 Male\n131 9 Male\n132 6 Male\n133 13 Female\n134 10 Female\n135 6 Female\n136 11 Female\n137 7 Male\n138 6 Female\n139 4 Female\n140 4 Female\n141 4 Male\n142 4 Female\n143 4 Male\n144 4 Male\n145 3 Male\n146 4 Female\n147 3 Male\n148 3 Male\n149 13 Female\n150 7 Female\n151 10 Male\n152 6 Male\n153 10 Female\n154 12 Female\n155 10 Male\n156 10 Male\n157 13 Male\n158 13 Female\n159 5 Female\n160 3 Female\n161 4 Male\n162 1 Male\n163 3 Female\n164 4 Male\n165 4 Male\n166 1 Male\n167 5 Female\n168 6 Female\n169 14 Female\n170 6 Male\n171 13 Female\n172 9 Male\n173 11 Male\n174 10 Male\n175 5 Female\n176 14 Male\n177 7 Male\n178 10 Male\n179 6 Male\n180 5 Male\n181 3 Female\n182 4 Male\n183 2 Female\n184 3 Male\n185 3 Female\n186 2 Female\n187 3 Male\n188 5 Female\n189 2 Male\n190 3 Female\n191 14 Female\n192 9 Female\n193 14 Female\n194 9 Female\n195 8 Female\n196 7 Male\n197 13 Male\n198 8 Female\n199 6 Male\n200 12 Female\n201 14 Female\n202 15 Female\n203 2 Female\n204 4 Female\n205 3 Male\n206 3 Female\n207 3 Male\n208 4 Female\n209 3 Male\n210 14 Female\n211 8 Male\n212 7 Male\n213 14 Female\n214 13 Female\n215 13 Female\n216 7 Male\n217 8 Female\n218 10 Female\n219 9 Male\n220 9 Female\n221 3 Female\n222 4 Male\n223 4 Female\n224 4 Male\n225 2 Female\n226 1 Female\n227 3 Female\n228 2 Male\n229 3 Male\n230 5 Male\n231 2 Female\n232 2 Male\n233 9 Male\n234 13 Male\n235 10 Female\n236 6 Male\n237 13 Female\n238 11 Male\n239 10 Male\n240 8 Female\n241 9 Female\n242 10 Male\n243 14 Male\n244 1 Female\n245 2 Male\n246 3 Female\n247 2 Male\n248 3 Female\n249 2 Female\n250 3 Female\n251 5 Female\n252 10 Female\n253 7 Male\n254 13 Female\n255 15 Male\n256 11 Female\n257 10 Female\n258 3 Female\n259 2 Male\n260 3 Male\n261 3 Female\n262 3 Female\n263 4 Male\n264 3 Male\n265 2 Male\n266 4 Male\n267 2 Female\n268 8 Male\n269 11 Male\n270 6 Male\n271 14 Female\n272 14 Male\n273 5 Female\n274 5 Male\n275 10 Female\n276 13 Male\n277 6 Male\n278 5 Male\n279 12 Male\n280 2 Male\n281 3 Female\n282 1 Female\n283 1 Male\n284 1 Female\n285 2 Female\n286 5 Female\n287 5 Male\n288 4 Female\n289 2 Male\n290 NA Female\n291 6 Female\n292 8 Male\n293 15 Male\n294 11 Male\n295 14 Male\n296 6 Male\n297 10 Female\n298 12 Male\n299 14 Male\n300 10 Male\n301 1 Female\n302 3 Male\n303 2 Male\n304 3 Female\n305 4 Male\n306 3 Male\n307 4 Female\n308 4 Male\n309 1 Female\n310 7 Male\n311 11 Female\n312 7 Female\n313 5 Female\n314 10 Male\n315 9 Female\n316 13 Male\n317 11 Female\n318 13 Male\n319 9 Female\n320 15 Female\n321 7 Female\n322 4 Male\n323 1 Male\n324 1 Male\n325 2 Female\n326 2 Female\n327 3 Male\n328 2 Male\n329 3 Male\n330 4 Female\n331 7 Female\n332 11 Female\n333 10 Female\n334 5 Male\n335 8 Male\n336 15 Male\n337 14 Male\n338 2 Male\n339 2 Female\n340 2 Male\n341 5 Male\n342 4 Female\n343 3 Male\n344 5 Female\n345 4 Female\n346 2 Female\n347 1 Female\n348 7 Male\n349 8 Female\n350 NA Male\n351 9 Male\n352 8 Female\n353 5 Male\n354 14 Male\n355 14 Male\n356 7 Female\n357 13 Female\n358 2 Male\n359 1 Female\n360 1 Male\n361 4 Female\n362 3 Male\n363 4 Female\n364 3 Male\n365 1 Male\n366 5 Female\n367 4 Female\n368 4 Female\n369 4 Male\n370 11 Male\n371 15 Female\n372 12 Female\n373 11 Female\n374 8 Female\n375 13 Male\n376 10 Female\n377 10 Female\n378 15 Male\n379 8 Female\n380 14 Male\n381 4 Male\n382 1 Male\n383 5 Female\n384 2 Male\n385 2 Female\n386 4 Male\n387 4 Male\n388 2 Female\n389 3 Male\n390 11 Male\n391 10 Female\n392 6 Male\n393 12 Female\n394 10 Female\n395 8 Male\n396 8 Male\n397 13 Male\n398 10 Male\n399 13 Female\n400 10 Male\n401 2 Male\n402 4 Female\n403 3 Female\n404 2 Female\n405 1 Female\n406 3 Male\n407 3 Female\n408 4 Male\n409 5 Female\n410 5 Female\n411 1 Female\n412 11 Male\n413 6 Male\n414 14 Female\n415 8 Male\n416 8 Female\n417 9 Female\n418 7 Male\n419 6 Male\n420 12 Female\n421 8 Male\n422 11 Female\n423 14 Male\n424 3 Female\n425 1 Female\n426 5 Female\n427 2 Female\n428 3 Female\n429 4 Female\n430 2 Male\n431 3 Female\n432 4 Male\n433 1 Female\n434 7 Female\n435 10 Male\n436 11 Male\n437 7 Female\n438 10 Female\n439 14 Female\n440 7 Female\n441 11 Male\n442 12 Male\n443 10 Female\n444 6 Male\n445 13 Male\n446 8 Female\n447 2 Male\n448 3 Female\n449 1 Female\n450 2 Female\n451 NA Male\n452 NA Female\n453 4 Male\n454 4 Male\n455 1 Male\n456 2 Female\n457 2 Male\n458 12 Male\n459 12 Female\n460 8 Female\n461 14 Female\n462 13 Female\n463 6 Male\n464 11 Female\n465 11 Male\n466 10 Female\n467 12 Male\n468 14 Female\n469 11 Female\n470 1 Male\n471 2 Female\n472 3 Male\n473 3 Female\n474 5 Female\n475 3 Male\n476 1 Male\n477 4 Female\n478 4 Female\n479 4 Male\n480 2 Female\n481 5 Female\n482 7 Male\n483 8 Male\n484 10 Male\n485 6 Female\n486 7 Male\n487 10 Female\n488 6 Male\n489 6 Female\n490 15 Female\n491 5 Male\n492 3 Male\n493 5 Male\n494 3 Female\n495 5 Male\n496 5 Male\n497 1 Female\n498 1 Male\n499 7 Female\n500 14 Female\n501 9 Male\n502 10 Female\n503 10 Female\n504 11 Male\n505 11 Female\n506 12 Female\n507 11 Female\n508 12 Male\n509 12 Male\n510 10 Female\n511 1 Male\n512 2 Female\n513 4 Male\n514 2 Male\n515 3 Male\n516 3 Female\n517 2 Male\n518 4 Male\n519 3 Male\n520 1 Female\n521 4 Male\n522 12 Female\n523 6 Male\n524 7 Female\n525 7 Male\n526 13 Female\n527 8 Female\n528 7 Male\n529 8 Female\n530 8 Female\n531 11 Female\n532 14 Female\n533 3 Male\n534 2 Female\n535 2 Male\n536 3 Male\n537 2 Male\n538 2 Female\n539 3 Female\n540 2 Male\n541 5 Male\n542 10 Female\n543 14 Male\n544 9 Male\n545 6 Male\n546 7 Male\n547 14 Female\n548 7 Female\n549 7 Male\n550 9 Male\n551 14 Male\n552 10 Female\n553 13 Female\n554 5 Male\n555 4 Female\n556 4 Female\n557 5 Female\n558 4 Female\n559 4 Male\n560 4 Male\n561 3 Female\n562 1 Female\n563 4 Male\n564 1 Male\n565 1 Female\n566 7 Male\n567 13 Female\n568 10 Female\n569 14 Male\n570 12 Female\n571 14 Male\n572 8 Male\n573 7 Male\n574 11 Female\n575 8 Male\n576 12 Male\n577 9 Female\n578 5 Female\n579 4 Male\n580 3 Female\n581 2 Male\n582 2 Male\n583 3 Male\n584 4 Female\n585 4 Male\n586 4 Female\n587 5 Male\n588 3 Female\n589 6 Female\n590 3 Male\n591 11 Female\n592 11 Male\n593 7 Male\n594 8 Male\n595 6 Female\n596 10 Female\n597 8 Female\n598 8 Male\n599 9 Female\n600 8 Male\n601 13 Male\n602 11 Male\n603 8 Female\n604 2 Female\n605 4 Male\n606 2 Male\n607 2 Female\n608 4 Male\n609 2 Male\n610 4 Female\n611 2 Female\n612 4 Female\n613 1 Female\n614 4 Female\n615 12 Female\n616 7 Female\n617 11 Male\n618 6 Male\n619 8 Male\n620 14 Male\n621 11 Male\n622 7 Female\n623 14 Female\n624 6 Male\n625 13 Female\n626 13 Female\n627 3 Male\n628 1 Male\n629 3 Male\n630 1 Female\n631 1 Female\n632 2 Male\n633 4 Male\n634 4 Male\n635 2 Female\n636 4 Female\n637 5 Male\n638 3 Female\n639 3 Male\n640 6 Female\n641 11 Female\n642 9 Female\n643 7 Female\n644 8 Male\n645 NA Female\n646 8 Female\n647 14 Female\n648 10 Male\n649 10 Male\n650 11 Female\n651 13 Female\n```\n:::\n:::\n\nWe can remove select columns using indexing as well, OR by simply changing the column to `NULL`\n\n::: {.cell}\n\n```{.r .cell-code}\ndf[, -5] #remove column 5, \"slum\" variable\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$slum <- NULL # this is the same as above\n```\n:::\n\nWe can also grab the `age` column using the `$` operator, again this is selecting the variable for all of the rows.\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age\n```\n:::\n\n\n\n## Using indexing to subset by rows\n\nWe can use indexing to also subset by rows. For example, here we pull the 100th observation/row.\n\n::: {.cell}\n\n```{.r .cell-code}\ndf[100,] \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n observation_id IgG_concentration age gender slum\n100 8122 0.1818182 5 Female Non slum\n```\n:::\n:::\n\nAnd, here we pull the `age` of the 100th observation/row.\n\n::: {.cell}\n\n```{.r .cell-code}\ndf[100,\"age\"] \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 5\n```\n:::\n:::\n\n \n\n## Logical operators\n\nLogical operators can be evaluated on object(s) in order to return a binary response of TRUE/FALSE\n\noperator | operator option |description\n-----|-----|-----:\n`<`|%l%|less than\n`<=`|%le%|less than or equal to\n`>`|%g%|greater than\n`>=`|%ge%|greater than or equal to\n`==`||equal to\n`!=`||not equal to\n`x&y`||x and y\n`x|y`||x or y\n`%in%`||match\n`%!in%`||do not match\n\n\n## Logical operators examples\n\nLet's practice. First, here is a reminder of what the number.object contains.\n\n::: {.cell}\n\n```{.r .cell-code}\nnumber.object\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 3\n```\n:::\n:::\n\n\nNow, we will use logical operators to evaluate the object.\n\n::: {.cell}\n\n```{.r .cell-code}\nnumber.object<4\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nnumber.object>=3\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nnumber.object!=5\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nnumber.object %in% c(6,7,2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n:::\n\n\n\n## Using indexing and logical operators to rename columns\n\n1. We can assign the column names from data frame `df` to an object `cn`, then we can modify `cn` directly using indexing and logical operators, finally we reassign the column names, `cn`, back to the data frame `df`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncn <- colnames(df)\ncn\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"observation_id\" \"IgG_concentration\" \"age\" \n[4] \"gender\" \"slum\" \n```\n:::\n\n```{.r .cell-code}\ncn==\"IgG_concentration\"\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE TRUE FALSE FALSE FALSE\n```\n:::\n\n```{.r .cell-code}\ncn[cn==\"IgG_concentration\"] <-\"IgG_concentration_mIU\" #rename cn to \"IgG_concentration_mIU\" when cn is \"IgG_concentration\"\ncolnames(df) <- cn\ncolnames(df)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"observation_id\" \"IgG_concentration_mIU\" \"age\" \n[4] \"gender\" \"slum\" \n```\n:::\n:::\n\n\nNote, I am resetting the column name back to the original name for the sake of the rest of the module.\n\n::: {.cell}\n\n```{.r .cell-code}\ncolnames(df)[colnames(df)==\"IgG_concentration_mIU\"] <- \"IgG_concentration\" #reset\n```\n:::\n\n\n\n## Using indexing and logical operators to subset data\n\n\nIn this example, we subset by rows and pull only observations with an age of less than or equal to 10 and then saved the subset data to `df_lt10`. Note that the logical operators `df$age<=10` is before the comma because I want to subset by rows (the first dimension).\n\n::: {.cell}\n\n```{.r .cell-code}\ndf_lte10 <- df[df$age<=10, ]\n```\n:::\n\nLets check that my subsets worked using the `summary()` function. \n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(df_lte10$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n Min. 1st Qu. Median Mean 3rd Qu. Max. NA's \n 1.0 3.0 4.0 4.8 7.0 10.0 9 \n```\n:::\n:::\n\n\n
\n\nIn the next example, we subset by rows and pull only observations with an age of less than or equal to 5 OR greater than 10.\n\n::: {.cell}\n\n```{.r .cell-code}\ndf_lte5_gt10 <- df[df$age<=5 | df$age>10, ]\n```\n:::\n\nLets check that my subsets worked using the `summary()` function. \n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(df_lte5_gt10$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n Min. 1st Qu. Median Mean 3rd Qu. Max. NA's \n 1.00 2.50 4.00 6.08 11.00 15.00 9 \n```\n:::\n:::\n\n\n\n## Missing values \n\nMissing data need to be carefully described and dealt with in data analysis. Understanding the different types of missing data and how you can identify them, is the first step to data cleaning.\n\nTypes of \"missing\" values:\n\n- `NA` - **N**ot **A**pplicable general missing data\n- `NaN` - stands for \"**N**ot **a** **N**umber\", happens when you do 0/0.\n- `Inf` and `-Inf` - Infinity, happens when you divide a positive number (or negative number) by 0.\n- blank space - sometimes when data is read it, there is a blank space left\n- an empty string (e.g., `\"\"`) \n- `NULL`- undefined value that represents something that does not exist\n\n## Logical operators to help identify and missing data\n\noperator |description\n-----|-----|-----:\n`is.na`|is NAN or NA\n`is.nan`|is NAN\n`!is.na`|is not NAN or NA\n`!is.nan`|is not NAN\n`is.infinite`|is infinite\n`any`|are any TRUE\n`all`|all are TRUE\n`which`|which are TRUE\n\n## More logical operators examples\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntest <- c(0,NA, -1)/0\ntest\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] NaN NA -Inf\n```\n:::\n\n```{.r .cell-code}\nis.na(test)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE TRUE FALSE\n```\n:::\n\n```{.r .cell-code}\nis.nan(test)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE FALSE FALSE\n```\n:::\n\n```{.r .cell-code}\nis.infinite(test)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE FALSE TRUE\n```\n:::\n:::\n\n\n## More logical operators examples\n\n`any(is.na(x))` means do we have any `NA`'s in the object `x`?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nany(is.na(df$IgG_concentration)) # are there any NAs - YES/TRUE\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nany(is.na(df$slum)) # are there any NAs- NO/FALSE\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n:::\n\n\n`which(is.na(x))` means which of the elements in object `x` are `NA`'s?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwhich(is.na(df$IgG_concentration)) \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 13 55 57 72 182 406 414 478 488 595\n```\n:::\n\n```{.r .cell-code}\nwhich(is.na(df$slum)) \n```\n\n::: {.cell-output .cell-output-stdout}\n```\ninteger(0)\n```\n:::\n:::\n\n\n## `subset()` function\n\nThe Base R `subset()` function is a slightly easier way to select variables and observations.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?subset\n```\n:::\n\n\n```\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n```\n\nSubsetting Vectors, Matrices and Data Frames\n\nDescription:\n\n Return subsets of vectors, matrices or data frames which meet\n conditions.\n\nUsage:\n\n subset(x, ...)\n \n ## Default S3 method:\n subset(x, subset, ...)\n \n ## S3 method for class 'matrix'\n subset(x, subset, select, drop = FALSE, ...)\n \n ## S3 method for class 'data.frame'\n subset(x, subset, select, drop = FALSE, ...)\n \nArguments:\n\n x: object to be subsetted.\n\n subset: logical expression indicating elements or rows to keep:\n missing values are taken as false.\n\n select: expression, indicating columns to select from a data frame.\n\n drop: passed on to '[' indexing operator.\n\n ...: further arguments to be passed to or from other methods.\n\nDetails:\n\n This is a generic function, with methods supplied for matrices,\n data frames and vectors (including lists). Packages and users can\n add further methods.\n\n For ordinary vectors, the result is simply 'x[subset &\n !is.na(subset)]'.\n\n For data frames, the 'subset' argument works on the rows. Note\n that 'subset' will be evaluated in the data frame, so columns can\n be referred to (by name) as variables in the expression (see the\n examples).\n\n The 'select' argument exists only for the methods for data frames\n and matrices. It works by first replacing column names in the\n selection expression with the corresponding column numbers in the\n data frame and then using the resulting integer vector to index\n the columns. This allows the use of the standard indexing\n conventions so that for example ranges of columns can be specified\n easily, or single columns can be dropped (see the examples).\n\n The 'drop' argument is passed on to the indexing method for\n matrices and data frames: note that the default for matrices is\n different from that for indexing.\n\n Factors may have empty levels after subsetting; unused levels are\n not automatically removed. See 'droplevels' for a way to drop all\n unused levels from a data frame.\n\nValue:\n\n An object similar to 'x' contain just the selected elements (for a\n vector), rows and columns (for a matrix or data frame), and so on.\n\nWarning:\n\n This is a convenience function intended for use interactively.\n For programming it is better to use the standard subsetting\n functions like '[', and in particular the non-standard evaluation\n of argument 'subset' can have unanticipated consequences.\n\nAuthor(s):\n\n Peter Dalgaard and Brian Ripley\n\nSee Also:\n\n '[', 'transform' 'droplevels'\n\nExamples:\n\n subset(airquality, Temp > 80, select = c(Ozone, Temp))\n subset(airquality, Day == 1, select = -Temp)\n subset(airquality, select = Ozone:Wind)\n \n with(airquality, subset(Ozone, Temp > 80))\n \n ## sometimes requiring a logical 'subset' argument is a nuisance\n nm <- rownames(state.x77)\n start_with_M <- nm %in% grep(\"^M\", nm, value = TRUE)\n subset(state.x77, start_with_M, Illiteracy:Murder)\n # but in recent versions of R this can simply be\n subset(state.x77, grepl(\"^M\", nm), Illiteracy:Murder)\n\n\n## Subsetting use the `subset()` function\n\nHere are a few examples using the `subset()` function\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf_lte10_v2 <- subset(df, df$age<=10, select=c(IgG_concentration, age))\ndf_lt5_f <- subset(df, df$age<=5 & gender==\"Female\", select=c(IgG_concentration, slum))\n```\n:::\n\n\n## `subset()` function vs logical operators\n\n`subset()` automatically removes NAs, which is a different behavior from doing logical operations on NAs.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(df_lte10$age) #created with indexing\n```\n\n::: {.cell-output-display}\n| Min.| 1st Qu.| Median| Mean| 3rd Qu.| Max.| NA's|\n|----:|-------:|------:|----:|-------:|----:|----:|\n| 1| 3| 4| 4.8| 7| 10| 9|\n:::\n\n```{.r .cell-code}\nsummary(df_lte10_v2$age) #created with the subset function\n```\n\n::: {.cell-output-display}\n| Min.| 1st Qu.| Median| Mean| 3rd Qu.| Max.|\n|----:|-------:|------:|----:|-------:|----:|\n| 1| 3| 4| 4.8| 7| 10|\n:::\n:::\n\n\nWe can also see this by looking at the number or rows in each dataset.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nnrow(df_lte10)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 504\n```\n:::\n\n```{.r .cell-code}\nnrow(df_lte10_v2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 495\n```\n:::\n:::\n\n\n\n\n## Summary\n\n- `colnames()`, `str()` and `summary()`functions from Base R are functions to assess the data type and some summary statistics\n- There are three basic indexing syntax: `[`, `[[` and `$`\n- Indexing can be used to extract part of an object (e.g., subset data) and to replace parts of an object (e.g., rename variables / columns)\n- Logical operators can be evaluated on object(s) in order to return a binary response of TRUE/FALSE, and are useful for decision rules for indexing\n- There are 7 “types” of missing values, the most common being “NA”\n- Logical operators meant to determine missing values are very helpful for data cleaning\n- The Base R `subset()` function is a slightly easier way to select variables and observations.\n\n## Acknowledgements\n\nThese are the materials we looked through, modified, or extracted to complete this module's lecture.\n\n- [\"Introduction to R for Public Health Researchers\" Johns Hopkins University](https://jhudatascience.org/intro_to_r/)\n- [\"Indexing\" CRAN Project](https://cran.r-project.org/doc/manuals/R-lang.html#Indexing)\n- [\"Logical operators\" CRAN Project](https://cran.r-project.org/web/packages/extraoperators/vignettes/logicals-vignette.html)\n\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/modules/Module07-VarCreationClassesSummaries/execute-results/html.json b/_freeze/modules/Module07-VarCreationClassesSummaries/execute-results/html.json index 2996645..89c0109 100644 --- a/_freeze/modules/Module07-VarCreationClassesSummaries/execute-results/html.json +++ b/_freeze/modules/Module07-VarCreationClassesSummaries/execute-results/html.json @@ -1,7 +1,7 @@ { - "hash": "5ecd3b27a4a72d2ba1db1285b9852998", + "hash": "d36a9161972c30d45b4350606c8bff8d", "result": { - "markdown": "---\ntitle: \"Module 7: Variable Creation, Classes, and Summaries\"\nformat: \n revealjs:\n scrollable: true\n smaller: true\n---\n\n\n## Learning Objectives\n\nAfter module 7, you should be able to...\n\n- Create new variables\n- Characterize variable classes\n- Manipulate the classes of variables\n- Conduct 1 variable data summaries\n\n## Import data for this module\nLet's first read in the data from the previous module and look at it briefly with a new function `head()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- read.csv(file = \"data/serodata.csv\") #relative path\nhead(x=df, n=3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n observation_id IgG_concentration age gender slum\n1 5772 0.3176895 2 Female Non slum\n2 8095 3.4368231 4 Female Non slum\n3 9784 0.3000000 4 Male Non slum\n```\n:::\n:::\n\n\n```\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n```\n\nReturn the First or Last Parts of an Object\n\nDescription:\n\n Returns the first or last parts of a vector, matrix, table, data\n frame or function. Since 'head()' and 'tail()' are generic\n functions, they may also have been extended to other classes.\n\nUsage:\n\n head(x, ...)\n ## Default S3 method:\n head(x, n = 6L, ...)\n \n ## S3 method for class 'matrix'\n head(x, n = 6L, ...) # is exported as head.matrix()\n ## NB: The methods for 'data.frame' and 'array' are identical to the 'matrix' one\n \n ## S3 method for class 'ftable'\n head(x, n = 6L, ...)\n ## S3 method for class 'function'\n head(x, n = 6L, ...)\n \n \n tail(x, ...)\n ## Default S3 method:\n tail(x, n = 6L, keepnums = FALSE, addrownums, ...)\n ## S3 method for class 'matrix'\n tail(x, n = 6L, keepnums = TRUE, addrownums, ...) # exported as tail.matrix()\n ## NB: The methods for 'data.frame', 'array', and 'table'\n ## are identical to the 'matrix' one\n \n ## S3 method for class 'ftable'\n tail(x, n = 6L, keepnums = FALSE, addrownums, ...)\n ## S3 method for class 'function'\n tail(x, n = 6L, ...)\n \nArguments:\n\n x: an object\n\n n: an integer vector of length up to 'dim(x)' (or 1, for\n non-dimensioned objects). A 'logical' is silently coerced to\n integer. Values specify the indices to be selected in the\n corresponding dimension (or along the length) of the object.\n A positive value of 'n[i]' includes the first/last 'n[i]'\n indices in that dimension, while a negative value excludes\n the last/first 'abs(n[i])', including all remaining indices.\n 'NA' or non-specified values (when 'length(n) <\n length(dim(x))') select all indices in that dimension. Must\n contain at least one non-missing value.\n\nkeepnums: in each dimension, if no names in that dimension are present,\n create them using the indices included in that dimension.\n Ignored if 'dim(x)' is 'NULL' or its length 1.\n\naddrownums: deprecated - 'keepnums' should be used instead. Taken as\n the value of 'keepnums' if it is explicitly set when\n 'keepnums' is not.\n\n ...: arguments to be passed to or from other methods.\n\nDetails:\n\n For vector/array based objects, 'head()' ('tail()') returns a\n subset of the same dimensionality as 'x', usually of the same\n class. For historical reasons, by default they select the first\n (last) 6 indices in the first dimension (\"rows\") or along the\n length of a non-dimensioned vector, and the full extent (all\n indices) in any remaining dimensions. 'head.matrix()' and\n 'tail.matrix()' are exported.\n\n The default and array(/matrix) methods for 'head()' and 'tail()'\n are quite general. They will work as is for any class which has a\n 'dim()' method, a 'length()' method (only required if 'dim()'\n returns 'NULL'), and a '[' method (that accepts the 'drop'\n argument and can subset in all dimensions in the dimensioned\n case).\n\n For functions, the lines of the deparsed function are returned as\n character strings.\n\n When 'x' is an array(/matrix) of dimensionality two and more,\n 'tail()' will add dimnames similar to how they would appear in a\n full printing of 'x' for all dimensions 'k' where 'n[k]' is\n specified and non-missing and 'dimnames(x)[[k]]' (or 'dimnames(x)'\n itself) is 'NULL'. Specifically, the form of the added dimnames\n will vary for different dimensions as follows:\n\n 'k=1' (rows): '\"[n,]\"' (right justified with whitespace padding)\n\n 'k=2' (columns): '\"[,n]\"' (with _no_ whitespace padding)\n\n 'k>2' (higher dims): '\"n\"', i.e., the indices as _character_\n values\n\n Setting 'keepnums = FALSE' suppresses this behaviour.\n\n As 'data.frame' subsetting ('indexing') keeps 'attributes', so do\n the 'head()' and 'tail()' methods for data frames.\n\nValue:\n\n An object (usually) like 'x' but generally smaller. Hence, for\n 'array's, the result corresponds to 'x[.., drop=FALSE]'. For\n 'ftable' objects 'x', a transformed 'format(x)'.\n\nNote:\n\n For array inputs the output of 'tail' when 'keepnums' is 'TRUE',\n any dimnames vectors added for dimensions '>2' are the original\n numeric indices in that dimension _as character vectors_. This\n means that, e.g., for 3-dimensional array 'arr', 'tail(arr,\n c(2,2,-1))[ , , 2]' and 'tail(arr, c(2,2,-1))[ , , \"2\"]' may both\n be valid but have completely different meanings.\n\nAuthor(s):\n\n Patrick Burns, improved and corrected by R-Core. Negative argument\n added by Vincent Goulet. Multi-dimension support added by Gabriel\n Becker.\n\nExamples:\n\n head(letters)\n head(letters, n = -6L)\n \n head(freeny.x, n = 10L)\n head(freeny.y)\n \n head(iris3)\n head(iris3, c(6L, 2L))\n head(iris3, c(6L, -1L, 2L))\n \n tail(letters)\n tail(letters, n = -6L)\n \n tail(freeny.x)\n ## the bottom-right \"corner\" :\n tail(freeny.x, n = c(4, 2))\n tail(freeny.y)\n \n tail(iris3)\n tail(iris3, c(6L, 2L))\n tail(iris3, c(6L, -1L, 2L))\n \n ## iris with dimnames stripped\n a3d <- iris3 ; dimnames(a3d) <- NULL\n tail(a3d, c(6, -1, 2)) # keepnums = TRUE is default here!\n tail(a3d, c(6, -1, 2), keepnums = FALSE)\n \n ## data frame w/ a (non-standard) attribute:\n treeS <- structure(trees, foo = \"bar\")\n (n <- nrow(treeS))\n stopifnot(exprs = { # attribute is kept\n identical(htS <- head(treeS), treeS[1:6, ])\n identical(attr(htS, \"foo\") , \"bar\")\n identical(tlS <- tail(treeS), treeS[(n-5):n, ])\n ## BUT if I use \"useAttrib(.)\", this is *not* ok, when n is of length 2:\n ## --- because [i,j]-indexing of data frames *also* drops \"other\" attributes ..\n identical(tail(treeS, 3:2), treeS[(n-2):n, 2:3] )\n })\n \n tail(library) # last lines of function\n \n head(stats::ftable(Titanic))\n \n ## 1d-array (with named dim) :\n a1 <- array(1:7, 7); names(dim(a1)) <- \"O2\"\n stopifnot(exprs = {\n identical( tail(a1, 10), a1)\n identical( head(a1, 10), a1)\n identical( head(a1, 1), a1 [1 , drop=FALSE] ) # was a1[1] in R <= 3.6.x\n identical( tail(a1, 2), a1[6:7])\n identical( tail(a1, 1), a1 [7 , drop=FALSE] ) # was a1[7] in R <= 3.6.x\n })\n\n\n\n## Adding new columns\n\nYou can add a new column, called `newcol` to `df`, using the `$` operator:\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$log_IgG <- log(df$IgG_concentration)\nhead(df,3)\n```\n\n::: {.cell-output-display}\n| observation_id| IgG_concentration| age|gender |slum | log_IgG|\n|--------------:|-----------------:|---:|:------|:--------|---------:|\n| 5772| 0.3176895| 2|Female |Non slum | -1.146681|\n| 8095| 3.4368231| 4|Female |Non slum | 1.234547|\n| 9784| 0.3000000| 4|Male |Non slum | -1.203973|\n:::\n:::\n\n\n## Creating conditional variables\n\nOne frequently-used tool is creating variables with conditions. A general function for creating new variables based on existing variables is the Base R `ifelse()` function, which \"returns a value depending on whether the element of test is `TRUE` or `FALSE`.\"\n\n\nConditional Element Selection\n\nDescription:\n\n 'ifelse' returns a value with the same shape as 'test' which is\n filled with elements selected from either 'yes' or 'no' depending\n on whether the element of 'test' is 'TRUE' or 'FALSE'.\n\nUsage:\n\n ifelse(test, yes, no)\n \nArguments:\n\n test: an object which can be coerced to logical mode.\n\n yes: return values for true elements of 'test'.\n\n no: return values for false elements of 'test'.\n\nDetails:\n\n If 'yes' or 'no' are too short, their elements are recycled.\n 'yes' will be evaluated if and only if any element of 'test' is\n true, and analogously for 'no'.\n\n Missing values in 'test' give missing values in the result.\n\nValue:\n\n A vector of the same length and attributes (including dimensions\n and '\"class\"') as 'test' and data values from the values of 'yes'\n or 'no'. The mode of the answer will be coerced from logical to\n accommodate first any values taken from 'yes' and then any values\n taken from 'no'.\n\nWarning:\n\n The mode of the result may depend on the value of 'test' (see the\n examples), and the class attribute (see 'oldClass') of the result\n is taken from 'test' and may be inappropriate for the values\n selected from 'yes' and 'no'.\n\n Sometimes it is better to use a construction such as\n\n (tmp <- yes; tmp[!test] <- no[!test]; tmp)\n \n , possibly extended to handle missing values in 'test'.\n\n Further note that 'if(test) yes else no' is much more efficient\n and often much preferable to 'ifelse(test, yes, no)' whenever\n 'test' is a simple true/false result, i.e., when 'length(test) ==\n 1'.\n\n The 'srcref' attribute of functions is handled specially: if\n 'test' is a simple true result and 'yes' evaluates to a function\n with 'srcref' attribute, 'ifelse' returns 'yes' including its\n attribute (the same applies to a false 'test' and 'no' argument).\n This functionality is only for backwards compatibility, the form\n 'if(test) yes else no' should be used whenever 'yes' and 'no' are\n functions.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'if'.\n\nExamples:\n\n x <- c(6:-4)\n sqrt(x) #- gives warning\n sqrt(ifelse(x >= 0, x, NA)) # no warning\n \n ## Note: the following also gives the warning !\n ifelse(x >= 0, sqrt(x), NA)\n \n \n ## ifelse() strips attributes\n ## This is important when working with Dates and factors\n x <- seq(as.Date(\"2000-02-29\"), as.Date(\"2004-10-04\"), by = \"1 month\")\n ## has many \"yyyy-mm-29\", but a few \"yyyy-03-01\" in the non-leap years\n y <- ifelse(as.POSIXlt(x)$mday == 29, x, NA)\n head(y) # not what you expected ... ==> need restore the class attribute:\n class(y) <- class(x)\n y\n ## This is a (not atypical) case where it is better *not* to use ifelse(),\n ## but rather the more efficient and still clear:\n y2 <- x\n y2[as.POSIXlt(x)$mday != 29] <- NA\n ## which gives the same as ifelse()+class() hack:\n stopifnot(identical(y2, y))\n \n \n ## example of different return modes (and 'test' alone determining length):\n yes <- 1:3\n no <- pi^(1:4)\n utils::str( ifelse(NA, yes, no) ) # logical, length 1\n utils::str( ifelse(TRUE, yes, no) ) # integer, length 1\n utils::str( ifelse(FALSE, yes, no) ) # double, length 1\n\n\n\n## `ifelse` example\n\nReminder of the first three arguments in the `ifelse()` function are `ifelse(test, yes, no)`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group <- ifelse(df$age <= 5, \"young\", \"old\")\nhead(df)\n```\n\n::: {.cell-output-display}\n| observation_id| IgG_concentration| age|gender |slum | log_IgG|age_group |\n|--------------:|-----------------:|---:|:------|:--------|----------:|:---------|\n| 5772| 0.3176895| 2|Female |Non slum | -1.1466807|young |\n| 8095| 3.4368231| 4|Female |Non slum | 1.2345475|young |\n| 9784| 0.3000000| 4|Male |Non slum | -1.2039728|young |\n| 9338| 143.2363014| 4|Male |Non slum | 4.9644957|young |\n| 6369| 0.4476534| 1|Male |Non slum | -0.8037359|young |\n| 6885| 0.0252708| 4|Male |Non slum | -3.6781074|young |\n:::\n:::\n\n\n\n## Nesting `ifelse` statements example\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group <- ifelse(df$age <= 5, \"young\", \n ifelse(df$age<=10 & df$age>5, \"middle\", \n ifelse(df$age>10, \"old\", NA)))\nhead(df)\n```\n\n::: {.cell-output-display}\n| observation_id| IgG_concentration| age|gender |slum | log_IgG|age_group |\n|--------------:|-----------------:|---:|:------|:--------|----------:|:---------|\n| 5772| 0.3176895| 2|Female |Non slum | -1.1466807|young |\n| 8095| 3.4368231| 4|Female |Non slum | 1.2345475|young |\n| 9784| 0.3000000| 4|Male |Non slum | -1.2039728|young |\n| 9338| 143.2363014| 4|Male |Non slum | 4.9644957|young |\n| 6369| 0.4476534| 1|Male |Non slum | -0.8037359|young |\n| 6885| 0.0252708| 4|Male |Non slum | -3.6781074|young |\n:::\n:::\n\n\n\n# Data Classes\n\n## Overview - Data Classes\n\n1. One dimensional types (i.e., vectors of characters, numeric, logical, or factor values)\n\n2. Two dimensional types (e.g., matrix, data frame, tibble)\n\n3. Special data classes (e.g., lists, dates). \n\n## \t`class()` function\n\nThe `class()` function allows you to evaluate the class of an object.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(df$IgG_concentration)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"numeric\"\n```\n:::\n\n```{.r .cell-code}\nclass(df$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"integer\"\n```\n:::\n\n```{.r .cell-code}\nclass(df$gender)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"character\"\n```\n:::\n:::\n\nReturn the First or Last Parts of an Object\n\nDescription:\n\n Returns the first or last parts of a vector, matrix, table, data\n frame or function. Since 'head()' and 'tail()' are generic\n functions, they may also have been extended to other classes.\n\nUsage:\n\n head(x, ...)\n ## Default S3 method:\n head(x, n = 6L, ...)\n \n ## S3 method for class 'matrix'\n head(x, n = 6L, ...) # is exported as head.matrix()\n ## NB: The methods for 'data.frame' and 'array' are identical to the 'matrix' one\n \n ## S3 method for class 'ftable'\n head(x, n = 6L, ...)\n ## S3 method for class 'function'\n head(x, n = 6L, ...)\n \n \n tail(x, ...)\n ## Default S3 method:\n tail(x, n = 6L, keepnums = FALSE, addrownums, ...)\n ## S3 method for class 'matrix'\n tail(x, n = 6L, keepnums = TRUE, addrownums, ...) # exported as tail.matrix()\n ## NB: The methods for 'data.frame', 'array', and 'table'\n ## are identical to the 'matrix' one\n \n ## S3 method for class 'ftable'\n tail(x, n = 6L, keepnums = FALSE, addrownums, ...)\n ## S3 method for class 'function'\n tail(x, n = 6L, ...)\n \nArguments:\n\n x: an object\n\n n: an integer vector of length up to 'dim(x)' (or 1, for\n non-dimensioned objects). A 'logical' is silently coerced to\n integer. Values specify the indices to be selected in the\n corresponding dimension (or along the length) of the object.\n A positive value of 'n[i]' includes the first/last 'n[i]'\n indices in that dimension, while a negative value excludes\n the last/first 'abs(n[i])', including all remaining indices.\n 'NA' or non-specified values (when 'length(n) <\n length(dim(x))') select all indices in that dimension. Must\n contain at least one non-missing value.\n\nkeepnums: in each dimension, if no names in that dimension are present,\n create them using the indices included in that dimension.\n Ignored if 'dim(x)' is 'NULL' or its length 1.\n\naddrownums: deprecated - 'keepnums' should be used instead. Taken as\n the value of 'keepnums' if it is explicitly set when\n 'keepnums' is not.\n\n ...: arguments to be passed to or from other methods.\n\nDetails:\n\n For vector/array based objects, 'head()' ('tail()') returns a\n subset of the same dimensionality as 'x', usually of the same\n class. For historical reasons, by default they select the first\n (last) 6 indices in the first dimension (\"rows\") or along the\n length of a non-dimensioned vector, and the full extent (all\n indices) in any remaining dimensions. 'head.matrix()' and\n 'tail.matrix()' are exported.\n\n The default and array(/matrix) methods for 'head()' and 'tail()'\n are quite general. They will work as is for any class which has a\n 'dim()' method, a 'length()' method (only required if 'dim()'\n returns 'NULL'), and a '[' method (that accepts the 'drop'\n argument and can subset in all dimensions in the dimensioned\n case).\n\n For functions, the lines of the deparsed function are returned as\n character strings.\n\n When 'x' is an array(/matrix) of dimensionality two and more,\n 'tail()' will add dimnames similar to how they would appear in a\n full printing of 'x' for all dimensions 'k' where 'n[k]' is\n specified and non-missing and 'dimnames(x)[[k]]' (or 'dimnames(x)'\n itself) is 'NULL'. Specifically, the form of the added dimnames\n will vary for different dimensions as follows:\n\n 'k=1' (rows): '\"[n,]\"' (right justified with whitespace padding)\n\n 'k=2' (columns): '\"[,n]\"' (with _no_ whitespace padding)\n\n 'k>2' (higher dims): '\"n\"', i.e., the indices as _character_\n values\n\n Setting 'keepnums = FALSE' suppresses this behaviour.\n\n As 'data.frame' subsetting ('indexing') keeps 'attributes', so do\n the 'head()' and 'tail()' methods for data frames.\n\nValue:\n\n An object (usually) like 'x' but generally smaller. Hence, for\n 'array's, the result corresponds to 'x[.., drop=FALSE]'. For\n 'ftable' objects 'x', a transformed 'format(x)'.\n\nNote:\n\n For array inputs the output of 'tail' when 'keepnums' is 'TRUE',\n any dimnames vectors added for dimensions '>2' are the original\n numeric indices in that dimension _as character vectors_. This\n means that, e.g., for 3-dimensional array 'arr', 'tail(arr,\n c(2,2,-1))[ , , 2]' and 'tail(arr, c(2,2,-1))[ , , \"2\"]' may both\n be valid but have completely different meanings.\n\nAuthor(s):\n\n Patrick Burns, improved and corrected by R-Core. Negative argument\n added by Vincent Goulet. Multi-dimension support added by Gabriel\n Becker.\n\nExamples:\n\n head(letters)\n head(letters, n = -6L)\n \n head(freeny.x, n = 10L)\n head(freeny.y)\n \n head(iris3)\n head(iris3, c(6L, 2L))\n head(iris3, c(6L, -1L, 2L))\n \n tail(letters)\n tail(letters, n = -6L)\n \n tail(freeny.x)\n ## the bottom-right \"corner\" :\n tail(freeny.x, n = c(4, 2))\n tail(freeny.y)\n \n tail(iris3)\n tail(iris3, c(6L, 2L))\n tail(iris3, c(6L, -1L, 2L))\n \n ## iris with dimnames stripped\n a3d <- iris3 ; dimnames(a3d) <- NULL\n tail(a3d, c(6, -1, 2)) # keepnums = TRUE is default here!\n tail(a3d, c(6, -1, 2), keepnums = FALSE)\n \n ## data frame w/ a (non-standard) attribute:\n treeS <- structure(trees, foo = \"bar\")\n (n <- nrow(treeS))\n stopifnot(exprs = { # attribute is kept\n identical(htS <- head(treeS), treeS[1:6, ])\n identical(attr(htS, \"foo\") , \"bar\")\n identical(tlS <- tail(treeS), treeS[(n-5):n, ])\n ## BUT if I use \"useAttrib(.)\", this is *not* ok, when n is of length 2:\n ## --- because [i,j]-indexing of data frames *also* drops \"other\" attributes ..\n identical(tail(treeS, 3:2), treeS[(n-2):n, 2:3] )\n })\n \n tail(library) # last lines of function\n \n head(stats::ftable(Titanic))\n \n ## 1d-array (with named dim) :\n a1 <- array(1:7, 7); names(dim(a1)) <- \"O2\"\n stopifnot(exprs = {\n identical( tail(a1, 10), a1)\n identical( head(a1, 10), a1)\n identical( head(a1, 1), a1 [1 , drop=FALSE] ) # was a1[1] in R <= 3.6.x\n identical( tail(a1, 2), a1[6:7])\n identical( tail(a1, 1), a1 [7 , drop=FALSE] ) # was a1[7] in R <= 3.6.x\n })\n\n\n\n## One dimensional data types\n\n* Character: strings or individual characters, quoted\n* Numeric: any real number(s)\n - Double: contains fractional values (i.e., double precision) - default numeric\n - Integer: any integer(s)/whole numbers\n* Logical: variables composed of TRUE or FALSE\n* Factor: categorical/qualitative variables\n\n## Character and numeric\n\nThis can also be a bit tricky. \n\nIf only one character in the whole vector, the class is assumed to be character\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(c(1, 2, \"tree\")) \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"character\"\n```\n:::\n:::\n\n\nHere because integers are in quotations, it is read as a character class by R.\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(c(\"1\", \"4\", \"7\")) \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"character\"\n```\n:::\n:::\n\n\nNote, this is the first time we have shown you nested functions. Here, instead of creating a new vector object (e.g., `x <- c(\"1\", \"4\", \"7\")`) and then feeding the vector object `x` into the first argument of the `class()` function (e.g., `class(x)`), we combined the two steps and directly fed a vector object into the class function.\n\n## Numeric Subclasses\n\nThere are two major numeric subclasses\n\n1. `Double` is a special subset of `numeric` that contains fractional values. `Double` stands for [double-precision](https://en.wikipedia.org/wiki/Double-precision_floating-point_format)\n2. `Integer` is a special subset of `numeric` that contains only whole numbers. \n\n`typeof()` identifies the vector type (double, integer, logical, or character), whereas `class()` identifies the root class. The difference between the two will be more clear when we look at two dimensional classes below.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(df$IgG_concentration)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"numeric\"\n```\n:::\n\n```{.r .cell-code}\nclass(df$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"integer\"\n```\n:::\n\n```{.r .cell-code}\ntypeof(df$IgG_concentration)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"double\"\n```\n:::\n\n```{.r .cell-code}\ntypeof(df$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"integer\"\n```\n:::\n:::\n\n\n\n## Logical\n\nReminder `logical` is a type that only has two possible elements: `TRUE` and `FALSE`. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(c(TRUE, FALSE, TRUE, TRUE, FALSE))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"logical\"\n```\n:::\n:::\n\n\nNote that `logical` elements are NOT in quotes. Putting R special classes (e.g., `NA` or `FALSE`) in quotations turns them into character value. \n\n\n## Other useful functions for evaluating/setting classes\n\nThere are two useful functions associated with practically all R classes: \n\n- `is.CLASS_NAME(x)` to **logically check** whether or not `x` is of certain class. For example, `is.integer` or `is.character` or `is.numeric`\n- `as.CLASS_NAME(x)` to **coerce between classes** `x` from current `x` class into a certain class. For example, `as.integer` or `as.character` or `as.numeric`. This is particularly useful is maybe integer variable was read in as a character variable, or when you need to change a character variable to a factor variable (more on this later).\n\n## Examples `is.CLASS_NAME(x)`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nis.numeric(df$IgG_concentration)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nis.character(df$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n\n```{.r .cell-code}\nis.character(df$gender)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n:::\n\n\n## Examples `as.CLASS_NAME(x)`\n\nIn some cases, coercing is seamless\n\n::: {.cell}\n\n```{.r .cell-code}\nas.character(c(1, 4, 7))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"1\" \"4\" \"7\"\n```\n:::\n\n```{.r .cell-code}\nas.numeric(c(\"1\", \"4\", \"7\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1 4 7\n```\n:::\n\n```{.r .cell-code}\nas.logical(c(\"TRUE\", \"FALSE\", \"FALSE\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE FALSE FALSE\n```\n:::\n:::\n\n\nIn some cases the coercing is not possible; if executed, will return `NA` (an R constant representing \"**N**ot **A**vailable\" i.e. missing value)\n\n::: {.cell}\n\n```{.r .cell-code}\nas.numeric(c(\"1\", \"4\", \"7a\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning: NAs introduced by coercion\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1 4 NA\n```\n:::\n\n```{.r .cell-code}\nas.logical(c(\"TRUE\", \"FALSE\", \"UNKNOWN\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE FALSE NA\n```\n:::\n:::\n\n\n\n## Factors\n\nA `factor` is a special `character` vector where the elements have pre-defined groups or 'levels'. You can think of these as qualitative or categorical variables. Use the `factor()` function to create factors from character values. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(df$age_group)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"character\"\n```\n:::\n\n```{.r .cell-code}\ndf$age_group_factor <- factor(df$age_group)\nclass(df$age_group_factor)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"factor\"\n```\n:::\n\n```{.r .cell-code}\nlevels(df$age_group_factor)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"middle\" \"old\" \"young\" \n```\n:::\n:::\n\n\nNote that levels are, by default, set to **alphanumerical** order! And, the first is always the \"reference\" group. However, we often prefer a different reference group.\n\n## Reference Groups \n\n**Why do we care about reference groups?** \n\nGeneralized linear regression allows you to compare the outcome of two or more groups. Your reference group is the group that everything else is compared to. Say we want to assess whether being <5 years old is associated with higher IgG antibody concentrations \n\nBy default `middle` is the reference group therefore we will only generate beta coefficients comparing `middle` to `young` AND `middle` to `old`. But, we want `young` to be the reference group so we will generate beta coefficients comparing `young` to `middle` AND `young` to `old`.\n\n## Changing factor reference \n\nChanging the reference group of a factor variable.\n- If the object is already a factor then use `relevel()` function and the `ref` argument to specify the reference.\n- If the object is a character then use `factor()` function and `levels` argument to specify the order of the values, the first being the reference.\n\n\n\nReorder Levels of Factor\n\nDescription:\n\n The levels of a factor are re-ordered so that the level specified\n by 'ref' is first and the others are moved down. This is useful\n for 'contr.treatment' contrasts which take the first level as the\n reference.\n\nUsage:\n\n relevel(x, ref, ...)\n \nArguments:\n\n x: an unordered factor.\n\n ref: the reference level, typically a string.\n\n ...: additional arguments for future methods.\n\nDetails:\n\n This, as 'reorder()', is a special case of simply calling\n 'factor(x, levels = levels(x)[....])'.\n\nValue:\n\n A factor of the same length as 'x'.\n\nSee Also:\n\n 'factor', 'contr.treatment', 'levels', 'reorder'.\n\nExamples:\n\n warpbreaks$tension <- relevel(warpbreaks$tension, ref = \"M\")\n summary(lm(breaks ~ wool + tension, data = warpbreaks))\n\nFactors\n\nDescription:\n\n The function 'factor' is used to encode a vector as a factor (the\n terms 'category' and 'enumerated type' are also used for factors).\n If argument 'ordered' is 'TRUE', the factor levels are assumed to\n be ordered. For compatibility with S there is also a function\n 'ordered'.\n\n 'is.factor', 'is.ordered', 'as.factor' and 'as.ordered' are the\n membership and coercion functions for these classes.\n\nUsage:\n\n factor(x = character(), levels, labels = levels,\n exclude = NA, ordered = is.ordered(x), nmax = NA)\n \n ordered(x = character(), ...)\n \n is.factor(x)\n is.ordered(x)\n \n as.factor(x)\n as.ordered(x)\n \n addNA(x, ifany = FALSE)\n \n .valid.factor(object)\n \nArguments:\n\n x: a vector of data, usually taking a small number of distinct\n values.\n\n levels: an optional vector of the unique values (as character\n strings) that 'x' might have taken. The default is the\n unique set of values taken by 'as.character(x)', sorted into\n increasing order _of 'x'_. Note that this set can be\n specified as smaller than 'sort(unique(x))'.\n\n labels: _either_ an optional character vector of labels for the\n levels (in the same order as 'levels' after removing those in\n 'exclude'), _or_ a character string of length 1. Duplicated\n values in 'labels' can be used to map different values of 'x'\n to the same factor level.\n\n exclude: a vector of values to be excluded when forming the set of\n levels. This may be factor with the same level set as 'x' or\n should be a 'character'.\n\n ordered: logical flag to determine if the levels should be regarded as\n ordered (in the order given).\n\n nmax: an upper bound on the number of levels; see 'Details'.\n\n ...: (in 'ordered(.)'): any of the above, apart from 'ordered'\n itself.\n\n ifany: only add an 'NA' level if it is used, i.e. if\n 'any(is.na(x))'.\n\n object: an R object.\n\nDetails:\n\n The type of the vector 'x' is not restricted; it only must have an\n 'as.character' method and be sortable (by 'order').\n\n Ordered factors differ from factors only in their class, but\n methods and the model-fitting functions treat the two classes\n quite differently.\n\n The encoding of the vector happens as follows. First all the\n values in 'exclude' are removed from 'levels'. If 'x[i]' equals\n 'levels[j]', then the 'i'-th element of the result is 'j'. If no\n match is found for 'x[i]' in 'levels' (which will happen for\n excluded values) then the 'i'-th element of the result is set to\n 'NA'.\n\n Normally the 'levels' used as an attribute of the result are the\n reduced set of levels after removing those in 'exclude', but this\n can be altered by supplying 'labels'. This should either be a set\n of new labels for the levels, or a character string, in which case\n the levels are that character string with a sequence number\n appended.\n\n 'factor(x, exclude = NULL)' applied to a factor without 'NA's is a\n no-operation unless there are unused levels: in that case, a\n factor with the reduced level set is returned. If 'exclude' is\n used, since R version 3.4.0, excluding non-existing character\n levels is equivalent to excluding nothing, and when 'exclude' is a\n 'character' vector, that _is_ applied to the levels of 'x'.\n Alternatively, 'exclude' can be factor with the same level set as\n 'x' and will exclude the levels present in 'exclude'.\n\n The codes of a factor may contain 'NA'. For a numeric 'x', set\n 'exclude = NULL' to make 'NA' an extra level (prints as '');\n by default, this is the last level.\n\n If 'NA' is a level, the way to set a code to be missing (as\n opposed to the code of the missing level) is to use 'is.na' on the\n left-hand-side of an assignment (as in 'is.na(f)[i] <- TRUE';\n indexing inside 'is.na' does not work). Under those circumstances\n missing values are currently printed as '', i.e., identical to\n entries of level 'NA'.\n\n 'is.factor' is generic: you can write methods to handle specific\n classes of objects, see InternalMethods.\n\n Where 'levels' is not supplied, 'unique' is called. Since factors\n typically have quite a small number of levels, for large vectors\n 'x' it is helpful to supply 'nmax' as an upper bound on the number\n of unique values.\n\n When using 'c' to combine a (possibly ordered) factor with other\n objects, if all objects are (possibly ordered) factors, the result\n will be a factor with levels the union of the level sets of the\n elements, in the order the levels occur in the level sets of the\n elements (which means that if all the elements have the same level\n set, that is the level set of the result), equivalent to how\n 'unlist' operates on a list of factor objects.\n\nValue:\n\n 'factor' returns an object of class '\"factor\"' which has a set of\n integer codes the length of 'x' with a '\"levels\"' attribute of\n mode 'character' and unique ('!anyDuplicated(.)') entries. If\n argument 'ordered' is true (or 'ordered()' is used) the result has\n class 'c(\"ordered\", \"factor\")'. Undocumentedly for a long time,\n 'factor(x)' loses all 'attributes(x)' but '\"names\"', and resets\n '\"levels\"' and '\"class\"'.\n\n Applying 'factor' to an ordered or unordered factor returns a\n factor (of the same type) with just the levels which occur: see\n also '[.factor' for a more transparent way to achieve this.\n\n 'is.factor' returns 'TRUE' or 'FALSE' depending on whether its\n argument is of type factor or not. Correspondingly, 'is.ordered'\n returns 'TRUE' when its argument is an ordered factor and 'FALSE'\n otherwise.\n\n 'as.factor' coerces its argument to a factor. It is an\n abbreviated (sometimes faster) form of 'factor'.\n\n 'as.ordered(x)' returns 'x' if this is ordered, and 'ordered(x)'\n otherwise.\n\n 'addNA' modifies a factor by turning 'NA' into an extra level (so\n that 'NA' values are counted in tables, for instance).\n\n '.valid.factor(object)' checks the validity of a factor, currently\n only 'levels(object)', and returns 'TRUE' if it is valid,\n otherwise a string describing the validity problem. This function\n is used for 'validObject()'.\n\nWarning:\n\n The interpretation of a factor depends on both the codes and the\n '\"levels\"' attribute. Be careful only to compare factors with the\n same set of levels (in the same order). In particular,\n 'as.numeric' applied to a factor is meaningless, and may happen by\n implicit coercion. To transform a factor 'f' to approximately its\n original numeric values, 'as.numeric(levels(f))[f]' is recommended\n and slightly more efficient than 'as.numeric(as.character(f))'.\n\n The levels of a factor are by default sorted, but the sort order\n may well depend on the locale at the time of creation, and should\n not be assumed to be ASCII.\n\n There are some anomalies associated with factors that have 'NA' as\n a level. It is suggested to use them sparingly, e.g., only for\n tabulation purposes.\n\nComparison operators and group generic methods:\n\n There are '\"factor\"' and '\"ordered\"' methods for the group generic\n 'Ops' which provide methods for the Comparison operators, and for\n the 'min', 'max', and 'range' generics in 'Summary' of\n '\"ordered\"'. (The rest of the groups and the 'Math' group\n generate an error as they are not meaningful for factors.)\n\n Only '==' and '!=' can be used for factors: a factor can only be\n compared to another factor with an identical set of levels (not\n necessarily in the same ordering) or to a character vector.\n Ordered factors are compared in the same way, but the general\n dispatch mechanism precludes comparing ordered and unordered\n factors.\n\n All the comparison operators are available for ordered factors.\n Collation is done by the levels of the operands: if both operands\n are ordered factors they must have the same level set.\n\nNote:\n\n In earlier versions of R, storing character data as a factor was\n more space efficient if there is even a small proportion of\n repeats. However, identical character strings now share storage,\n so the difference is small in most cases. (Integer values are\n stored in 4 bytes whereas each reference to a character string\n needs a pointer of 4 or 8 bytes.)\n\nReferences:\n\n Chambers, J. M. and Hastie, T. J. (1992) _Statistical Models in\n S_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n '[.factor' for subsetting of factors.\n\n 'gl' for construction of balanced factors and 'C' for factors with\n specified contrasts. 'levels' and 'nlevels' for accessing the\n levels, and 'unclass' to get integer codes.\n\nExamples:\n\n (ff <- factor(substring(\"statistics\", 1:10, 1:10), levels = letters))\n as.integer(ff) # the internal codes\n (f. <- factor(ff)) # drops the levels that do not occur\n ff[, drop = TRUE] # the same, more transparently\n \n factor(letters[1:20], labels = \"letter\")\n \n class(ordered(4:1)) # \"ordered\", inheriting from \"factor\"\n z <- factor(LETTERS[3:1], ordered = TRUE)\n ## and \"relational\" methods work:\n stopifnot(sort(z)[c(1,3)] == range(z), min(z) < max(z))\n \n \n ## suppose you want \"NA\" as a level, and to allow missing values.\n (x <- factor(c(1, 2, NA), exclude = NULL))\n is.na(x)[2] <- TRUE\n x # [1] 1 \n is.na(x)\n # [1] FALSE TRUE FALSE\n \n ## More rational, since R 3.4.0 :\n factor(c(1:2, NA), exclude = \"\" ) # keeps , as\n factor(c(1:2, NA), exclude = NULL) # always did\n ## exclude = \n z # ordered levels 'A < B < C'\n factor(z, exclude = \"C\") # does exclude\n factor(z, exclude = \"B\") # ditto\n \n ## Now, labels maybe duplicated:\n ## factor() with duplicated labels allowing to \"merge levels\"\n x <- c(\"Man\", \"Male\", \"Man\", \"Lady\", \"Female\")\n ## Map from 4 different values to only two levels:\n (xf <- factor(x, levels = c(\"Male\", \"Man\" , \"Lady\", \"Female\"),\n labels = c(\"Male\", \"Male\", \"Female\", \"Female\")))\n #> [1] Male Male Male Female Female\n #> Levels: Male Female\n \n ## Using addNA()\n Month <- airquality$Month\n table(addNA(Month))\n table(addNA(Month, ifany = TRUE))\n\n\n\n## Changing factor reference examples\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group_factor <- relevel(df$age_group_factor, ref=\"young\")\nlevels(df$age_group_factor)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"young\" \"middle\" \"old\" \n```\n:::\n:::\n\n\nOR\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group_factor <- factor(df$age_group, levels=c(\"young\", \"middle\", \"old\"))\nlevels(df$age_group_factor)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"young\" \"middle\" \"old\" \n```\n:::\n:::\n\n\nArranging, tabulating, and plotting the data will reflect the new order\n\n\n## Two-dimensional data classes\n\nTwo-dimensional classes are those we would often use to store data read from a file \n\n* a matrix (`matrix` class)\n* a data frame (`data.frame` or `tibble` classes)\n\n\n## Matrices\n\nMatrices, like data frames are also composed of rows and columns. Matrices, unlike `data.frame`, the entire matrix is composed of one R class. **For example: all entries are `numeric`, or all entries are `character`**\n\n`as.matrix()` creates a matrix from a data frame (where all values are the same class).\n\nYou can also create a matrix from scratch using `matrix()` Use `?matrix` to see the arguments. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmatrix(1:6, ncol = 2) \n```\n\n::: {.cell-output-display}\n| | |\n|--:|--:|\n| 1| 4|\n| 2| 5|\n| 3| 6|\n:::\n\n```{.r .cell-code}\nmatrix(1:6, ncol=2, byrow=TRUE) \n```\n\n::: {.cell-output-display}\n| | |\n|--:|--:|\n| 1| 2|\n| 3| 4|\n| 5| 6|\n:::\n:::\n\n\nNotice, the first matrix filled in numbers 1-6 by columns first and then rows because default `byrow` argument is FALSE. In the second matrix, we changed the argument `byrow` to `TRUE`, and now numbers 1-6 are filled by rows first and then columns.\n\n## Data Frame \n\nYou can transform an existing matrix into data frames and tibble using `as.data.frame()`. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nas.data.frame(matrix(1:6, ncol = 2) ) \n```\n\n::: {.cell-output-display}\n| V1| V2|\n|--:|--:|\n| 1| 4|\n| 2| 5|\n| 3| 6|\n:::\n:::\n\n\n\n## Numeric variable data summary\n\nData summarization on numeric vectors/variables:\n-\t\t`mean()`: takes the mean of x\n-\t\t`sd()`: takes the standard deviation of x\n-\t\t`median()`: takes the median of x\n-\t\t`quantile()`: displays sample quantiles of x. Default is min, IQR, max\n-\t\t`range()`: displays the range. Same as `c(min(), max())`\n-\t\t`sum()`: sum of x\n-\t\t`max()`: maximum value in x\n-\t\t`min()`: minimum value in x\n\nNote, **all have the ** `na.rm =` **argument for missing data**\n\n\nArithmetic Mean\n\nDescription:\n\n Generic function for the (trimmed) arithmetic mean.\n\nUsage:\n\n mean(x, ...)\n \n ## Default S3 method:\n mean(x, trim = 0, na.rm = FALSE, ...)\n \nArguments:\n\n x: An R object. Currently there are methods for numeric/logical\n vectors and date, date-time and time interval objects.\n Complex vectors are allowed for 'trim = 0', only.\n\n trim: the fraction (0 to 0.5) of observations to be trimmed from\n each end of 'x' before the mean is computed. Values of trim\n outside that range are taken as the nearest endpoint.\n\n na.rm: a logical evaluating to 'TRUE' or 'FALSE' indicating whether\n 'NA' values should be stripped before the computation\n proceeds.\n\n ...: further arguments passed to or from other methods.\n\nValue:\n\n If 'trim' is zero (the default), the arithmetic mean of the values\n in 'x' is computed, as a numeric or complex vector of length one.\n If 'x' is not logical (coerced to numeric), numeric (including\n integer) or complex, 'NA_real_' is returned, with a warning.\n\n If 'trim' is non-zero, a symmetrically trimmed mean is computed\n with a fraction of 'trim' observations deleted from each end\n before the mean is computed.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'weighted.mean', 'mean.POSIXct', 'colMeans' for row and column\n means.\n\nExamples:\n\n x <- c(0:10, 50)\n xm <- mean(x)\n c(xm, mean(x, trim = 0.10))\n\n\n## Numeric variable data summary examples\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(df)\n```\n\n::: {.cell-output-display}\n| |observation_id |IgG_concentration | age | gender | slum | log_IgG | age_group |age_group_factor |\n|:--|:--------------|:-----------------|:--------------|:----------------|:----------------|:---------------|:----------------|:----------------|\n| |Min. :5006 |Min. : 0.0054 |Min. : 1.000 |Length:651 |Length:651 |Min. :-5.2231 |Length:651 |young :316 |\n| |1st Qu.:6306 |1st Qu.: 0.3000 |1st Qu.: 3.000 |Class :character |Class :character |1st Qu.:-1.2040 |Class :character |middle:179 |\n| |Median :7495 |Median : 1.6658 |Median : 6.000 |Mode :character |Mode :character |Median : 0.5103 |Mode :character |old :147 |\n| |Mean :7492 |Mean : 87.3683 |Mean : 6.606 |NA |NA |Mean : 1.6074 |NA |NA's : 9 |\n| |3rd Qu.:8749 |3rd Qu.:141.4405 |3rd Qu.:10.000 |NA |NA |3rd Qu.: 4.9519 |NA |NA |\n| |Max. :9982 |Max. :916.4179 |Max. :15.000 |NA |NA |Max. : 6.8205 |NA |NA |\n| |NA |NA's :10 |NA's :9 |NA |NA |NA's :10 |NA |NA |\n:::\n\n```{.r .cell-code}\nrange(df$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] NA NA\n```\n:::\n\n```{.r .cell-code}\nrange(df$age, na.rm=TRUE)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1 15\n```\n:::\n\n```{.r .cell-code}\nmedian(df$IgG_concentration, na.rm=TRUE)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1.665753\n```\n:::\n:::\n\n\n\n## Character Variable Data Summaries\n\nData summarization on character or factor vectors/variables\n\t\t* `table()`\n\t\t\n\nCross Tabulation and Table Creation\n\nDescription:\n\n 'table' uses cross-classifying factors to build a contingency\n table of the counts at each combination of factor levels.\n\nUsage:\n\n table(...,\n exclude = if (useNA == \"no\") c(NA, NaN),\n useNA = c(\"no\", \"ifany\", \"always\"),\n dnn = list.names(...), deparse.level = 1)\n \n as.table(x, ...)\n is.table(x)\n \n ## S3 method for class 'table'\n as.data.frame(x, row.names = NULL, ...,\n responseName = \"Freq\", stringsAsFactors = TRUE,\n sep = \"\", base = list(LETTERS))\n \nArguments:\n\n ...: one or more objects which can be interpreted as factors\n (including numbers or character strings), or a 'list' (such\n as a data frame) whose components can be so interpreted.\n (For 'as.table', arguments passed to specific methods; for\n 'as.data.frame', unused.)\n\n exclude: levels to remove for all factors in '...'. If it does not\n contain 'NA' and 'useNA' is not specified, it implies 'useNA\n = \"ifany\"'. See 'Details' for its interpretation for\n non-factor arguments.\n\n useNA: whether to include 'NA' values in the table. See 'Details'.\n Can be abbreviated.\n\n dnn: the names to be given to the dimensions in the result (the\n _dimnames names_).\n\ndeparse.level: controls how the default 'dnn' is constructed. See\n 'Details'.\n\n x: an arbitrary R object, or an object inheriting from class\n '\"table\"' for the 'as.data.frame' method. Note that\n 'as.data.frame.table(x, *)' may be called explicitly for\n non-table 'x' for \"reshaping\" 'array's.\n\nrow.names: a character vector giving the row names for the data frame.\n\nresponseName: The name to be used for the column of table entries,\n usually counts.\n\nstringsAsFactors: logical: should the classifying factors be returned\n as factors (the default) or character vectors?\n\nsep, base: passed to 'provideDimnames'.\n\nDetails:\n\n If the argument 'dnn' is not supplied, the internal function\n 'list.names' is called to compute the 'dimname names' as follows:\n If '...' is one 'list' with its own 'names()', these 'names' are\n used. Otherwise, if the arguments in '...' are named, those names\n are used. For the remaining arguments, 'deparse.level = 0' gives\n an empty name, 'deparse.level = 1' uses the supplied argument if\n it is a symbol, and 'deparse.level = 2' will deparse the argument.\n\n Only when 'exclude' is specified (i.e., not by default) and\n non-empty, will 'table' potentially drop levels of factor\n arguments.\n\n 'useNA' controls if the table includes counts of 'NA' values: the\n allowed values correspond to never ('\"no\"'), only if the count is\n positive ('\"ifany\"') and even for zero counts ('\"always\"'). Note\n the somewhat \"pathological\" case of two different kinds of 'NA's\n which are treated differently, depending on both 'useNA' and\n 'exclude', see 'd.patho' in the 'Examples:' below.\n\n Both 'exclude' and 'useNA' operate on an \"all or none\" basis. If\n you want to control the dimensions of a multiway table separately,\n modify each argument using 'factor' or 'addNA'.\n\n Non-factor arguments 'a' are coerced via 'factor(a,\n exclude=exclude)'. Since R 3.4.0, care is taken _not_ to count\n the excluded values (where they were included in the 'NA' count,\n previously).\n\n The 'summary' method for class '\"table\"' (used for objects created\n by 'table' or 'xtabs') which gives basic information and performs\n a chi-squared test for independence of factors (note that the\n function 'chisq.test' currently only handles 2-d tables).\n\nValue:\n\n 'table()' returns a _contingency table_, an object of class\n '\"table\"', an array of integer values. Note that unlike S the\n result is always an 'array', a 1D array if one factor is given.\n\n 'as.table' and 'is.table' coerce to and test for contingency\n table, respectively.\n\n The 'as.data.frame' method for objects inheriting from class\n '\"table\"' can be used to convert the array-based representation of\n a contingency table to a data frame containing the classifying\n factors and the corresponding entries (the latter as component\n named by 'responseName'). This is the inverse of 'xtabs'.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'tabulate' is the underlying function and allows finer control.\n\n Use 'ftable' for printing (and more) of multidimensional tables.\n 'margin.table', 'prop.table', 'addmargins'.\n\n 'addNA' for constructing factors with 'NA' as a level.\n\n 'xtabs' for cross tabulation of data frames with a formula\n interface.\n\nExamples:\n\n require(stats) # for rpois and xtabs\n ## Simple frequency distribution\n table(rpois(100, 5))\n ## Check the design:\n with(warpbreaks, table(wool, tension))\n table(state.division, state.region)\n \n # simple two-way contingency table\n with(airquality, table(cut(Temp, quantile(Temp)), Month))\n \n a <- letters[1:3]\n table(a, sample(a)) # dnn is c(\"a\", \"\")\n table(a, sample(a), deparse.level = 0) # dnn is c(\"\", \"\")\n table(a, sample(a), deparse.level = 2) # dnn is c(\"a\", \"sample(a)\")\n \n ## xtabs() <-> as.data.frame.table() :\n UCBAdmissions ## already a contingency table\n DF <- as.data.frame(UCBAdmissions)\n class(tab <- xtabs(Freq ~ ., DF)) # xtabs & table\n ## tab *is* \"the same\" as the original table:\n all(tab == UCBAdmissions)\n all.equal(dimnames(tab), dimnames(UCBAdmissions))\n \n a <- rep(c(NA, 1/0:3), 10)\n table(a) # does not report NA's\n table(a, exclude = NULL) # reports NA's\n b <- factor(rep(c(\"A\",\"B\",\"C\"), 10))\n table(b)\n table(b, exclude = \"B\")\n d <- factor(rep(c(\"A\",\"B\",\"C\"), 10), levels = c(\"A\",\"B\",\"C\",\"D\",\"E\"))\n table(d, exclude = \"B\")\n print(table(b, d), zero.print = \".\")\n \n ## NA counting:\n is.na(d) <- 3:4\n d. <- addNA(d)\n d.[1:7]\n table(d.) # \", exclude = NULL\" is not needed\n ## i.e., if you want to count the NA's of 'd', use\n table(d, useNA = \"ifany\")\n \n ## \"pathological\" case:\n d.patho <- addNA(c(1,NA,1:2,1:3))[-7]; is.na(d.patho) <- 3:4\n d.patho\n ## just 3 consecutive NA's ? --- well, have *two* kinds of NAs here :\n as.integer(d.patho) # 1 4 NA NA 1 2\n ##\n ## In R >= 3.4.0, table() allows to differentiate:\n table(d.patho) # counts the \"unusual\" NA\n table(d.patho, useNA = \"ifany\") # counts all three\n table(d.patho, exclude = NULL) # (ditto)\n table(d.patho, exclude = NA) # counts none\n \n ## Two-way tables with NA counts. The 3rd variant is absurd, but shows\n ## something that cannot be done using exclude or useNA.\n with(airquality,\n table(OzHi = Ozone > 80, Month, useNA = \"ifany\"))\n with(airquality,\n table(OzHi = Ozone > 80, Month, useNA = \"always\"))\n with(airquality,\n table(OzHi = Ozone > 80, addNA(Month)))\n\n\n\n\n## Character variable data summary examples\n\nNumber of observations in each category\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntable(df$gender)\n```\n\n::: {.cell-output-display}\n| Female| Male|\n|------:|----:|\n| 325| 326|\n:::\n\n```{.r .cell-code}\ntable(df$gender, useNA=\"always\")\n```\n\n::: {.cell-output-display}\n| Female| Male| NA|\n|------:|----:|--:|\n| 325| 326| 0|\n:::\n\n```{.r .cell-code}\ntable(df$age_group, useNA=\"always\")\n```\n\n::: {.cell-output-display}\n| middle| old| young| NA|\n|------:|---:|-----:|--:|\n| 179| 147| 316| 9|\n:::\n:::\n\n\nPercent of observations in each category (xxzane - better way in base r?)\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntable(df$gender)/nrow(df) #if no NA values\n```\n\n::: {.cell-output-display}\n| Female| Male|\n|--------:|--------:|\n| 0.499232| 0.500768|\n:::\n\n```{.r .cell-code}\ntable(df$age_group)/nrow(df[!is.na(df$age_group),]) #if there are NA values\n```\n\n::: {.cell-output-display}\n| middle| old| young|\n|---------:|--------:|---------:|\n| 0.2788162| 0.228972| 0.4922118|\n:::\n\n```{.r .cell-code}\ntable(df$age_group)/nrow(subset(df, !is.na(df$age_group),)) #if there are NA values\n```\n\n::: {.cell-output-display}\n| middle| old| young|\n|---------:|--------:|---------:|\n| 0.2788162| 0.228972| 0.4922118|\n:::\n:::\n\n\n\n\n## Summary\n\n- Adding (or modifying) columns/variable to a data frame by using `$` \n- There are two types of numeric class objects: integer and double\n- Logical class objects only have `TRUE` or `False` (without quotes)\n- `is.CLASS_NAME(x)` can be used to test the class of an object x\n- `as.CLASS_NAME(x)` can be used to change the class of an object x\n- Factors are a special character class that has levels \n- ...\n\t\t\n\n## Acknowledgements\n\nThese are the materials I looked through, modified, or extracted to complete this module's lecture.\n\n- [\"Introduction to R for Public Health Researchers\" Johns Hopkins University](https://jhudatascience.org/intro_to_r/)\n\n", + "markdown": "---\ntitle: \"Module 7: Variable Creation, Classes, and Summaries\"\nformat:\n revealjs:\n smaller: true\n scrollable: true\n toc: false\n---\n\n\n## Learning Objectives\n\nAfter module 7, you should be able to...\n\n- Create new variables\n- Characterize variable classes\n- Manipulate the classes of variables\n- Conduct 1 variable data summaries\n\n## Import data for this module\nLet's first read in the data from the previous module and look at it briefly with a new function `head()`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- read.csv(file = \"data/serodata.csv\") #relative path\nhead(x=df, n=3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n observation_id IgG_concentration age gender slum\n1 5772 0.3176895 2 Female Non slum\n2 8095 3.4368231 4 Female Non slum\n3 9784 0.3000000 4 Male Non slum\n```\n:::\n:::\n\n\n```\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n```\n\nReturn the First or Last Parts of an Object\n\nDescription:\n\n Returns the first or last parts of a vector, matrix, table, data\n frame or function. Since 'head()' and 'tail()' are generic\n functions, they may also have been extended to other classes.\n\nUsage:\n\n head(x, ...)\n ## Default S3 method:\n head(x, n = 6L, ...)\n \n ## S3 method for class 'matrix'\n head(x, n = 6L, ...) # is exported as head.matrix()\n ## NB: The methods for 'data.frame' and 'array' are identical to the 'matrix' one\n \n ## S3 method for class 'ftable'\n head(x, n = 6L, ...)\n ## S3 method for class 'function'\n head(x, n = 6L, ...)\n \n \n tail(x, ...)\n ## Default S3 method:\n tail(x, n = 6L, keepnums = FALSE, addrownums, ...)\n ## S3 method for class 'matrix'\n tail(x, n = 6L, keepnums = TRUE, addrownums, ...) # exported as tail.matrix()\n ## NB: The methods for 'data.frame', 'array', and 'table'\n ## are identical to the 'matrix' one\n \n ## S3 method for class 'ftable'\n tail(x, n = 6L, keepnums = FALSE, addrownums, ...)\n ## S3 method for class 'function'\n tail(x, n = 6L, ...)\n \nArguments:\n\n x: an object\n\n n: an integer vector of length up to 'dim(x)' (or 1, for\n non-dimensioned objects). A 'logical' is silently coerced to\n integer. Values specify the indices to be selected in the\n corresponding dimension (or along the length) of the object.\n A positive value of 'n[i]' includes the first/last 'n[i]'\n indices in that dimension, while a negative value excludes\n the last/first 'abs(n[i])', including all remaining indices.\n 'NA' or non-specified values (when 'length(n) <\n length(dim(x))') select all indices in that dimension. Must\n contain at least one non-missing value.\n\nkeepnums: in each dimension, if no names in that dimension are present,\n create them using the indices included in that dimension.\n Ignored if 'dim(x)' is 'NULL' or its length 1.\n\naddrownums: deprecated - 'keepnums' should be used instead. Taken as\n the value of 'keepnums' if it is explicitly set when\n 'keepnums' is not.\n\n ...: arguments to be passed to or from other methods.\n\nDetails:\n\n For vector/array based objects, 'head()' ('tail()') returns a\n subset of the same dimensionality as 'x', usually of the same\n class. For historical reasons, by default they select the first\n (last) 6 indices in the first dimension (\"rows\") or along the\n length of a non-dimensioned vector, and the full extent (all\n indices) in any remaining dimensions. 'head.matrix()' and\n 'tail.matrix()' are exported.\n\n The default and array(/matrix) methods for 'head()' and 'tail()'\n are quite general. They will work as is for any class which has a\n 'dim()' method, a 'length()' method (only required if 'dim()'\n returns 'NULL'), and a '[' method (that accepts the 'drop'\n argument and can subset in all dimensions in the dimensioned\n case).\n\n For functions, the lines of the deparsed function are returned as\n character strings.\n\n When 'x' is an array(/matrix) of dimensionality two and more,\n 'tail()' will add dimnames similar to how they would appear in a\n full printing of 'x' for all dimensions 'k' where 'n[k]' is\n specified and non-missing and 'dimnames(x)[[k]]' (or 'dimnames(x)'\n itself) is 'NULL'. Specifically, the form of the added dimnames\n will vary for different dimensions as follows:\n\n 'k=1' (rows): '\"[n,]\"' (right justified with whitespace padding)\n\n 'k=2' (columns): '\"[,n]\"' (with _no_ whitespace padding)\n\n 'k>2' (higher dims): '\"n\"', i.e., the indices as _character_\n values\n\n Setting 'keepnums = FALSE' suppresses this behaviour.\n\n As 'data.frame' subsetting ('indexing') keeps 'attributes', so do\n the 'head()' and 'tail()' methods for data frames.\n\nValue:\n\n An object (usually) like 'x' but generally smaller. Hence, for\n 'array's, the result corresponds to 'x[.., drop=FALSE]'. For\n 'ftable' objects 'x', a transformed 'format(x)'.\n\nNote:\n\n For array inputs the output of 'tail' when 'keepnums' is 'TRUE',\n any dimnames vectors added for dimensions '>2' are the original\n numeric indices in that dimension _as character vectors_. This\n means that, e.g., for 3-dimensional array 'arr', 'tail(arr,\n c(2,2,-1))[ , , 2]' and 'tail(arr, c(2,2,-1))[ , , \"2\"]' may both\n be valid but have completely different meanings.\n\nAuthor(s):\n\n Patrick Burns, improved and corrected by R-Core. Negative argument\n added by Vincent Goulet. Multi-dimension support added by Gabriel\n Becker.\n\nExamples:\n\n head(letters)\n head(letters, n = -6L)\n \n head(freeny.x, n = 10L)\n head(freeny.y)\n \n head(iris3)\n head(iris3, c(6L, 2L))\n head(iris3, c(6L, -1L, 2L))\n \n tail(letters)\n tail(letters, n = -6L)\n \n tail(freeny.x)\n ## the bottom-right \"corner\" :\n tail(freeny.x, n = c(4, 2))\n tail(freeny.y)\n \n tail(iris3)\n tail(iris3, c(6L, 2L))\n tail(iris3, c(6L, -1L, 2L))\n \n ## iris with dimnames stripped\n a3d <- iris3 ; dimnames(a3d) <- NULL\n tail(a3d, c(6, -1, 2)) # keepnums = TRUE is default here!\n tail(a3d, c(6, -1, 2), keepnums = FALSE)\n \n ## data frame w/ a (non-standard) attribute:\n treeS <- structure(trees, foo = \"bar\")\n (n <- nrow(treeS))\n stopifnot(exprs = { # attribute is kept\n identical(htS <- head(treeS), treeS[1:6, ])\n identical(attr(htS, \"foo\") , \"bar\")\n identical(tlS <- tail(treeS), treeS[(n-5):n, ])\n ## BUT if I use \"useAttrib(.)\", this is *not* ok, when n is of length 2:\n ## --- because [i,j]-indexing of data frames *also* drops \"other\" attributes ..\n identical(tail(treeS, 3:2), treeS[(n-2):n, 2:3] )\n })\n \n tail(library) # last lines of function\n \n head(stats::ftable(Titanic))\n \n ## 1d-array (with named dim) :\n a1 <- array(1:7, 7); names(dim(a1)) <- \"O2\"\n stopifnot(exprs = {\n identical( tail(a1, 10), a1)\n identical( head(a1, 10), a1)\n identical( head(a1, 1), a1 [1 , drop=FALSE] ) # was a1[1] in R <= 3.6.x\n identical( tail(a1, 2), a1[6:7])\n identical( tail(a1, 1), a1 [7 , drop=FALSE] ) # was a1[7] in R <= 3.6.x\n })\n\n\n\n## Adding new columns\n\nYou can add a new column, called `log_IgG` to `df`, using the `$` operator:\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$log_IgG <- log(df$IgG_concentration)\nhead(df,3)\n```\n\n::: {.cell-output-display}\n| observation_id| IgG_concentration| age|gender |slum | log_IgG|\n|--------------:|-----------------:|---:|:------|:--------|---------:|\n| 5772| 0.3176895| 2|Female |Non slum | -1.146681|\n| 8095| 3.4368231| 4|Female |Non slum | 1.234547|\n| 9784| 0.3000000| 4|Male |Non slum | -1.203973|\n:::\n:::\n\n\nNote, my use of the underscore in the variable name rather than a space. This is good coding practice and make calling variables much less prone to error.\n\n## Adding new columns\n\nWe can also add a new column using the `transform()` function:\n\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n```\nTransform an Object, for Example a Data Frame\n\nDescription:\n\n 'transform' is a generic function, which-at least currently-only\n does anything useful with data frames. 'transform.default'\n converts its first argument to a data frame if possible and calls\n 'transform.data.frame'.\n\nUsage:\n\n transform(`_data`, ...)\n \nArguments:\n\n _data: The object to be transformed\n\n ...: Further arguments of the form 'tag=value'\n\nDetails:\n\n The '...' arguments to 'transform.data.frame' are tagged vector\n expressions, which are evaluated in the data frame '_data'. The\n tags are matched against 'names(_data)', and for those that match,\n the value replace the corresponding variable in '_data', and the\n others are appended to '_data'.\n\nValue:\n\n The modified value of '_data'.\n\nWarning:\n\n This is a convenience function intended for use interactively.\n For programming it is better to use the standard subsetting\n arithmetic functions, and in particular the non-standard\n evaluation of argument 'transform' can have unanticipated\n consequences.\n\nNote:\n\n If some of the values are not vectors of the appropriate length,\n you deserve whatever you get!\n\nAuthor(s):\n\n Peter Dalgaard\n\nSee Also:\n\n 'within' for a more flexible approach, 'subset', 'list',\n 'data.frame'\n\nExamples:\n\n transform(airquality, Ozone = -Ozone)\n transform(airquality, new = -Ozone, Temp = (Temp-32)/1.8)\n \n attach(airquality)\n transform(Ozone, logOzone = log(Ozone)) # marginally interesting ...\n detach(airquality)\n```\n:::\n:::\n\n\nFor example, adding a binary column for seropositivity called `seropos`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- transform(df, seropos = IgG_concentration >= 10)\nhead(df)\n```\n\n::: {.cell-output-display}\n| observation_id| IgG_concentration| age|gender |slum | log_IgG|seropos |\n|--------------:|-----------------:|---:|:------|:--------|----------:|:-------|\n| 5772| 0.3176895| 2|Female |Non slum | -1.1466807|FALSE |\n| 8095| 3.4368231| 4|Female |Non slum | 1.2345475|FALSE |\n| 9784| 0.3000000| 4|Male |Non slum | -1.2039728|FALSE |\n| 9338| 143.2363014| 4|Male |Non slum | 4.9644957|TRUE |\n| 6369| 0.4476534| 1|Male |Non slum | -0.8037359|FALSE |\n| 6885| 0.0252708| 4|Male |Non slum | -3.6781074|FALSE |\n:::\n:::\n\n\n\n## Creating conditional variables\n\nOne frequently-used tool is creating variables with conditions. A general function for creating new variables based on existing variables is the Base R `ifelse()` function, which \"returns a value depending on whether the element of test is `TRUE` or `FALSE`.\"\n\n\nConditional Element Selection\n\nDescription:\n\n 'ifelse' returns a value with the same shape as 'test' which is\n filled with elements selected from either 'yes' or 'no' depending\n on whether the element of 'test' is 'TRUE' or 'FALSE'.\n\nUsage:\n\n ifelse(test, yes, no)\n \nArguments:\n\n test: an object which can be coerced to logical mode.\n\n yes: return values for true elements of 'test'.\n\n no: return values for false elements of 'test'.\n\nDetails:\n\n If 'yes' or 'no' are too short, their elements are recycled.\n 'yes' will be evaluated if and only if any element of 'test' is\n true, and analogously for 'no'.\n\n Missing values in 'test' give missing values in the result.\n\nValue:\n\n A vector of the same length and attributes (including dimensions\n and '\"class\"') as 'test' and data values from the values of 'yes'\n or 'no'. The mode of the answer will be coerced from logical to\n accommodate first any values taken from 'yes' and then any values\n taken from 'no'.\n\nWarning:\n\n The mode of the result may depend on the value of 'test' (see the\n examples), and the class attribute (see 'oldClass') of the result\n is taken from 'test' and may be inappropriate for the values\n selected from 'yes' and 'no'.\n\n Sometimes it is better to use a construction such as\n\n (tmp <- yes; tmp[!test] <- no[!test]; tmp)\n \n , possibly extended to handle missing values in 'test'.\n\n Further note that 'if(test) yes else no' is much more efficient\n and often much preferable to 'ifelse(test, yes, no)' whenever\n 'test' is a simple true/false result, i.e., when 'length(test) ==\n 1'.\n\n The 'srcref' attribute of functions is handled specially: if\n 'test' is a simple true result and 'yes' evaluates to a function\n with 'srcref' attribute, 'ifelse' returns 'yes' including its\n attribute (the same applies to a false 'test' and 'no' argument).\n This functionality is only for backwards compatibility, the form\n 'if(test) yes else no' should be used whenever 'yes' and 'no' are\n functions.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'if'.\n\nExamples:\n\n x <- c(6:-4)\n sqrt(x) #- gives warning\n sqrt(ifelse(x >= 0, x, NA)) # no warning\n \n ## Note: the following also gives the warning !\n ifelse(x >= 0, sqrt(x), NA)\n \n \n ## ifelse() strips attributes\n ## This is important when working with Dates and factors\n x <- seq(as.Date(\"2000-02-29\"), as.Date(\"2004-10-04\"), by = \"1 month\")\n ## has many \"yyyy-mm-29\", but a few \"yyyy-03-01\" in the non-leap years\n y <- ifelse(as.POSIXlt(x)$mday == 29, x, NA)\n head(y) # not what you expected ... ==> need restore the class attribute:\n class(y) <- class(x)\n y\n ## This is a (not atypical) case where it is better *not* to use ifelse(),\n ## but rather the more efficient and still clear:\n y2 <- x\n y2[as.POSIXlt(x)$mday != 29] <- NA\n ## which gives the same as ifelse()+class() hack:\n stopifnot(identical(y2, y))\n \n \n ## example of different return modes (and 'test' alone determining length):\n yes <- 1:3\n no <- pi^(1:4)\n utils::str( ifelse(NA, yes, no) ) # logical, length 1\n utils::str( ifelse(TRUE, yes, no) ) # integer, length 1\n utils::str( ifelse(FALSE, yes, no) ) # double, length 1\n\n\n\n## `ifelse` example\n\nReminder of the first three arguments in the `ifelse()` function are `ifelse(test, yes, no)`.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group <- ifelse(df$age <= 5, \"young\", \"old\")\nhead(df)\n```\n\n::: {.cell-output-display}\n| observation_id| IgG_concentration| age|gender |slum | log_IgG|seropos |age_group |\n|--------------:|-----------------:|---:|:------|:--------|----------:|:-------|:---------|\n| 5772| 0.3176895| 2|Female |Non slum | -1.1466807|FALSE |young |\n| 8095| 3.4368231| 4|Female |Non slum | 1.2345475|FALSE |young |\n| 9784| 0.3000000| 4|Male |Non slum | -1.2039728|FALSE |young |\n| 9338| 143.2363014| 4|Male |Non slum | 4.9644957|TRUE |young |\n| 6369| 0.4476534| 1|Male |Non slum | -0.8037359|FALSE |young |\n| 6885| 0.0252708| 4|Male |Non slum | -3.6781074|FALSE |young |\n:::\n:::\n\n\nLet's delve into what is actually happening, with a focus on the NA values in `age` variable.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age <= 5\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE NA TRUE TRUE TRUE FALSE\n [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE\n [25] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE\n [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n [49] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n [61] TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE\n [73] FALSE TRUE TRUE TRUE NA TRUE TRUE TRUE FALSE FALSE FALSE FALSE\n [85] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n [97] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[109] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE NA TRUE TRUE\n[121] NA TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[133] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE\n[145] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[157] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE\n[169] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE\n[181] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE\n[193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE\n[205] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[217] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[229] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[241] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE\n[253] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[265] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE\n[277] FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[289] TRUE NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[301] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE\n[313] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE\n[325] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE\n[337] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE\n[349] FALSE NA FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE\n[361] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE\n[373] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE\n[385] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[397] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[409] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[421] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[433] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[445] FALSE FALSE TRUE TRUE TRUE TRUE NA NA TRUE TRUE TRUE TRUE\n[457] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[469] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[481] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE\n[493] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE\n[505] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE\n[517] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[529] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[541] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[553] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[565] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[577] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[589] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[601] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[613] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE\n[625] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE\n[637] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE NA FALSE FALSE FALSE\n[649] FALSE FALSE FALSE\n```\n:::\n\n```{.r .cell-code}\ntable(df$age, df$age_group, useNA=\"always\", dnn=list(\"age\", \"\"))\n```\n\n::: {.cell-output-display}\n|age/ | old| young| NA|\n|:----|---:|-----:|--:|\n|1 | 0| 44| 0|\n|2 | 0| 72| 0|\n|3 | 0| 79| 0|\n|4 | 0| 80| 0|\n|5 | 0| 41| 0|\n|6 | 38| 0| 0|\n|7 | 38| 0| 0|\n|8 | 39| 0| 0|\n|9 | 20| 0| 0|\n|10 | 44| 0| 0|\n|11 | 41| 0| 0|\n|12 | 23| 0| 0|\n|13 | 35| 0| 0|\n|14 | 37| 0| 0|\n|15 | 11| 0| 0|\n|NA | 0| 0| 9|\n:::\n:::\n\n\n## Nesting `ifelse` statements example\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group <- ifelse(df$age <= 5, \"young\", \n ifelse(df$age<=10 & df$age>5, \"middle\", \"old\"))\ntable(df$age, df$age_group, useNA=\"always\", dnn=list(\"age\", \"\"))\n```\n\n::: {.cell-output-display}\n|age/ | middle| old| young| NA|\n|:----|------:|---:|-----:|--:|\n|1 | 0| 0| 44| 0|\n|2 | 0| 0| 72| 0|\n|3 | 0| 0| 79| 0|\n|4 | 0| 0| 80| 0|\n|5 | 0| 0| 41| 0|\n|6 | 38| 0| 0| 0|\n|7 | 38| 0| 0| 0|\n|8 | 39| 0| 0| 0|\n|9 | 20| 0| 0| 0|\n|10 | 44| 0| 0| 0|\n|11 | 0| 41| 0| 0|\n|12 | 0| 23| 0| 0|\n|13 | 0| 35| 0| 0|\n|14 | 0| 37| 0| 0|\n|15 | 0| 11| 0| 0|\n|NA | 0| 0| 0| 9|\n:::\n:::\n\n\nNote, it puts the variable levels in alphabetical order, we will show how to change this later.\n\n# Data Classes\n\n## Overview - Data Classes\n\n1. One dimensional types (i.e., vectors of characters, numeric, logical, or factor values)\n\n2. Two dimensional types (e.g., matrix, data frame, tibble)\n\n3. Special data classes (e.g., lists, dates). \n\n## \t`class()` function\n\nThe `class()` function allows you to evaluate the class of an object.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(df$IgG_concentration)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"numeric\"\n```\n:::\n\n```{.r .cell-code}\nclass(df$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"integer\"\n```\n:::\n\n```{.r .cell-code}\nclass(df$gender)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"character\"\n```\n:::\n:::\n\n\n\n## One dimensional data types\n\n* Character: strings or individual characters, quoted\n* Numeric: any real number(s)\n - Double: contains fractional values (i.e., double precision) - default numeric\n - Integer: any integer(s)/whole numbers\n* Logical: variables composed of TRUE or FALSE\n* Factor: categorical/qualitative variables\n\n## Character and numeric\n\nThis can also be a bit tricky. \n\nIf only one character in the whole vector, the class is assumed to be character\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(c(1, 2, \"tree\")) \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"character\"\n```\n:::\n:::\n\n\nHere because integers are in quotations, it is read as a character class by R.\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(c(\"1\", \"4\", \"7\")) \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"character\"\n```\n:::\n:::\n\n\nNote, instead of creating a new vector object (e.g., `x <- c(\"1\", \"4\", \"7\")`) and then feeding the vector object `x` into the first argument of the `class()` function (e.g., `class(x)`), we combined the two steps and directly fed a vector object into the class function.\n\n## Numeric Subclasses\n\nThere are two major numeric subclasses\n\n1. `Double` is a special subset of `numeric` that contains fractional values. `Double` stands for [double-precision](https://en.wikipedia.org/wiki/Double-precision_floating-point_format)\n2. `Integer` is a special subset of `numeric` that contains only whole numbers. \n\n`typeof()` identifies the vector type (double, integer, logical, or character), whereas `class()` identifies the root class. The difference between the two will be more clear when we look at two dimensional classes below.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(df$IgG_concentration)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"numeric\"\n```\n:::\n\n```{.r .cell-code}\nclass(df$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"integer\"\n```\n:::\n\n```{.r .cell-code}\ntypeof(df$IgG_concentration)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"double\"\n```\n:::\n\n```{.r .cell-code}\ntypeof(df$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"integer\"\n```\n:::\n:::\n\n\n\n## Logical\n\nReminder `logical` is a type that only has three possible elements: `TRUE` and `FALSE` and `NA`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(c(TRUE, FALSE, TRUE, TRUE, FALSE))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"logical\"\n```\n:::\n:::\n\n\nNote that when creating `logical` object the `TRUE` and `FALSE` are NOT in quotes. Putting R special classes (e.g., `NA` or `FALSE`) in quotations turns them into character value. \n\n\n## Other useful functions for evaluating/setting classes\n\nThere are two useful functions associated with practically all R classes: \n\n- `is.CLASS_NAME(x)` to **logically check** whether or not `x` is of certain class. For example, `is.integer` or `is.character` or `is.numeric`\n- `as.CLASS_NAME(x)` to **coerce between classes** `x` from current `x` class into a another class. For example, `as.integer` or `as.character` or `as.numeric`. This is particularly useful is maybe integer variable was read in as a character variable, or when you need to change a character variable to a factor variable (more on this later).\n\n## Examples `is.CLASS_NAME(x)`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nis.numeric(df$IgG_concentration)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n\n```{.r .cell-code}\nis.character(df$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] FALSE\n```\n:::\n\n```{.r .cell-code}\nis.character(df$gender)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE\n```\n:::\n:::\n\n\n## Examples `as.CLASS_NAME(x)`\n\nIn some cases, coercing is seamless\n\n::: {.cell}\n\n```{.r .cell-code}\nas.character(c(1, 4, 7))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"1\" \"4\" \"7\"\n```\n:::\n\n```{.r .cell-code}\nas.numeric(c(\"1\", \"4\", \"7\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1 4 7\n```\n:::\n\n```{.r .cell-code}\nas.logical(c(\"TRUE\", \"FALSE\", \"FALSE\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE FALSE FALSE\n```\n:::\n:::\n\n\nIn some cases the coercing is not possible; if executed, will return `NA`\n\n::: {.cell}\n\n```{.r .cell-code}\nas.numeric(c(\"1\", \"4\", \"7a\"))\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning: NAs introduced by coercion\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1 4 NA\n```\n:::\n\n```{.r .cell-code}\nas.logical(c(\"TRUE\", \"FALSE\", \"UNKNOWN\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] TRUE FALSE NA\n```\n:::\n:::\n\n\n\n## Factors\n\nA `factor` is a special `character` vector where the elements have pre-defined groups or 'levels'. You can think of these as qualitative or categorical variables. Use the `factor()` function to create factors from character values. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nclass(df$age_group)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"character\"\n```\n:::\n\n```{.r .cell-code}\ndf$age_group_factor <- factor(df$age_group)\nclass(df$age_group_factor)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"factor\"\n```\n:::\n\n```{.r .cell-code}\nlevels(df$age_group_factor)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"middle\" \"old\" \"young\" \n```\n:::\n:::\n\n\nNote 1, that levels are, by default, set to **alphanumerical** order! And, the first is always the \"reference\" group. However, we often prefer a different reference group.\n\nNote 2, we can also make ordered factors using `factor(... ordered=TRUE)`, but we won't talk more about that.\n\n## Reference Groups \n\n**Why do we care about reference groups?** \n\nGeneralized linear regression allows you to compare the outcome of two or more groups. Your reference group is the group that everything else is compared to. Say we want to assess whether being <5 years old is associated with higher IgG antibody concentrations \n\nBy default `middle` is the reference group therefore we will only generate beta coefficients comparing `middle` to `young` AND `middle` to `old`. But, we want `young` to be the reference group so we will generate beta coefficients comparing `young` to `middle` AND `young` to `old`.\n\n## Changing factor reference \n\nChanging the reference group of a factor variable.\n\n- If the object is already a factor then use `relevel()` function and the `ref` argument to specify the reference.\n- If the object is a character then use `factor()` function and `levels` argument to specify the order of the values, the first being the reference.\n\n\nLet's look at the `relevel()` help file\n\nReorder Levels of Factor\n\nDescription:\n\n The levels of a factor are re-ordered so that the level specified\n by 'ref' is first and the others are moved down. This is useful\n for 'contr.treatment' contrasts which take the first level as the\n reference.\n\nUsage:\n\n relevel(x, ref, ...)\n \nArguments:\n\n x: an unordered factor.\n\n ref: the reference level, typically a string.\n\n ...: additional arguments for future methods.\n\nDetails:\n\n This, as 'reorder()', is a special case of simply calling\n 'factor(x, levels = levels(x)[....])'.\n\nValue:\n\n A factor of the same length as 'x'.\n\nSee Also:\n\n 'factor', 'contr.treatment', 'levels', 'reorder'.\n\nExamples:\n\n warpbreaks$tension <- relevel(warpbreaks$tension, ref = \"M\")\n summary(lm(breaks ~ wool + tension, data = warpbreaks))\n\n\n
\n\nLet's look at the `factor()` help file\n\nFactors\n\nDescription:\n\n The function 'factor' is used to encode a vector as a factor (the\n terms 'category' and 'enumerated type' are also used for factors).\n If argument 'ordered' is 'TRUE', the factor levels are assumed to\n be ordered. For compatibility with S there is also a function\n 'ordered'.\n\n 'is.factor', 'is.ordered', 'as.factor' and 'as.ordered' are the\n membership and coercion functions for these classes.\n\nUsage:\n\n factor(x = character(), levels, labels = levels,\n exclude = NA, ordered = is.ordered(x), nmax = NA)\n \n ordered(x = character(), ...)\n \n is.factor(x)\n is.ordered(x)\n \n as.factor(x)\n as.ordered(x)\n \n addNA(x, ifany = FALSE)\n \n .valid.factor(object)\n \nArguments:\n\n x: a vector of data, usually taking a small number of distinct\n values.\n\n levels: an optional vector of the unique values (as character\n strings) that 'x' might have taken. The default is the\n unique set of values taken by 'as.character(x)', sorted into\n increasing order _of 'x'_. Note that this set can be\n specified as smaller than 'sort(unique(x))'.\n\n labels: _either_ an optional character vector of labels for the\n levels (in the same order as 'levels' after removing those in\n 'exclude'), _or_ a character string of length 1. Duplicated\n values in 'labels' can be used to map different values of 'x'\n to the same factor level.\n\n exclude: a vector of values to be excluded when forming the set of\n levels. This may be factor with the same level set as 'x' or\n should be a 'character'.\n\n ordered: logical flag to determine if the levels should be regarded as\n ordered (in the order given).\n\n nmax: an upper bound on the number of levels; see 'Details'.\n\n ...: (in 'ordered(.)'): any of the above, apart from 'ordered'\n itself.\n\n ifany: only add an 'NA' level if it is used, i.e. if\n 'any(is.na(x))'.\n\n object: an R object.\n\nDetails:\n\n The type of the vector 'x' is not restricted; it only must have an\n 'as.character' method and be sortable (by 'order').\n\n Ordered factors differ from factors only in their class, but\n methods and the model-fitting functions treat the two classes\n quite differently.\n\n The encoding of the vector happens as follows. First all the\n values in 'exclude' are removed from 'levels'. If 'x[i]' equals\n 'levels[j]', then the 'i'-th element of the result is 'j'. If no\n match is found for 'x[i]' in 'levels' (which will happen for\n excluded values) then the 'i'-th element of the result is set to\n 'NA'.\n\n Normally the 'levels' used as an attribute of the result are the\n reduced set of levels after removing those in 'exclude', but this\n can be altered by supplying 'labels'. This should either be a set\n of new labels for the levels, or a character string, in which case\n the levels are that character string with a sequence number\n appended.\n\n 'factor(x, exclude = NULL)' applied to a factor without 'NA's is a\n no-operation unless there are unused levels: in that case, a\n factor with the reduced level set is returned. If 'exclude' is\n used, since R version 3.4.0, excluding non-existing character\n levels is equivalent to excluding nothing, and when 'exclude' is a\n 'character' vector, that _is_ applied to the levels of 'x'.\n Alternatively, 'exclude' can be factor with the same level set as\n 'x' and will exclude the levels present in 'exclude'.\n\n The codes of a factor may contain 'NA'. For a numeric 'x', set\n 'exclude = NULL' to make 'NA' an extra level (prints as '');\n by default, this is the last level.\n\n If 'NA' is a level, the way to set a code to be missing (as\n opposed to the code of the missing level) is to use 'is.na' on the\n left-hand-side of an assignment (as in 'is.na(f)[i] <- TRUE';\n indexing inside 'is.na' does not work). Under those circumstances\n missing values are currently printed as '', i.e., identical to\n entries of level 'NA'.\n\n 'is.factor' is generic: you can write methods to handle specific\n classes of objects, see InternalMethods.\n\n Where 'levels' is not supplied, 'unique' is called. Since factors\n typically have quite a small number of levels, for large vectors\n 'x' it is helpful to supply 'nmax' as an upper bound on the number\n of unique values.\n\n When using 'c' to combine a (possibly ordered) factor with other\n objects, if all objects are (possibly ordered) factors, the result\n will be a factor with levels the union of the level sets of the\n elements, in the order the levels occur in the level sets of the\n elements (which means that if all the elements have the same level\n set, that is the level set of the result), equivalent to how\n 'unlist' operates on a list of factor objects.\n\nValue:\n\n 'factor' returns an object of class '\"factor\"' which has a set of\n integer codes the length of 'x' with a '\"levels\"' attribute of\n mode 'character' and unique ('!anyDuplicated(.)') entries. If\n argument 'ordered' is true (or 'ordered()' is used) the result has\n class 'c(\"ordered\", \"factor\")'. Undocumentedly for a long time,\n 'factor(x)' loses all 'attributes(x)' but '\"names\"', and resets\n '\"levels\"' and '\"class\"'.\n\n Applying 'factor' to an ordered or unordered factor returns a\n factor (of the same type) with just the levels which occur: see\n also '[.factor' for a more transparent way to achieve this.\n\n 'is.factor' returns 'TRUE' or 'FALSE' depending on whether its\n argument is of type factor or not. Correspondingly, 'is.ordered'\n returns 'TRUE' when its argument is an ordered factor and 'FALSE'\n otherwise.\n\n 'as.factor' coerces its argument to a factor. It is an\n abbreviated (sometimes faster) form of 'factor'.\n\n 'as.ordered(x)' returns 'x' if this is ordered, and 'ordered(x)'\n otherwise.\n\n 'addNA' modifies a factor by turning 'NA' into an extra level (so\n that 'NA' values are counted in tables, for instance).\n\n '.valid.factor(object)' checks the validity of a factor, currently\n only 'levels(object)', and returns 'TRUE' if it is valid,\n otherwise a string describing the validity problem. This function\n is used for 'validObject()'.\n\nWarning:\n\n The interpretation of a factor depends on both the codes and the\n '\"levels\"' attribute. Be careful only to compare factors with the\n same set of levels (in the same order). In particular,\n 'as.numeric' applied to a factor is meaningless, and may happen by\n implicit coercion. To transform a factor 'f' to approximately its\n original numeric values, 'as.numeric(levels(f))[f]' is recommended\n and slightly more efficient than 'as.numeric(as.character(f))'.\n\n The levels of a factor are by default sorted, but the sort order\n may well depend on the locale at the time of creation, and should\n not be assumed to be ASCII.\n\n There are some anomalies associated with factors that have 'NA' as\n a level. It is suggested to use them sparingly, e.g., only for\n tabulation purposes.\n\nComparison operators and group generic methods:\n\n There are '\"factor\"' and '\"ordered\"' methods for the group generic\n 'Ops' which provide methods for the Comparison operators, and for\n the 'min', 'max', and 'range' generics in 'Summary' of\n '\"ordered\"'. (The rest of the groups and the 'Math' group\n generate an error as they are not meaningful for factors.)\n\n Only '==' and '!=' can be used for factors: a factor can only be\n compared to another factor with an identical set of levels (not\n necessarily in the same ordering) or to a character vector.\n Ordered factors are compared in the same way, but the general\n dispatch mechanism precludes comparing ordered and unordered\n factors.\n\n All the comparison operators are available for ordered factors.\n Collation is done by the levels of the operands: if both operands\n are ordered factors they must have the same level set.\n\nNote:\n\n In earlier versions of R, storing character data as a factor was\n more space efficient if there is even a small proportion of\n repeats. However, identical character strings now share storage,\n so the difference is small in most cases. (Integer values are\n stored in 4 bytes whereas each reference to a character string\n needs a pointer of 4 or 8 bytes.)\n\nReferences:\n\n Chambers, J. M. and Hastie, T. J. (1992) _Statistical Models in\n S_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n '[.factor' for subsetting of factors.\n\n 'gl' for construction of balanced factors and 'C' for factors with\n specified contrasts. 'levels' and 'nlevels' for accessing the\n levels, and 'unclass' to get integer codes.\n\nExamples:\n\n (ff <- factor(substring(\"statistics\", 1:10, 1:10), levels = letters))\n as.integer(ff) # the internal codes\n (f. <- factor(ff)) # drops the levels that do not occur\n ff[, drop = TRUE] # the same, more transparently\n \n factor(letters[1:20], labels = \"letter\")\n \n class(ordered(4:1)) # \"ordered\", inheriting from \"factor\"\n z <- factor(LETTERS[3:1], ordered = TRUE)\n ## and \"relational\" methods work:\n stopifnot(sort(z)[c(1,3)] == range(z), min(z) < max(z))\n \n \n ## suppose you want \"NA\" as a level, and to allow missing values.\n (x <- factor(c(1, 2, NA), exclude = NULL))\n is.na(x)[2] <- TRUE\n x # [1] 1 \n is.na(x)\n # [1] FALSE TRUE FALSE\n \n ## More rational, since R 3.4.0 :\n factor(c(1:2, NA), exclude = \"\" ) # keeps , as\n factor(c(1:2, NA), exclude = NULL) # always did\n ## exclude = \n z # ordered levels 'A < B < C'\n factor(z, exclude = \"C\") # does exclude\n factor(z, exclude = \"B\") # ditto\n \n ## Now, labels maybe duplicated:\n ## factor() with duplicated labels allowing to \"merge levels\"\n x <- c(\"Man\", \"Male\", \"Man\", \"Lady\", \"Female\")\n ## Map from 4 different values to only two levels:\n (xf <- factor(x, levels = c(\"Male\", \"Man\" , \"Lady\", \"Female\"),\n labels = c(\"Male\", \"Male\", \"Female\", \"Female\")))\n #> [1] Male Male Male Female Female\n #> Levels: Male Female\n \n ## Using addNA()\n Month <- airquality$Month\n table(addNA(Month))\n table(addNA(Month, ifany = TRUE))\n\n\n\n## Changing factor reference examples\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group_factor <- relevel(df$age_group_factor, ref=\"young\")\nlevels(df$age_group_factor)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"young\" \"middle\" \"old\" \n```\n:::\n:::\n\n\nOR\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group_factor <- factor(df$age_group, levels=c(\"young\", \"middle\", \"old\"))\nlevels(df$age_group_factor)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"young\" \"middle\" \"old\" \n```\n:::\n:::\n\n\nArranging, tabulating, and plotting the data will reflect the new order\n\n\n## Two-dimensional data classes\n\nTwo-dimensional classes are those we would often use to store data read from a file \n\n* a matrix (`matrix` class)\n* a data frame (`data.frame` or `tibble` classes)\n\n\n## Matrices\n\nMatrices, like data frames are also composed of rows and columns. Matrices, unlike `data.frame`, the entire matrix is composed of one R class. **For example: all entries are `numeric`, or all entries are `character`**\n\n`as.matrix()` creates a matrix from a data frame (where all values are the same class).\n\nYou can also create a matrix from scratch using `matrix()` Use `?matrix` to see the arguments. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nmatrix(data=1:6, ncol = 2) \n```\n\n::: {.cell-output-display}\n| | |\n|--:|--:|\n| 1| 4|\n| 2| 5|\n| 3| 6|\n:::\n\n```{.r .cell-code}\nmatrix(data=1:6, ncol=2, byrow=TRUE) \n```\n\n::: {.cell-output-display}\n| | |\n|--:|--:|\n| 1| 2|\n| 3| 4|\n| 5| 6|\n:::\n:::\n\n\nNote, the first matrix filled in numbers 1-6 by columns first and then rows because default `byrow` argument is FALSE. In the second matrix, we changed the argument `byrow` to `TRUE`, and now numbers 1-6 are filled by rows first and then columns.\n\n## Data frame \n\nYou can transform an existing matrix into data frames using `as.data.frame()` \n\n\n::: {.cell}\n\n```{.r .cell-code}\nas.data.frame(matrix(1:6, ncol = 2) ) \n```\n\n::: {.cell-output-display}\n| V1| V2|\n|--:|--:|\n| 1| 4|\n| 2| 5|\n| 3| 6|\n:::\n:::\n\n\n\n## Numeric variable data summary\n\nData summarization on numeric vectors/variables:\n\n-\t`mean()`: takes the mean of x\n-\t`sd()`: takes the standard deviation of x\n-\t`median()`: takes the median of x\n-\t`quantile()`: displays sample quantiles of x. Default is min, IQR, max\n-\t`range()`: displays the range. Same as `c(min(), max())`\n-\t`sum()`: sum of x\n-\t`max()`: maximum value in x\n-\t`min()`: minimum value in x\n\nNote, **all have the ** `na.rm` **argument for missing data**\n\n\nArithmetic Mean\n\nDescription:\n\n Generic function for the (trimmed) arithmetic mean.\n\nUsage:\n\n mean(x, ...)\n \n ## Default S3 method:\n mean(x, trim = 0, na.rm = FALSE, ...)\n \nArguments:\n\n x: An R object. Currently there are methods for numeric/logical\n vectors and date, date-time and time interval objects.\n Complex vectors are allowed for 'trim = 0', only.\n\n trim: the fraction (0 to 0.5) of observations to be trimmed from\n each end of 'x' before the mean is computed. Values of trim\n outside that range are taken as the nearest endpoint.\n\n na.rm: a logical evaluating to 'TRUE' or 'FALSE' indicating whether\n 'NA' values should be stripped before the computation\n proceeds.\n\n ...: further arguments passed to or from other methods.\n\nValue:\n\n If 'trim' is zero (the default), the arithmetic mean of the values\n in 'x' is computed, as a numeric or complex vector of length one.\n If 'x' is not logical (coerced to numeric), numeric (including\n integer) or complex, 'NA_real_' is returned, with a warning.\n\n If 'trim' is non-zero, a symmetrically trimmed mean is computed\n with a fraction of 'trim' observations deleted from each end\n before the mean is computed.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'weighted.mean', 'mean.POSIXct', 'colMeans' for row and column\n means.\n\nExamples:\n\n x <- c(0:10, 50)\n xm <- mean(x)\n c(xm, mean(x, trim = 0.10))\n\n\n## Numeric variable data summary examples\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(df)\n```\n\n::: {.cell-output-display}\n| |observation_id |IgG_concentration | age | gender | slum | log_IgG | seropos | age_group |age_group_factor |\n|:--|:--------------|:-----------------|:--------------|:----------------|:----------------|:---------------|:-------------|:----------------|:----------------|\n| |Min. :5006 |Min. : 0.0054 |Min. : 1.000 |Length:651 |Length:651 |Min. :-5.2231 |Mode :logical |Length:651 |young :316 |\n| |1st Qu.:6306 |1st Qu.: 0.3000 |1st Qu.: 3.000 |Class :character |Class :character |1st Qu.:-1.2040 |FALSE:360 |Class :character |middle:179 |\n| |Median :7495 |Median : 1.6658 |Median : 6.000 |Mode :character |Mode :character |Median : 0.5103 |TRUE :281 |Mode :character |old :147 |\n| |Mean :7492 |Mean : 87.3683 |Mean : 6.606 |NA |NA |Mean : 1.6074 |NA's :10 |NA |NA's : 9 |\n| |3rd Qu.:8749 |3rd Qu.:141.4405 |3rd Qu.:10.000 |NA |NA |3rd Qu.: 4.9519 |NA |NA |NA |\n| |Max. :9982 |Max. :916.4179 |Max. :15.000 |NA |NA |Max. : 6.8205 |NA |NA |NA |\n| |NA |NA's :10 |NA's :9 |NA |NA |NA's :10 |NA |NA |NA |\n:::\n\n```{.r .cell-code}\nrange(df$age)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] NA NA\n```\n:::\n\n```{.r .cell-code}\nrange(df$age, na.rm=TRUE)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1 15\n```\n:::\n\n```{.r .cell-code}\nmedian(df$IgG_concentration, na.rm=TRUE)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 1.665753\n```\n:::\n:::\n\n\n\n## Character variable data summaries\n\nData summarization on character or factor vectors/variables using `table()`\n\t\t\n\nCross Tabulation and Table Creation\n\nDescription:\n\n 'table' uses cross-classifying factors to build a contingency\n table of the counts at each combination of factor levels.\n\nUsage:\n\n table(...,\n exclude = if (useNA == \"no\") c(NA, NaN),\n useNA = c(\"no\", \"ifany\", \"always\"),\n dnn = list.names(...), deparse.level = 1)\n \n as.table(x, ...)\n is.table(x)\n \n ## S3 method for class 'table'\n as.data.frame(x, row.names = NULL, ...,\n responseName = \"Freq\", stringsAsFactors = TRUE,\n sep = \"\", base = list(LETTERS))\n \nArguments:\n\n ...: one or more objects which can be interpreted as factors\n (including numbers or character strings), or a 'list' (such\n as a data frame) whose components can be so interpreted.\n (For 'as.table', arguments passed to specific methods; for\n 'as.data.frame', unused.)\n\n exclude: levels to remove for all factors in '...'. If it does not\n contain 'NA' and 'useNA' is not specified, it implies 'useNA\n = \"ifany\"'. See 'Details' for its interpretation for\n non-factor arguments.\n\n useNA: whether to include 'NA' values in the table. See 'Details'.\n Can be abbreviated.\n\n dnn: the names to be given to the dimensions in the result (the\n _dimnames names_).\n\ndeparse.level: controls how the default 'dnn' is constructed. See\n 'Details'.\n\n x: an arbitrary R object, or an object inheriting from class\n '\"table\"' for the 'as.data.frame' method. Note that\n 'as.data.frame.table(x, *)' may be called explicitly for\n non-table 'x' for \"reshaping\" 'array's.\n\nrow.names: a character vector giving the row names for the data frame.\n\nresponseName: The name to be used for the column of table entries,\n usually counts.\n\nstringsAsFactors: logical: should the classifying factors be returned\n as factors (the default) or character vectors?\n\nsep, base: passed to 'provideDimnames'.\n\nDetails:\n\n If the argument 'dnn' is not supplied, the internal function\n 'list.names' is called to compute the 'dimname names' as follows:\n If '...' is one 'list' with its own 'names()', these 'names' are\n used. Otherwise, if the arguments in '...' are named, those names\n are used. For the remaining arguments, 'deparse.level = 0' gives\n an empty name, 'deparse.level = 1' uses the supplied argument if\n it is a symbol, and 'deparse.level = 2' will deparse the argument.\n\n Only when 'exclude' is specified (i.e., not by default) and\n non-empty, will 'table' potentially drop levels of factor\n arguments.\n\n 'useNA' controls if the table includes counts of 'NA' values: the\n allowed values correspond to never ('\"no\"'), only if the count is\n positive ('\"ifany\"') and even for zero counts ('\"always\"'). Note\n the somewhat \"pathological\" case of two different kinds of 'NA's\n which are treated differently, depending on both 'useNA' and\n 'exclude', see 'd.patho' in the 'Examples:' below.\n\n Both 'exclude' and 'useNA' operate on an \"all or none\" basis. If\n you want to control the dimensions of a multiway table separately,\n modify each argument using 'factor' or 'addNA'.\n\n Non-factor arguments 'a' are coerced via 'factor(a,\n exclude=exclude)'. Since R 3.4.0, care is taken _not_ to count\n the excluded values (where they were included in the 'NA' count,\n previously).\n\n The 'summary' method for class '\"table\"' (used for objects created\n by 'table' or 'xtabs') which gives basic information and performs\n a chi-squared test for independence of factors (note that the\n function 'chisq.test' currently only handles 2-d tables).\n\nValue:\n\n 'table()' returns a _contingency table_, an object of class\n '\"table\"', an array of integer values. Note that unlike S the\n result is always an 'array', a 1D array if one factor is given.\n\n 'as.table' and 'is.table' coerce to and test for contingency\n table, respectively.\n\n The 'as.data.frame' method for objects inheriting from class\n '\"table\"' can be used to convert the array-based representation of\n a contingency table to a data frame containing the classifying\n factors and the corresponding entries (the latter as component\n named by 'responseName'). This is the inverse of 'xtabs'.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'tabulate' is the underlying function and allows finer control.\n\n Use 'ftable' for printing (and more) of multidimensional tables.\n 'margin.table', 'prop.table', 'addmargins'.\n\n 'addNA' for constructing factors with 'NA' as a level.\n\n 'xtabs' for cross tabulation of data frames with a formula\n interface.\n\nExamples:\n\n require(stats) # for rpois and xtabs\n ## Simple frequency distribution\n table(rpois(100, 5))\n ## Check the design:\n with(warpbreaks, table(wool, tension))\n table(state.division, state.region)\n \n # simple two-way contingency table\n with(airquality, table(cut(Temp, quantile(Temp)), Month))\n \n a <- letters[1:3]\n table(a, sample(a)) # dnn is c(\"a\", \"\")\n table(a, sample(a), deparse.level = 0) # dnn is c(\"\", \"\")\n table(a, sample(a), deparse.level = 2) # dnn is c(\"a\", \"sample(a)\")\n \n ## xtabs() <-> as.data.frame.table() :\n UCBAdmissions ## already a contingency table\n DF <- as.data.frame(UCBAdmissions)\n class(tab <- xtabs(Freq ~ ., DF)) # xtabs & table\n ## tab *is* \"the same\" as the original table:\n all(tab == UCBAdmissions)\n all.equal(dimnames(tab), dimnames(UCBAdmissions))\n \n a <- rep(c(NA, 1/0:3), 10)\n table(a) # does not report NA's\n table(a, exclude = NULL) # reports NA's\n b <- factor(rep(c(\"A\",\"B\",\"C\"), 10))\n table(b)\n table(b, exclude = \"B\")\n d <- factor(rep(c(\"A\",\"B\",\"C\"), 10), levels = c(\"A\",\"B\",\"C\",\"D\",\"E\"))\n table(d, exclude = \"B\")\n print(table(b, d), zero.print = \".\")\n \n ## NA counting:\n is.na(d) <- 3:4\n d. <- addNA(d)\n d.[1:7]\n table(d.) # \", exclude = NULL\" is not needed\n ## i.e., if you want to count the NA's of 'd', use\n table(d, useNA = \"ifany\")\n \n ## \"pathological\" case:\n d.patho <- addNA(c(1,NA,1:2,1:3))[-7]; is.na(d.patho) <- 3:4\n d.patho\n ## just 3 consecutive NA's ? --- well, have *two* kinds of NAs here :\n as.integer(d.patho) # 1 4 NA NA 1 2\n ##\n ## In R >= 3.4.0, table() allows to differentiate:\n table(d.patho) # counts the \"unusual\" NA\n table(d.patho, useNA = \"ifany\") # counts all three\n table(d.patho, exclude = NULL) # (ditto)\n table(d.patho, exclude = NA) # counts none\n \n ## Two-way tables with NA counts. The 3rd variant is absurd, but shows\n ## something that cannot be done using exclude or useNA.\n with(airquality,\n table(OzHi = Ozone > 80, Month, useNA = \"ifany\"))\n with(airquality,\n table(OzHi = Ozone > 80, Month, useNA = \"always\"))\n with(airquality,\n table(OzHi = Ozone > 80, addNA(Month)))\n\n\n\n## Character variable data summary examples\n\nNumber of observations in each category\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntable(df$gender)\n```\n\n::: {.cell-output-display}\n| Female| Male|\n|------:|----:|\n| 325| 326|\n:::\n\n```{.r .cell-code}\ntable(df$gender, useNA=\"always\")\n```\n\n::: {.cell-output-display}\n| Female| Male| NA|\n|------:|----:|--:|\n| 325| 326| 0|\n:::\n\n```{.r .cell-code}\ntable(df$age_group, useNA=\"always\")\n```\n\n::: {.cell-output-display}\n| middle| old| young| NA|\n|------:|---:|-----:|--:|\n| 179| 147| 316| 9|\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ntable(df$gender)/nrow(df) #if no NA values\n```\n\n::: {.cell-output-display}\n| Female| Male|\n|--------:|--------:|\n| 0.499232| 0.500768|\n:::\n\n```{.r .cell-code}\ntable(df$age_group)/nrow(df[!is.na(df$age_group),]) #if there are NA values\n```\n\n::: {.cell-output-display}\n| middle| old| young|\n|---------:|--------:|---------:|\n| 0.2788162| 0.228972| 0.4922118|\n:::\n\n```{.r .cell-code}\ntable(df$age_group)/nrow(subset(df, !is.na(df$age_group),)) #if there are NA values\n```\n\n::: {.cell-output-display}\n| middle| old| young|\n|---------:|--------:|---------:|\n| 0.2788162| 0.228972| 0.4922118|\n:::\n:::\n\n\n\n## Summary\n\n- Adding (or modifying) columns/variable to a data frame by using `$` \n- There are two types of numeric class objects: integer and double\n- Logical class objects only have `TRUE` or `False` (without quotes)\n- `is.CLASS_NAME(x)` can be used to test the class of an object x\n- `as.CLASS_NAME(x)` can be used to change the class of an object x\n- Factors are a special character class that has levels \n- ...xxamy complete\n\t\t\n\n## Acknowledgements\n\nThese are the materials we looked through, modified, or extracted to complete this module's lecture.\n\n- [\"Introduction to R for Public Health Researchers\" Johns Hopkins University](https://jhudatascience.org/intro_to_r/)\n\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/modules/Module08-DataMergeReshape/execute-results/html.json b/_freeze/modules/Module08-DataMergeReshape/execute-results/html.json new file mode 100644 index 0000000..90dab95 --- /dev/null +++ b/_freeze/modules/Module08-DataMergeReshape/execute-results/html.json @@ -0,0 +1,18 @@ +{ + "hash": "a098429eb85b4995ea1507646b4860d7", + "result": { + "markdown": "---\ntitle: \"Module 8: Data Merging and Reshaping\"\nformat:\n revealjs:\n scrollable: true\n smaller: true\n toc: false\n---\n\n\n## Learning Objectives\n\nAfter module 8, you should be able to...\n\n- Merge/join data together\n- Reshape data from wide to long\n- Reshape data from long to wide\n\n## Joining types\n\nPay close attention to the number of rows in your data set before and after a join. This will help flag when an issue has arisen. This will depend on the type of merge:\n\n- 1:1 merge (one-to-one merge) – Simplest merge (sometimes things go wrong)\n- 1:m merge (one-to-many merge) – More complex (things often go wrong)\n - The \"one\" suggests that one dataset has the merging variable (e.g., id) each represented once and the \"many” implies that one dataset has the merging variable represented multiple times\n- m:m merge (many-to-many merge) – Danger zone (can be unpredictable)\n \n\n## one-to-one merge\n\n- This means that each row of data represents a unique unit of analysis that exists in another dataset (e.g,. id variable)\n- Will likely have variables that don’t exist in the current dataset (that’s why you are trying to merge it in)\n- The merging variable (e.g., id) each represented a single time\n- You should try to structure your data so that a 1:1 merge or 1:m merge is possible so that fewer things can go wrong.\n\n## `merge()` function\n\nWe will use the `merge()` function to conduct one-to-one merge\n\n\n\n```\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n```\n\nMerge Two Data Frames\n\nDescription:\n\n Merge two data frames by common columns or row names, or do other\n versions of database _join_ operations.\n\nUsage:\n\n merge(x, y, ...)\n \n ## Default S3 method:\n merge(x, y, ...)\n \n ## S3 method for class 'data.frame'\n merge(x, y, by = intersect(names(x), names(y)),\n by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,\n sort = TRUE, suffixes = c(\".x\",\".y\"), no.dups = TRUE,\n incomparables = NULL, ...)\n \nArguments:\n\n x, y: data frames, or objects to be coerced to one.\n\nby, by.x, by.y: specifications of the columns used for merging. See\n 'Details'.\n\n all: logical; 'all = L' is shorthand for 'all.x = L' and 'all.y =\n L', where 'L' is either 'TRUE' or 'FALSE'.\n\n all.x: logical; if 'TRUE', then extra rows will be added to the\n output, one for each row in 'x' that has no matching row in\n 'y'. These rows will have 'NA's in those columns that are\n usually filled with values from 'y'. The default is 'FALSE',\n so that only rows with data from both 'x' and 'y' are\n included in the output.\n\n all.y: logical; analogous to 'all.x'.\n\n sort: logical. Should the result be sorted on the 'by' columns?\n\nsuffixes: a character vector of length 2 specifying the suffixes to be\n used for making unique the names of columns in the result\n which are not used for merging (appearing in 'by' etc).\n\n no.dups: logical indicating that 'suffixes' are appended in more cases\n to avoid duplicated column names in the result. This was\n implicitly false before R version 3.5.0.\n\nincomparables: values which cannot be matched. See 'match'. This is\n intended to be used for merging on one column, so these are\n incomparable values of that column.\n\n ...: arguments to be passed to or from methods.\n\nDetails:\n\n 'merge' is a generic function whose principal method is for data\n frames: the default method coerces its arguments to data frames\n and calls the '\"data.frame\"' method.\n\n By default the data frames are merged on the columns with names\n they both have, but separate specifications of the columns can be\n given by 'by.x' and 'by.y'. The rows in the two data frames that\n match on the specified columns are extracted, and joined together.\n If there is more than one match, all possible matches contribute\n one row each. For the precise meaning of 'match', see 'match'.\n\n Columns to merge on can be specified by name, number or by a\n logical vector: the name '\"row.names\"' or the number '0' specifies\n the row names. If specified by name it must correspond uniquely\n to a named column in the input.\n\n If 'by' or both 'by.x' and 'by.y' are of length 0 (a length zero\n vector or 'NULL'), the result, 'r', is the _Cartesian product_ of\n 'x' and 'y', i.e., 'dim(r) = c(nrow(x)*nrow(y), ncol(x) +\n ncol(y))'.\n\n If 'all.x' is true, all the non matching cases of 'x' are appended\n to the result as well, with 'NA' filled in the corresponding\n columns of 'y'; analogously for 'all.y'.\n\n If the columns in the data frames not used in merging have any\n common names, these have 'suffixes' ('\".x\"' and '\".y\"' by default)\n appended to try to make the names of the result unique. If this\n is not possible, an error is thrown.\n\n If a 'by.x' column name matches one of 'y', and if 'no.dups' is\n true (as by default), the y version gets suffixed as well,\n avoiding duplicate column names in the result.\n\n The complexity of the algorithm used is proportional to the length\n of the answer.\n\n In SQL database terminology, the default value of 'all = FALSE'\n gives a _natural join_, a special case of an _inner join_.\n Specifying 'all.x = TRUE' gives a _left (outer) join_, 'all.y =\n TRUE' a _right (outer) join_, and both ('all = TRUE') a _(full)\n outer join_. DBMSes do not match 'NULL' records, equivalent to\n 'incomparables = NA' in R.\n\nValue:\n\n A data frame. The rows are by default lexicographically sorted on\n the common columns, but for 'sort = FALSE' are in an unspecified\n order. The columns are the common columns followed by the\n remaining columns in 'x' and then those in 'y'. If the matching\n involved row names, an extra character column called 'Row.names'\n is added at the left, and in all cases the result has 'automatic'\n row names.\n\nNote:\n\n This is intended to work with data frames with vector-like\n columns: some aspects work with data frames containing matrices,\n but not all.\n\n Currently long vectors are not accepted for inputs, which are thus\n restricted to less than 2^31 rows. That restriction also applies\n to the result for 32-bit platforms.\n\nSee Also:\n\n 'data.frame', 'by', 'cbind'.\n\n 'dendrogram' for a class which has a 'merge' method.\n\nExamples:\n\n authors <- data.frame(\n ## I(*) : use character columns of names to get sensible sort order\n surname = I(c(\"Tukey\", \"Venables\", \"Tierney\", \"Ripley\", \"McNeil\")),\n nationality = c(\"US\", \"Australia\", \"US\", \"UK\", \"Australia\"),\n deceased = c(\"yes\", rep(\"no\", 4)))\n authorN <- within(authors, { name <- surname; rm(surname) })\n books <- data.frame(\n name = I(c(\"Tukey\", \"Venables\", \"Tierney\",\n \"Ripley\", \"Ripley\", \"McNeil\", \"R Core\")),\n title = c(\"Exploratory Data Analysis\",\n \"Modern Applied Statistics ...\",\n \"LISP-STAT\",\n \"Spatial Statistics\", \"Stochastic Simulation\",\n \"Interactive Data Analysis\",\n \"An Introduction to R\"),\n other.author = c(NA, \"Ripley\", NA, NA, NA, NA,\n \"Venables & Smith\"))\n \n (m0 <- merge(authorN, books))\n (m1 <- merge(authors, books, by.x = \"surname\", by.y = \"name\"))\n m2 <- merge(books, authors, by.x = \"name\", by.y = \"surname\")\n stopifnot(exprs = {\n identical(m0, m2[, names(m0)])\n as.character(m1[, 1]) == as.character(m2[, 1])\n all.equal(m1[, -1], m2[, -1][ names(m1)[-1] ])\n identical(dim(merge(m1, m2, by = NULL)),\n c(nrow(m1)*nrow(m2), ncol(m1)+ncol(m2)))\n })\n \n ## \"R core\" is missing from authors and appears only here :\n merge(authors, books, by.x = \"surname\", by.y = \"name\", all = TRUE)\n \n \n ## example of using 'incomparables'\n x <- data.frame(k1 = c(NA,NA,3,4,5), k2 = c(1,NA,NA,4,5), data = 1:5)\n y <- data.frame(k1 = c(NA,2,NA,4,5), k2 = c(NA,NA,3,4,5), data = 1:5)\n merge(x, y, by = c(\"k1\",\"k2\")) # NA's match\n merge(x, y, by = \"k1\") # NA's match, so 6 rows\n merge(x, y, by = \"k2\", incomparables = NA) # 2 rows\n\n\n \n## Lets import the new data we want to merge and take a look\n\nThe new data `serodata_new.csv` represents a follow-up serological survey four years later. At this follow-up individuals were retested for IgG antibody concentrations and their ages were collected.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf_new <- read.csv(\"data/serodata_new.csv\")\nstr(df_new)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n'data.frame':\t636 obs. of 3 variables:\n $ observation_id : int 5772 8095 9784 9338 6369 6885 6252 8913 7332 6941 ...\n $ IgG_concentration: num 0.261 2.981 0.282 136.638 0.381 ...\n $ age : int 6 8 8 8 5 8 8 NA 8 6 ...\n```\n:::\n\n```{.r .cell-code}\nsummary(df_new)\n```\n\n::: {.cell-output-display}\n| |observation_id |IgG_concentration | age |\n|:--|:--------------|:-----------------|:-------------|\n| |Min. :5006 |Min. : 0.0051 |Min. : 5.00 |\n| |1st Qu.:6328 |1st Qu.: 0.2751 |1st Qu.: 7.00 |\n| |Median :7494 |Median : 1.5477 |Median :10.00 |\n| |Mean :7490 |Mean : 82.7684 |Mean :10.63 |\n| |3rd Qu.:8736 |3rd Qu.:129.6389 |3rd Qu.:14.00 |\n| |Max. :9982 |Max. :950.6590 |Max. :19.00 |\n| |NA |NA |NA's :9 |\n:::\n:::\n\n\n\n## Merge the new data with the original data\n\nLets load the old data as well and look for a variable, or variables, to merge by.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- read.csv(\"data/serodata.csv\")\ncolnames(df)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] \"observation_id\" \"IgG_concentration\" \"age\" \n[4] \"gender\" \"slum\" \n```\n:::\n:::\n\n\nWe notice that `observation_id` seems to be the obvious variable by which to merge. However, we also realize that `IgG_concentration` and `age` are the exact same names. If we merge now we see that \n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(merge(df, df_new, all.x=T, all.y=T, by=c('observation_id')))\n```\n\n::: {.cell-output-display}\n| observation_id| IgG_concentration.x| age.x|gender |slum | IgG_concentration.y| age.y|\n|--------------:|-------------------:|-----:|:------|:--------|-------------------:|-----:|\n| 5006| 164.2979452| 7|Male |Non slum | 155.5811325| 11|\n| 5024| 0.3000000| 5|Female |Non slum | 0.2918605| 9|\n| 5026| 0.3000000| 10|Female |Non slum | 0.2542945| 14|\n| 5030| 0.0555556| 7|Female |Non slum | 0.0533262| 11|\n| 5035| 26.2112514| 11|Female |Non slum | 22.0159300| 15|\n| 5054| 0.3000000| 3|Male |Non slum | 0.2709671| 7|\n:::\n:::\n\n\n## Merge the new data with the original data\n\nThe first option is to rename the `IgG_concentration` and `age` variables before the merge, so that it is clear which is time point 1 and time point 2. \n\n::: {.cell}\n\n```{.r .cell-code}\ndf$IgG_concentration_time1 <- df$IgG_concentration\ndf$age_time1 <- df$age\ndf$IgG_concentration <- df$age <- NULL #remove the original variables\n\ndf_new$IgG_concentration_time2 <- df_new$IgG_concentration\ndf_new$age_time2 <- df_new$age\ndf_new$IgG_concentration <- df_new$age <- NULL #remove the original variables\n```\n:::\n\n\nNow, lets merge.\n\n::: {.cell}\n\n```{.r .cell-code}\ndf_all_wide <- merge(df, df_new, all.x=T, all.y=T, by=c('observation_id'))\nstr(df_all_wide)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n'data.frame':\t651 obs. of 7 variables:\n $ observation_id : int 5006 5024 5026 5030 5035 5054 5057 5063 5064 5080 ...\n $ gender : chr \"Male\" \"Female\" \"Female\" \"Female\" ...\n $ slum : chr \"Non slum\" \"Non slum\" \"Non slum\" \"Non slum\" ...\n $ IgG_concentration_time1: num 164.2979 0.3 0.3 0.0556 26.2113 ...\n $ age_time1 : int 7 5 10 7 11 3 3 12 14 6 ...\n $ IgG_concentration_time2: num 155.5811 0.2919 0.2543 0.0533 22.0159 ...\n $ age_time2 : int 11 9 14 11 15 7 7 16 18 10 ...\n```\n:::\n:::\n\n\n## Merge the new data with the original data\n\nThe second option is to add a time variable to the two data sets and then merge by `observation_id`,`time`,`age`,`IgG_concentration`. Note, I need to read in the data again b/c I removed the `IgG_concentration` and `age` variables.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- read.csv(\"data/serodata.csv\")\ndf_new <- read.csv(\"data/serodata_new.csv\")\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$time <- 1 #you can put in one number and it will repeat it\ndf_new$time <- 2\nhead(df)\n```\n\n::: {.cell-output-display}\n| observation_id| IgG_concentration| age|gender |slum | time|\n|--------------:|-----------------:|---:|:------|:--------|----:|\n| 5772| 0.3176895| 2|Female |Non slum | 1|\n| 8095| 3.4368231| 4|Female |Non slum | 1|\n| 9784| 0.3000000| 4|Male |Non slum | 1|\n| 9338| 143.2363014| 4|Male |Non slum | 1|\n| 6369| 0.4476534| 1|Male |Non slum | 1|\n| 6885| 0.0252708| 4|Male |Non slum | 1|\n:::\n\n```{.r .cell-code}\nhead(df_new)\n```\n\n::: {.cell-output-display}\n| observation_id| IgG_concentration| age| time|\n|--------------:|-----------------:|---:|----:|\n| 5772| 0.2612388| 6| 2|\n| 8095| 2.9809049| 8| 2|\n| 9784| 0.2819489| 8| 2|\n| 9338| 136.6382260| 8| 2|\n| 6369| 0.3810119| 5| 2|\n| 6885| 0.0245951| 8| 2|\n:::\n:::\n\n\nNow, lets merge. Note, \"By default the data frames are merged on the columns with names they both have\" therefore if I don't specify the by argument it will merge on all matching variables.\n\n::: {.cell}\n\n```{.r .cell-code}\ndf_all_long <- merge(df, df_new, all.x=T, all.y=T) \nstr(df_all_long)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n'data.frame':\t1287 obs. of 6 variables:\n $ observation_id : int 5006 5006 5024 5024 5026 5026 5030 5030 5035 5035 ...\n $ IgG_concentration: num 155.581 164.298 0.292 0.3 0.254 ...\n $ age : int 11 7 9 5 14 10 11 7 15 11 ...\n $ time : num 2 1 2 1 2 1 2 1 2 1 ...\n $ gender : chr NA \"Male\" NA \"Female\" ...\n $ slum : chr NA \"Non slum\" NA \"Non slum\" ...\n```\n:::\n:::\n\nNote, there are 1287 rows, which is the sum of the number of rows of `df` (651 rows) and `df_new` (636 rows)\n\n\n## What is wide/long data?\n\nAbove, we actually created a wide and long version of the data.\n\nWide: has many columns\n\n- multiple columns per individual, values spread across multiple columns \n- easier for humans to read\n \nLong: has many rows\n\n- column names become data\n- multiple rows per observation, a single column contains the values\n- easier for R to make plots & do analysis\n\n## `reshape()` function \n\nThe `reshape()` function allows you to toggle between wide and long data\n\n\nReshape Grouped Data\n\nDescription:\n\n This function reshapes a data frame between 'wide' format (with\n repeated measurements in separate columns of the same row) and\n 'long' format (with the repeated measurements in separate rows).\n\nUsage:\n\n reshape(data, varying = NULL, v.names = NULL, timevar = \"time\",\n idvar = \"id\", ids = 1:NROW(data),\n times = seq_along(varying[[1]]),\n drop = NULL, direction, new.row.names = NULL,\n sep = \".\",\n split = if (sep == \"\") {\n list(regexp = \"[A-Za-z][0-9]\", include = TRUE)\n } else {\n list(regexp = sep, include = FALSE, fixed = TRUE)}\n )\n \n ### Typical usage for converting from long to wide format:\n \n # reshape(data, direction = \"wide\",\n # idvar = \"___\", timevar = \"___\", # mandatory\n # v.names = c(___), # time-varying variables\n # varying = list(___)) # auto-generated if missing\n \n ### Typical usage for converting from wide to long format:\n \n ### If names of wide-format variables are in a 'nice' format\n \n # reshape(data, direction = \"long\",\n # varying = c(___), # vector \n # sep) # to help guess 'v.names' and 'times'\n \n ### To specify long-format variable names explicitly\n \n # reshape(data, direction = \"long\",\n # varying = ___, # list / matrix / vector (use with care)\n # v.names = ___, # vector of variable names in long format\n # timevar, times, # name / values of constructed time variable\n # idvar, ids) # name / values of constructed id variable\n \nArguments:\n\n data: a data frame\n\n varying: names of sets of variables in the wide format that correspond\n to single variables in long format ('time-varying'). This is\n canonically a list of vectors of variable names, but it can\n optionally be a matrix of names, or a single vector of names.\n In each case, when 'direction = \"long\"', the names can be\n replaced by indices which are interpreted as referring to\n 'names(data)'. See 'Details' for more details and options.\n\n v.names: names of variables in the long format that correspond to\n multiple variables in the wide format. See 'Details'.\n\n timevar: the variable in long format that differentiates multiple\n records from the same group or individual. If more than one\n record matches, the first will be taken (with a warning).\n\n idvar: Names of one or more variables in long format that identify\n multiple records from the same group/individual. These\n variables may also be present in wide format.\n\n ids: the values to use for a newly created 'idvar' variable in\n long format.\n\n times: the values to use for a newly created 'timevar' variable in\n long format. See 'Details'.\n\n drop: a vector of names of variables to drop before reshaping.\n\ndirection: character string, partially matched to either '\"wide\"' to\n reshape to wide format, or '\"long\"' to reshape to long\n format.\n\nnew.row.names: character or 'NULL': a non-null value will be used for\n the row names of the result.\n\n sep: A character vector of length 1, indicating a separating\n character in the variable names in the wide format. This is\n used for guessing 'v.names' and 'times' arguments based on\n the names in 'varying'. If 'sep == \"\"', the split is just\n before the first numeral that follows an alphabetic\n character. This is also used to create variable names when\n reshaping to wide format.\n\n split: A list with three components, 'regexp', 'include', and\n (optionally) 'fixed'. This allows an extended interface to\n variable name splitting. See 'Details'.\n\nDetails:\n\n Although 'reshape()' can be used in a variety of contexts, the\n motivating application is data from longitudinal studies, and the\n arguments of this function are named and described in those terms.\n A longitudinal study is characterized by repeated measurements of\n the same variable(s), e.g., height and weight, on each unit being\n studied (e.g., individual persons) at different time points (which\n are assumed to be the same for all units). These variables are\n called time-varying variables. The study may include other\n variables that are measured only once for each unit and do not\n vary with time (e.g., gender and race); these are called\n time-constant variables.\n\n A 'wide' format representation of a longitudinal dataset will have\n one record (row) for each unit, typically with some time-constant\n variables that occupy single columns, and some time-varying\n variables that occupy multiple columns (one column for each time\n point). A 'long' format representation of the same dataset will\n have multiple records (rows) for each individual, with the\n time-constant variables being constant across these records and\n the time-varying variables varying across the records. The 'long'\n format dataset will have two additional variables: a 'time'\n variable identifying which time point each record comes from, and\n an 'id' variable showing which records refer to the same unit.\n\n The type of conversion (long to wide or wide to long) is\n determined by the 'direction' argument, which is mandatory unless\n the 'data' argument is the result of a previous call to 'reshape'.\n In that case, the operation can be reversed simply using\n 'reshape(data)' (the other arguments are stored as attributes on\n the data frame).\n\n Conversion from long to wide format with 'direction = \"wide\"' is\n the simpler operation, and is mainly useful in the context of\n multivariate analysis where data is often expected as a\n wide-format matrix. In this case, the time variable 'timevar' and\n id variable 'idvar' must be specified. All other variables are\n assumed to be time-varying, unless the time-varying variables are\n explicitly specified via the 'v.names' argument. A warning is\n issued if time-constant variables are not actually constant.\n\n Each time-varying variable is expanded into multiple variables in\n the wide format. The names of these expanded variables are\n generated automatically, unless they are specified as the\n 'varying' argument in the form of a list (or matrix) with one\n component (or row) for each time-varying variable. If 'varying' is\n a vector of names, it is implicitly converted into a matrix, with\n one row for each time-varying variable. Use this option with care\n if there are multiple time-varying variables, as the ordering (by\n column, the default in the 'matrix' constructor) may be\n unintuitive, whereas the explicit list or matrix form is\n unambiguous.\n\n Conversion from wide to long with 'direction = \"long\"' is the more\n common operation as most (univariate) statistical modeling\n functions expect data in the long format. In the simpler case\n where there is only one time-varying variable, the corresponding\n columns in the wide format input can be specified as the 'varying'\n argument, which can be either a vector of column names or the\n corresponding column indices. The name of the corresponding\n variable in the long format output combining these columns can be\n optionally specified as the 'v.names' argument, and the name of\n the time variables as the 'timevar' argument. The values to use as\n the time values corresponding to the different columns in the wide\n format can be specified as the 'times' argument. If 'v.names' is\n unspecified, the function will attempt to guess 'v.names' and\n 'times' from 'varying' (an explicitly specified 'times' argument\n is unused in that case). The default expects variable names like\n 'x.1', 'x.2', where 'sep = \".\"' specifies to split at the dot and\n drop it from the name. To have alphabetic followed by numeric\n times use 'sep = \"\"'.\n\n Multiple time-varying variables can be specified in two ways,\n either with 'varying' as an atomic vector as above, or as a list\n (or a matrix). The first form is useful (and mandatory) if the\n automatic variable name splitting as described above is used; this\n requires the names of all time-varying variables to be suitably\n formatted in the same manner, and 'v.names' to be unspecified. If\n 'varying' is a list (with one component for each time-varying\n variable) or a matrix (one row for each time-varying variable),\n variable name splitting is not attempted, and 'v.names' and\n 'times' will generally need to be specified, although they will\n default to, respectively, the first variable name in each set, and\n sequential times.\n\n Also, guessing is not attempted if 'v.names' is given explicitly,\n even if 'varying' is an atomic vector. In that case, the number of\n time-varying variables is taken to be the length of 'v.names', and\n 'varying' is implicitly converted into a matrix, with one row for\n each time-varying variable. As in the case of long to wide\n conversion, the matrix is filled up by column, so careful\n attention needs to be paid to the order of variable names (or\n indices) in 'varying', which is taken to be like 'x.1', 'y.1',\n 'x.2', 'y.2' (i.e., variables corresponding to the same time point\n need to be grouped together).\n\n The 'split' argument should not usually be necessary. The\n 'split$regexp' component is passed to either 'strsplit' or\n 'regexpr', where the latter is used if 'split$include' is 'TRUE',\n in which case the splitting occurs after the first character of\n the matched string. In the 'strsplit' case, the separator is not\n included in the result, and it is possible to specify fixed-string\n matching using 'split$fixed'.\n\nValue:\n\n The reshaped data frame with added attributes to simplify\n reshaping back to the original form.\n\nSee Also:\n\n 'stack', 'aperm'; 'relist' for reshaping the result of 'unlist'.\n 'xtabs' and 'as.data.frame.table' for creating contingency tables\n and converting them back to data frames.\n\nExamples:\n\n summary(Indometh) # data in long format\n \n ## long to wide (direction = \"wide\") requires idvar and timevar at a minimum\n reshape(Indometh, direction = \"wide\", idvar = \"Subject\", timevar = \"time\")\n \n ## can also explicitly specify name of combined variable\n wide <- reshape(Indometh, direction = \"wide\", idvar = \"Subject\",\n timevar = \"time\", v.names = \"conc\", sep= \"_\")\n wide\n \n ## reverse transformation\n reshape(wide, direction = \"long\")\n reshape(wide, idvar = \"Subject\", varying = list(2:12),\n v.names = \"conc\", direction = \"long\")\n \n ## times need not be numeric\n df <- data.frame(id = rep(1:4, rep(2,4)),\n visit = I(rep(c(\"Before\",\"After\"), 4)),\n x = rnorm(4), y = runif(4))\n df\n reshape(df, timevar = \"visit\", idvar = \"id\", direction = \"wide\")\n ## warns that y is really varying\n reshape(df, timevar = \"visit\", idvar = \"id\", direction = \"wide\", v.names = \"x\")\n \n \n ## unbalanced 'long' data leads to NA fill in 'wide' form\n df2 <- df[1:7, ]\n df2\n reshape(df2, timevar = \"visit\", idvar = \"id\", direction = \"wide\")\n \n ## Alternative regular expressions for guessing names\n df3 <- data.frame(id = 1:4, age = c(40,50,60,50), dose1 = c(1,2,1,2),\n dose2 = c(2,1,2,1), dose4 = c(3,3,3,3))\n reshape(df3, direction = \"long\", varying = 3:5, sep = \"\")\n \n \n ## an example that isn't longitudinal data\n state.x77 <- as.data.frame(state.x77)\n long <- reshape(state.x77, idvar = \"state\", ids = row.names(state.x77),\n times = names(state.x77), timevar = \"Characteristic\",\n varying = list(names(state.x77)), direction = \"long\")\n \n reshape(long, direction = \"wide\")\n \n reshape(long, direction = \"wide\", new.row.names = unique(long$state))\n \n ## multiple id variables\n df3 <- data.frame(school = rep(1:3, each = 4), class = rep(9:10, 6),\n time = rep(c(1,1,2,2), 3), score = rnorm(12))\n wide <- reshape(df3, idvar = c(\"school\", \"class\"), direction = \"wide\")\n wide\n ## transform back\n reshape(wide)\n\n\n\n## long to wide data\n\nxxzane - help\n\n\n## wide to long data\n\nxxzane - help\n\n\n## Let's get real\n\nUse the `pivot_wider()` and `pivot_longer()` from the tidyr package!\n\n\n\n## Summary\n\n- ...\n\t\t\n\n## Acknowledgements\n\nThese are the materials we looked through, modified, or extracted to complete this module's lecture.\n\n- [\"Introduction to R for Public Health Researchers\" Johns Hopkins University](https://jhudatascience.org/intro_to_r/)\n\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-after-body": [ + "\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/modules/Module09-DataAnalysis/execute-results/html.json b/_freeze/modules/Module09-DataAnalysis/execute-results/html.json index 0497888..4af0d22 100644 --- a/_freeze/modules/Module09-DataAnalysis/execute-results/html.json +++ b/_freeze/modules/Module09-DataAnalysis/execute-results/html.json @@ -1,8 +1,7 @@ { - "hash": "ebcf08f6d0a895a7c6ee1c74583797e5", + "hash": "662b02c140c1e96bfb158859db710b34", "result": { - "engine": "knitr", - "markdown": "---\ntitle: \"Module 9: Data Analysis\"\nformat: \n revealjs:\n scrollable: true\n smaller: true\n---\n\n\n\n## Learning Objectives\n\nAfter module 9, you should be able to...\n\n-\t\tDescriptively assess association between two variables\n-\t\tCompute basic statistics \n-\t\tFit a generalized linear model\n\n## Import data for this module\n\nLet's read in our data (again) and take a quick look.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- read.csv(file = \"data/serodata.csv\") #relative path\nhead(x=df, n=3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n observation_id IgG_concentration age gender slum\n1 5772 0.3176895 2 Female Non slum\n2 8095 3.4368231 4 Female Non slum\n3 9784 0.3000000 4 Male Non slum\n```\n\n\n:::\n:::\n\n\n\n## Prep data\n\nCreate `age_group` three level factor variable\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group <- ifelse(df$age <= 5, \"young\", \n ifelse(df$age<=10 & df$age>5, \"middle\", \n ifelse(df$age>10, \"old\", NA)))\ndf$age_group <- factor(df$age_group, levels=c(\"young\", \"middle\", \"old\"))\n```\n:::\n\n\n\nCreate `seropos` binary variable representing seropositivity if antibody concentrations are >10 mIUmL.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$seropos <- ifelse(df$IgG_concentration<10, 0, \n\t\t\t\t\t\t\t\t\t\tifelse(df$IgG_concentration>=10, 1, NA))\n```\n:::\n\n\n\n\n## 2 variable contingency tables\n\nWe use `table()` prior to look at one variable, now we can generate frequency tables for 2 plus variables. To get cell percentages, the `prop.table()` is useful. \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfreq <- table(df$age_group, df$seropo)\nprop <- prop.table(freq)\nfreq\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n \n 0 1\n young 254 57\n middle 70 105\n old 30 116\n```\n\n\n:::\n\n```{.r .cell-code}\nprop\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n \n 0 1\n young 0.40189873 0.09018987\n middle 0.11075949 0.16613924\n old 0.04746835 0.18354430\n```\n\n\n:::\n:::\n\n\n\n## Chi-Square test\n\nThe `chisq.test()` function test of independence of factor variables from `stats` package.\n\n\n\n\n```\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n```\n\nPearson's Chi-squared Test for Count Data\n\nDescription:\n\n 'chisq.test' performs chi-squared contingency table tests and\n goodness-of-fit tests.\n\nUsage:\n\n chisq.test(x, y = NULL, correct = TRUE,\n p = rep(1/length(x), length(x)), rescale.p = FALSE,\n simulate.p.value = FALSE, B = 2000)\n \nArguments:\n\n x: a numeric vector or matrix. 'x' and 'y' can also both be\n factors.\n\n y: a numeric vector; ignored if 'x' is a matrix. If 'x' is a\n factor, 'y' should be a factor of the same length.\n\n correct: a logical indicating whether to apply continuity correction\n when computing the test statistic for 2 by 2 tables: one half\n is subtracted from all |O - E| differences; however, the\n correction will not be bigger than the differences\n themselves. No correction is done if 'simulate.p.value =\n TRUE'.\n\n p: a vector of probabilities of the same length as 'x'. An\n error is given if any entry of 'p' is negative.\n\nrescale.p: a logical scalar; if TRUE then 'p' is rescaled (if\n necessary) to sum to 1. If 'rescale.p' is FALSE, and 'p'\n does not sum to 1, an error is given.\n\nsimulate.p.value: a logical indicating whether to compute p-values by\n Monte Carlo simulation.\n\n B: an integer specifying the number of replicates used in the\n Monte Carlo test.\n\nDetails:\n\n If 'x' is a matrix with one row or column, or if 'x' is a vector\n and 'y' is not given, then a _goodness-of-fit test_ is performed\n ('x' is treated as a one-dimensional contingency table). The\n entries of 'x' must be non-negative integers. In this case, the\n hypothesis tested is whether the population probabilities equal\n those in 'p', or are all equal if 'p' is not given.\n\n If 'x' is a matrix with at least two rows and columns, it is taken\n as a two-dimensional contingency table: the entries of 'x' must be\n non-negative integers. Otherwise, 'x' and 'y' must be vectors or\n factors of the same length; cases with missing values are removed,\n the objects are coerced to factors, and the contingency table is\n computed from these. Then Pearson's chi-squared test is performed\n of the null hypothesis that the joint distribution of the cell\n counts in a 2-dimensional contingency table is the product of the\n row and column marginals.\n\n If 'simulate.p.value' is 'FALSE', the p-value is computed from the\n asymptotic chi-squared distribution of the test statistic;\n continuity correction is only used in the 2-by-2 case (if\n 'correct' is 'TRUE', the default). Otherwise the p-value is\n computed for a Monte Carlo test (Hope, 1968) with 'B' replicates.\n The default 'B = 2000' implies a minimum p-value of about 0.0005\n (1/(B+1)).\n\n In the contingency table case, simulation is done by random\n sampling from the set of all contingency tables with given\n marginals, and works only if the marginals are strictly positive.\n Continuity correction is never used, and the statistic is quoted\n without it. Note that this is not the usual sampling situation\n assumed for the chi-squared test but rather that for Fisher's\n exact test.\n\n In the goodness-of-fit case simulation is done by random sampling\n from the discrete distribution specified by 'p', each sample being\n of size 'n = sum(x)'. This simulation is done in R and may be\n slow.\n\nValue:\n\n A list with class '\"htest\"' containing the following components:\n\nstatistic: the value the chi-squared test statistic.\n\nparameter: the degrees of freedom of the approximate chi-squared\n distribution of the test statistic, 'NA' if the p-value is\n computed by Monte Carlo simulation.\n\n p.value: the p-value for the test.\n\n method: a character string indicating the type of test performed, and\n whether Monte Carlo simulation or continuity correction was\n used.\n\ndata.name: a character string giving the name(s) of the data.\n\nobserved: the observed counts.\n\nexpected: the expected counts under the null hypothesis.\n\nresiduals: the Pearson residuals, '(observed - expected) /\n sqrt(expected)'.\n\n stdres: standardized residuals, '(observed - expected) / sqrt(V)',\n where 'V' is the residual cell variance (Agresti, 2007,\n section 2.4.5 for the case where 'x' is a matrix, 'n * p * (1\n - p)' otherwise).\n\nSource:\n\n The code for Monte Carlo simulation is a C translation of the\n Fortran algorithm of Patefield (1981).\n\nReferences:\n\n Hope, A. C. A. (1968). A simplified Monte Carlo significance test\n procedure. _Journal of the Royal Statistical Society Series B_,\n *30*, 582-598. doi:10.1111/j.2517-6161.1968.tb00759.x\n .\n\n Patefield, W. M. (1981). Algorithm AS 159: An efficient method of\n generating r x c tables with given row and column totals.\n _Applied Statistics_, *30*, 91-97. doi:10.2307/2346669\n .\n\n Agresti, A. (2007). _An Introduction to Categorical Data\n Analysis_, 2nd ed. New York: John Wiley & Sons. Page 38.\n\nSee Also:\n\n For goodness-of-fit testing, notably of continuous distributions,\n 'ks.test'.\n\nExamples:\n\n ## From Agresti(2007) p.39\n M <- as.table(rbind(c(762, 327, 468), c(484, 239, 477)))\n dimnames(M) <- list(gender = c(\"F\", \"M\"),\n party = c(\"Democrat\",\"Independent\", \"Republican\"))\n (Xsq <- chisq.test(M)) # Prints test summary\n Xsq$observed # observed counts (same as M)\n Xsq$expected # expected counts under the null\n Xsq$residuals # Pearson residuals\n Xsq$stdres # standardized residuals\n \n \n ## Effect of simulating p-values\n x <- matrix(c(12, 5, 7, 7), ncol = 2)\n chisq.test(x)$p.value # 0.4233\n chisq.test(x, simulate.p.value = TRUE, B = 10000)$p.value\n # around 0.29!\n \n ## Testing for population probabilities\n ## Case A. Tabulated data\n x <- c(A = 20, B = 15, C = 25)\n chisq.test(x)\n chisq.test(as.table(x)) # the same\n x <- c(89,37,30,28,2)\n p <- c(40,20,20,15,5)\n try(\n chisq.test(x, p = p) # gives an error\n )\n chisq.test(x, p = p, rescale.p = TRUE)\n # works\n p <- c(0.40,0.20,0.20,0.19,0.01)\n # Expected count in category 5\n # is 1.86 < 5 ==> chi square approx.\n chisq.test(x, p = p) # maybe doubtful, but is ok!\n chisq.test(x, p = p, simulate.p.value = TRUE)\n \n ## Case B. Raw data\n x <- trunc(5 * runif(100))\n chisq.test(table(x)) # NOT 'chisq.test(x)'!\n\n\n\n\n## Chi-Square test\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchisq.test(freq)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n\tPearson's Chi-squared test\n\ndata: freq\nX-squared = 175.85, df = 2, p-value < 2.2e-16\n```\n\n\n:::\n:::\n\n\n\nWe reject the null hypothesis that the proportion of seropositive individuals who are young (<5yo) is the same for individuals who are middle (5-10yo) or old (>10yo).\n\n\n## Correlation\n\nFirst, we compute correlation by providing two vectors.\n\nLike other functions, if there are `NA`s, you get `NA` as the result. But if you specify use only the complete observations, then it will give you correlation using the non-missing data.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncor(df$age, df$IgG_concentration, method=\"pearson\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] NA\n```\n\n\n:::\n\n```{.r .cell-code}\ncor(df$age, df$IgG_concentration, method=\"pearson\", use = \"complete.obs\") #IF have missing data\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[1] 0.2604783\n```\n\n\n:::\n:::\n\n\n\nSmall positive correlation between IgG concentration and age.\n\n## T-test\n\nThe commonly used are:\n\n- **one-sample t-test** -- used to test mean of a variable in one group (to the null hypothesis mean)\n- **two-sample t-test** -- used to test difference in means of a variable between two groups (null hypothesis - the group means are the *same*); if \"two groups\" are data of the *same* individuals collected at 2 time points, we say it is two-sample paired t-test\n\n## T-test\n\nWe can use the `t.test()` function from the `stats` package.\n\n\n\nStudent's t-Test\n\nDescription:\n\n Performs one and two sample t-tests on vectors of data.\n\nUsage:\n\n t.test(x, ...)\n \n ## Default S3 method:\n t.test(x, y = NULL,\n alternative = c(\"two.sided\", \"less\", \"greater\"),\n mu = 0, paired = FALSE, var.equal = FALSE,\n conf.level = 0.95, ...)\n \n ## S3 method for class 'formula'\n t.test(formula, data, subset, na.action, ...)\n \nArguments:\n\n x: a (non-empty) numeric vector of data values.\n\n y: an optional (non-empty) numeric vector of data values.\n\nalternative: a character string specifying the alternative hypothesis,\n must be one of '\"two.sided\"' (default), '\"greater\"' or\n '\"less\"'. You can specify just the initial letter.\n\n mu: a number indicating the true value of the mean (or difference\n in means if you are performing a two sample test).\n\n paired: a logical indicating whether you want a paired t-test.\n\nvar.equal: a logical variable indicating whether to treat the two\n variances as being equal. If 'TRUE' then the pooled variance\n is used to estimate the variance otherwise the Welch (or\n Satterthwaite) approximation to the degrees of freedom is\n used.\n\nconf.level: confidence level of the interval.\n\n formula: a formula of the form 'lhs ~ rhs' where 'lhs' is a numeric\n variable giving the data values and 'rhs' either '1' for a\n one-sample or paired test or a factor with two levels giving\n the corresponding groups. If 'lhs' is of class '\"Pair\"' and\n 'rhs' is '1', a paired test is done.\n\n data: an optional matrix or data frame (or similar: see\n 'model.frame') containing the variables in the formula\n 'formula'. By default the variables are taken from\n 'environment(formula)'.\n\n subset: an optional vector specifying a subset of observations to be\n used.\n\nna.action: a function which indicates what should happen when the data\n contain 'NA's. Defaults to 'getOption(\"na.action\")'.\n\n ...: further arguments to be passed to or from methods.\n\nDetails:\n\n 'alternative = \"greater\"' is the alternative that 'x' has a larger\n mean than 'y'. For the one-sample case: that the mean is positive.\n\n If 'paired' is 'TRUE' then both 'x' and 'y' must be specified and\n they must be the same length. Missing values are silently removed\n (in pairs if 'paired' is 'TRUE'). If 'var.equal' is 'TRUE' then\n the pooled estimate of the variance is used. By default, if\n 'var.equal' is 'FALSE' then the variance is estimated separately\n for both groups and the Welch modification to the degrees of\n freedom is used.\n\n If the input data are effectively constant (compared to the larger\n of the two means) an error is generated.\n\nValue:\n\n A list with class '\"htest\"' containing the following components:\n\nstatistic: the value of the t-statistic.\n\nparameter: the degrees of freedom for the t-statistic.\n\n p.value: the p-value for the test.\n\nconf.int: a confidence interval for the mean appropriate to the\n specified alternative hypothesis.\n\nestimate: the estimated mean or difference in means depending on\n whether it was a one-sample test or a two-sample test.\n\nnull.value: the specified hypothesized value of the mean or mean\n difference depending on whether it was a one-sample test or a\n two-sample test.\n\n stderr: the standard error of the mean (difference), used as\n denominator in the t-statistic formula.\n\nalternative: a character string describing the alternative hypothesis.\n\n method: a character string indicating what type of t-test was\n performed.\n\ndata.name: a character string giving the name(s) of the data.\n\nSee Also:\n\n 'prop.test'\n\nExamples:\n\n require(graphics)\n \n t.test(1:10, y = c(7:20)) # P = .00001855\n t.test(1:10, y = c(7:20, 200)) # P = .1245 -- NOT significant anymore\n \n ## Classical example: Student's sleep data\n plot(extra ~ group, data = sleep)\n ## Traditional interface\n with(sleep, t.test(extra[group == 1], extra[group == 2]))\n \n ## Formula interface\n t.test(extra ~ group, data = sleep)\n \n ## Formula interface to one-sample test\n t.test(extra ~ 1, data = sleep)\n \n ## Formula interface to paired test\n ## The sleep data are actually paired, so could have been in wide format:\n sleep2 <- reshape(sleep, direction = \"wide\", \n idvar = \"ID\", timevar = \"group\")\n t.test(Pair(extra.1, extra.2) ~ 1, data = sleep2)\n\n\n\n## Running two-sample t-test\n\nThe **base R** - `t.test()` function from the `stats` package. It tests test difference in means of a variable between two groups. By default:\n\n- tests whether difference in means of a variable is equal to 0 (default `mu=0`)\n- uses \"two sided\" alternative (`alternative = \"two.sided\"`)\n- returns result assuming confidence level 0.95 (`conf.level = 0.95`)\n- assumes data are not paired (`paired = FALSE`)\n- assumes true variance in the two groups is not equal (`var.equal = FALSE`)\n\n## Running two-sample t-test\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nIgG_young <- df$IgG_concentration[df$age_group==\"young\"]\nIgG_old <- df$IgG_concentration[df$age_group==\"old\"]\n\nt.test(IgG_young, IgG_old)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\n\tWelch Two Sample t-test\n\ndata: IgG_young and IgG_old\nt = -6.1969, df = 259.54, p-value = 2.25e-09\nalternative hypothesis: true difference in means is not equal to 0\n95 percent confidence interval:\n -111.09281 -57.51515\nsample estimates:\nmean of x mean of y \n 45.05056 129.35454 \n```\n\n\n:::\n:::\n\n\n\nThe mean IgG concenration of young and old is 45.05 and 129.35 mIU/mL, respectively. We reject null hypothesis that the difference in the mean IgG concentration of young and old is 0 mIU/mL.\n\n## Linear regression fit in R\n\nTo fit regression models in R, we use the function `glm()` (Generalized Linear Model).\n\n\n\n\nFitting Generalized Linear Models\n\nDescription:\n\n 'glm' is used to fit generalized linear models, specified by\n giving a symbolic description of the linear predictor and a\n description of the error distribution.\n\nUsage:\n\n glm(formula, family = gaussian, data, weights, subset,\n na.action, start = NULL, etastart, mustart, offset,\n control = list(...), model = TRUE, method = \"glm.fit\",\n x = FALSE, y = TRUE, singular.ok = TRUE, contrasts = NULL, ...)\n \n glm.fit(x, y, weights = rep.int(1, nobs),\n start = NULL, etastart = NULL, mustart = NULL,\n offset = rep.int(0, nobs), family = gaussian(),\n control = list(), intercept = TRUE, singular.ok = TRUE)\n \n ## S3 method for class 'glm'\n weights(object, type = c(\"prior\", \"working\"), ...)\n \nArguments:\n\n formula: an object of class '\"formula\"' (or one that can be coerced to\n that class): a symbolic description of the model to be\n fitted. The details of model specification are given under\n 'Details'.\n\n family: a description of the error distribution and link function to\n be used in the model. For 'glm' this can be a character\n string naming a family function, a family function or the\n result of a call to a family function. For 'glm.fit' only\n the third option is supported. (See 'family' for details of\n family functions.)\n\n data: an optional data frame, list or environment (or object\n coercible by 'as.data.frame' to a data frame) containing the\n variables in the model. If not found in 'data', the\n variables are taken from 'environment(formula)', typically\n the environment from which 'glm' is called.\n\n weights: an optional vector of 'prior weights' to be used in the\n fitting process. Should be 'NULL' or a numeric vector.\n\n subset: an optional vector specifying a subset of observations to be\n used in the fitting process.\n\nna.action: a function which indicates what should happen when the data\n contain 'NA's. The default is set by the 'na.action' setting\n of 'options', and is 'na.fail' if that is unset. The\n 'factory-fresh' default is 'na.omit'. Another possible value\n is 'NULL', no action. Value 'na.exclude' can be useful.\n\n start: starting values for the parameters in the linear predictor.\n\netastart: starting values for the linear predictor.\n\n mustart: starting values for the vector of means.\n\n offset: this can be used to specify an _a priori_ known component to\n be included in the linear predictor during fitting. This\n should be 'NULL' or a numeric vector of length equal to the\n number of cases. One or more 'offset' terms can be included\n in the formula instead or as well, and if more than one is\n specified their sum is used. See 'model.offset'.\n\n control: a list of parameters for controlling the fitting process.\n For 'glm.fit' this is passed to 'glm.control'.\n\n model: a logical value indicating whether _model frame_ should be\n included as a component of the returned value.\n\n method: the method to be used in fitting the model. The default\n method '\"glm.fit\"' uses iteratively reweighted least squares\n (IWLS): the alternative '\"model.frame\"' returns the model\n frame and does no fitting.\n\n User-supplied fitting functions can be supplied either as a\n function or a character string naming a function, with a\n function which takes the same arguments as 'glm.fit'. If\n specified as a character string it is looked up from within\n the 'stats' namespace.\n\n x, y: For 'glm': logical values indicating whether the response\n vector and model matrix used in the fitting process should be\n returned as components of the returned value.\n\n For 'glm.fit': 'x' is a design matrix of dimension 'n * p',\n and 'y' is a vector of observations of length 'n'.\n\nsingular.ok: logical; if 'FALSE' a singular fit is an error.\n\ncontrasts: an optional list. See the 'contrasts.arg' of\n 'model.matrix.default'.\n\nintercept: logical. Should an intercept be included in the _null_\n model?\n\n object: an object inheriting from class '\"glm\"'.\n\n type: character, partial matching allowed. Type of weights to\n extract from the fitted model object. Can be abbreviated.\n\n ...: For 'glm': arguments to be used to form the default 'control'\n argument if it is not supplied directly.\n\n For 'weights': further arguments passed to or from other\n methods.\n\nDetails:\n\n A typical predictor has the form 'response ~ terms' where\n 'response' is the (numeric) response vector and 'terms' is a\n series of terms which specifies a linear predictor for 'response'.\n For 'binomial' and 'quasibinomial' families the response can also\n be specified as a 'factor' (when the first level denotes failure\n and all others success) or as a two-column matrix with the columns\n giving the numbers of successes and failures. A terms\n specification of the form 'first + second' indicates all the terms\n in 'first' together with all the terms in 'second' with any\n duplicates removed.\n\n A specification of the form 'first:second' indicates the set of\n terms obtained by taking the interactions of all terms in 'first'\n with all terms in 'second'. The specification 'first*second'\n indicates the _cross_ of 'first' and 'second'. This is the same\n as 'first + second + first:second'.\n\n The terms in the formula will be re-ordered so that main effects\n come first, followed by the interactions, all second-order, all\n third-order and so on: to avoid this pass a 'terms' object as the\n formula.\n\n Non-'NULL' 'weights' can be used to indicate that different\n observations have different dispersions (with the values in\n 'weights' being inversely proportional to the dispersions); or\n equivalently, when the elements of 'weights' are positive integers\n w_i, that each response y_i is the mean of w_i unit-weight\n observations. For a binomial GLM prior weights are used to give\n the number of trials when the response is the proportion of\n successes: they would rarely be used for a Poisson GLM.\n\n 'glm.fit' is the workhorse function: it is not normally called\n directly but can be more efficient where the response vector,\n design matrix and family have already been calculated.\n\n If more than one of 'etastart', 'start' and 'mustart' is\n specified, the first in the list will be used. It is often\n advisable to supply starting values for a 'quasi' family, and also\n for families with unusual links such as 'gaussian(\"log\")'.\n\n All of 'weights', 'subset', 'offset', 'etastart' and 'mustart' are\n evaluated in the same way as variables in 'formula', that is first\n in 'data' and then in the environment of 'formula'.\n\n For the background to warning messages about 'fitted probabilities\n numerically 0 or 1 occurred' for binomial GLMs, see Venables &\n Ripley (2002, pp. 197-8).\n\nValue:\n\n 'glm' returns an object of class inheriting from '\"glm\"' which\n inherits from the class '\"lm\"'. See later in this section. If a\n non-standard 'method' is used, the object will also inherit from\n the class (if any) returned by that function.\n\n The function 'summary' (i.e., 'summary.glm') can be used to obtain\n or print a summary of the results and the function 'anova' (i.e.,\n 'anova.glm') to produce an analysis of variance table.\n\n The generic accessor functions 'coefficients', 'effects',\n 'fitted.values' and 'residuals' can be used to extract various\n useful features of the value returned by 'glm'.\n\n 'weights' extracts a vector of weights, one for each case in the\n fit (after subsetting and 'na.action').\n\n An object of class '\"glm\"' is a list containing at least the\n following components:\n\ncoefficients: a named vector of coefficients\n\nresiduals: the _working_ residuals, that is the residuals in the final\n iteration of the IWLS fit. Since cases with zero weights are\n omitted, their working residuals are 'NA'.\n\nfitted.values: the fitted mean values, obtained by transforming the\n linear predictors by the inverse of the link function.\n\n rank: the numeric rank of the fitted linear model.\n\n family: the 'family' object used.\n\nlinear.predictors: the linear fit on link scale.\n\ndeviance: up to a constant, minus twice the maximized log-likelihood.\n Where sensible, the constant is chosen so that a saturated\n model has deviance zero.\n\n aic: A version of Akaike's _An Information Criterion_, minus twice\n the maximized log-likelihood plus twice the number of\n parameters, computed via the 'aic' component of the family.\n For binomial and Poison families the dispersion is fixed at\n one and the number of parameters is the number of\n coefficients. For gaussian, Gamma and inverse gaussian\n families the dispersion is estimated from the residual\n deviance, and the number of parameters is the number of\n coefficients plus one. For a gaussian family the MLE of the\n dispersion is used so this is a valid value of AIC, but for\n Gamma and inverse gaussian families it is not. For families\n fitted by quasi-likelihood the value is 'NA'.\n\nnull.deviance: The deviance for the null model, comparable with\n 'deviance'. The null model will include the offset, and an\n intercept if there is one in the model. Note that this will\n be incorrect if the link function depends on the data other\n than through the fitted mean: specify a zero offset to force\n a correct calculation.\n\n iter: the number of iterations of IWLS used.\n\n weights: the _working_ weights, that is the weights in the final\n iteration of the IWLS fit.\n\nprior.weights: the weights initially supplied, a vector of '1's if none\n were.\n\ndf.residual: the residual degrees of freedom.\n\n df.null: the residual degrees of freedom for the null model.\n\n y: if requested (the default) the 'y' vector used. (It is a\n vector even for a binomial model.)\n\n x: if requested, the model matrix.\n\n model: if requested (the default), the model frame.\n\nconverged: logical. Was the IWLS algorithm judged to have converged?\n\nboundary: logical. Is the fitted value on the boundary of the\n attainable values?\n\n call: the matched call.\n\n formula: the formula supplied.\n\n terms: the 'terms' object used.\n\n data: the 'data argument'.\n\n offset: the offset vector used.\n\n control: the value of the 'control' argument used.\n\n method: the name of the fitter function used (when provided as a\n 'character' string to 'glm()') or the fitter 'function' (when\n provided as that).\n\ncontrasts: (where relevant) the contrasts used.\n\n xlevels: (where relevant) a record of the levels of the factors used\n in fitting.\n\nna.action: (where relevant) information returned by 'model.frame' on\n the special handling of 'NA's.\n\n In addition, non-empty fits will have components 'qr', 'R' and\n 'effects' relating to the final weighted linear fit.\n\n Objects of class '\"glm\"' are normally of class 'c(\"glm\", \"lm\")',\n that is inherit from class '\"lm\"', and well-designed methods for\n class '\"lm\"' will be applied to the weighted linear model at the\n final iteration of IWLS. However, care is needed, as extractor\n functions for class '\"glm\"' such as 'residuals' and 'weights' do\n *not* just pick out the component of the fit with the same name.\n\n If a 'binomial' 'glm' model was specified by giving a two-column\n response, the weights returned by 'prior.weights' are the total\n numbers of cases (factored by the supplied case weights) and the\n component 'y' of the result is the proportion of successes.\n\nFitting functions:\n\n The argument 'method' serves two purposes. One is to allow the\n model frame to be recreated with no fitting. The other is to\n allow the default fitting function 'glm.fit' to be replaced by a\n function which takes the same arguments and uses a different\n fitting algorithm. If 'glm.fit' is supplied as a character string\n it is used to search for a function of that name, starting in the\n 'stats' namespace.\n\n The class of the object return by the fitter (if any) will be\n prepended to the class returned by 'glm'.\n\nAuthor(s):\n\n The original R implementation of 'glm' was written by Simon Davies\n working for Ross Ihaka at the University of Auckland, but has\n since been extensively re-written by members of the R Core team.\n\n The design was inspired by the S function of the same name\n described in Hastie & Pregibon (1992).\n\nReferences:\n\n Dobson, A. J. (1990) _An Introduction to Generalized Linear\n Models._ London: Chapman and Hall.\n\n Hastie, T. J. and Pregibon, D. (1992) _Generalized linear models._\n Chapter 6 of _Statistical Models in S_ eds J. M. Chambers and T.\n J. Hastie, Wadsworth & Brooks/Cole.\n\n McCullagh P. and Nelder, J. A. (1989) _Generalized Linear Models._\n London: Chapman and Hall.\n\n Venables, W. N. and Ripley, B. D. (2002) _Modern Applied\n Statistics with S._ New York: Springer.\n\nSee Also:\n\n 'anova.glm', 'summary.glm', etc. for 'glm' methods, and the\n generic functions 'anova', 'summary', 'effects', 'fitted.values',\n and 'residuals'.\n\n 'lm' for non-generalized _linear_ models (which SAS calls GLMs,\n for 'general' linear models).\n\n 'loglin' and 'loglm' (package 'MASS') for fitting log-linear\n models (which binomial and Poisson GLMs are) to contingency\n tables.\n\n 'bigglm' in package 'biglm' for an alternative way to fit GLMs to\n large datasets (especially those with many cases).\n\n 'esoph', 'infert' and 'predict.glm' have examples of fitting\n binomial glms.\n\nExamples:\n\n ## Dobson (1990) Page 93: Randomized Controlled Trial :\n counts <- c(18,17,15,20,10,20,25,13,12)\n outcome <- gl(3,1,9)\n treatment <- gl(3,3)\n data.frame(treatment, outcome, counts) # showing data\n glm.D93 <- glm(counts ~ outcome + treatment, family = poisson())\n anova(glm.D93)\n summary(glm.D93)\n ## Computing AIC [in many ways]:\n (A0 <- AIC(glm.D93))\n (ll <- logLik(glm.D93))\n A1 <- -2*c(ll) + 2*attr(ll, \"df\")\n A2 <- glm.D93$family$aic(counts, mu=fitted(glm.D93), wt=1) +\n 2 * length(coef(glm.D93))\n stopifnot(exprs = {\n all.equal(A0, A1)\n all.equal(A1, A2)\n all.equal(A1, glm.D93$aic)\n })\n \n \n ## an example with offsets from Venables & Ripley (2002, p.189)\n utils::data(anorexia, package = \"MASS\")\n \n anorex.1 <- glm(Postwt ~ Prewt + Treat + offset(Prewt),\n family = gaussian, data = anorexia)\n summary(anorex.1)\n \n \n # A Gamma example, from McCullagh & Nelder (1989, pp. 300-2)\n clotting <- data.frame(\n u = c(5,10,15,20,30,40,60,80,100),\n lot1 = c(118,58,42,35,27,25,21,19,18),\n lot2 = c(69,35,26,21,18,16,13,12,12))\n summary(glm(lot1 ~ log(u), data = clotting, family = Gamma))\n summary(glm(lot2 ~ log(u), data = clotting, family = Gamma))\n ## Aliased (\"S\"ingular) -> 1 NA coefficient\n (fS <- glm(lot2 ~ log(u) + log(u^2), data = clotting, family = Gamma))\n tools::assertError(update(fS, singular.ok=FALSE), verbose=interactive())\n ## -> .. \"singular fit encountered\"\n \n ## Not run:\n \n ## for an example of the use of a terms object as a formula\n demo(glm.vr)\n ## End(Not run)\n\n\n\n## Linear regression fit in R\n\nWe tend to focus on three arguments:\n\n- `formula` -- model formula written using names of columns in our data\n- `data` -- our data frame\n-\t\t`family` -- error distribution and link function\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfit1 <- glm(IgG_concentration~age+gender+slum, data=df, family=gaussian())\nfit2 <- glm(seropos~age_group+gender+slum, data=df, family = binomial(link = \"logit\"))\n```\n:::\n\n\n\n## `summary.glm()`\n\nThe `summary()` function when applied to a fit object based on a glm is technically the `summary.glm()` function and produces details of the model fit. Note on object oriented code.\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/rstudio_script.png){width=200%}\n:::\n:::\n\nSummarizing Generalized Linear Model Fits\n\nDescription:\n\n These functions are all 'methods' for class 'glm' or 'summary.glm'\n objects.\n\nUsage:\n\n ## S3 method for class 'glm'\n summary(object, dispersion = NULL, correlation = FALSE,\n symbolic.cor = FALSE, ...)\n \n ## S3 method for class 'summary.glm'\n print(x, digits = max(3, getOption(\"digits\") - 3),\n symbolic.cor = x$symbolic.cor,\n signif.stars = getOption(\"show.signif.stars\"),\n show.residuals = FALSE, ...)\n \nArguments:\n\n object: an object of class '\"glm\"', usually, a result of a call to\n 'glm'.\n\n x: an object of class '\"summary.glm\"', usually, a result of a\n call to 'summary.glm'.\n\ndispersion: the dispersion parameter for the family used. Either a\n single numerical value or 'NULL' (the default), when it is\n inferred from 'object' (see 'Details').\n\ncorrelation: logical; if 'TRUE', the correlation matrix of the\n estimated parameters is returned and printed.\n\n digits: the number of significant digits to use when printing.\n\nsymbolic.cor: logical. If 'TRUE', print the correlations in a symbolic\n form (see 'symnum') rather than as numbers.\n\nsignif.stars: logical. If 'TRUE', 'significance stars' are printed for\n each coefficient.\n\nshow.residuals: logical. If 'TRUE' then a summary of the deviance\n residuals is printed at the head of the output.\n\n ...: further arguments passed to or from other methods.\n\nDetails:\n\n 'print.summary.glm' tries to be smart about formatting the\n coefficients, standard errors, etc. and additionally gives\n 'significance stars' if 'signif.stars' is 'TRUE'. The\n 'coefficients' component of the result gives the estimated\n coefficients and their estimated standard errors, together with\n their ratio. This third column is labelled 't ratio' if the\n dispersion is estimated, and 'z ratio' if the dispersion is known\n (or fixed by the family). A fourth column gives the two-tailed\n p-value corresponding to the t or z ratio based on a Student t or\n Normal reference distribution. (It is possible that the\n dispersion is not known and there are no residual degrees of\n freedom from which to estimate it. In that case the estimate is\n 'NaN'.)\n\n Aliased coefficients are omitted in the returned object but\n restored by the 'print' method.\n\n Correlations are printed to two decimal places (or symbolically):\n to see the actual correlations print 'summary(object)$correlation'\n directly.\n\n The dispersion of a GLM is not used in the fitting process, but it\n is needed to find standard errors. If 'dispersion' is not\n supplied or 'NULL', the dispersion is taken as '1' for the\n 'binomial' and 'Poisson' families, and otherwise estimated by the\n residual Chisquared statistic (calculated from cases with non-zero\n weights) divided by the residual degrees of freedom.\n\n 'summary' can be used with Gaussian 'glm' fits to handle the case\n of a linear regression with known error variance, something not\n handled by 'summary.lm'.\n\nValue:\n\n 'summary.glm' returns an object of class '\"summary.glm\"', a list\n with components\n\n call: the component from 'object'.\n\n family: the component from 'object'.\n\ndeviance: the component from 'object'.\n\ncontrasts: the component from 'object'.\n\ndf.residual: the component from 'object'.\n\nnull.deviance: the component from 'object'.\n\n df.null: the component from 'object'.\n\ndeviance.resid: the deviance residuals: see 'residuals.glm'.\n\ncoefficients: the matrix of coefficients, standard errors, z-values and\n p-values. Aliased coefficients are omitted.\n\n aliased: named logical vector showing if the original coefficients are\n aliased.\n\ndispersion: either the supplied argument or the inferred/estimated\n dispersion if the former is 'NULL'.\n\n df: a 3-vector of the rank of the model and the number of\n residual degrees of freedom, plus number of coefficients\n (including aliased ones).\n\ncov.unscaled: the unscaled ('dispersion = 1') estimated covariance\n matrix of the estimated coefficients.\n\ncov.scaled: ditto, scaled by 'dispersion'.\n\ncorrelation: (only if 'correlation' is true.) The estimated\n correlations of the estimated coefficients.\n\nsymbolic.cor: (only if 'correlation' is true.) The value of the\n argument 'symbolic.cor'.\n\nSee Also:\n\n 'glm', 'summary'.\n\nExamples:\n\n ## For examples see example(glm)\n\n\n\n\n## Linear regression fit in R\n\nLets look at the output...\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(fit1)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = IgG_concentration ~ age + gender + slum, family = gaussian(), \n data = df)\n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 46.132 16.774 2.750 0.00613 ** \nage 9.324 1.388 6.718 4.15e-11 ***\ngenderMale -9.655 11.543 -0.836 0.40321 \nslumNon slum -20.353 14.299 -1.423 0.15513 \nslumSlum -29.705 25.009 -1.188 0.23536 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for gaussian family taken to be 20918.39)\n\n Null deviance: 14141483 on 631 degrees of freedom\nResidual deviance: 13115831 on 627 degrees of freedom\n (19 observations deleted due to missingness)\nAIC: 8087.9\n\nNumber of Fisher Scoring iterations: 2\n```\n\n\n:::\n\n```{.r .cell-code}\nsummary(fit2)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\nCall:\nglm(formula = seropos ~ age_group + gender + slum, family = binomial(link = \"logit\"), \n data = df)\n\nCoefficients:\n Estimate Std. Error z value Pr(>|z|) \n(Intercept) -1.3220 0.2516 -5.254 1.49e-07 ***\nage_groupmiddle 1.9020 0.2133 8.916 < 2e-16 ***\nage_groupold 2.8443 0.2522 11.278 < 2e-16 ***\ngenderMale -0.1725 0.1895 -0.910 0.363 \nslumNon slum -0.1099 0.2329 -0.472 0.637 \nslumSlum -0.1073 0.4118 -0.261 0.794 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n Null deviance: 866.98 on 631 degrees of freedom\nResidual deviance: 679.10 on 626 degrees of freedom\n (19 observations deleted due to missingness)\nAIC: 691.1\n\nNumber of Fisher Scoring iterations: 4\n```\n\n\n:::\n:::\n\n\n\n\n\n## Summary\n\n-\t\tUse `cor()` to calculate correlation between two numeric vectors.\n- `corrplot()` and `ggpairs()` is nice for a quick visualization of correlations\n- `t.test()` or `t_test()` tests the mean compared to null or difference in means between two groups\n-\t\t... xxamy more\n\n## Acknowledgements\n\nThese are the materials I looked through, modified, or extracted to complete this module's lecture.\n\n- [\"Introduction to R for Public Health Researchers\" Johns Hopkins University](https://jhudatascience.org/intro_to_r/)\n", + "markdown": "---\ntitle: \"Module 9: Data Analysis\"\nformat: \n revealjs:\n scrollable: true\n smaller: true\n toc: false\n---\n\n\n## Learning Objectives\n\nAfter module 9, you should be able to...\n\n-\tDescriptively assess association between two variables\n-\tCompute basic statistics \n-\tFit a generalized linear model\n\n## Import data for this module\n\nLet's read in our data (again) and take a quick look.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- read.csv(file = \"data/serodata.csv\") #relative path\nhead(x=df, n=3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n observation_id IgG_concentration age gender slum\n1 5772 0.3176895 2 Female Non slum\n2 8095 3.4368231 4 Female Non slum\n3 9784 0.3000000 4 Male Non slum\n```\n:::\n:::\n\n\n## Prep data\n\nCreate `age_group` three level factor variable\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group <- ifelse(df$age <= 5, \"young\", \n ifelse(df$age<=10 & df$age>5, \"middle\", \"old\"))\ndf$age_group <- factor(df$age_group, levels=c(\"young\", \"middle\", \"old\"))\n```\n:::\n\n\nCreate `seropos` binary variable representing seropositivity if antibody concentrations are >10 IU/mL.\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$seropos <- ifelse(df$IgG_concentration<10, 0, 1)\n```\n:::\n\n\n\n## 2 variable contingency tables\n\nWe use `table()` prior to look at one variable, now we can generate frequency tables for 2 plus variables. To get cell percentages, the `prop.table()` is useful. \n\n\n::: {.cell}\n\n```{.r .cell-code}\n?prop.table\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(printr)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n```\n:::\n\n```{.r .cell-code}\n?prop.table\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nExpress Table Entries as Fraction of Marginal Table\n\nDescription:\n\n Returns conditional proportions given 'margins', i.e. entries of\n 'x', divided by the appropriate marginal sums.\n\nUsage:\n\n proportions(x, margin = NULL)\n prop.table(x, margin = NULL)\n \nArguments:\n\n x: table\n\n margin: a vector giving the margins to split by. E.g., for a matrix\n '1' indicates rows, '2' indicates columns, 'c(1, 2)'\n indicates rows and columns. When 'x' has named dimnames, it\n can be a character vector selecting dimension names.\n\nValue:\n\n Table like 'x' expressed relative to 'margin'\n\nNote:\n\n 'prop.table' is an earlier name, retained for back-compatibility.\n\nAuthor(s):\n\n Peter Dalgaard\n\nSee Also:\n\n 'marginSums'. 'apply', 'sweep' are a more general mechanism for\n sweeping out marginal statistics.\n\nExamples:\n\n m <- matrix(1:4, 2)\n m\n proportions(m, 1)\n \n DF <- as.data.frame(UCBAdmissions)\n tbl <- xtabs(Freq ~ Gender + Admit, DF)\n \n proportions(tbl, \"Gender\")\n```\n:::\n:::\n\n\n## 2 variable contingency tables\n\nLet's practice\n\n::: {.cell}\n\n```{.r .cell-code}\nfreq <- table(df$age_group, df$seropos)\nfreq\n```\n\n::: {.cell-output-display}\n|/ | 0| 1|\n|:------|---:|---:|\n|young | 254| 57|\n|middle | 70| 105|\n|old | 30| 116|\n:::\n:::\n\n\nNow, lets move to percentages\n\n::: {.cell}\n\n```{.r .cell-code}\nprop.cell.percentages <- prop.table(freq)\nprop.cell.percentages\n```\n\n::: {.cell-output-display}\n|/ | 0| 1|\n|:------|---------:|---------:|\n|young | 0.4018987| 0.0901899|\n|middle | 0.1107595| 0.1661392|\n|old | 0.0474684| 0.1835443|\n:::\n\n```{.r .cell-code}\nprop.column.percentages <- prop.table(freq, margin=2)\nprop.column.percentages\n```\n\n::: {.cell-output-display}\n|/ | 0| 1|\n|:------|---------:|---------:|\n|young | 0.7175141| 0.2050360|\n|middle | 0.1977401| 0.3776978|\n|old | 0.0847458| 0.4172662|\n:::\n:::\n\n\n\n## Chi-Square test\n\nThe `chisq.test()` function test of independence of factor variables from `stats` package.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?chisq.test\n```\n:::\n\nPearson's Chi-squared Test for Count Data\n\nDescription:\n\n 'chisq.test' performs chi-squared contingency table tests and\n goodness-of-fit tests.\n\nUsage:\n\n chisq.test(x, y = NULL, correct = TRUE,\n p = rep(1/length(x), length(x)), rescale.p = FALSE,\n simulate.p.value = FALSE, B = 2000)\n \nArguments:\n\n x: a numeric vector or matrix. 'x' and 'y' can also both be\n factors.\n\n y: a numeric vector; ignored if 'x' is a matrix. If 'x' is a\n factor, 'y' should be a factor of the same length.\n\n correct: a logical indicating whether to apply continuity correction\n when computing the test statistic for 2 by 2 tables: one half\n is subtracted from all |O - E| differences; however, the\n correction will not be bigger than the differences\n themselves. No correction is done if 'simulate.p.value =\n TRUE'.\n\n p: a vector of probabilities of the same length as 'x'. An\n error is given if any entry of 'p' is negative.\n\nrescale.p: a logical scalar; if TRUE then 'p' is rescaled (if\n necessary) to sum to 1. If 'rescale.p' is FALSE, and 'p'\n does not sum to 1, an error is given.\n\nsimulate.p.value: a logical indicating whether to compute p-values by\n Monte Carlo simulation.\n\n B: an integer specifying the number of replicates used in the\n Monte Carlo test.\n\nDetails:\n\n If 'x' is a matrix with one row or column, or if 'x' is a vector\n and 'y' is not given, then a _goodness-of-fit test_ is performed\n ('x' is treated as a one-dimensional contingency table). The\n entries of 'x' must be non-negative integers. In this case, the\n hypothesis tested is whether the population probabilities equal\n those in 'p', or are all equal if 'p' is not given.\n\n If 'x' is a matrix with at least two rows and columns, it is taken\n as a two-dimensional contingency table: the entries of 'x' must be\n non-negative integers. Otherwise, 'x' and 'y' must be vectors or\n factors of the same length; cases with missing values are removed,\n the objects are coerced to factors, and the contingency table is\n computed from these. Then Pearson's chi-squared test is performed\n of the null hypothesis that the joint distribution of the cell\n counts in a 2-dimensional contingency table is the product of the\n row and column marginals.\n\n If 'simulate.p.value' is 'FALSE', the p-value is computed from the\n asymptotic chi-squared distribution of the test statistic;\n continuity correction is only used in the 2-by-2 case (if\n 'correct' is 'TRUE', the default). Otherwise the p-value is\n computed for a Monte Carlo test (Hope, 1968) with 'B' replicates.\n The default 'B = 2000' implies a minimum p-value of about 0.0005\n (1/(B+1)).\n\n In the contingency table case, simulation is done by random\n sampling from the set of all contingency tables with given\n marginals, and works only if the marginals are strictly positive.\n Continuity correction is never used, and the statistic is quoted\n without it. Note that this is not the usual sampling situation\n assumed for the chi-squared test but rather that for Fisher's\n exact test.\n\n In the goodness-of-fit case simulation is done by random sampling\n from the discrete distribution specified by 'p', each sample being\n of size 'n = sum(x)'. This simulation is done in R and may be\n slow.\n\nValue:\n\n A list with class '\"htest\"' containing the following components:\n\nstatistic: the value the chi-squared test statistic.\n\nparameter: the degrees of freedom of the approximate chi-squared\n distribution of the test statistic, 'NA' if the p-value is\n computed by Monte Carlo simulation.\n\n p.value: the p-value for the test.\n\n method: a character string indicating the type of test performed, and\n whether Monte Carlo simulation or continuity correction was\n used.\n\ndata.name: a character string giving the name(s) of the data.\n\nobserved: the observed counts.\n\nexpected: the expected counts under the null hypothesis.\n\nresiduals: the Pearson residuals, '(observed - expected) /\n sqrt(expected)'.\n\n stdres: standardized residuals, '(observed - expected) / sqrt(V)',\n where 'V' is the residual cell variance (Agresti, 2007,\n section 2.4.5 for the case where 'x' is a matrix, 'n * p * (1\n - p)' otherwise).\n\nSource:\n\n The code for Monte Carlo simulation is a C translation of the\n Fortran algorithm of Patefield (1981).\n\nReferences:\n\n Hope, A. C. A. (1968). A simplified Monte Carlo significance test\n procedure. _Journal of the Royal Statistical Society Series B_,\n *30*, 582-598. doi:10.1111/j.2517-6161.1968.tb00759.x\n .\n\n Patefield, W. M. (1981). Algorithm AS 159: An efficient method of\n generating r x c tables with given row and column totals.\n _Applied Statistics_, *30*, 91-97. doi:10.2307/2346669\n .\n\n Agresti, A. (2007). _An Introduction to Categorical Data\n Analysis_, 2nd ed. New York: John Wiley & Sons. Page 38.\n\nSee Also:\n\n For goodness-of-fit testing, notably of continuous distributions,\n 'ks.test'.\n\nExamples:\n\n ## From Agresti(2007) p.39\n M <- as.table(rbind(c(762, 327, 468), c(484, 239, 477)))\n dimnames(M) <- list(gender = c(\"F\", \"M\"),\n party = c(\"Democrat\",\"Independent\", \"Republican\"))\n (Xsq <- chisq.test(M)) # Prints test summary\n Xsq$observed # observed counts (same as M)\n Xsq$expected # expected counts under the null\n Xsq$residuals # Pearson residuals\n Xsq$stdres # standardized residuals\n \n \n ## Effect of simulating p-values\n x <- matrix(c(12, 5, 7, 7), ncol = 2)\n chisq.test(x)$p.value # 0.4233\n chisq.test(x, simulate.p.value = TRUE, B = 10000)$p.value\n # around 0.29!\n \n ## Testing for population probabilities\n ## Case A. Tabulated data\n x <- c(A = 20, B = 15, C = 25)\n chisq.test(x)\n chisq.test(as.table(x)) # the same\n x <- c(89,37,30,28,2)\n p <- c(40,20,20,15,5)\n try(\n chisq.test(x, p = p) # gives an error\n )\n chisq.test(x, p = p, rescale.p = TRUE)\n # works\n p <- c(0.40,0.20,0.20,0.19,0.01)\n # Expected count in category 5\n # is 1.86 < 5 ==> chi square approx.\n chisq.test(x, p = p) # maybe doubtful, but is ok!\n chisq.test(x, p = p, simulate.p.value = TRUE)\n \n ## Case B. Raw data\n x <- trunc(5 * runif(100))\n chisq.test(table(x)) # NOT 'chisq.test(x)'!\n\n\n\n## Chi-Square test\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchisq.test(freq)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\n\tPearson's Chi-squared test\n\ndata: freq\nX-squared = 175.85, df = 2, p-value < 2.2e-16\n```\n:::\n:::\n\n\nWe reject the null hypothesis that the proportion of seropositive individuals in the young, middle, and old age groups are the same.\n\n\n## Correlation\n\nFirst, we compute correlation by providing two vectors.\n\nLike other functions, if there are `NA`s, you get `NA` as the result. But if you specify use only the complete observations, then it will give you correlation using the non-missing data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncor(df$age, df$IgG_concentration, method=\"pearson\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] NA\n```\n:::\n\n```{.r .cell-code}\ncor(df$age, df$IgG_concentration, method=\"pearson\", use = \"complete.obs\") #IF have missing data\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[1] 0.2604783\n```\n:::\n:::\n\n\nSmall positive correlation between IgG concentration and age.\n\n## Correlation confidence interval\n\nThe function `cor.test()` also gives you the confidence interval of the correlation statistic. Note, it uses complete observations by default. \n\n\n::: {.cell}\n\n```{.r .cell-code}\ncor.test(df$age, df$IgG_concentration, method=\"pearson\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\n\tPearson's product-moment correlation\n\ndata: df$age and df$IgG_concentration\nt = 6.7717, df = 630, p-value = 2.921e-11\nalternative hypothesis: true correlation is not equal to 0\n95 percent confidence interval:\n 0.1862722 0.3317295\nsample estimates:\n cor \n0.2604783 \n```\n:::\n:::\n\n\n\n## T-test\n\nThe commonly used are:\n\n- **one-sample t-test** -- used to test mean of a variable in one group (to the null hypothesis mean)\n- **two-sample t-test** -- used to test difference in means of a variable between two groups (null hypothesis - the group means are the *same*)\n\n## T-test\n\nWe can use the `t.test()` function from the `stats` package.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?t.test\n```\n:::\n\nStudent's t-Test\n\nDescription:\n\n Performs one and two sample t-tests on vectors of data.\n\nUsage:\n\n t.test(x, ...)\n \n ## Default S3 method:\n t.test(x, y = NULL,\n alternative = c(\"two.sided\", \"less\", \"greater\"),\n mu = 0, paired = FALSE, var.equal = FALSE,\n conf.level = 0.95, ...)\n \n ## S3 method for class 'formula'\n t.test(formula, data, subset, na.action, ...)\n \nArguments:\n\n x: a (non-empty) numeric vector of data values.\n\n y: an optional (non-empty) numeric vector of data values.\n\nalternative: a character string specifying the alternative hypothesis,\n must be one of '\"two.sided\"' (default), '\"greater\"' or\n '\"less\"'. You can specify just the initial letter.\n\n mu: a number indicating the true value of the mean (or difference\n in means if you are performing a two sample test).\n\n paired: a logical indicating whether you want a paired t-test.\n\nvar.equal: a logical variable indicating whether to treat the two\n variances as being equal. If 'TRUE' then the pooled variance\n is used to estimate the variance otherwise the Welch (or\n Satterthwaite) approximation to the degrees of freedom is\n used.\n\nconf.level: confidence level of the interval.\n\n formula: a formula of the form 'lhs ~ rhs' where 'lhs' is a numeric\n variable giving the data values and 'rhs' either '1' for a\n one-sample or paired test or a factor with two levels giving\n the corresponding groups. If 'lhs' is of class '\"Pair\"' and\n 'rhs' is '1', a paired test is done.\n\n data: an optional matrix or data frame (or similar: see\n 'model.frame') containing the variables in the formula\n 'formula'. By default the variables are taken from\n 'environment(formula)'.\n\n subset: an optional vector specifying a subset of observations to be\n used.\n\nna.action: a function which indicates what should happen when the data\n contain 'NA's. Defaults to 'getOption(\"na.action\")'.\n\n ...: further arguments to be passed to or from methods.\n\nDetails:\n\n 'alternative = \"greater\"' is the alternative that 'x' has a larger\n mean than 'y'. For the one-sample case: that the mean is positive.\n\n If 'paired' is 'TRUE' then both 'x' and 'y' must be specified and\n they must be the same length. Missing values are silently removed\n (in pairs if 'paired' is 'TRUE'). If 'var.equal' is 'TRUE' then\n the pooled estimate of the variance is used. By default, if\n 'var.equal' is 'FALSE' then the variance is estimated separately\n for both groups and the Welch modification to the degrees of\n freedom is used.\n\n If the input data are effectively constant (compared to the larger\n of the two means) an error is generated.\n\nValue:\n\n A list with class '\"htest\"' containing the following components:\n\nstatistic: the value of the t-statistic.\n\nparameter: the degrees of freedom for the t-statistic.\n\n p.value: the p-value for the test.\n\nconf.int: a confidence interval for the mean appropriate to the\n specified alternative hypothesis.\n\nestimate: the estimated mean or difference in means depending on\n whether it was a one-sample test or a two-sample test.\n\nnull.value: the specified hypothesized value of the mean or mean\n difference depending on whether it was a one-sample test or a\n two-sample test.\n\n stderr: the standard error of the mean (difference), used as\n denominator in the t-statistic formula.\n\nalternative: a character string describing the alternative hypothesis.\n\n method: a character string indicating what type of t-test was\n performed.\n\ndata.name: a character string giving the name(s) of the data.\n\nSee Also:\n\n 'prop.test'\n\nExamples:\n\n require(graphics)\n \n t.test(1:10, y = c(7:20)) # P = .00001855\n t.test(1:10, y = c(7:20, 200)) # P = .1245 -- NOT significant anymore\n \n ## Classical example: Student's sleep data\n plot(extra ~ group, data = sleep)\n ## Traditional interface\n with(sleep, t.test(extra[group == 1], extra[group == 2]))\n \n ## Formula interface\n t.test(extra ~ group, data = sleep)\n \n ## Formula interface to one-sample test\n t.test(extra ~ 1, data = sleep)\n \n ## Formula interface to paired test\n ## The sleep data are actually paired, so could have been in wide format:\n sleep2 <- reshape(sleep, direction = \"wide\", \n idvar = \"ID\", timevar = \"group\")\n t.test(Pair(extra.1, extra.2) ~ 1, data = sleep2)\n\n\n## Running two-sample t-test\n\nThe **base R** - `t.test()` function from the `stats` package. It tests test difference in means of a variable between two groups. By default:\n\n- tests whether difference in means of a variable is equal to 0 (default `mu=0`)\n- uses \"two sided\" alternative (`alternative = \"two.sided\"`)\n- returns result assuming confidence level 0.95 (`conf.level = 0.95`)\n- assumes data are not paired (`paired = FALSE`)\n- assumes true variance in the two groups is not equal (`var.equal = FALSE`)\n\n## Running two-sample t-test\n\n\n::: {.cell}\n\n```{.r .cell-code}\nIgG_young <- df$IgG_concentration[df$age_group==\"young\"]\nIgG_old <- df$IgG_concentration[df$age_group==\"old\"]\n\nt.test(IgG_young, IgG_old)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\n\tWelch Two Sample t-test\n\ndata: IgG_young and IgG_old\nt = -6.1969, df = 259.54, p-value = 2.25e-09\nalternative hypothesis: true difference in means is not equal to 0\n95 percent confidence interval:\n -111.09281 -57.51515\nsample estimates:\nmean of x mean of y \n 45.05056 129.35454 \n```\n:::\n:::\n\n\nThe mean IgG concenration of young and old is 45.05 and 129.35 IU/mL, respectively. We reject null hypothesis that the difference in the mean IgG concentration of young and old is 0 IU/mL.\n\n## Linear regression fit in R\n\nTo fit regression models in R, we use the function `glm()` (Generalized Linear Model).\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?glm\n```\n:::\n\nFitting Generalized Linear Models\n\nDescription:\n\n 'glm' is used to fit generalized linear models, specified by\n giving a symbolic description of the linear predictor and a\n description of the error distribution.\n\nUsage:\n\n glm(formula, family = gaussian, data, weights, subset,\n na.action, start = NULL, etastart, mustart, offset,\n control = list(...), model = TRUE, method = \"glm.fit\",\n x = FALSE, y = TRUE, singular.ok = TRUE, contrasts = NULL, ...)\n \n glm.fit(x, y, weights = rep.int(1, nobs),\n start = NULL, etastart = NULL, mustart = NULL,\n offset = rep.int(0, nobs), family = gaussian(),\n control = list(), intercept = TRUE, singular.ok = TRUE)\n \n ## S3 method for class 'glm'\n weights(object, type = c(\"prior\", \"working\"), ...)\n \nArguments:\n\n formula: an object of class '\"formula\"' (or one that can be coerced to\n that class): a symbolic description of the model to be\n fitted. The details of model specification are given under\n 'Details'.\n\n family: a description of the error distribution and link function to\n be used in the model. For 'glm' this can be a character\n string naming a family function, a family function or the\n result of a call to a family function. For 'glm.fit' only\n the third option is supported. (See 'family' for details of\n family functions.)\n\n data: an optional data frame, list or environment (or object\n coercible by 'as.data.frame' to a data frame) containing the\n variables in the model. If not found in 'data', the\n variables are taken from 'environment(formula)', typically\n the environment from which 'glm' is called.\n\n weights: an optional vector of 'prior weights' to be used in the\n fitting process. Should be 'NULL' or a numeric vector.\n\n subset: an optional vector specifying a subset of observations to be\n used in the fitting process.\n\nna.action: a function which indicates what should happen when the data\n contain 'NA's. The default is set by the 'na.action' setting\n of 'options', and is 'na.fail' if that is unset. The\n 'factory-fresh' default is 'na.omit'. Another possible value\n is 'NULL', no action. Value 'na.exclude' can be useful.\n\n start: starting values for the parameters in the linear predictor.\n\netastart: starting values for the linear predictor.\n\n mustart: starting values for the vector of means.\n\n offset: this can be used to specify an _a priori_ known component to\n be included in the linear predictor during fitting. This\n should be 'NULL' or a numeric vector of length equal to the\n number of cases. One or more 'offset' terms can be included\n in the formula instead or as well, and if more than one is\n specified their sum is used. See 'model.offset'.\n\n control: a list of parameters for controlling the fitting process.\n For 'glm.fit' this is passed to 'glm.control'.\n\n model: a logical value indicating whether _model frame_ should be\n included as a component of the returned value.\n\n method: the method to be used in fitting the model. The default\n method '\"glm.fit\"' uses iteratively reweighted least squares\n (IWLS): the alternative '\"model.frame\"' returns the model\n frame and does no fitting.\n\n User-supplied fitting functions can be supplied either as a\n function or a character string naming a function, with a\n function which takes the same arguments as 'glm.fit'. If\n specified as a character string it is looked up from within\n the 'stats' namespace.\n\n x, y: For 'glm': logical values indicating whether the response\n vector and model matrix used in the fitting process should be\n returned as components of the returned value.\n\n For 'glm.fit': 'x' is a design matrix of dimension 'n * p',\n and 'y' is a vector of observations of length 'n'.\n\nsingular.ok: logical; if 'FALSE' a singular fit is an error.\n\ncontrasts: an optional list. See the 'contrasts.arg' of\n 'model.matrix.default'.\n\nintercept: logical. Should an intercept be included in the _null_\n model?\n\n object: an object inheriting from class '\"glm\"'.\n\n type: character, partial matching allowed. Type of weights to\n extract from the fitted model object. Can be abbreviated.\n\n ...: For 'glm': arguments to be used to form the default 'control'\n argument if it is not supplied directly.\n\n For 'weights': further arguments passed to or from other\n methods.\n\nDetails:\n\n A typical predictor has the form 'response ~ terms' where\n 'response' is the (numeric) response vector and 'terms' is a\n series of terms which specifies a linear predictor for 'response'.\n For 'binomial' and 'quasibinomial' families the response can also\n be specified as a 'factor' (when the first level denotes failure\n and all others success) or as a two-column matrix with the columns\n giving the numbers of successes and failures. A terms\n specification of the form 'first + second' indicates all the terms\n in 'first' together with all the terms in 'second' with any\n duplicates removed.\n\n A specification of the form 'first:second' indicates the set of\n terms obtained by taking the interactions of all terms in 'first'\n with all terms in 'second'. The specification 'first*second'\n indicates the _cross_ of 'first' and 'second'. This is the same\n as 'first + second + first:second'.\n\n The terms in the formula will be re-ordered so that main effects\n come first, followed by the interactions, all second-order, all\n third-order and so on: to avoid this pass a 'terms' object as the\n formula.\n\n Non-'NULL' 'weights' can be used to indicate that different\n observations have different dispersions (with the values in\n 'weights' being inversely proportional to the dispersions); or\n equivalently, when the elements of 'weights' are positive integers\n w_i, that each response y_i is the mean of w_i unit-weight\n observations. For a binomial GLM prior weights are used to give\n the number of trials when the response is the proportion of\n successes: they would rarely be used for a Poisson GLM.\n\n 'glm.fit' is the workhorse function: it is not normally called\n directly but can be more efficient where the response vector,\n design matrix and family have already been calculated.\n\n If more than one of 'etastart', 'start' and 'mustart' is\n specified, the first in the list will be used. It is often\n advisable to supply starting values for a 'quasi' family, and also\n for families with unusual links such as 'gaussian(\"log\")'.\n\n All of 'weights', 'subset', 'offset', 'etastart' and 'mustart' are\n evaluated in the same way as variables in 'formula', that is first\n in 'data' and then in the environment of 'formula'.\n\n For the background to warning messages about 'fitted probabilities\n numerically 0 or 1 occurred' for binomial GLMs, see Venables &\n Ripley (2002, pp. 197-8).\n\nValue:\n\n 'glm' returns an object of class inheriting from '\"glm\"' which\n inherits from the class '\"lm\"'. See later in this section. If a\n non-standard 'method' is used, the object will also inherit from\n the class (if any) returned by that function.\n\n The function 'summary' (i.e., 'summary.glm') can be used to obtain\n or print a summary of the results and the function 'anova' (i.e.,\n 'anova.glm') to produce an analysis of variance table.\n\n The generic accessor functions 'coefficients', 'effects',\n 'fitted.values' and 'residuals' can be used to extract various\n useful features of the value returned by 'glm'.\n\n 'weights' extracts a vector of weights, one for each case in the\n fit (after subsetting and 'na.action').\n\n An object of class '\"glm\"' is a list containing at least the\n following components:\n\ncoefficients: a named vector of coefficients\n\nresiduals: the _working_ residuals, that is the residuals in the final\n iteration of the IWLS fit. Since cases with zero weights are\n omitted, their working residuals are 'NA'.\n\nfitted.values: the fitted mean values, obtained by transforming the\n linear predictors by the inverse of the link function.\n\n rank: the numeric rank of the fitted linear model.\n\n family: the 'family' object used.\n\nlinear.predictors: the linear fit on link scale.\n\ndeviance: up to a constant, minus twice the maximized log-likelihood.\n Where sensible, the constant is chosen so that a saturated\n model has deviance zero.\n\n aic: A version of Akaike's _An Information Criterion_, minus twice\n the maximized log-likelihood plus twice the number of\n parameters, computed via the 'aic' component of the family.\n For binomial and Poison families the dispersion is fixed at\n one and the number of parameters is the number of\n coefficients. For gaussian, Gamma and inverse gaussian\n families the dispersion is estimated from the residual\n deviance, and the number of parameters is the number of\n coefficients plus one. For a gaussian family the MLE of the\n dispersion is used so this is a valid value of AIC, but for\n Gamma and inverse gaussian families it is not. For families\n fitted by quasi-likelihood the value is 'NA'.\n\nnull.deviance: The deviance for the null model, comparable with\n 'deviance'. The null model will include the offset, and an\n intercept if there is one in the model. Note that this will\n be incorrect if the link function depends on the data other\n than through the fitted mean: specify a zero offset to force\n a correct calculation.\n\n iter: the number of iterations of IWLS used.\n\n weights: the _working_ weights, that is the weights in the final\n iteration of the IWLS fit.\n\nprior.weights: the weights initially supplied, a vector of '1's if none\n were.\n\ndf.residual: the residual degrees of freedom.\n\n df.null: the residual degrees of freedom for the null model.\n\n y: if requested (the default) the 'y' vector used. (It is a\n vector even for a binomial model.)\n\n x: if requested, the model matrix.\n\n model: if requested (the default), the model frame.\n\nconverged: logical. Was the IWLS algorithm judged to have converged?\n\nboundary: logical. Is the fitted value on the boundary of the\n attainable values?\n\n call: the matched call.\n\n formula: the formula supplied.\n\n terms: the 'terms' object used.\n\n data: the 'data argument'.\n\n offset: the offset vector used.\n\n control: the value of the 'control' argument used.\n\n method: the name of the fitter function used (when provided as a\n 'character' string to 'glm()') or the fitter 'function' (when\n provided as that).\n\ncontrasts: (where relevant) the contrasts used.\n\n xlevels: (where relevant) a record of the levels of the factors used\n in fitting.\n\nna.action: (where relevant) information returned by 'model.frame' on\n the special handling of 'NA's.\n\n In addition, non-empty fits will have components 'qr', 'R' and\n 'effects' relating to the final weighted linear fit.\n\n Objects of class '\"glm\"' are normally of class 'c(\"glm\", \"lm\")',\n that is inherit from class '\"lm\"', and well-designed methods for\n class '\"lm\"' will be applied to the weighted linear model at the\n final iteration of IWLS. However, care is needed, as extractor\n functions for class '\"glm\"' such as 'residuals' and 'weights' do\n *not* just pick out the component of the fit with the same name.\n\n If a 'binomial' 'glm' model was specified by giving a two-column\n response, the weights returned by 'prior.weights' are the total\n numbers of cases (factored by the supplied case weights) and the\n component 'y' of the result is the proportion of successes.\n\nFitting functions:\n\n The argument 'method' serves two purposes. One is to allow the\n model frame to be recreated with no fitting. The other is to\n allow the default fitting function 'glm.fit' to be replaced by a\n function which takes the same arguments and uses a different\n fitting algorithm. If 'glm.fit' is supplied as a character string\n it is used to search for a function of that name, starting in the\n 'stats' namespace.\n\n The class of the object return by the fitter (if any) will be\n prepended to the class returned by 'glm'.\n\nAuthor(s):\n\n The original R implementation of 'glm' was written by Simon Davies\n working for Ross Ihaka at the University of Auckland, but has\n since been extensively re-written by members of the R Core team.\n\n The design was inspired by the S function of the same name\n described in Hastie & Pregibon (1992).\n\nReferences:\n\n Dobson, A. J. (1990) _An Introduction to Generalized Linear\n Models._ London: Chapman and Hall.\n\n Hastie, T. J. and Pregibon, D. (1992) _Generalized linear models._\n Chapter 6 of _Statistical Models in S_ eds J. M. Chambers and T.\n J. Hastie, Wadsworth & Brooks/Cole.\n\n McCullagh P. and Nelder, J. A. (1989) _Generalized Linear Models._\n London: Chapman and Hall.\n\n Venables, W. N. and Ripley, B. D. (2002) _Modern Applied\n Statistics with S._ New York: Springer.\n\nSee Also:\n\n 'anova.glm', 'summary.glm', etc. for 'glm' methods, and the\n generic functions 'anova', 'summary', 'effects', 'fitted.values',\n and 'residuals'.\n\n 'lm' for non-generalized _linear_ models (which SAS calls GLMs,\n for 'general' linear models).\n\n 'loglin' and 'loglm' (package 'MASS') for fitting log-linear\n models (which binomial and Poisson GLMs are) to contingency\n tables.\n\n 'bigglm' in package 'biglm' for an alternative way to fit GLMs to\n large datasets (especially those with many cases).\n\n 'esoph', 'infert' and 'predict.glm' have examples of fitting\n binomial glms.\n\nExamples:\n\n ## Dobson (1990) Page 93: Randomized Controlled Trial :\n counts <- c(18,17,15,20,10,20,25,13,12)\n outcome <- gl(3,1,9)\n treatment <- gl(3,3)\n data.frame(treatment, outcome, counts) # showing data\n glm.D93 <- glm(counts ~ outcome + treatment, family = poisson())\n anova(glm.D93)\n summary(glm.D93)\n ## Computing AIC [in many ways]:\n (A0 <- AIC(glm.D93))\n (ll <- logLik(glm.D93))\n A1 <- -2*c(ll) + 2*attr(ll, \"df\")\n A2 <- glm.D93$family$aic(counts, mu=fitted(glm.D93), wt=1) +\n 2 * length(coef(glm.D93))\n stopifnot(exprs = {\n all.equal(A0, A1)\n all.equal(A1, A2)\n all.equal(A1, glm.D93$aic)\n })\n \n \n ## an example with offsets from Venables & Ripley (2002, p.189)\n utils::data(anorexia, package = \"MASS\")\n \n anorex.1 <- glm(Postwt ~ Prewt + Treat + offset(Prewt),\n family = gaussian, data = anorexia)\n summary(anorex.1)\n \n \n # A Gamma example, from McCullagh & Nelder (1989, pp. 300-2)\n clotting <- data.frame(\n u = c(5,10,15,20,30,40,60,80,100),\n lot1 = c(118,58,42,35,27,25,21,19,18),\n lot2 = c(69,35,26,21,18,16,13,12,12))\n summary(glm(lot1 ~ log(u), data = clotting, family = Gamma))\n summary(glm(lot2 ~ log(u), data = clotting, family = Gamma))\n ## Aliased (\"S\"ingular) -> 1 NA coefficient\n (fS <- glm(lot2 ~ log(u) + log(u^2), data = clotting, family = Gamma))\n tools::assertError(update(fS, singular.ok=FALSE), verbose=interactive())\n ## -> .. \"singular fit encountered\"\n \n ## Not run:\n \n ## for an example of the use of a terms object as a formula\n demo(glm.vr)\n ## End(Not run)\n\n\n## Linear regression fit in R\n\nWe tend to focus on three arguments:\n\n- `formula` -- model formula written using names of columns in our data\n- `data` -- our data frame\n- `family` -- error distribution and link function\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfit1 <- glm(IgG_concentration~age+gender+slum, data=df, family=gaussian())\nfit2 <- glm(seropos~age_group+gender+slum, data=df, family = binomial(link = \"logit\"))\n```\n:::\n\n\n## `summary.glm()`\n\nThe `summary()` function when applied to a fit object based on a glm is technically the `summary.glm()` function and produces details of the model fit. Note on object oriented code.\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/rstudio_script.png){width=200%}\n:::\n:::\n\nSummarizing Generalized Linear Model Fits\n\nDescription:\n\n These functions are all 'methods' for class 'glm' or 'summary.glm'\n objects.\n\nUsage:\n\n ## S3 method for class 'glm'\n summary(object, dispersion = NULL, correlation = FALSE,\n symbolic.cor = FALSE, ...)\n \n ## S3 method for class 'summary.glm'\n print(x, digits = max(3, getOption(\"digits\") - 3),\n symbolic.cor = x$symbolic.cor,\n signif.stars = getOption(\"show.signif.stars\"),\n show.residuals = FALSE, ...)\n \nArguments:\n\n object: an object of class '\"glm\"', usually, a result of a call to\n 'glm'.\n\n x: an object of class '\"summary.glm\"', usually, a result of a\n call to 'summary.glm'.\n\ndispersion: the dispersion parameter for the family used. Either a\n single numerical value or 'NULL' (the default), when it is\n inferred from 'object' (see 'Details').\n\ncorrelation: logical; if 'TRUE', the correlation matrix of the\n estimated parameters is returned and printed.\n\n digits: the number of significant digits to use when printing.\n\nsymbolic.cor: logical. If 'TRUE', print the correlations in a symbolic\n form (see 'symnum') rather than as numbers.\n\nsignif.stars: logical. If 'TRUE', 'significance stars' are printed for\n each coefficient.\n\nshow.residuals: logical. If 'TRUE' then a summary of the deviance\n residuals is printed at the head of the output.\n\n ...: further arguments passed to or from other methods.\n\nDetails:\n\n 'print.summary.glm' tries to be smart about formatting the\n coefficients, standard errors, etc. and additionally gives\n 'significance stars' if 'signif.stars' is 'TRUE'. The\n 'coefficients' component of the result gives the estimated\n coefficients and their estimated standard errors, together with\n their ratio. This third column is labelled 't ratio' if the\n dispersion is estimated, and 'z ratio' if the dispersion is known\n (or fixed by the family). A fourth column gives the two-tailed\n p-value corresponding to the t or z ratio based on a Student t or\n Normal reference distribution. (It is possible that the\n dispersion is not known and there are no residual degrees of\n freedom from which to estimate it. In that case the estimate is\n 'NaN'.)\n\n Aliased coefficients are omitted in the returned object but\n restored by the 'print' method.\n\n Correlations are printed to two decimal places (or symbolically):\n to see the actual correlations print 'summary(object)$correlation'\n directly.\n\n The dispersion of a GLM is not used in the fitting process, but it\n is needed to find standard errors. If 'dispersion' is not\n supplied or 'NULL', the dispersion is taken as '1' for the\n 'binomial' and 'Poisson' families, and otherwise estimated by the\n residual Chisquared statistic (calculated from cases with non-zero\n weights) divided by the residual degrees of freedom.\n\n 'summary' can be used with Gaussian 'glm' fits to handle the case\n of a linear regression with known error variance, something not\n handled by 'summary.lm'.\n\nValue:\n\n 'summary.glm' returns an object of class '\"summary.glm\"', a list\n with components\n\n call: the component from 'object'.\n\n family: the component from 'object'.\n\ndeviance: the component from 'object'.\n\ncontrasts: the component from 'object'.\n\ndf.residual: the component from 'object'.\n\nnull.deviance: the component from 'object'.\n\n df.null: the component from 'object'.\n\ndeviance.resid: the deviance residuals: see 'residuals.glm'.\n\ncoefficients: the matrix of coefficients, standard errors, z-values and\n p-values. Aliased coefficients are omitted.\n\n aliased: named logical vector showing if the original coefficients are\n aliased.\n\ndispersion: either the supplied argument or the inferred/estimated\n dispersion if the former is 'NULL'.\n\n df: a 3-vector of the rank of the model and the number of\n residual degrees of freedom, plus number of coefficients\n (including aliased ones).\n\ncov.unscaled: the unscaled ('dispersion = 1') estimated covariance\n matrix of the estimated coefficients.\n\ncov.scaled: ditto, scaled by 'dispersion'.\n\ncorrelation: (only if 'correlation' is true.) The estimated\n correlations of the estimated coefficients.\n\nsymbolic.cor: (only if 'correlation' is true.) The value of the\n argument 'symbolic.cor'.\n\nSee Also:\n\n 'glm', 'summary'.\n\nExamples:\n\n ## For examples see example(glm)\n\n\n\n## Linear regression fit in R\n\nLets look at the output...\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(fit1)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nglm(formula = IgG_concentration ~ age + gender + slum, family = gaussian(), \n data = df)\n\nCoefficients:\n Estimate Std. Error t value Pr(>|t|) \n(Intercept) 46.132 16.774 2.750 0.00613 ** \nage 9.324 1.388 6.718 4.15e-11 ***\ngenderMale -9.655 11.543 -0.836 0.40321 \nslumNon slum -20.353 14.299 -1.423 0.15513 \nslumSlum -29.705 25.009 -1.188 0.23536 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for gaussian family taken to be 20918.39)\n\n Null deviance: 14141483 on 631 degrees of freedom\nResidual deviance: 13115831 on 627 degrees of freedom\n (19 observations deleted due to missingness)\nAIC: 8087.9\n\nNumber of Fisher Scoring iterations: 2\n```\n:::\n\n```{.r .cell-code}\nsummary(fit2)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\nCall:\nglm(formula = seropos ~ age_group + gender + slum, family = binomial(link = \"logit\"), \n data = df)\n\nCoefficients:\n Estimate Std. Error z value Pr(>|z|) \n(Intercept) -1.3220 0.2516 -5.254 1.49e-07 ***\nage_groupmiddle 1.9020 0.2133 8.916 < 2e-16 ***\nage_groupold 2.8443 0.2522 11.278 < 2e-16 ***\ngenderMale -0.1725 0.1895 -0.910 0.363 \nslumNon slum -0.1099 0.2329 -0.472 0.637 \nslumSlum -0.1073 0.4118 -0.261 0.794 \n---\nSignif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1\n\n(Dispersion parameter for binomial family taken to be 1)\n\n Null deviance: 866.98 on 631 degrees of freedom\nResidual deviance: 679.10 on 626 degrees of freedom\n (19 observations deleted due to missingness)\nAIC: 691.1\n\nNumber of Fisher Scoring iterations: 4\n```\n:::\n:::\n\n\n\n\n## Summary\n\n-\tUse `cor()` or `cor.test()` to calculate correlation between two numeric vectors.\n- `t.test()` tests the mean compared to null or difference in means between two groups\n-\t\t... xxamy more\n\n## Acknowledgements\n\nThese are the materials we looked through, modified, or extracted to complete this module's lecture.\n\n- [\"Introduction to R for Public Health Researchers\" Johns Hopkins University](https://jhudatascience.org/intro_to_r/)\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/_freeze/modules/Module10-DataVisualization/execute-results/html.json b/_freeze/modules/Module10-DataVisualization/execute-results/html.json index 217dc3b..059f6ff 100644 --- a/_freeze/modules/Module10-DataVisualization/execute-results/html.json +++ b/_freeze/modules/Module10-DataVisualization/execute-results/html.json @@ -1,8 +1,7 @@ { - "hash": "e546ac5cfa3fec481cca7f255f1ede69", + "hash": "f5db9a97f56b293b9a271bfe4464641d", "result": { - "engine": "knitr", - "markdown": "---\ntitle: \"Module 10: Data Visualization\"\nformat: \n revealjs:\n scrollable: true\n smaller: true\n---\n\n\n\n## Learning Objectives\n\nAfter module 10, you should be able to:\n\n- Create Base R plots\n\n## Import data for this module\n\nLet's read in our data (again) and take a quick look.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- read.csv(file = \"data/serodata.csv\") #relative path\nhead(x=df, n=3)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n observation_id IgG_concentration age gender slum\n1 5772 0.3176895 2 Female Non slum\n2 8095 3.4368231 4 Female Non slum\n3 9784 0.3000000 4 Male Non slum\n```\n\n\n:::\n:::\n\n\n\n## Prep data\n\nCreate `age_group` three level factor variable\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group <- ifelse(df$age <= 5, \"young\", \n ifelse(df$age<=10 & df$age>5, \"middle\", \n ifelse(df$age>10, \"old\", NA)))\ndf$age_group <- factor(df$age_group, levels=c(\"young\", \"middle\", \"old\"))\n```\n:::\n\n\n\nCreate `seropos` binary variable representing seropositivity if antibody concentrations are >10 mIUmL.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$seropos <- ifelse(df$IgG_concentration<10, 0, \n\t\t\t\t\t\t\t\t\t\tifelse(df$IgG_concentration>=10, 1, NA))\n```\n:::\n\n\n\n## Base R data visualizattion functions\n\nThe Base R 'graphics' package has a ton of graphics options. \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(help = \"graphics\")\n```\n:::\n\n::: {.cell}\n::: {.cell-output .cell-output-stderr}\n\n```\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stdout}\n\n```\n\t\tInformation on package 'graphics'\n\nDescription:\n\nPackage: graphics\nVersion: 4.3.1\nPriority: base\nTitle: The R Graphics Package\nAuthor: R Core Team and contributors worldwide\nMaintainer: R Core Team \nContact: R-help mailing list \nDescription: R functions for base graphics.\nImports: grDevices\nLicense: Part of R 4.3.1\nNeedsCompilation: yes\nBuilt: R 4.3.1; aarch64-apple-darwin20; 2023-06-16\n 21:53:01 UTC; unix\n\nIndex:\n\nAxis Generic Function to Add an Axis to a Plot\nabline Add Straight Lines to a Plot\narrows Add Arrows to a Plot\nassocplot Association Plots\naxTicks Compute Axis Tickmark Locations\naxis Add an Axis to a Plot\naxis.POSIXct Date and Date-time Plotting Functions\nbarplot Bar Plots\nbox Draw a Box around a Plot\nboxplot Box Plots\nboxplot.matrix Draw a Boxplot for each Column (Row) of a\n Matrix\nbxp Draw Box Plots from Summaries\ncdplot Conditional Density Plots\nclip Set Clipping Region\ncontour Display Contours\ncoplot Conditioning Plots\ncurve Draw Function Plots\ndotchart Cleveland's Dot Plots\nfilled.contour Level (Contour) Plots\nfourfoldplot Fourfold Plots\nframe Create / Start a New Plot Frame\ngraphics-package The R Graphics Package\ngrconvertX Convert between Graphics Coordinate Systems\ngrid Add Grid to a Plot\nhist Histograms\nhist.POSIXt Histogram of a Date or Date-Time Object\nidentify Identify Points in a Scatter Plot\nimage Display a Color Image\nlayout Specifying Complex Plot Arrangements\nlegend Add Legends to Plots\nlines Add Connected Line Segments to a Plot\nlocator Graphical Input\nmatplot Plot Columns of Matrices\nmosaicplot Mosaic Plots\nmtext Write Text into the Margins of a Plot\npairs Scatterplot Matrices\npanel.smooth Simple Panel Plot\npar Set or Query Graphical Parameters\npersp Perspective Plots\npie Pie Charts\nplot.data.frame Plot Method for Data Frames\nplot.default The Default Scatterplot Function\nplot.design Plot Univariate Effects of a Design or Model\nplot.factor Plotting Factor Variables\nplot.formula Formula Notation for Scatterplots\nplot.histogram Plot Histograms\nplot.raster Plotting Raster Images\nplot.table Plot Methods for 'table' Objects\nplot.window Set up World Coordinates for Graphics Window\nplot.xy Basic Internal Plot Function\npoints Add Points to a Plot\npolygon Polygon Drawing\npolypath Path Drawing\nrasterImage Draw One or More Raster Images\nrect Draw One or More Rectangles\nrug Add a Rug to a Plot\nscreen Creating and Controlling Multiple Screens on a\n Single Device\nsegments Add Line Segments to a Plot\nsmoothScatter Scatterplots with Smoothed Densities Color\n Representation\nspineplot Spine Plots and Spinograms\nstars Star (Spider/Radar) Plots and Segment Diagrams\nstem Stem-and-Leaf Plots\nstripchart 1-D Scatter Plots\nstrwidth Plotting Dimensions of Character Strings and\n Math Expressions\nsunflowerplot Produce a Sunflower Scatter Plot\nsymbols Draw Symbols (Circles, Squares, Stars,\n Thermometers, Boxplots)\ntext Add Text to a Plot\ntitle Plot Annotation\nxinch Graphical Units\nxspline Draw an X-spline\n```\n\n\n:::\n:::\n\n\n\n\n\n## Base R Plotting\n\nTo make a plot you often need to specify the following features:\n\n1. Parameters\n2. Plot attributes\n3. The legend\n\n## 1. Parameters\n\nThe parameter section fixes the settings for all your plots, basically the plot options. Adding attributes via `par()` before you call the plot creates ‘global’ settings for your plot.\n\nIn the example below, we have set two commonly used optional attributes in the global plot settings. \n-\t\tThe `mfrow` specifies that we have one row and two columns of plots — that is, two plots side by side. \n-\t\tThe `mar` attribute is a vector of our margin widths, with the first value indicating the margin below the plot (5), the second indicating the margin to the left of the plot (5), the third, the top of the plot(4), and the fourth to the left (1).\n\n```\npar(mfrow = c(1,2), mar = c(5,5,4,1))\n```\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/par.png){width=70%}\n:::\n:::\n\n\n\n\n## Lots of parameters options\n\nHowever, there are many more parameter options that can be specified in the 'global' settings or specific to a certain plot option. \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?par\n```\n:::\n\nSet or Query Graphical Parameters\n\nDescription:\n\n 'par' can be used to set or query graphical parameters.\n Parameters can be set by specifying them as arguments to 'par' in\n 'tag = value' form, or by passing them as a list of tagged values.\n\nUsage:\n\n par(..., no.readonly = FALSE)\n \n (...., = )\n \nArguments:\n\n ...: arguments in 'tag = value' form, a single list of tagged\n values, or character vectors of parameter names. Supported\n parameters are described in the 'Graphical Parameters'\n section.\n\nno.readonly: logical; if 'TRUE' and there are no other arguments, only\n parameters are returned which can be set by a subsequent\n 'par()' call _on the same device_.\n\nDetails:\n\n Each device has its own set of graphical parameters. If the\n current device is the null device, 'par' will open a new device\n before querying/setting parameters. (What device is controlled by\n 'options(\"device\")'.)\n\n Parameters are queried by giving one or more character vectors of\n parameter names to 'par'.\n\n 'par()' (no arguments) or 'par(no.readonly = TRUE)' is used to get\n _all_ the graphical parameters (as a named list). Their names are\n currently taken from the unexported variable 'graphics:::.Pars'.\n\n _*R.O.*_ indicates _*read-only arguments*_: These may only be used\n in queries and cannot be set. ('\"cin\"', '\"cra\"', '\"csi\"',\n '\"cxy\"', '\"din\"' and '\"page\"' are always read-only.)\n\n Several parameters can only be set by a call to 'par()':\n\n • '\"ask\"',\n\n • '\"fig\"', '\"fin\"',\n\n • '\"lheight\"',\n\n • '\"mai\"', '\"mar\"', '\"mex\"', '\"mfcol\"', '\"mfrow\"', '\"mfg\"',\n\n • '\"new\"',\n\n • '\"oma\"', '\"omd\"', '\"omi\"',\n\n • '\"pin\"', '\"plt\"', '\"ps\"', '\"pty\"',\n\n • '\"usr\"',\n\n • '\"xlog\"', '\"ylog\"',\n\n • '\"ylbias\"'\n\n The remaining parameters can also be set as arguments (often via\n '...') to high-level plot functions such as 'plot.default',\n 'plot.window', 'points', 'lines', 'abline', 'axis', 'title',\n 'text', 'mtext', 'segments', 'symbols', 'arrows', 'polygon',\n 'rect', 'box', 'contour', 'filled.contour' and 'image'. Such\n settings will be active during the execution of the function,\n only. However, see the comments on 'bg', 'cex', 'col', 'lty',\n 'lwd' and 'pch' which may be taken as _arguments_ to certain plot\n functions rather than as graphical parameters.\n\n The meaning of 'character size' is not well-defined: this is set\n up for the device taking 'pointsize' into account but often not\n the actual font family in use. Internally the corresponding pars\n ('cra', 'cin', 'cxy' and 'csi') are used only to set the\n inter-line spacing used to convert 'mar' and 'oma' to physical\n margins. (The same inter-line spacing multiplied by 'lheight' is\n used for multi-line strings in 'text' and 'strheight'.)\n\n Note that graphical parameters are suggestions: plotting functions\n and devices need not make use of them (and this is particularly\n true of non-default methods for e.g. 'plot').\n\nValue:\n\n When parameters are set, their previous values are returned in an\n invisible named list. Such a list can be passed as an argument to\n 'par' to restore the parameter values. Use 'par(no.readonly =\n TRUE)' for the full list of parameters that can be restored.\n However, restoring all of these is not wise: see the 'Note'\n section.\n\n When just one parameter is queried, the value of that parameter is\n returned as (atomic) vector. When two or more parameters are\n queried, their values are returned in a list, with the list names\n giving the parameters.\n\n Note the inconsistency: setting one parameter returns a list, but\n querying one parameter returns a vector.\n\nGraphical Parameters:\n\n 'adj' The value of 'adj' determines the way in which text strings\n are justified in 'text', 'mtext' and 'title'. A value of '0'\n produces left-justified text, '0.5' (the default) centered\n text and '1' right-justified text. (Any value in [0, 1] is\n allowed, and on most devices values outside that interval\n will also work.)\n\n Note that the 'adj' _argument_ of 'text' also allows 'adj =\n c(x, y)' for different adjustment in x- and y- directions.\n Note that whereas for 'text' it refers to positioning of text\n about a point, for 'mtext' and 'title' it controls placement\n within the plot or device region.\n\n 'ann' If set to 'FALSE', high-level plotting functions calling\n 'plot.default' do not annotate the plots they produce with\n axis titles and overall titles. The default is to do\n annotation.\n\n 'ask' logical. If 'TRUE' (and the R session is interactive) the\n user is asked for input, before a new figure is drawn. As\n this applies to the device, it also affects output by\n packages 'grid' and 'lattice'. It can be set even on\n non-screen devices but may have no effect there.\n\n This not really a graphics parameter, and its use is\n deprecated in favour of 'devAskNewPage'.\n\n 'bg' The color to be used for the background of the device region.\n When called from 'par()' it also sets 'new = FALSE'. See\n section 'Color Specification' for suitable values. For many\n devices the initial value is set from the 'bg' argument of\n the device, and for the rest it is normally '\"white\"'.\n\n Note that some graphics functions such as 'plot.default' and\n 'points' have an _argument_ of this name with a different\n meaning.\n\n 'bty' A character string which determined the type of 'box' which\n is drawn about plots. If 'bty' is one of '\"o\"' (the\n default), '\"l\"', '\"7\"', '\"c\"', '\"u\"', or '\"]\"' the resulting\n box resembles the corresponding upper case letter. A value\n of '\"n\"' suppresses the box.\n\n 'cex' A numerical value giving the amount by which plotting text\n and symbols should be magnified relative to the default.\n This starts as '1' when a device is opened, and is reset when\n the layout is changed, e.g. by setting 'mfrow'.\n\n Note that some graphics functions such as 'plot.default' have\n an _argument_ of this name which _multiplies_ this graphical\n parameter, and some functions such as 'points' and 'text'\n accept a vector of values which are recycled.\n\n 'cex.axis' The magnification to be used for axis annotation\n relative to the current setting of 'cex'.\n\n 'cex.lab' The magnification to be used for x and y labels relative\n to the current setting of 'cex'.\n\n 'cex.main' The magnification to be used for main titles relative\n to the current setting of 'cex'.\n\n 'cex.sub' The magnification to be used for sub-titles relative to\n the current setting of 'cex'.\n\n 'cin' _*R.O.*_; character size '(width, height)' in inches. These\n are the same measurements as 'cra', expressed in different\n units.\n\n 'col' A specification for the default plotting color. See section\n 'Color Specification'.\n\n Some functions such as 'lines' and 'text' accept a vector of\n values which are recycled and may be interpreted slightly\n differently.\n\n 'col.axis' The color to be used for axis annotation. Defaults to\n '\"black\"'.\n\n 'col.lab' The color to be used for x and y labels. Defaults to\n '\"black\"'.\n\n 'col.main' The color to be used for plot main titles. Defaults to\n '\"black\"'.\n\n 'col.sub' The color to be used for plot sub-titles. Defaults to\n '\"black\"'.\n\n 'cra' _*R.O.*_; size of default character '(width, height)' in\n 'rasters' (pixels). Some devices have no concept of pixels\n and so assume an arbitrary pixel size, usually 1/72 inch.\n These are the same measurements as 'cin', expressed in\n different units.\n\n 'crt' A numerical value specifying (in degrees) how single\n characters should be rotated. It is unwise to expect values\n other than multiples of 90 to work. Compare with 'srt' which\n does string rotation.\n\n 'csi' _*R.O.*_; height of (default-sized) characters in inches.\n The same as 'par(\"cin\")[2]'.\n\n 'cxy' _*R.O.*_; size of default character '(width, height)' in\n user coordinate units. 'par(\"cxy\")' is\n 'par(\"cin\")/par(\"pin\")' scaled to user coordinates. Note\n that 'c(strwidth(ch), strheight(ch))' for a given string 'ch'\n is usually much more precise.\n\n 'din' _*R.O.*_; the device dimensions, '(width, height)', in\n inches. See also 'dev.size', which is updated immediately\n when an on-screen device windows is re-sized.\n\n 'err' (_Unimplemented_; R is silent when points outside the plot\n region are _not_ plotted.) The degree of error reporting\n desired.\n\n 'family' The name of a font family for drawing text. The maximum\n allowed length is 200 bytes. This name gets mapped by each\n graphics device to a device-specific font description. The\n default value is '\"\"' which means that the default device\n fonts will be used (and what those are should be listed on\n the help page for the device). Standard values are\n '\"serif\"', '\"sans\"' and '\"mono\"', and the Hershey font\n families are also available. (Devices may define others, and\n some devices will ignore this setting completely. Names\n starting with '\"Hershey\"' are treated specially and should\n only be used for the built-in Hershey font families.) This\n can be specified inline for 'text'.\n\n 'fg' The color to be used for the foreground of plots. This is\n the default color used for things like axes and boxes around\n plots. When called from 'par()' this also sets parameter\n 'col' to the same value. See section 'Color Specification'.\n A few devices have an argument to set the initial value,\n which is otherwise '\"black\"'.\n\n 'fig' A numerical vector of the form 'c(x1, x2, y1, y2)' which\n gives the (NDC) coordinates of the figure region in the\n display region of the device. If you set this, unlike S, you\n start a new plot, so to add to an existing plot use 'new =\n TRUE' as well.\n\n 'fin' The figure region dimensions, '(width, height)', in inches.\n If you set this, unlike S, you start a new plot.\n\n 'font' An integer which specifies which font to use for text. If\n possible, device drivers arrange so that 1 corresponds to\n plain text (the default), 2 to bold face, 3 to italic and 4\n to bold italic. Also, font 5 is expected to be the symbol\n font, in Adobe symbol encoding. On some devices font\n families can be selected by 'family' to choose different sets\n of 5 fonts.\n\n 'font.axis' The font to be used for axis annotation.\n\n 'font.lab' The font to be used for x and y labels.\n\n 'font.main' The font to be used for plot main titles.\n\n 'font.sub' The font to be used for plot sub-titles.\n\n 'lab' A numerical vector of the form 'c(x, y, len)' which modifies\n the default way that axes are annotated. The values of 'x'\n and 'y' give the (approximate) number of tickmarks on the x\n and y axes and 'len' specifies the label length. The default\n is 'c(5, 5, 7)'. 'len' _is unimplemented_ in R.\n\n 'las' numeric in {0,1,2,3}; the style of axis labels.\n\n 0: always parallel to the axis [_default_],\n\n 1: always horizontal,\n\n 2: always perpendicular to the axis,\n\n 3: always vertical.\n\n Also supported by 'mtext'. Note that string/character\n rotation _via_ argument 'srt' to 'par' does _not_ affect the\n axis labels.\n\n 'lend' The line end style. This can be specified as an integer or\n string:\n\n '0' and '\"round\"' mean rounded line caps [_default_];\n\n '1' and '\"butt\"' mean butt line caps;\n\n '2' and '\"square\"' mean square line caps.\n\n 'lheight' The line height multiplier. The height of a line of\n text (used to vertically space multi-line text) is found by\n multiplying the character height both by the current\n character expansion and by the line height multiplier.\n Default value is 1. Used in 'text' and 'strheight'.\n\n 'ljoin' The line join style. This can be specified as an integer\n or string:\n\n '0' and '\"round\"' mean rounded line joins [_default_];\n\n '1' and '\"mitre\"' mean mitred line joins;\n\n '2' and '\"bevel\"' mean bevelled line joins.\n\n 'lmitre' The line mitre limit. This controls when mitred line\n joins are automatically converted into bevelled line joins.\n The value must be larger than 1 and the default is 10. Not\n all devices will honour this setting.\n\n 'lty' The line type. Line types can either be specified as an\n integer (0=blank, 1=solid (default), 2=dashed, 3=dotted,\n 4=dotdash, 5=longdash, 6=twodash) or as one of the character\n strings '\"blank\"', '\"solid\"', '\"dashed\"', '\"dotted\"',\n '\"dotdash\"', '\"longdash\"', or '\"twodash\"', where '\"blank\"'\n uses 'invisible lines' (i.e., does not draw them).\n\n Alternatively, a string of up to 8 characters (from 'c(1:9,\n \"A\":\"F\")') may be given, giving the length of line segments\n which are alternatively drawn and skipped. See section 'Line\n Type Specification'.\n\n Functions such as 'lines' and 'segments' accept a vector of\n values which are recycled.\n\n 'lwd' The line width, a _positive_ number, defaulting to '1'. The\n interpretation is device-specific, and some devices do not\n implement line widths less than one. (See the help on the\n device for details of the interpretation.)\n\n Functions such as 'lines' and 'segments' accept a vector of\n values which are recycled: in such uses lines corresponding\n to values 'NA' or 'NaN' are omitted. The interpretation of\n '0' is device-specific.\n\n 'mai' A numerical vector of the form 'c(bottom, left, top, right)'\n which gives the margin size specified in inches.\n\n 'mar' A numerical vector of the form 'c(bottom, left, top, right)'\n which gives the number of lines of margin to be specified on\n the four sides of the plot. The default is 'c(5, 4, 4, 2) +\n 0.1'.\n\n 'mex' 'mex' is a character size expansion factor which is used to\n describe coordinates in the margins of plots. Note that this\n does not change the font size, rather specifies the size of\n font (as a multiple of 'csi') used to convert between 'mar'\n and 'mai', and between 'oma' and 'omi'.\n\n This starts as '1' when the device is opened, and is reset\n when the layout is changed (alongside resetting 'cex').\n\n 'mfcol, mfrow' A vector of the form 'c(nr, nc)'. Subsequent\n figures will be drawn in an 'nr'-by-'nc' array on the device\n by _columns_ ('mfcol'), or _rows_ ('mfrow'), respectively.\n\n In a layout with exactly two rows and columns the base value\n of '\"cex\"' is reduced by a factor of 0.83: if there are three\n or more of either rows or columns, the reduction factor is\n 0.66.\n\n Setting a layout resets the base value of 'cex' and that of\n 'mex' to '1'.\n\n If either of these is queried it will give the current\n layout, so querying cannot tell you the order in which the\n array will be filled.\n\n Consider the alternatives, 'layout' and 'split.screen'.\n\n 'mfg' A numerical vector of the form 'c(i, j)' where 'i' and 'j'\n indicate which figure in an array of figures is to be drawn\n next (if setting) or is being drawn (if enquiring). The\n array must already have been set by 'mfcol' or 'mfrow'.\n\n For compatibility with S, the form 'c(i, j, nr, nc)' is also\n accepted, when 'nr' and 'nc' should be the current number of\n rows and number of columns. Mismatches will be ignored, with\n a warning.\n\n 'mgp' The margin line (in 'mex' units) for the axis title, axis\n labels and axis line. Note that 'mgp[1]' affects 'title'\n whereas 'mgp[2:3]' affect 'axis'. The default is 'c(3, 1,\n 0)'.\n\n 'mkh' The height in inches of symbols to be drawn when the value\n of 'pch' is an integer. _Completely ignored in R_.\n\n 'new' logical, defaulting to 'FALSE'. If set to 'TRUE', the next\n high-level plotting command (actually 'plot.new') should _not\n clean_ the frame before drawing _as if it were on a *_new_*\n device_. It is an error (ignored with a warning) to try to\n use 'new = TRUE' on a device that does not currently contain\n a high-level plot.\n\n 'oma' A vector of the form 'c(bottom, left, top, right)' giving\n the size of the outer margins in lines of text.\n\n 'omd' A vector of the form 'c(x1, x2, y1, y2)' giving the region\n _inside_ outer margins in NDC (= normalized device\n coordinates), i.e., as a fraction (in [0, 1]) of the device\n region.\n\n 'omi' A vector of the form 'c(bottom, left, top, right)' giving\n the size of the outer margins in inches.\n\n 'page' _*R.O.*_; A boolean value indicating whether the next call\n to 'plot.new' is going to start a new page. This value may\n be 'FALSE' if there are multiple figures on the page.\n\n 'pch' Either an integer specifying a symbol or a single character\n to be used as the default in plotting points. See 'points'\n for possible values and their interpretation. Note that only\n integers and single-character strings can be set as a\n graphics parameter (and not 'NA' nor 'NULL').\n\n Some functions such as 'points' accept a vector of values\n which are recycled.\n\n 'pin' The current plot dimensions, '(width, height)', in inches.\n\n 'plt' A vector of the form 'c(x1, x2, y1, y2)' giving the\n coordinates of the plot region as fractions of the current\n figure region.\n\n 'ps' integer; the point size of text (but not symbols). Unlike\n the 'pointsize' argument of most devices, this does not\n change the relationship between 'mar' and 'mai' (nor 'oma'\n and 'omi').\n\n What is meant by 'point size' is device-specific, but most\n devices mean a multiple of 1bp, that is 1/72 of an inch.\n\n 'pty' A character specifying the type of plot region to be used;\n '\"s\"' generates a square plotting region and '\"m\"' generates\n the maximal plotting region.\n\n 'smo' (_Unimplemented_) a value which indicates how smooth circles\n and circular arcs should be.\n\n 'srt' The string rotation in degrees. See the comment about\n 'crt'. Only supported by 'text'.\n\n 'tck' The length of tick marks as a fraction of the smaller of the\n width or height of the plotting region. If 'tck >= 0.5' it\n is interpreted as a fraction of the relevant side, so if 'tck\n = 1' grid lines are drawn. The default setting ('tck = NA')\n is to use 'tcl = -0.5'.\n\n 'tcl' The length of tick marks as a fraction of the height of a\n line of text. The default value is '-0.5'; setting 'tcl =\n NA' sets 'tck = -0.01' which is S' default.\n\n 'usr' A vector of the form 'c(x1, x2, y1, y2)' giving the extremes\n of the user coordinates of the plotting region. When a\n logarithmic scale is in use (i.e., 'par(\"xlog\")' is true, see\n below), then the x-limits will be '10 ^ par(\"usr\")[1:2]'.\n Similarly for the y-axis.\n\n 'xaxp' A vector of the form 'c(x1, x2, n)' giving the coordinates\n of the extreme tick marks and the number of intervals between\n tick-marks when 'par(\"xlog\")' is false. Otherwise, when\n _log_ coordinates are active, the three values have a\n different meaning: For a small range, 'n' is _negative_, and\n the ticks are as in the linear case, otherwise, 'n' is in\n '1:3', specifying a case number, and 'x1' and 'x2' are the\n lowest and highest power of 10 inside the user coordinates,\n '10 ^ par(\"usr\")[1:2]'. (The '\"usr\"' coordinates are\n log10-transformed here!)\n\n n = 1 will produce tick marks at 10^j for integer j,\n\n n = 2 gives marks k 10^j with k in {1,5},\n\n n = 3 gives marks k 10^j with k in {1,2,5}.\n\n See 'axTicks()' for a pure R implementation of this.\n\n This parameter is reset when a user coordinate system is set\n up, for example by starting a new page or by calling\n 'plot.window' or setting 'par(\"usr\")': 'n' is taken from\n 'par(\"lab\")'. It affects the default behaviour of subsequent\n calls to 'axis' for sides 1 or 3.\n\n It is only relevant to default numeric axis systems, and not\n for example to dates.\n\n 'xaxs' The style of axis interval calculation to be used for the\n x-axis. Possible values are '\"r\"', '\"i\"', '\"e\"', '\"s\"',\n '\"d\"'. The styles are generally controlled by the range of\n data or 'xlim', if given.\n Style '\"r\"' (regular) first extends the data range by 4\n percent at each end and then finds an axis with pretty labels\n that fits within the extended range.\n Style '\"i\"' (internal) just finds an axis with pretty labels\n that fits within the original data range.\n Style '\"s\"' (standard) finds an axis with pretty labels\n within which the original data range fits.\n Style '\"e\"' (extended) is like style '\"s\"', except that it is\n also ensures that there is room for plotting symbols within\n the bounding box.\n Style '\"d\"' (direct) specifies that the current axis should\n be used on subsequent plots.\n (_Only '\"r\"' and '\"i\"' styles have been implemented in R._)\n\n 'xaxt' A character which specifies the x axis type. Specifying\n '\"n\"' suppresses plotting of the axis. The standard value is\n '\"s\"': for compatibility with S values '\"l\"' and '\"t\"' are\n accepted but are equivalent to '\"s\"': any value other than\n '\"n\"' implies plotting.\n\n 'xlog' A logical value (see 'log' in 'plot.default'). If 'TRUE',\n a logarithmic scale is in use (e.g., after 'plot(*, log =\n \"x\")'). For a new device, it defaults to 'FALSE', i.e.,\n linear scale.\n\n 'xpd' A logical value or 'NA'. If 'FALSE', all plotting is\n clipped to the plot region, if 'TRUE', all plotting is\n clipped to the figure region, and if 'NA', all plotting is\n clipped to the device region. See also 'clip'.\n\n 'yaxp' A vector of the form 'c(y1, y2, n)' giving the coordinates\n of the extreme tick marks and the number of intervals between\n tick-marks unless for log coordinates, see 'xaxp' above.\n\n 'yaxs' The style of axis interval calculation to be used for the\n y-axis. See 'xaxs' above.\n\n 'yaxt' A character which specifies the y axis type. Specifying\n '\"n\"' suppresses plotting.\n\n 'ylbias' A positive real value used in the positioning of text in\n the margins by 'axis' and 'mtext'. The default is in\n principle device-specific, but currently '0.2' for all of R's\n own devices. Set this to '0.2' for compatibility with R <\n 2.14.0 on 'x11' and 'windows()' devices.\n\n 'ylog' A logical value; see 'xlog' above.\n\nColor Specification:\n\n Colors can be specified in several different ways. The simplest\n way is with a character string giving the color name (e.g.,\n '\"red\"'). A list of the possible colors can be obtained with the\n function 'colors'. Alternatively, colors can be specified\n directly in terms of their RGB components with a string of the\n form '\"#RRGGBB\"' where each of the pairs 'RR', 'GG', 'BB' consist\n of two hexadecimal digits giving a value in the range '00' to\n 'FF'. Colors can also be specified by giving an index into a\n small table of colors, the 'palette': indices wrap round so with\n the default palette of size 8, '10' is the same as '2'. This\n provides compatibility with S. Index '0' corresponds to the\n background color. Note that the palette (apart from '0' which is\n per-device) is a per-session setting.\n\n Negative integer colours are errors.\n\n Additionally, '\"transparent\"' is _transparent_, useful for filled\n areas (such as the background!), and just invisible for things\n like lines or text. In most circumstances (integer) 'NA' is\n equivalent to '\"transparent\"' (but not for 'text' and 'mtext').\n\n Semi-transparent colors are available for use on devices that\n support them.\n\n The functions 'rgb', 'hsv', 'hcl', 'gray' and 'rainbow' provide\n additional ways of generating colors.\n\nLine Type Specification:\n\n Line types can either be specified by giving an index into a small\n built-in table of line types (1 = solid, 2 = dashed, etc, see\n 'lty' above) or directly as the lengths of on/off stretches of\n line. This is done with a string of an even number (up to eight)\n of characters, namely _non-zero_ (hexadecimal) digits which give\n the lengths in consecutive positions in the string. For example,\n the string '\"33\"' specifies three units on followed by three off\n and '\"3313\"' specifies three units on followed by three off\n followed by one on and finally three off. The 'units' here are\n (on most devices) proportional to 'lwd', and with 'lwd = 1' are in\n pixels or points or 1/96 inch.\n\n The five standard dash-dot line types ('lty = 2:6') correspond to\n 'c(\"44\", \"13\", \"1343\", \"73\", \"2262\")'.\n\n Note that 'NA' is not a valid value for 'lty'.\n\nNote:\n\n The effect of restoring all the (settable) graphics parameters as\n in the examples is hard to predict if the device has been resized.\n Several of them are attempting to set the same things in different\n ways, and those last in the alphabet will win. In particular, the\n settings of 'mai', 'mar', 'pin', 'plt' and 'pty' interact, as do\n the outer margin settings, the figure layout and figure region\n size.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Murrell, P. (2005) _R Graphics_. Chapman & Hall/CRC Press.\n\nSee Also:\n\n 'plot.default' for some high-level plotting parameters; 'colors';\n 'clip'; 'options' for other setup parameters; graphic devices\n 'x11', 'postscript' and setting up device regions by 'layout' and\n 'split.screen'.\n\nExamples:\n\n op <- par(mfrow = c(2, 2), # 2 x 2 pictures on one plot\n pty = \"s\") # square plotting region,\n # independent of device size\n \n ## At end of plotting, reset to previous settings:\n par(op)\n \n ## Alternatively,\n op <- par(no.readonly = TRUE) # the whole list of settable par's.\n ## do lots of plotting and par(.) calls, then reset:\n par(op)\n ## Note this is not in general good practice\n \n par(\"ylog\") # FALSE\n plot(1 : 12, log = \"y\")\n par(\"ylog\") # TRUE\n \n plot(1:2, xaxs = \"i\") # 'inner axis' w/o extra space\n par(c(\"usr\", \"xaxp\"))\n \n ( nr.prof <-\n c(prof.pilots = 16, lawyers = 11, farmers = 10, salesmen = 9, physicians = 9,\n mechanics = 6, policemen = 6, managers = 6, engineers = 5, teachers = 4,\n housewives = 3, students = 3, armed.forces = 1))\n par(las = 3)\n barplot(rbind(nr.prof)) # R 0.63.2: shows alignment problem\n par(las = 0) # reset to default\n \n require(grDevices) # for gray\n ## 'fg' use:\n plot(1:12, type = \"b\", main = \"'fg' : axes, ticks and box in gray\",\n fg = gray(0.7), bty = \"7\" , sub = R.version.string)\n \n ex <- function() {\n old.par <- par(no.readonly = TRUE) # all par settings which\n # could be changed.\n on.exit(par(old.par))\n ## ...\n ## ... do lots of par() settings and plots\n ## ...\n invisible() #-- now, par(old.par) will be executed\n }\n ex()\n \n ## Line types\n showLty <- function(ltys, xoff = 0, ...) {\n stopifnot((n <- length(ltys)) >= 1)\n op <- par(mar = rep(.5,4)); on.exit(par(op))\n plot(0:1, 0:1, type = \"n\", axes = FALSE, ann = FALSE)\n y <- (n:1)/(n+1)\n clty <- as.character(ltys)\n mytext <- function(x, y, txt)\n text(x, y, txt, adj = c(0, -.3), cex = 0.8, ...)\n abline(h = y, lty = ltys, ...); mytext(xoff, y, clty)\n y <- y - 1/(3*(n+1))\n abline(h = y, lty = ltys, lwd = 2, ...)\n mytext(1/8+xoff, y, paste(clty,\" lwd = 2\"))\n }\n showLty(c(\"solid\", \"dashed\", \"dotted\", \"dotdash\", \"longdash\", \"twodash\"))\n par(new = TRUE) # the same:\n showLty(c(\"solid\", \"44\", \"13\", \"1343\", \"73\", \"2262\"), xoff = .2, col = 2)\n showLty(c(\"11\", \"22\", \"33\", \"44\", \"12\", \"13\", \"14\", \"21\", \"31\"))\n\n\n\n## Common parameter options\n\nSix useful parameter arguments help improve the readability of the plot:\n\n- `xlab`: specifies the x-axis label of the plot\n- `ylab`: specifies the y-axis label\n- `main`: titles your graph\n- `pch`: specifies the symbology of your graph\n- `lty`: specifies the line type of your graph\n- `lwd`: specifies line thickness\n-\t`cex` : specifies size\n- `col`: specifies the colors for your graph.\n\nWe will explore use of these arguments below.\n\n## Common parameter options\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/atrributes.png){width=200%}\n:::\n:::\n\n\n\n\n## 2. Plot Attributes\n\nPlot attributes are those that map your data to the plot. This mean this is where you specify what variables in the data frame you want to plot. \n\nWe will only look at four types of plots today:\n\n- `hist()` displays histogram of one variable\n- `plot()` displays x-y plot of two variables\n- `boxplot()` displays boxplot \n- `barplot()` displays barplot\n\n\n## `histogram()` Help File\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?hist\n```\n:::\n\nHistograms\n\nDescription:\n\n The generic function 'hist' computes a histogram of the given data\n values. If 'plot = TRUE', the resulting object of class\n '\"histogram\"' is plotted by 'plot.histogram', before it is\n returned.\n\nUsage:\n\n hist(x, ...)\n \n ## Default S3 method:\n hist(x, breaks = \"Sturges\",\n freq = NULL, probability = !freq,\n include.lowest = TRUE, right = TRUE, fuzz = 1e-7,\n density = NULL, angle = 45, col = \"lightgray\", border = NULL,\n main = paste(\"Histogram of\" , xname),\n xlim = range(breaks), ylim = NULL,\n xlab = xname, ylab,\n axes = TRUE, plot = TRUE, labels = FALSE,\n nclass = NULL, warn.unused = TRUE, ...)\n \nArguments:\n\n x: a vector of values for which the histogram is desired.\n\n breaks: one of:\n\n • a vector giving the breakpoints between histogram cells,\n\n • a function to compute the vector of breakpoints,\n\n • a single number giving the number of cells for the\n histogram,\n\n • a character string naming an algorithm to compute the\n number of cells (see 'Details'),\n\n • a function to compute the number of cells.\n\n In the last three cases the number is a suggestion only; as\n the breakpoints will be set to 'pretty' values, the number is\n limited to '1e6' (with a warning if it was larger). If\n 'breaks' is a function, the 'x' vector is supplied to it as\n the only argument (and the number of breaks is only limited\n by the amount of available memory).\n\n freq: logical; if 'TRUE', the histogram graphic is a representation\n of frequencies, the 'counts' component of the result; if\n 'FALSE', probability densities, component 'density', are\n plotted (so that the histogram has a total area of one).\n Defaults to 'TRUE' _if and only if_ 'breaks' are equidistant\n (and 'probability' is not specified).\n\nprobability: an _alias_ for '!freq', for S compatibility.\n\ninclude.lowest: logical; if 'TRUE', an 'x[i]' equal to the 'breaks'\n value will be included in the first (or last, for 'right =\n FALSE') bar. This will be ignored (with a warning) unless\n 'breaks' is a vector.\n\n right: logical; if 'TRUE', the histogram cells are right-closed\n (left open) intervals.\n\n fuzz: non-negative number, for the case when the data is \"pretty\"\n and some observations 'x[.]' are close but not exactly on a\n 'break'. For counting fuzzy breaks proportional to 'fuzz'\n are used. The default is occasionally suboptimal.\n\n density: the density of shading lines, in lines per inch. The default\n value of 'NULL' means that no shading lines are drawn.\n Non-positive values of 'density' also inhibit the drawing of\n shading lines.\n\n angle: the slope of shading lines, given as an angle in degrees\n (counter-clockwise).\n\n col: a colour to be used to fill the bars.\n\n border: the color of the border around the bars. The default is to\n use the standard foreground color.\n\nmain, xlab, ylab: main title and axis labels: these arguments to\n 'title()' get \"smart\" defaults here, e.g., the default 'ylab'\n is '\"Frequency\"' iff 'freq' is true.\n\nxlim, ylim: the range of x and y values with sensible defaults. Note\n that 'xlim' is _not_ used to define the histogram (breaks),\n but only for plotting (when 'plot = TRUE').\n\n axes: logical. If 'TRUE' (default), axes are draw if the plot is\n drawn.\n\n plot: logical. If 'TRUE' (default), a histogram is plotted,\n otherwise a list of breaks and counts is returned. In the\n latter case, a warning is used if (typically graphical)\n arguments are specified that only apply to the 'plot = TRUE'\n case.\n\n labels: logical or character string. Additionally draw labels on top\n of bars, if not 'FALSE'; see 'plot.histogram'.\n\n nclass: numeric (integer). For S(-PLUS) compatibility only, 'nclass'\n is equivalent to 'breaks' for a scalar or character argument.\n\nwarn.unused: logical. If 'plot = FALSE' and 'warn.unused = TRUE', a\n warning will be issued when graphical parameters are passed\n to 'hist.default()'.\n\n ...: further arguments and graphical parameters passed to\n 'plot.histogram' and thence to 'title' and 'axis' (if 'plot =\n TRUE').\n\nDetails:\n\n The definition of _histogram_ differs by source (with\n country-specific biases). R's default with equi-spaced breaks\n (also the default) is to plot the counts in the cells defined by\n 'breaks'. Thus the height of a rectangle is proportional to the\n number of points falling into the cell, as is the area _provided_\n the breaks are equally-spaced.\n\n The default with non-equi-spaced breaks is to give a plot of area\n one, in which the _area_ of the rectangles is the fraction of the\n data points falling in the cells.\n\n If 'right = TRUE' (default), the histogram cells are intervals of\n the form (a, b], i.e., they include their right-hand endpoint, but\n not their left one, with the exception of the first cell when\n 'include.lowest' is 'TRUE'.\n\n For 'right = FALSE', the intervals are of the form [a, b), and\n 'include.lowest' means '_include highest_'.\n\n A numerical tolerance of 1e-7 times the median bin size (for more\n than four bins, otherwise the median is substituted) is applied\n when counting entries on the edges of bins. This is not included\n in the reported 'breaks' nor in the calculation of 'density'.\n\n The default for 'breaks' is '\"Sturges\"': see 'nclass.Sturges'.\n Other names for which algorithms are supplied are '\"Scott\"' and\n '\"FD\"' / '\"Freedman-Diaconis\"' (with corresponding functions\n 'nclass.scott' and 'nclass.FD'). Case is ignored and partial\n matching is used. Alternatively, a function can be supplied which\n will compute the intended number of breaks or the actual\n breakpoints as a function of 'x'.\n\nValue:\n\n an object of class '\"histogram\"' which is a list with components:\n\n breaks: the n+1 cell boundaries (= 'breaks' if that was a vector).\n These are the nominal breaks, not with the boundary fuzz.\n\n counts: n integers; for each cell, the number of 'x[]' inside.\n\n density: values f^(x[i]), as estimated density values. If\n 'all(diff(breaks) == 1)', they are the relative frequencies\n 'counts/n' and in general satisfy sum[i; f^(x[i])\n (b[i+1]-b[i])] = 1, where b[i] = 'breaks[i]'.\n\n mids: the n cell midpoints.\n\n xname: a character string with the actual 'x' argument name.\n\nequidist: logical, indicating if the distances between 'breaks' are all\n the same.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Venables, W. N. and Ripley. B. D. (2002) _Modern Applied\n Statistics with S_. Springer.\n\nSee Also:\n\n 'nclass.Sturges', 'stem', 'density', 'truehist' in package 'MASS'.\n\n Typical plots with vertical bars are _not_ histograms. Consider\n 'barplot' or 'plot(*, type = \"h\")' for such bar plots.\n\nExamples:\n\n op <- par(mfrow = c(2, 2))\n hist(islands)\n utils::str(hist(islands, col = \"gray\", labels = TRUE))\n \n hist(sqrt(islands), breaks = 12, col = \"lightblue\", border = \"pink\")\n ##-- For non-equidistant breaks, counts should NOT be graphed unscaled:\n r <- hist(sqrt(islands), breaks = c(4*0:5, 10*3:5, 70, 100, 140),\n col = \"blue1\")\n text(r$mids, r$density, r$counts, adj = c(.5, -.5), col = \"blue3\")\n sapply(r[2:3], sum)\n sum(r$density * diff(r$breaks)) # == 1\n lines(r, lty = 3, border = \"purple\") # -> lines.histogram(*)\n par(op)\n \n require(utils) # for str\n str(hist(islands, breaks = 12, plot = FALSE)) #-> 10 (~= 12) breaks\n str(hist(islands, breaks = c(12,20,36,80,200,1000,17000), plot = FALSE))\n \n hist(islands, breaks = c(12,20,36,80,200,1000,17000), freq = TRUE,\n main = \"WRONG histogram\") # and warning\n \n ## Extreme outliers; the \"FD\" rule would take very large number of 'breaks':\n XXL <- c(1:9, c(-1,1)*1e300)\n hh <- hist(XXL, \"FD\") # did not work in R <= 3.4.1; now gives warning\n ## pretty() determines how many counts are used (platform dependently!):\n length(hh$breaks) ## typically 1 million -- though 1e6 was \"a suggestion only\"\n \n ## R >= 4.2.0: no \"*.5\" labels on y-axis:\n hist(c(2,3,3,5,5,6,6,6,7))\n \n require(stats)\n set.seed(14)\n x <- rchisq(100, df = 4)\n \n ## Histogram with custom x-axis:\n hist(x, xaxt = \"n\")\n axis(1, at = 0:17)\n \n \n ## Comparing data with a model distribution should be done with qqplot()!\n qqplot(x, qchisq(ppoints(x), df = 4)); abline(0, 1, col = 2, lty = 2)\n \n ## if you really insist on using hist() ... :\n hist(x, freq = FALSE, ylim = c(0, 0.2))\n curve(dchisq(x, df = 4), col = 2, lty = 2, lwd = 2, add = TRUE)\n\n\n\n## `histogram()` example\n\nReminder\n```\nhist(x, breaks = \"Sturges\",\n freq = NULL, probability = !freq,\n include.lowest = TRUE, right = TRUE, fuzz = 1e-7,\n density = NULL, angle = 45, col = \"lightgray\", border = NULL,\n main = paste(\"Histogram of\" , xname),\n xlim = range(breaks), ylim = NULL,\n xlab = xname, ylab,\n axes = TRUE, plot = TRUE, labels = FALSE,\n nclass = NULL, warn.unused = TRUE, ...)\n```\n\nLet's practice\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhist(df$age)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-12-1.png){width=960}\n:::\n\n```{.r .cell-code}\nhist(\n\tdf$age, \n\tfreq=FALSE, \n\tmain=\"Histogram\", \n\txlab=\"Age (years)\"\n\t)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-12-2.png){width=960}\n:::\n:::\n\n\n\n\n## `plot()` Help File\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?plot\n```\n:::\n\nGeneric X-Y Plotting\n\nDescription:\n\n Generic function for plotting of R objects.\n\n For simple scatter plots, 'plot.default' will be used. However,\n there are 'plot' methods for many R objects, including\n 'function's, 'data.frame's, 'density' objects, etc. Use\n 'methods(plot)' and the documentation for these. Most of these\n methods are implemented using traditional graphics (the 'graphics'\n package), but this is not mandatory.\n\n For more details about graphical parameter arguments used by\n traditional graphics, see 'par'.\n\nUsage:\n\n plot(x, y, ...)\n \nArguments:\n\n x: the coordinates of points in the plot. Alternatively, a\n single plotting structure, function or _any R object with a\n 'plot' method_ can be provided.\n\n y: the y coordinates of points in the plot, _optional_ if 'x' is\n an appropriate structure.\n\n ...: Arguments to be passed to methods, such as graphical\n parameters (see 'par'). Many methods will accept the\n following arguments:\n\n 'type' what type of plot should be drawn. Possible types are\n\n • '\"p\"' for *p*oints,\n\n • '\"l\"' for *l*ines,\n\n • '\"b\"' for *b*oth,\n\n • '\"c\"' for the lines part alone of '\"b\"',\n\n • '\"o\"' for both '*o*verplotted',\n\n • '\"h\"' for '*h*istogram' like (or 'high-density')\n vertical lines,\n\n • '\"s\"' for stair *s*teps,\n\n • '\"S\"' for other *s*teps, see 'Details' below,\n\n • '\"n\"' for no plotting.\n\n All other 'type's give a warning or an error; using,\n e.g., 'type = \"punkte\"' being equivalent to 'type = \"p\"'\n for S compatibility. Note that some methods, e.g.\n 'plot.factor', do not accept this.\n\n 'main' an overall title for the plot: see 'title'.\n\n 'sub' a subtitle for the plot: see 'title'.\n\n 'xlab' a title for the x axis: see 'title'.\n\n 'ylab' a title for the y axis: see 'title'.\n\n 'asp' the y/x aspect ratio, see 'plot.window'.\n\nDetails:\n\n The two step types differ in their x-y preference: Going from\n (x1,y1) to (x2,y2) with x1 < x2, 'type = \"s\"' moves first\n horizontal, then vertical, whereas 'type = \"S\"' moves the other\n way around.\n\nNote:\n\n The 'plot' generic was moved from the 'graphics' package to the\n 'base' package in R 4.0.0. It is currently re-exported from the\n 'graphics' namespace to allow packages importing it from there to\n continue working, but this may change in future versions of R.\n\nSee Also:\n\n 'plot.default', 'plot.formula' and other methods; 'points',\n 'lines', 'par'. For thousands of points, consider using\n 'smoothScatter()' instead of 'plot()'.\n\n For X-Y-Z plotting see 'contour', 'persp' and 'image'.\n\nExamples:\n\n require(stats) # for lowess, rpois, rnorm\n require(graphics) # for plot methods\n plot(cars)\n lines(lowess(cars))\n \n plot(sin, -pi, 2*pi) # see ?plot.function\n \n ## Discrete Distribution Plot:\n plot(table(rpois(100, 5)), type = \"h\", col = \"red\", lwd = 10,\n main = \"rpois(100, lambda = 5)\")\n \n ## Simple quantiles/ECDF, see ecdf() {library(stats)} for a better one:\n plot(x <- sort(rnorm(47)), type = \"s\", main = \"plot(x, type = \\\"s\\\")\")\n points(x, cex = .5, col = \"dark red\")\n\n\n\n\n## `plot()` example\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplot(df$age, df$IgG_concentration)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-15-1.png){width=960}\n:::\n\n```{.r .cell-code}\nplot(\n\tdf$age, \n\tdf$IgG_concentration, \n\ttype=\"p\", \n\tmain=\"Age by IgG Concentrations\", \n\txlab=\"Age (years)\", \n\tylab=\"IgG Concentration (mIU/mL)\", \n\tpch=16, \n\tcex=0.9,\n\tcol=\"lightblue\")\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-15-2.png){width=960}\n:::\n:::\n\n\n\n\n## `boxplot()` Help File\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?boxplot\n```\n:::\n\nBox Plots\n\nDescription:\n\n Produce box-and-whisker plot(s) of the given (grouped) values.\n\nUsage:\n\n boxplot(x, ...)\n \n ## S3 method for class 'formula'\n boxplot(formula, data = NULL, ..., subset, na.action = NULL,\n xlab = mklab(y_var = horizontal),\n ylab = mklab(y_var =!horizontal),\n add = FALSE, ann = !add, horizontal = FALSE,\n drop = FALSE, sep = \".\", lex.order = FALSE)\n \n ## Default S3 method:\n boxplot(x, ..., range = 1.5, width = NULL, varwidth = FALSE,\n notch = FALSE, outline = TRUE, names, plot = TRUE,\n border = par(\"fg\"), col = \"lightgray\", log = \"\",\n pars = list(boxwex = 0.8, staplewex = 0.5, outwex = 0.5),\n ann = !add, horizontal = FALSE, add = FALSE, at = NULL)\n \nArguments:\n\n formula: a formula, such as 'y ~ grp', where 'y' is a numeric vector\n of data values to be split into groups according to the\n grouping variable 'grp' (usually a factor). Note that '~ g1\n + g2' is equivalent to 'g1:g2'.\n\n data: a data.frame (or list) from which the variables in 'formula'\n should be taken.\n\n subset: an optional vector specifying a subset of observations to be\n used for plotting.\n\nna.action: a function which indicates what should happen when the data\n contain 'NA's. The default is to ignore missing values in\n either the response or the group.\n\nxlab, ylab: x- and y-axis annotation, since R 3.6.0 with a non-empty\n default. Can be suppressed by 'ann=FALSE'.\n\n ann: 'logical' indicating if axes should be annotated (by 'xlab'\n and 'ylab').\n\ndrop, sep, lex.order: passed to 'split.default', see there.\n\n x: for specifying data from which the boxplots are to be\n produced. Either a numeric vector, or a single list\n containing such vectors. Additional unnamed arguments specify\n further data as separate vectors (each corresponding to a\n component boxplot). 'NA's are allowed in the data.\n\n ...: For the 'formula' method, named arguments to be passed to the\n default method.\n\n For the default method, unnamed arguments are additional data\n vectors (unless 'x' is a list when they are ignored), and\n named arguments are arguments and graphical parameters to be\n passed to 'bxp' in addition to the ones given by argument\n 'pars' (and override those in 'pars'). Note that 'bxp' may or\n may not make use of graphical parameters it is passed: see\n its documentation.\n\n range: this determines how far the plot whiskers extend out from the\n box. If 'range' is positive, the whiskers extend to the most\n extreme data point which is no more than 'range' times the\n interquartile range from the box. A value of zero causes the\n whiskers to extend to the data extremes.\n\n width: a vector giving the relative widths of the boxes making up\n the plot.\n\nvarwidth: if 'varwidth' is 'TRUE', the boxes are drawn with widths\n proportional to the square-roots of the number of\n observations in the groups.\n\n notch: if 'notch' is 'TRUE', a notch is drawn in each side of the\n boxes. If the notches of two plots do not overlap this is\n 'strong evidence' that the two medians differ (Chambers _et\n al_, 1983, p. 62). See 'boxplot.stats' for the calculations\n used.\n\n outline: if 'outline' is not true, the outliers are not drawn (as\n points whereas S+ uses lines).\n\n names: group labels which will be printed under each boxplot. Can\n be a character vector or an expression (see plotmath).\n\n boxwex: a scale factor to be applied to all boxes. When there are\n only a few groups, the appearance of the plot can be improved\n by making the boxes narrower.\n\nstaplewex: staple line width expansion, proportional to box width.\n\n outwex: outlier line width expansion, proportional to box width.\n\n plot: if 'TRUE' (the default) then a boxplot is produced. If not,\n the summaries which the boxplots are based on are returned.\n\n border: an optional vector of colors for the outlines of the\n boxplots. The values in 'border' are recycled if the length\n of 'border' is less than the number of plots.\n\n col: if 'col' is non-null it is assumed to contain colors to be\n used to colour the bodies of the box plots. By default they\n are in the background colour.\n\n log: character indicating if x or y or both coordinates should be\n plotted in log scale.\n\n pars: a list of (potentially many) more graphical parameters, e.g.,\n 'boxwex' or 'outpch'; these are passed to 'bxp' (if 'plot' is\n true); for details, see there.\n\nhorizontal: logical indicating if the boxplots should be horizontal;\n default 'FALSE' means vertical boxes.\n\n add: logical, if true _add_ boxplot to current plot.\n\n at: numeric vector giving the locations where the boxplots should\n be drawn, particularly when 'add = TRUE'; defaults to '1:n'\n where 'n' is the number of boxes.\n\nDetails:\n\n The generic function 'boxplot' currently has a default method\n ('boxplot.default') and a formula interface ('boxplot.formula').\n\n If multiple groups are supplied either as multiple arguments or\n via a formula, parallel boxplots will be plotted, in the order of\n the arguments or the order of the levels of the factor (see\n 'factor').\n\n Missing values are ignored when forming boxplots.\n\nValue:\n\n List with the following components:\n\n stats: a matrix, each column contains the extreme of the lower\n whisker, the lower hinge, the median, the upper hinge and the\n extreme of the upper whisker for one group/plot. If all the\n inputs have the same class attribute, so will this component.\n\n n: a vector with the number of (non-'NA') observations in each\n group.\n\n conf: a matrix where each column contains the lower and upper\n extremes of the notch.\n\n out: the values of any data points which lie beyond the extremes\n of the whiskers.\n\n group: a vector of the same length as 'out' whose elements indicate\n to which group the outlier belongs.\n\n names: a vector of names for the groups.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988). _The New\n S Language_. Wadsworth & Brooks/Cole.\n\n Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A.\n (1983). _Graphical Methods for Data Analysis_. Wadsworth &\n Brooks/Cole.\n\n Murrell, P. (2005). _R Graphics_. Chapman & Hall/CRC Press.\n\n See also 'boxplot.stats'.\n\nSee Also:\n\n 'boxplot.stats' which does the computation, 'bxp' for the plotting\n and more examples; and 'stripchart' for an alternative (with small\n data sets).\n\nExamples:\n\n ## boxplot on a formula:\n boxplot(count ~ spray, data = InsectSprays, col = \"lightgray\")\n # *add* notches (somewhat funny here <--> warning \"notches .. outside hinges\"):\n boxplot(count ~ spray, data = InsectSprays,\n notch = TRUE, add = TRUE, col = \"blue\")\n \n boxplot(decrease ~ treatment, data = OrchardSprays, col = \"bisque\",\n log = \"y\")\n ## horizontal=TRUE, switching y <--> x :\n boxplot(decrease ~ treatment, data = OrchardSprays, col = \"bisque\",\n log = \"x\", horizontal=TRUE)\n \n rb <- boxplot(decrease ~ treatment, data = OrchardSprays, col = \"bisque\")\n title(\"Comparing boxplot()s and non-robust mean +/- SD\")\n mn.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, mean)\n sd.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, sd)\n xi <- 0.3 + seq(rb$n)\n points(xi, mn.t, col = \"orange\", pch = 18)\n arrows(xi, mn.t - sd.t, xi, mn.t + sd.t,\n code = 3, col = \"pink\", angle = 75, length = .1)\n \n ## boxplot on a matrix:\n mat <- cbind(Uni05 = (1:100)/21, Norm = rnorm(100),\n `5T` = rt(100, df = 5), Gam2 = rgamma(100, shape = 2))\n boxplot(mat) # directly, calling boxplot.matrix()\n \n ## boxplot on a data frame:\n df. <- as.data.frame(mat)\n par(las = 1) # all axis labels horizontal\n boxplot(df., main = \"boxplot(*, horizontal = TRUE)\", horizontal = TRUE)\n \n ## Using 'at = ' and adding boxplots -- example idea by Roger Bivand :\n boxplot(len ~ dose, data = ToothGrowth,\n boxwex = 0.25, at = 1:3 - 0.2,\n subset = supp == \"VC\", col = \"yellow\",\n main = \"Guinea Pigs' Tooth Growth\",\n xlab = \"Vitamin C dose mg\",\n ylab = \"tooth length\",\n xlim = c(0.5, 3.5), ylim = c(0, 35), yaxs = \"i\")\n boxplot(len ~ dose, data = ToothGrowth, add = TRUE,\n boxwex = 0.25, at = 1:3 + 0.2,\n subset = supp == \"OJ\", col = \"orange\")\n legend(2, 9, c(\"Ascorbic acid\", \"Orange juice\"),\n fill = c(\"yellow\", \"orange\"))\n \n ## With less effort (slightly different) using factor *interaction*:\n boxplot(len ~ dose:supp, data = ToothGrowth,\n boxwex = 0.5, col = c(\"orange\", \"yellow\"),\n main = \"Guinea Pigs' Tooth Growth\",\n xlab = \"Vitamin C dose mg\", ylab = \"tooth length\",\n sep = \":\", lex.order = TRUE, ylim = c(0, 35), yaxs = \"i\")\n \n ## more examples in help(bxp)\n\n\n\n\n## `boxplot()` example\n\nReminder\n```\nboxplot(formula, data = NULL, ..., subset, na.action = NULL,\n xlab = mklab(y_var = horizontal),\n ylab = mklab(y_var =!horizontal),\n add = FALSE, ann = !add, horizontal = FALSE,\n drop = FALSE, sep = \".\", lex.order = FALSE)\n```\n\nLet's practice\n\n\n::: {.cell}\n\n```{.r .cell-code}\nboxplot(IgG_concentration~age_group, data=df)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-18-1.png){width=960}\n:::\n\n```{.r .cell-code}\nboxplot(\n\tlog(df$IgG_concentration)~df$age_group, \n\tmain=\"Age by IgG Concentrations\", \n\txlab=\"Age Group (years)\", \n\tylab=\"log IgG Concentration (mIU/mL)\", \n\tnames=c(\"1-5\",\"6-10\", \"11-15\"), \n\tvarwidth=T\n\t)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-18-2.png){width=960}\n:::\n:::\n\n\n\n\n## `barplot()` Help File\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?barplot\n```\n:::\n\nBox Plots\n\nDescription:\n\n Produce box-and-whisker plot(s) of the given (grouped) values.\n\nUsage:\n\n boxplot(x, ...)\n \n ## S3 method for class 'formula'\n boxplot(formula, data = NULL, ..., subset, na.action = NULL,\n xlab = mklab(y_var = horizontal),\n ylab = mklab(y_var =!horizontal),\n add = FALSE, ann = !add, horizontal = FALSE,\n drop = FALSE, sep = \".\", lex.order = FALSE)\n \n ## Default S3 method:\n boxplot(x, ..., range = 1.5, width = NULL, varwidth = FALSE,\n notch = FALSE, outline = TRUE, names, plot = TRUE,\n border = par(\"fg\"), col = \"lightgray\", log = \"\",\n pars = list(boxwex = 0.8, staplewex = 0.5, outwex = 0.5),\n ann = !add, horizontal = FALSE, add = FALSE, at = NULL)\n \nArguments:\n\n formula: a formula, such as 'y ~ grp', where 'y' is a numeric vector\n of data values to be split into groups according to the\n grouping variable 'grp' (usually a factor). Note that '~ g1\n + g2' is equivalent to 'g1:g2'.\n\n data: a data.frame (or list) from which the variables in 'formula'\n should be taken.\n\n subset: an optional vector specifying a subset of observations to be\n used for plotting.\n\nna.action: a function which indicates what should happen when the data\n contain 'NA's. The default is to ignore missing values in\n either the response or the group.\n\nxlab, ylab: x- and y-axis annotation, since R 3.6.0 with a non-empty\n default. Can be suppressed by 'ann=FALSE'.\n\n ann: 'logical' indicating if axes should be annotated (by 'xlab'\n and 'ylab').\n\ndrop, sep, lex.order: passed to 'split.default', see there.\n\n x: for specifying data from which the boxplots are to be\n produced. Either a numeric vector, or a single list\n containing such vectors. Additional unnamed arguments specify\n further data as separate vectors (each corresponding to a\n component boxplot). 'NA's are allowed in the data.\n\n ...: For the 'formula' method, named arguments to be passed to the\n default method.\n\n For the default method, unnamed arguments are additional data\n vectors (unless 'x' is a list when they are ignored), and\n named arguments are arguments and graphical parameters to be\n passed to 'bxp' in addition to the ones given by argument\n 'pars' (and override those in 'pars'). Note that 'bxp' may or\n may not make use of graphical parameters it is passed: see\n its documentation.\n\n range: this determines how far the plot whiskers extend out from the\n box. If 'range' is positive, the whiskers extend to the most\n extreme data point which is no more than 'range' times the\n interquartile range from the box. A value of zero causes the\n whiskers to extend to the data extremes.\n\n width: a vector giving the relative widths of the boxes making up\n the plot.\n\nvarwidth: if 'varwidth' is 'TRUE', the boxes are drawn with widths\n proportional to the square-roots of the number of\n observations in the groups.\n\n notch: if 'notch' is 'TRUE', a notch is drawn in each side of the\n boxes. If the notches of two plots do not overlap this is\n 'strong evidence' that the two medians differ (Chambers _et\n al_, 1983, p. 62). See 'boxplot.stats' for the calculations\n used.\n\n outline: if 'outline' is not true, the outliers are not drawn (as\n points whereas S+ uses lines).\n\n names: group labels which will be printed under each boxplot. Can\n be a character vector or an expression (see plotmath).\n\n boxwex: a scale factor to be applied to all boxes. When there are\n only a few groups, the appearance of the plot can be improved\n by making the boxes narrower.\n\nstaplewex: staple line width expansion, proportional to box width.\n\n outwex: outlier line width expansion, proportional to box width.\n\n plot: if 'TRUE' (the default) then a boxplot is produced. If not,\n the summaries which the boxplots are based on are returned.\n\n border: an optional vector of colors for the outlines of the\n boxplots. The values in 'border' are recycled if the length\n of 'border' is less than the number of plots.\n\n col: if 'col' is non-null it is assumed to contain colors to be\n used to colour the bodies of the box plots. By default they\n are in the background colour.\n\n log: character indicating if x or y or both coordinates should be\n plotted in log scale.\n\n pars: a list of (potentially many) more graphical parameters, e.g.,\n 'boxwex' or 'outpch'; these are passed to 'bxp' (if 'plot' is\n true); for details, see there.\n\nhorizontal: logical indicating if the boxplots should be horizontal;\n default 'FALSE' means vertical boxes.\n\n add: logical, if true _add_ boxplot to current plot.\n\n at: numeric vector giving the locations where the boxplots should\n be drawn, particularly when 'add = TRUE'; defaults to '1:n'\n where 'n' is the number of boxes.\n\nDetails:\n\n The generic function 'boxplot' currently has a default method\n ('boxplot.default') and a formula interface ('boxplot.formula').\n\n If multiple groups are supplied either as multiple arguments or\n via a formula, parallel boxplots will be plotted, in the order of\n the arguments or the order of the levels of the factor (see\n 'factor').\n\n Missing values are ignored when forming boxplots.\n\nValue:\n\n List with the following components:\n\n stats: a matrix, each column contains the extreme of the lower\n whisker, the lower hinge, the median, the upper hinge and the\n extreme of the upper whisker for one group/plot. If all the\n inputs have the same class attribute, so will this component.\n\n n: a vector with the number of (non-'NA') observations in each\n group.\n\n conf: a matrix where each column contains the lower and upper\n extremes of the notch.\n\n out: the values of any data points which lie beyond the extremes\n of the whiskers.\n\n group: a vector of the same length as 'out' whose elements indicate\n to which group the outlier belongs.\n\n names: a vector of names for the groups.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988). _The New\n S Language_. Wadsworth & Brooks/Cole.\n\n Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A.\n (1983). _Graphical Methods for Data Analysis_. Wadsworth &\n Brooks/Cole.\n\n Murrell, P. (2005). _R Graphics_. Chapman & Hall/CRC Press.\n\n See also 'boxplot.stats'.\n\nSee Also:\n\n 'boxplot.stats' which does the computation, 'bxp' for the plotting\n and more examples; and 'stripchart' for an alternative (with small\n data sets).\n\nExamples:\n\n ## boxplot on a formula:\n boxplot(count ~ spray, data = InsectSprays, col = \"lightgray\")\n # *add* notches (somewhat funny here <--> warning \"notches .. outside hinges\"):\n boxplot(count ~ spray, data = InsectSprays,\n notch = TRUE, add = TRUE, col = \"blue\")\n \n boxplot(decrease ~ treatment, data = OrchardSprays, col = \"bisque\",\n log = \"y\")\n ## horizontal=TRUE, switching y <--> x :\n boxplot(decrease ~ treatment, data = OrchardSprays, col = \"bisque\",\n log = \"x\", horizontal=TRUE)\n \n rb <- boxplot(decrease ~ treatment, data = OrchardSprays, col = \"bisque\")\n title(\"Comparing boxplot()s and non-robust mean +/- SD\")\n mn.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, mean)\n sd.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, sd)\n xi <- 0.3 + seq(rb$n)\n points(xi, mn.t, col = \"orange\", pch = 18)\n arrows(xi, mn.t - sd.t, xi, mn.t + sd.t,\n code = 3, col = \"pink\", angle = 75, length = .1)\n \n ## boxplot on a matrix:\n mat <- cbind(Uni05 = (1:100)/21, Norm = rnorm(100),\n `5T` = rt(100, df = 5), Gam2 = rgamma(100, shape = 2))\n boxplot(mat) # directly, calling boxplot.matrix()\n \n ## boxplot on a data frame:\n df. <- as.data.frame(mat)\n par(las = 1) # all axis labels horizontal\n boxplot(df., main = \"boxplot(*, horizontal = TRUE)\", horizontal = TRUE)\n \n ## Using 'at = ' and adding boxplots -- example idea by Roger Bivand :\n boxplot(len ~ dose, data = ToothGrowth,\n boxwex = 0.25, at = 1:3 - 0.2,\n subset = supp == \"VC\", col = \"yellow\",\n main = \"Guinea Pigs' Tooth Growth\",\n xlab = \"Vitamin C dose mg\",\n ylab = \"tooth length\",\n xlim = c(0.5, 3.5), ylim = c(0, 35), yaxs = \"i\")\n boxplot(len ~ dose, data = ToothGrowth, add = TRUE,\n boxwex = 0.25, at = 1:3 + 0.2,\n subset = supp == \"OJ\", col = \"orange\")\n legend(2, 9, c(\"Ascorbic acid\", \"Orange juice\"),\n fill = c(\"yellow\", \"orange\"))\n \n ## With less effort (slightly different) using factor *interaction*:\n boxplot(len ~ dose:supp, data = ToothGrowth,\n boxwex = 0.5, col = c(\"orange\", \"yellow\"),\n main = \"Guinea Pigs' Tooth Growth\",\n xlab = \"Vitamin C dose mg\", ylab = \"tooth length\",\n sep = \":\", lex.order = TRUE, ylim = c(0, 35), yaxs = \"i\")\n \n ## more examples in help(bxp)\n\n\n\n\n## `barplot()` example\n\nThe function takes the a lot of arguments to control the way the way our data is plotted. \n\nReminder\n```\nbarplot(height, width = 1, space = NULL,\n names.arg = NULL, legend.text = NULL, beside = FALSE,\n horiz = FALSE, density = NULL, angle = 45,\n col = NULL, border = par(\"fg\"),\n main = NULL, sub = NULL, xlab = NULL, ylab = NULL,\n xlim = NULL, ylim = NULL, xpd = TRUE, log = \"\",\n axes = TRUE, axisnames = TRUE,\n cex.axis = par(\"cex.axis\"), cex.names = par(\"cex.axis\"),\n inside = TRUE, plot = TRUE, axis.lty = 0, offset = 0,\n add = FALSE, ann = !add && par(\"ann\"), args.legend = NULL, ...)\n```\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfreq <- table(df$seropos, df$age_group)\nbarplot(freq)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-21-1.png){width=960}\n:::\n\n```{.r .cell-code}\nprop <- prop.table(freq)\nbarplot(prop)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-21-2.png){width=960}\n:::\n:::\n\n\n\n## 3. Legend!\n\nIn Base R plotting the legend is not automatically generated. This is nice because it gives you a huge amount of control over how your legend looks, but it is also easy to mislabel your colors, symbols, line types, etc. So, basically be careful.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?legend\n```\n:::\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n\n```\nAdd Legends to Plots\n\nDescription:\n\n This function can be used to add legends to plots. Note that a\n call to the function 'locator(1)' can be used in place of the 'x'\n and 'y' arguments.\n\nUsage:\n\n legend(x, y = NULL, legend, fill = NULL, col = par(\"col\"),\n border = \"black\", lty, lwd, pch,\n angle = 45, density = NULL, bty = \"o\", bg = par(\"bg\"),\n box.lwd = par(\"lwd\"), box.lty = par(\"lty\"), box.col = par(\"fg\"),\n pt.bg = NA, cex = 1, pt.cex = cex, pt.lwd = lwd,\n xjust = 0, yjust = 1, x.intersp = 1, y.intersp = 1,\n adj = c(0, 0.5), text.width = NULL, text.col = par(\"col\"),\n text.font = NULL, merge = do.lines && has.pch, trace = FALSE,\n plot = TRUE, ncol = 1, horiz = FALSE, title = NULL,\n inset = 0, xpd, title.col = text.col[1], title.adj = 0.5,\n title.cex = cex[1], title.font = text.font[1],\n seg.len = 2)\n \nArguments:\n\n x, y: the x and y co-ordinates to be used to position the legend.\n They can be specified by keyword or in any way which is\n accepted by 'xy.coords': See 'Details'.\n\n legend: a character or expression vector of length >= 1 to appear in\n the legend. Other objects will be coerced by\n 'as.graphicsAnnot'.\n\n fill: if specified, this argument will cause boxes filled with the\n specified colors (or shaded in the specified colors) to\n appear beside the legend text.\n\n col: the color of points or lines appearing in the legend.\n\n border: the border color for the boxes (used only if 'fill' is\n specified).\n\nlty, lwd: the line types and widths for lines appearing in the legend.\n One of these two _must_ be specified for line drawing.\n\n pch: the plotting symbols appearing in the legend, as numeric\n vector or a vector of 1-character strings (see 'points').\n Unlike 'points', this can all be specified as a single\n multi-character string. _Must_ be specified for symbol\n drawing.\n\n angle: angle of shading lines.\n\n density: the density of shading lines, if numeric and positive. If\n 'NULL' or negative or 'NA' color filling is assumed.\n\n bty: the type of box to be drawn around the legend. The allowed\n values are '\"o\"' (the default) and '\"n\"'.\n\n bg: the background color for the legend box. (Note that this is\n only used if 'bty != \"n\"'.)\n\nbox.lty, box.lwd, box.col: the line type, width and color for the\n legend box (if 'bty = \"o\"').\n\n pt.bg: the background color for the 'points', corresponding to its\n argument 'bg'.\n\n cex: character expansion factor *relative* to current\n 'par(\"cex\")'. Used for text, and provides the default for\n 'pt.cex'.\n\n pt.cex: expansion factor(s) for the points.\n\n pt.lwd: line width for the points, defaults to the one for lines, or\n if that is not set, to 'par(\"lwd\")'.\n\n xjust: how the legend is to be justified relative to the legend x\n location. A value of 0 means left justified, 0.5 means\n centered and 1 means right justified.\n\n yjust: the same as 'xjust' for the legend y location.\n\nx.intersp: character interspacing factor for horizontal (x) spacing\n between symbol and legend text.\n\ny.intersp: vertical (y) distances (in lines of text shared above/below\n each legend entry). A vector with one element for each row\n of the legend can be used.\n\n adj: numeric of length 1 or 2; the string adjustment for legend\n text. Useful for y-adjustment when 'labels' are plotmath\n expressions.\n\ntext.width: the width of the legend text in x ('\"user\"') coordinates.\n (Should be positive even for a reversed x axis.) Can be a\n single positive numeric value (same width for each column of\n the legend), a vector (one element for each column of the\n legend), 'NULL' (default) for computing a proper maximum\n value of 'strwidth(legend)'), or 'NA' for computing a proper\n column wise maximum value of 'strwidth(legend)').\n\ntext.col: the color used for the legend text.\n\ntext.font: the font used for the legend text, see 'text'.\n\n merge: logical; if 'TRUE', merge points and lines but not filled\n boxes. Defaults to 'TRUE' if there are points and lines.\n\n trace: logical; if 'TRUE', shows how 'legend' does all its magical\n computations.\n\n plot: logical. If 'FALSE', nothing is plotted but the sizes are\n returned.\n\n ncol: the number of columns in which to set the legend items\n (default is 1, a vertical legend).\n\n horiz: logical; if 'TRUE', set the legend horizontally rather than\n vertically (specifying 'horiz' overrides the 'ncol'\n specification).\n\n title: a character string or length-one expression giving a title to\n be placed at the top of the legend. Other objects will be\n coerced by 'as.graphicsAnnot'.\n\n inset: inset distance(s) from the margins as a fraction of the plot\n region when legend is placed by keyword.\n\n xpd: if supplied, a value of the graphical parameter 'xpd' to be\n used while the legend is being drawn.\n\ntitle.col: color for 'title', defaults to 'text.col[1]'.\n\ntitle.adj: horizontal adjustment for 'title': see the help for\n 'par(\"adj\")'.\n\ntitle.cex: expansion factor(s) for the title, defaults to 'cex[1]'.\n\ntitle.font: the font used for the legend title, defaults to\n 'text.font[1]', see 'text'.\n\n seg.len: the length of lines drawn to illustrate 'lty' and/or 'lwd'\n (in units of character widths).\n\nDetails:\n\n Arguments 'x', 'y', 'legend' are interpreted in a non-standard way\n to allow the coordinates to be specified _via_ one or two\n arguments. If 'legend' is missing and 'y' is not numeric, it is\n assumed that the second argument is intended to be 'legend' and\n that the first argument specifies the coordinates.\n\n The coordinates can be specified in any way which is accepted by\n 'xy.coords'. If this gives the coordinates of one point, it is\n used as the top-left coordinate of the rectangle containing the\n legend. If it gives the coordinates of two points, these specify\n opposite corners of the rectangle (either pair of corners, in any\n order).\n\n The location may also be specified by setting 'x' to a single\n keyword from the list '\"bottomright\"', '\"bottom\"', '\"bottomleft\"',\n '\"left\"', '\"topleft\"', '\"top\"', '\"topright\"', '\"right\"' and\n '\"center\"'. This places the legend on the inside of the plot frame\n at the given location. Partial argument matching is used. The\n optional 'inset' argument specifies how far the legend is inset\n from the plot margins. If a single value is given, it is used for\n both margins; if two values are given, the first is used for 'x'-\n distance, the second for 'y'-distance.\n\n Attribute arguments such as 'col', 'pch', 'lty', etc, are recycled\n if necessary: 'merge' is not. Set entries of 'lty' to '0' or set\n entries of 'lwd' to 'NA' to suppress lines in corresponding legend\n entries; set 'pch' values to 'NA' to suppress points.\n\n Points are drawn _after_ lines in order that they can cover the\n line with their background color 'pt.bg', if applicable.\n\n See the examples for how to right-justify labels.\n\n Since they are not used for Unicode code points, values '-31:-1'\n are silently omitted, as are 'NA' and '\"\"' values.\n\nValue:\n\n A list with list components\n\n rect: a list with components\n\n 'w', 'h' positive numbers giving *w*idth and *h*eight of the\n legend's box.\n\n 'left', 'top' x and y coordinates of upper left corner of the\n box.\n\n text: a list with components\n\n 'x, y' numeric vectors of length 'length(legend)', giving the\n x and y coordinates of the legend's text(s).\n\n returned invisibly.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Murrell, P. (2005) _R Graphics_. Chapman & Hall/CRC Press.\n\nSee Also:\n\n 'plot', 'barplot' which uses 'legend()', and 'text' for more\n examples of math expressions.\n\nExamples:\n\n ## Run the example in '?matplot' or the following:\n leg.txt <- c(\"Setosa Petals\", \"Setosa Sepals\",\n \"Versicolor Petals\", \"Versicolor Sepals\")\n y.leg <- c(4.5, 3, 2.1, 1.4, .7)\n cexv <- c(1.2, 1, 4/5, 2/3, 1/2)\n matplot(c(1, 8), c(0, 4.5), type = \"n\", xlab = \"Length\", ylab = \"Width\",\n main = \"Petal and Sepal Dimensions in Iris Blossoms\")\n for (i in seq(cexv)) {\n text (1, y.leg[i] - 0.1, paste(\"cex=\", formatC(cexv[i])), cex = 0.8, adj = 0)\n legend(3, y.leg[i], leg.txt, pch = \"sSvV\", col = c(1, 3), cex = cexv[i])\n }\n ## cex *vector* [in R <= 3.5.1 has 'if(xc < 0)' w/ length(xc) == 2]\n legend(\"right\", leg.txt, pch = \"sSvV\", col = c(1, 3),\n cex = 1+(-1:2)/8, trace = TRUE)# trace: show computed lengths & coords\n \n ## 'merge = TRUE' for merging lines & points:\n x <- seq(-pi, pi, length.out = 65)\n for(reverse in c(FALSE, TRUE)) { ## normal *and* reverse axes:\n F <- if(reverse) rev else identity\n plot(x, sin(x), type = \"l\", col = 3, lty = 2,\n xlim = F(range(x)), ylim = F(c(-1.2, 1.8)))\n points(x, cos(x), pch = 3, col = 4)\n lines(x, tan(x), type = \"b\", lty = 1, pch = 4, col = 6)\n title(\"legend('top', lty = c(2, -1, 1), pch = c(NA, 3, 4), merge = TRUE)\",\n cex.main = 1.1)\n legend(\"top\", c(\"sin\", \"cos\", \"tan\"), col = c(3, 4, 6),\n text.col = \"green4\", lty = c(2, -1, 1), pch = c(NA, 3, 4),\n merge = TRUE, bg = \"gray90\", trace=TRUE)\n \n } # for(..)\n \n ## right-justifying a set of labels: thanks to Uwe Ligges\n x <- 1:5; y1 <- 1/x; y2 <- 2/x\n plot(rep(x, 2), c(y1, y2), type = \"n\", xlab = \"x\", ylab = \"y\")\n lines(x, y1); lines(x, y2, lty = 2)\n temp <- legend(\"topright\", legend = c(\" \", \" \"),\n text.width = strwidth(\"1,000,000\"),\n lty = 1:2, xjust = 1, yjust = 1, inset = 1/10,\n title = \"Line Types\", title.cex = 0.5, trace=TRUE)\n text(temp$rect$left + temp$rect$w, temp$text$y,\n c(\"1,000\", \"1,000,000\"), pos = 2)\n \n \n ##--- log scaled Examples ------------------------------\n leg.txt <- c(\"a one\", \"a two\")\n \n par(mfrow = c(2, 2))\n for(ll in c(\"\",\"x\",\"y\",\"xy\")) {\n plot(2:10, log = ll, main = paste0(\"log = '\", ll, \"'\"))\n abline(1, 1)\n lines(2:3, 3:4, col = 2)\n points(2, 2, col = 3)\n rect(2, 3, 3, 2, col = 4)\n text(c(3,3), 2:3, c(\"rect(2,3,3,2, col=4)\",\n \"text(c(3,3),2:3,\\\"c(rect(...)\\\")\"), adj = c(0, 0.3))\n legend(list(x = 2,y = 8), legend = leg.txt, col = 2:3, pch = 1:2,\n lty = 1) #, trace = TRUE)\n } # ^^^^^^^ to force lines -> automatic merge=TRUE\n par(mfrow = c(1,1))\n \n ##-- Math expressions: ------------------------------\n x <- seq(-pi, pi, length.out = 65)\n plot(x, sin(x), type = \"l\", col = 2, xlab = expression(phi),\n ylab = expression(f(phi)))\n abline(h = -1:1, v = pi/2*(-6:6), col = \"gray90\")\n lines(x, cos(x), col = 3, lty = 2)\n ex.cs1 <- expression(plain(sin) * phi, paste(\"cos\", phi)) # 2 ways\n utils::str(legend(-3, .9, ex.cs1, lty = 1:2, plot = FALSE,\n adj = c(0, 0.6))) # adj y !\n legend(-3, 0.9, ex.cs1, lty = 1:2, col = 2:3, adj = c(0, 0.6))\n \n require(stats)\n x <- rexp(100, rate = .5)\n hist(x, main = \"Mean and Median of a Skewed Distribution\")\n abline(v = mean(x), col = 2, lty = 2, lwd = 2)\n abline(v = median(x), col = 3, lty = 3, lwd = 2)\n ex12 <- expression(bar(x) == sum(over(x[i], n), i == 1, n),\n hat(x) == median(x[i], i == 1, n))\n utils::str(legend(4.1, 30, ex12, col = 2:3, lty = 2:3, lwd = 2))\n \n ## 'Filled' boxes -- see also example(barplot) which may call legend(*, fill=)\n barplot(VADeaths)\n legend(\"topright\", rownames(VADeaths), fill = gray.colors(nrow(VADeaths)))\n \n ## Using 'ncol'\n x <- 0:64/64\n for(R in c(identity, rev)) { # normal *and* reverse x-axis works fine:\n xl <- R(range(x)); x1 <- xl[1]\n matplot(x, outer(x, 1:7, function(x, k) sin(k * pi * x)), xlim=xl,\n type = \"o\", col = 1:7, ylim = c(-1, 1.5), pch = \"*\")\n op <- par(bg = \"antiquewhite1\")\n legend(x1, 1.5, paste(\"sin(\", 1:7, \"pi * x)\"), col = 1:7, lty = 1:7,\n pch = \"*\", ncol = 4, cex = 0.8)\n legend(\"bottomright\", paste(\"sin(\", 1:7, \"pi * x)\"), col = 1:7, lty = 1:7,\n pch = \"*\", cex = 0.8)\n legend(x1, -.1, paste(\"sin(\", 1:4, \"pi * x)\"), col = 1:4, lty = 1:4,\n ncol = 2, cex = 0.8)\n legend(x1, -.4, paste(\"sin(\", 5:7, \"pi * x)\"), col = 4:6, pch = 24,\n ncol = 2, cex = 1.5, lwd = 2, pt.bg = \"pink\", pt.cex = 1:3)\n par(op)\n \n } # for(..)\n \n ## point covering line :\n y <- sin(3*pi*x)\n plot(x, y, type = \"l\", col = \"blue\",\n main = \"points with bg & legend(*, pt.bg)\")\n points(x, y, pch = 21, bg = \"white\")\n legend(.4,1, \"sin(c x)\", pch = 21, pt.bg = \"white\", lty = 1, col = \"blue\")\n \n ## legends with titles at different locations\n plot(x, y, type = \"n\")\n legend(\"bottomright\", \"(x,y)\", pch=1, title= \"bottomright\")\n legend(\"bottom\", \"(x,y)\", pch=1, title= \"bottom\")\n legend(\"bottomleft\", \"(x,y)\", pch=1, title= \"bottomleft\")\n legend(\"left\", \"(x,y)\", pch=1, title= \"left\")\n legend(\"topleft\", \"(x,y)\", pch=1, title= \"topleft, inset = .05\", inset = .05)\n legend(\"top\", \"(x,y)\", pch=1, title= \"top\")\n legend(\"topright\", \"(x,y)\", pch=1, title= \"topright, inset = .02\",inset = .02)\n legend(\"right\", \"(x,y)\", pch=1, title= \"right\")\n legend(\"center\", \"(x,y)\", pch=1, title= \"center\")\n \n # using text.font (and text.col):\n op <- par(mfrow = c(2, 2), mar = rep(2.1, 4))\n c6 <- terrain.colors(10)[1:6]\n for(i in 1:4) {\n plot(1, type = \"n\", axes = FALSE, ann = FALSE); title(paste(\"text.font =\",i))\n legend(\"top\", legend = LETTERS[1:6], col = c6,\n ncol = 2, cex = 2, lwd = 3, text.font = i, text.col = c6)\n }\n par(op)\n \n # using text.width for several columns\n plot(1, type=\"n\")\n legend(\"topleft\", c(\"This legend\", \"has\", \"equally sized\", \"columns.\"),\n pch = 1:4, ncol = 4)\n legend(\"bottomleft\", c(\"This legend\", \"has\", \"optimally sized\", \"columns.\"),\n pch = 1:4, ncol = 4, text.width = NA)\n legend(\"right\", letters[1:4], pch = 1:4, ncol = 4,\n text.width = 1:4 / 50)\n```\n\n\n:::\n:::\n\n\n\n\n\n## Add legend to the plot\n\nReminder\n```\nlegend(x, y = NULL, legend, fill = NULL, col = par(\"col\"),\n border = \"black\", lty, lwd, pch,\n angle = 45, density = NULL, bty = \"o\", bg = par(\"bg\"),\n box.lwd = par(\"lwd\"), box.lty = par(\"lty\"), box.col = par(\"fg\"),\n pt.bg = NA, cex = 1, pt.cex = cex, pt.lwd = lwd,\n xjust = 0, yjust = 1, x.intersp = 1, y.intersp = 1,\n adj = c(0, 0.5), text.width = NULL, text.col = par(\"col\"),\n text.font = NULL, merge = do.lines && has.pch, trace = FALSE,\n plot = TRUE, ncol = 1, horiz = FALSE, title = NULL,\n inset = 0, xpd, title.col = text.col[1], title.adj = 0.5,\n title.cex = cex[1], title.font = text.font[1],\n seg.len = 2)\n```\n\nLet's practice\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbarplot(prop, col=c(\"darkblue\",\"red\"), ylim=c(0,0.7), main=\"Seropositivity by Age Group\")\nlegend(x=2.5, y=0.7,\n\t\t\t fill=c(\"darkblue\",\"red\"), \n\t\t\t legend = c(\"seronegative\", \"seropositive\"))\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-24-1.png){width=960}\n:::\n:::\n\n\n\n## `barplot()` example\n\nGetting closer, but what I really want is column proportions (i.e., the proportions should sum to one for each age group). Also, the age groups need more meaningful names.\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfreq <- table(df$seropos, df$age_group)\ntot.per.age.group <- colSums(freq)\nage.seropos.matrix <- t(t(freq)/tot.per.age.group)\ncolnames(age.seropos.matrix) <- c(\"1-5 yo\", \"6-10 yo\", \"11-15 yo\")\n\nbarplot(age.seropos.matrix, col=c(\"darkblue\",\"red\"), ylim=c(0,1.35), main=\"Seropositivity by Age Group\")\naxis(2, at = c(0.2, 0.4, 0.6, 0.8,1))\nlegend(x=2.8, y=1.35,\n\t\t\t fill=c(\"darkblue\",\"red\"), \n\t\t\t legend = c(\"seronegative\", \"seropositive\"))\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-25-1.png){width=960}\n:::\n:::\n\n\n\n## `barplot()` example\n\nNow, let look at seropositivity by two individual level characteristics in the same plot. \n\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\npar(mfrow = c(1,2))\nbarplot(age.seropos.matrix, col=c(\"darkblue\",\"red\"), ylim=c(0,1.35), main=\"Seropositivity by Age Group\")\naxis(2, at = c(0.2, 0.4, 0.6, 0.8,1))\nlegend(x=1, y=1.35, fill=c(\"darkblue\",\"red\"), legend = c(\"seronegative\", \"seropositive\"))\n\nbarplot(slum.seropos.matrix, col=c(\"darkblue\",\"red\"), ylim=c(0,1.35), main=\"Seropositivity by Residence\")\naxis(2, at = c(0.2, 0.4, 0.6, 0.8,1))\nlegend(x=1, y=1.35, fill=c(\"darkblue\",\"red\"), legend = c(\"seronegative\", \"seropositive\"))\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-27-1.png){width=960}\n:::\n:::\n\n\n\n\n\n## Summary\n\n-\t\t\n\n## Acknowledgements\n\nThese are the materials I looked through, modified, or extracted to complete this module's lecture.\n\n- [\"Base Plotting in R\" by Medium](https://towardsdatascience.com/base-plotting-in-r-eb365da06b22)\n-\t\t[\"Base R margins: a cheatsheet\"](https://r-graph-gallery.com/74-margin-and-oma-cheatsheet.html)\n", + "markdown": "---\ntitle: \"Module 10: Data Visualization\"\nformat: \n revealjs:\n scrollable: true\n smaller: true\n toc: false\n---\n\n\n## Learning Objectives\n\nAfter module 10, you should be able to:\n\n- Create Base R plots\n\n## Import data for this module\n\nLet's read in our data (again) and take a quick look.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndf <- read.csv(file = \"data/serodata.csv\") #relative path\nhead(x=df, n=3)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n observation_id IgG_concentration age gender slum\n1 5772 0.3176895 2 Female Non slum\n2 8095 3.4368231 4 Female Non slum\n3 9784 0.3000000 4 Male Non slum\n```\n:::\n:::\n\n\n## Prep data\n\nCreate `age_group` three level factor variable\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$age_group <- ifelse(df$age <= 5, \"young\", \n ifelse(df$age<=10 & df$age>5, \"middle\", \"old\")) \ndf$age_group <- factor(df$age_group, levels=c(\"young\", \"middle\", \"old\"))\n```\n:::\n\n\nCreate `seropos` binary variable representing seropositivity if antibody concentrations are >10 IU/mL.\n\n::: {.cell}\n\n```{.r .cell-code}\ndf$seropos <- ifelse(df$IgG_concentration<10, 0, 1)\n```\n:::\n\n\n## Base R data visualizattion functions\n\nThe Base R 'graphics' package has a ton of graphics options. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nhelp(package = \"graphics\")\n```\n:::\n\n::: {.cell}\n::: {.cell-output .cell-output-stderr}\n```\nRegistered S3 method overwritten by 'printr':\n method from \n knit_print.data.frame rmarkdown\n```\n:::\n\n::: {.cell-output .cell-output-stdout}\n```\n\t\tInformation on package 'graphics'\n\nDescription:\n\nPackage: graphics\nVersion: 4.3.1\nPriority: base\nTitle: The R Graphics Package\nAuthor: R Core Team and contributors worldwide\nMaintainer: R Core Team \nContact: R-help mailing list \nDescription: R functions for base graphics.\nImports: grDevices\nLicense: Part of R 4.3.1\nNeedsCompilation: yes\nBuilt: R 4.3.1; aarch64-apple-darwin20; 2023-06-16\n 21:53:01 UTC; unix\n\nIndex:\n\nAxis Generic Function to Add an Axis to a Plot\nabline Add Straight Lines to a Plot\narrows Add Arrows to a Plot\nassocplot Association Plots\naxTicks Compute Axis Tickmark Locations\naxis Add an Axis to a Plot\naxis.POSIXct Date and Date-time Plotting Functions\nbarplot Bar Plots\nbox Draw a Box around a Plot\nboxplot Box Plots\nboxplot.matrix Draw a Boxplot for each Column (Row) of a\n Matrix\nbxp Draw Box Plots from Summaries\ncdplot Conditional Density Plots\nclip Set Clipping Region\ncontour Display Contours\ncoplot Conditioning Plots\ncurve Draw Function Plots\ndotchart Cleveland's Dot Plots\nfilled.contour Level (Contour) Plots\nfourfoldplot Fourfold Plots\nframe Create / Start a New Plot Frame\ngraphics-package The R Graphics Package\ngrconvertX Convert between Graphics Coordinate Systems\ngrid Add Grid to a Plot\nhist Histograms\nhist.POSIXt Histogram of a Date or Date-Time Object\nidentify Identify Points in a Scatter Plot\nimage Display a Color Image\nlayout Specifying Complex Plot Arrangements\nlegend Add Legends to Plots\nlines Add Connected Line Segments to a Plot\nlocator Graphical Input\nmatplot Plot Columns of Matrices\nmosaicplot Mosaic Plots\nmtext Write Text into the Margins of a Plot\npairs Scatterplot Matrices\npanel.smooth Simple Panel Plot\npar Set or Query Graphical Parameters\npersp Perspective Plots\npie Pie Charts\nplot.data.frame Plot Method for Data Frames\nplot.default The Default Scatterplot Function\nplot.design Plot Univariate Effects of a Design or Model\nplot.factor Plotting Factor Variables\nplot.formula Formula Notation for Scatterplots\nplot.histogram Plot Histograms\nplot.raster Plotting Raster Images\nplot.table Plot Methods for 'table' Objects\nplot.window Set up World Coordinates for Graphics Window\nplot.xy Basic Internal Plot Function\npoints Add Points to a Plot\npolygon Polygon Drawing\npolypath Path Drawing\nrasterImage Draw One or More Raster Images\nrect Draw One or More Rectangles\nrug Add a Rug to a Plot\nscreen Creating and Controlling Multiple Screens on a\n Single Device\nsegments Add Line Segments to a Plot\nsmoothScatter Scatterplots with Smoothed Densities Color\n Representation\nspineplot Spine Plots and Spinograms\nstars Star (Spider/Radar) Plots and Segment Diagrams\nstem Stem-and-Leaf Plots\nstripchart 1-D Scatter Plots\nstrwidth Plotting Dimensions of Character Strings and\n Math Expressions\nsunflowerplot Produce a Sunflower Scatter Plot\nsymbols Draw Symbols (Circles, Squares, Stars,\n Thermometers, Boxplots)\ntext Add Text to a Plot\ntitle Plot Annotation\nxinch Graphical Units\nxspline Draw an X-spline\n```\n:::\n:::\n\n\n\n\n## Base R Plotting\n\nTo make a plot you often need to specify the following features:\n\n1. Parameters\n2. Plot attributes\n3. The legend\n\n## 1. Parameters\n\nThe parameter section fixes the settings for all your plots, basically the plot options. Adding attributes via `par()` before you call the plot creates ‘global’ settings for your plot.\n\nIn the example below, we have set two commonly used optional attributes in the global plot settings.\n\n-\tThe `mfrow` specifies that we have one row and two columns of plots — that is, two plots side by side. \n-\tThe `mar` attribute is a vector of our margin widths, with the first value indicating the margin below the plot (5), the second indicating the margin to the left of the plot (5), the third, the top of the plot(4), and the fourth to the left (1).\n\n```\npar(mfrow = c(1,2), mar = c(5,5,4,1))\n```\n\n\n## 1. Parameters\n\n\n::: {.cell figwidth='100%'}\n::: {.cell-output-display}\n![](images/par.png)\n:::\n:::\n\n\n\n## Lots of parameters options\n\nHowever, there are many more parameter options that can be specified in the 'global' settings or specific to a certain plot option. \n\n\n::: {.cell}\n\n```{.r .cell-code}\n?par\n```\n:::\n\nSet or Query Graphical Parameters\n\nDescription:\n\n 'par' can be used to set or query graphical parameters.\n Parameters can be set by specifying them as arguments to 'par' in\n 'tag = value' form, or by passing them as a list of tagged values.\n\nUsage:\n\n par(..., no.readonly = FALSE)\n \n (...., = )\n \nArguments:\n\n ...: arguments in 'tag = value' form, a single list of tagged\n values, or character vectors of parameter names. Supported\n parameters are described in the 'Graphical Parameters'\n section.\n\nno.readonly: logical; if 'TRUE' and there are no other arguments, only\n parameters are returned which can be set by a subsequent\n 'par()' call _on the same device_.\n\nDetails:\n\n Each device has its own set of graphical parameters. If the\n current device is the null device, 'par' will open a new device\n before querying/setting parameters. (What device is controlled by\n 'options(\"device\")'.)\n\n Parameters are queried by giving one or more character vectors of\n parameter names to 'par'.\n\n 'par()' (no arguments) or 'par(no.readonly = TRUE)' is used to get\n _all_ the graphical parameters (as a named list). Their names are\n currently taken from the unexported variable 'graphics:::.Pars'.\n\n _*R.O.*_ indicates _*read-only arguments*_: These may only be used\n in queries and cannot be set. ('\"cin\"', '\"cra\"', '\"csi\"',\n '\"cxy\"', '\"din\"' and '\"page\"' are always read-only.)\n\n Several parameters can only be set by a call to 'par()':\n\n • '\"ask\"',\n\n • '\"fig\"', '\"fin\"',\n\n • '\"lheight\"',\n\n • '\"mai\"', '\"mar\"', '\"mex\"', '\"mfcol\"', '\"mfrow\"', '\"mfg\"',\n\n • '\"new\"',\n\n • '\"oma\"', '\"omd\"', '\"omi\"',\n\n • '\"pin\"', '\"plt\"', '\"ps\"', '\"pty\"',\n\n • '\"usr\"',\n\n • '\"xlog\"', '\"ylog\"',\n\n • '\"ylbias\"'\n\n The remaining parameters can also be set as arguments (often via\n '...') to high-level plot functions such as 'plot.default',\n 'plot.window', 'points', 'lines', 'abline', 'axis', 'title',\n 'text', 'mtext', 'segments', 'symbols', 'arrows', 'polygon',\n 'rect', 'box', 'contour', 'filled.contour' and 'image'. Such\n settings will be active during the execution of the function,\n only. However, see the comments on 'bg', 'cex', 'col', 'lty',\n 'lwd' and 'pch' which may be taken as _arguments_ to certain plot\n functions rather than as graphical parameters.\n\n The meaning of 'character size' is not well-defined: this is set\n up for the device taking 'pointsize' into account but often not\n the actual font family in use. Internally the corresponding pars\n ('cra', 'cin', 'cxy' and 'csi') are used only to set the\n inter-line spacing used to convert 'mar' and 'oma' to physical\n margins. (The same inter-line spacing multiplied by 'lheight' is\n used for multi-line strings in 'text' and 'strheight'.)\n\n Note that graphical parameters are suggestions: plotting functions\n and devices need not make use of them (and this is particularly\n true of non-default methods for e.g. 'plot').\n\nValue:\n\n When parameters are set, their previous values are returned in an\n invisible named list. Such a list can be passed as an argument to\n 'par' to restore the parameter values. Use 'par(no.readonly =\n TRUE)' for the full list of parameters that can be restored.\n However, restoring all of these is not wise: see the 'Note'\n section.\n\n When just one parameter is queried, the value of that parameter is\n returned as (atomic) vector. When two or more parameters are\n queried, their values are returned in a list, with the list names\n giving the parameters.\n\n Note the inconsistency: setting one parameter returns a list, but\n querying one parameter returns a vector.\n\nGraphical Parameters:\n\n 'adj' The value of 'adj' determines the way in which text strings\n are justified in 'text', 'mtext' and 'title'. A value of '0'\n produces left-justified text, '0.5' (the default) centered\n text and '1' right-justified text. (Any value in [0, 1] is\n allowed, and on most devices values outside that interval\n will also work.)\n\n Note that the 'adj' _argument_ of 'text' also allows 'adj =\n c(x, y)' for different adjustment in x- and y- directions.\n Note that whereas for 'text' it refers to positioning of text\n about a point, for 'mtext' and 'title' it controls placement\n within the plot or device region.\n\n 'ann' If set to 'FALSE', high-level plotting functions calling\n 'plot.default' do not annotate the plots they produce with\n axis titles and overall titles. The default is to do\n annotation.\n\n 'ask' logical. If 'TRUE' (and the R session is interactive) the\n user is asked for input, before a new figure is drawn. As\n this applies to the device, it also affects output by\n packages 'grid' and 'lattice'. It can be set even on\n non-screen devices but may have no effect there.\n\n This not really a graphics parameter, and its use is\n deprecated in favour of 'devAskNewPage'.\n\n 'bg' The color to be used for the background of the device region.\n When called from 'par()' it also sets 'new = FALSE'. See\n section 'Color Specification' for suitable values. For many\n devices the initial value is set from the 'bg' argument of\n the device, and for the rest it is normally '\"white\"'.\n\n Note that some graphics functions such as 'plot.default' and\n 'points' have an _argument_ of this name with a different\n meaning.\n\n 'bty' A character string which determined the type of 'box' which\n is drawn about plots. If 'bty' is one of '\"o\"' (the\n default), '\"l\"', '\"7\"', '\"c\"', '\"u\"', or '\"]\"' the resulting\n box resembles the corresponding upper case letter. A value\n of '\"n\"' suppresses the box.\n\n 'cex' A numerical value giving the amount by which plotting text\n and symbols should be magnified relative to the default.\n This starts as '1' when a device is opened, and is reset when\n the layout is changed, e.g. by setting 'mfrow'.\n\n Note that some graphics functions such as 'plot.default' have\n an _argument_ of this name which _multiplies_ this graphical\n parameter, and some functions such as 'points' and 'text'\n accept a vector of values which are recycled.\n\n 'cex.axis' The magnification to be used for axis annotation\n relative to the current setting of 'cex'.\n\n 'cex.lab' The magnification to be used for x and y labels relative\n to the current setting of 'cex'.\n\n 'cex.main' The magnification to be used for main titles relative\n to the current setting of 'cex'.\n\n 'cex.sub' The magnification to be used for sub-titles relative to\n the current setting of 'cex'.\n\n 'cin' _*R.O.*_; character size '(width, height)' in inches. These\n are the same measurements as 'cra', expressed in different\n units.\n\n 'col' A specification for the default plotting color. See section\n 'Color Specification'.\n\n Some functions such as 'lines' and 'text' accept a vector of\n values which are recycled and may be interpreted slightly\n differently.\n\n 'col.axis' The color to be used for axis annotation. Defaults to\n '\"black\"'.\n\n 'col.lab' The color to be used for x and y labels. Defaults to\n '\"black\"'.\n\n 'col.main' The color to be used for plot main titles. Defaults to\n '\"black\"'.\n\n 'col.sub' The color to be used for plot sub-titles. Defaults to\n '\"black\"'.\n\n 'cra' _*R.O.*_; size of default character '(width, height)' in\n 'rasters' (pixels). Some devices have no concept of pixels\n and so assume an arbitrary pixel size, usually 1/72 inch.\n These are the same measurements as 'cin', expressed in\n different units.\n\n 'crt' A numerical value specifying (in degrees) how single\n characters should be rotated. It is unwise to expect values\n other than multiples of 90 to work. Compare with 'srt' which\n does string rotation.\n\n 'csi' _*R.O.*_; height of (default-sized) characters in inches.\n The same as 'par(\"cin\")[2]'.\n\n 'cxy' _*R.O.*_; size of default character '(width, height)' in\n user coordinate units. 'par(\"cxy\")' is\n 'par(\"cin\")/par(\"pin\")' scaled to user coordinates. Note\n that 'c(strwidth(ch), strheight(ch))' for a given string 'ch'\n is usually much more precise.\n\n 'din' _*R.O.*_; the device dimensions, '(width, height)', in\n inches. See also 'dev.size', which is updated immediately\n when an on-screen device windows is re-sized.\n\n 'err' (_Unimplemented_; R is silent when points outside the plot\n region are _not_ plotted.) The degree of error reporting\n desired.\n\n 'family' The name of a font family for drawing text. The maximum\n allowed length is 200 bytes. This name gets mapped by each\n graphics device to a device-specific font description. The\n default value is '\"\"' which means that the default device\n fonts will be used (and what those are should be listed on\n the help page for the device). Standard values are\n '\"serif\"', '\"sans\"' and '\"mono\"', and the Hershey font\n families are also available. (Devices may define others, and\n some devices will ignore this setting completely. Names\n starting with '\"Hershey\"' are treated specially and should\n only be used for the built-in Hershey font families.) This\n can be specified inline for 'text'.\n\n 'fg' The color to be used for the foreground of plots. This is\n the default color used for things like axes and boxes around\n plots. When called from 'par()' this also sets parameter\n 'col' to the same value. See section 'Color Specification'.\n A few devices have an argument to set the initial value,\n which is otherwise '\"black\"'.\n\n 'fig' A numerical vector of the form 'c(x1, x2, y1, y2)' which\n gives the (NDC) coordinates of the figure region in the\n display region of the device. If you set this, unlike S, you\n start a new plot, so to add to an existing plot use 'new =\n TRUE' as well.\n\n 'fin' The figure region dimensions, '(width, height)', in inches.\n If you set this, unlike S, you start a new plot.\n\n 'font' An integer which specifies which font to use for text. If\n possible, device drivers arrange so that 1 corresponds to\n plain text (the default), 2 to bold face, 3 to italic and 4\n to bold italic. Also, font 5 is expected to be the symbol\n font, in Adobe symbol encoding. On some devices font\n families can be selected by 'family' to choose different sets\n of 5 fonts.\n\n 'font.axis' The font to be used for axis annotation.\n\n 'font.lab' The font to be used for x and y labels.\n\n 'font.main' The font to be used for plot main titles.\n\n 'font.sub' The font to be used for plot sub-titles.\n\n 'lab' A numerical vector of the form 'c(x, y, len)' which modifies\n the default way that axes are annotated. The values of 'x'\n and 'y' give the (approximate) number of tickmarks on the x\n and y axes and 'len' specifies the label length. The default\n is 'c(5, 5, 7)'. 'len' _is unimplemented_ in R.\n\n 'las' numeric in {0,1,2,3}; the style of axis labels.\n\n 0: always parallel to the axis [_default_],\n\n 1: always horizontal,\n\n 2: always perpendicular to the axis,\n\n 3: always vertical.\n\n Also supported by 'mtext'. Note that string/character\n rotation _via_ argument 'srt' to 'par' does _not_ affect the\n axis labels.\n\n 'lend' The line end style. This can be specified as an integer or\n string:\n\n '0' and '\"round\"' mean rounded line caps [_default_];\n\n '1' and '\"butt\"' mean butt line caps;\n\n '2' and '\"square\"' mean square line caps.\n\n 'lheight' The line height multiplier. The height of a line of\n text (used to vertically space multi-line text) is found by\n multiplying the character height both by the current\n character expansion and by the line height multiplier.\n Default value is 1. Used in 'text' and 'strheight'.\n\n 'ljoin' The line join style. This can be specified as an integer\n or string:\n\n '0' and '\"round\"' mean rounded line joins [_default_];\n\n '1' and '\"mitre\"' mean mitred line joins;\n\n '2' and '\"bevel\"' mean bevelled line joins.\n\n 'lmitre' The line mitre limit. This controls when mitred line\n joins are automatically converted into bevelled line joins.\n The value must be larger than 1 and the default is 10. Not\n all devices will honour this setting.\n\n 'lty' The line type. Line types can either be specified as an\n integer (0=blank, 1=solid (default), 2=dashed, 3=dotted,\n 4=dotdash, 5=longdash, 6=twodash) or as one of the character\n strings '\"blank\"', '\"solid\"', '\"dashed\"', '\"dotted\"',\n '\"dotdash\"', '\"longdash\"', or '\"twodash\"', where '\"blank\"'\n uses 'invisible lines' (i.e., does not draw them).\n\n Alternatively, a string of up to 8 characters (from 'c(1:9,\n \"A\":\"F\")') may be given, giving the length of line segments\n which are alternatively drawn and skipped. See section 'Line\n Type Specification'.\n\n Functions such as 'lines' and 'segments' accept a vector of\n values which are recycled.\n\n 'lwd' The line width, a _positive_ number, defaulting to '1'. The\n interpretation is device-specific, and some devices do not\n implement line widths less than one. (See the help on the\n device for details of the interpretation.)\n\n Functions such as 'lines' and 'segments' accept a vector of\n values which are recycled: in such uses lines corresponding\n to values 'NA' or 'NaN' are omitted. The interpretation of\n '0' is device-specific.\n\n 'mai' A numerical vector of the form 'c(bottom, left, top, right)'\n which gives the margin size specified in inches.\n\n 'mar' A numerical vector of the form 'c(bottom, left, top, right)'\n which gives the number of lines of margin to be specified on\n the four sides of the plot. The default is 'c(5, 4, 4, 2) +\n 0.1'.\n\n 'mex' 'mex' is a character size expansion factor which is used to\n describe coordinates in the margins of plots. Note that this\n does not change the font size, rather specifies the size of\n font (as a multiple of 'csi') used to convert between 'mar'\n and 'mai', and between 'oma' and 'omi'.\n\n This starts as '1' when the device is opened, and is reset\n when the layout is changed (alongside resetting 'cex').\n\n 'mfcol, mfrow' A vector of the form 'c(nr, nc)'. Subsequent\n figures will be drawn in an 'nr'-by-'nc' array on the device\n by _columns_ ('mfcol'), or _rows_ ('mfrow'), respectively.\n\n In a layout with exactly two rows and columns the base value\n of '\"cex\"' is reduced by a factor of 0.83: if there are three\n or more of either rows or columns, the reduction factor is\n 0.66.\n\n Setting a layout resets the base value of 'cex' and that of\n 'mex' to '1'.\n\n If either of these is queried it will give the current\n layout, so querying cannot tell you the order in which the\n array will be filled.\n\n Consider the alternatives, 'layout' and 'split.screen'.\n\n 'mfg' A numerical vector of the form 'c(i, j)' where 'i' and 'j'\n indicate which figure in an array of figures is to be drawn\n next (if setting) or is being drawn (if enquiring). The\n array must already have been set by 'mfcol' or 'mfrow'.\n\n For compatibility with S, the form 'c(i, j, nr, nc)' is also\n accepted, when 'nr' and 'nc' should be the current number of\n rows and number of columns. Mismatches will be ignored, with\n a warning.\n\n 'mgp' The margin line (in 'mex' units) for the axis title, axis\n labels and axis line. Note that 'mgp[1]' affects 'title'\n whereas 'mgp[2:3]' affect 'axis'. The default is 'c(3, 1,\n 0)'.\n\n 'mkh' The height in inches of symbols to be drawn when the value\n of 'pch' is an integer. _Completely ignored in R_.\n\n 'new' logical, defaulting to 'FALSE'. If set to 'TRUE', the next\n high-level plotting command (actually 'plot.new') should _not\n clean_ the frame before drawing _as if it were on a *_new_*\n device_. It is an error (ignored with a warning) to try to\n use 'new = TRUE' on a device that does not currently contain\n a high-level plot.\n\n 'oma' A vector of the form 'c(bottom, left, top, right)' giving\n the size of the outer margins in lines of text.\n\n 'omd' A vector of the form 'c(x1, x2, y1, y2)' giving the region\n _inside_ outer margins in NDC (= normalized device\n coordinates), i.e., as a fraction (in [0, 1]) of the device\n region.\n\n 'omi' A vector of the form 'c(bottom, left, top, right)' giving\n the size of the outer margins in inches.\n\n 'page' _*R.O.*_; A boolean value indicating whether the next call\n to 'plot.new' is going to start a new page. This value may\n be 'FALSE' if there are multiple figures on the page.\n\n 'pch' Either an integer specifying a symbol or a single character\n to be used as the default in plotting points. See 'points'\n for possible values and their interpretation. Note that only\n integers and single-character strings can be set as a\n graphics parameter (and not 'NA' nor 'NULL').\n\n Some functions such as 'points' accept a vector of values\n which are recycled.\n\n 'pin' The current plot dimensions, '(width, height)', in inches.\n\n 'plt' A vector of the form 'c(x1, x2, y1, y2)' giving the\n coordinates of the plot region as fractions of the current\n figure region.\n\n 'ps' integer; the point size of text (but not symbols). Unlike\n the 'pointsize' argument of most devices, this does not\n change the relationship between 'mar' and 'mai' (nor 'oma'\n and 'omi').\n\n What is meant by 'point size' is device-specific, but most\n devices mean a multiple of 1bp, that is 1/72 of an inch.\n\n 'pty' A character specifying the type of plot region to be used;\n '\"s\"' generates a square plotting region and '\"m\"' generates\n the maximal plotting region.\n\n 'smo' (_Unimplemented_) a value which indicates how smooth circles\n and circular arcs should be.\n\n 'srt' The string rotation in degrees. See the comment about\n 'crt'. Only supported by 'text'.\n\n 'tck' The length of tick marks as a fraction of the smaller of the\n width or height of the plotting region. If 'tck >= 0.5' it\n is interpreted as a fraction of the relevant side, so if 'tck\n = 1' grid lines are drawn. The default setting ('tck = NA')\n is to use 'tcl = -0.5'.\n\n 'tcl' The length of tick marks as a fraction of the height of a\n line of text. The default value is '-0.5'; setting 'tcl =\n NA' sets 'tck = -0.01' which is S' default.\n\n 'usr' A vector of the form 'c(x1, x2, y1, y2)' giving the extremes\n of the user coordinates of the plotting region. When a\n logarithmic scale is in use (i.e., 'par(\"xlog\")' is true, see\n below), then the x-limits will be '10 ^ par(\"usr\")[1:2]'.\n Similarly for the y-axis.\n\n 'xaxp' A vector of the form 'c(x1, x2, n)' giving the coordinates\n of the extreme tick marks and the number of intervals between\n tick-marks when 'par(\"xlog\")' is false. Otherwise, when\n _log_ coordinates are active, the three values have a\n different meaning: For a small range, 'n' is _negative_, and\n the ticks are as in the linear case, otherwise, 'n' is in\n '1:3', specifying a case number, and 'x1' and 'x2' are the\n lowest and highest power of 10 inside the user coordinates,\n '10 ^ par(\"usr\")[1:2]'. (The '\"usr\"' coordinates are\n log10-transformed here!)\n\n n = 1 will produce tick marks at 10^j for integer j,\n\n n = 2 gives marks k 10^j with k in {1,5},\n\n n = 3 gives marks k 10^j with k in {1,2,5}.\n\n See 'axTicks()' for a pure R implementation of this.\n\n This parameter is reset when a user coordinate system is set\n up, for example by starting a new page or by calling\n 'plot.window' or setting 'par(\"usr\")': 'n' is taken from\n 'par(\"lab\")'. It affects the default behaviour of subsequent\n calls to 'axis' for sides 1 or 3.\n\n It is only relevant to default numeric axis systems, and not\n for example to dates.\n\n 'xaxs' The style of axis interval calculation to be used for the\n x-axis. Possible values are '\"r\"', '\"i\"', '\"e\"', '\"s\"',\n '\"d\"'. The styles are generally controlled by the range of\n data or 'xlim', if given.\n Style '\"r\"' (regular) first extends the data range by 4\n percent at each end and then finds an axis with pretty labels\n that fits within the extended range.\n Style '\"i\"' (internal) just finds an axis with pretty labels\n that fits within the original data range.\n Style '\"s\"' (standard) finds an axis with pretty labels\n within which the original data range fits.\n Style '\"e\"' (extended) is like style '\"s\"', except that it is\n also ensures that there is room for plotting symbols within\n the bounding box.\n Style '\"d\"' (direct) specifies that the current axis should\n be used on subsequent plots.\n (_Only '\"r\"' and '\"i\"' styles have been implemented in R._)\n\n 'xaxt' A character which specifies the x axis type. Specifying\n '\"n\"' suppresses plotting of the axis. The standard value is\n '\"s\"': for compatibility with S values '\"l\"' and '\"t\"' are\n accepted but are equivalent to '\"s\"': any value other than\n '\"n\"' implies plotting.\n\n 'xlog' A logical value (see 'log' in 'plot.default'). If 'TRUE',\n a logarithmic scale is in use (e.g., after 'plot(*, log =\n \"x\")'). For a new device, it defaults to 'FALSE', i.e.,\n linear scale.\n\n 'xpd' A logical value or 'NA'. If 'FALSE', all plotting is\n clipped to the plot region, if 'TRUE', all plotting is\n clipped to the figure region, and if 'NA', all plotting is\n clipped to the device region. See also 'clip'.\n\n 'yaxp' A vector of the form 'c(y1, y2, n)' giving the coordinates\n of the extreme tick marks and the number of intervals between\n tick-marks unless for log coordinates, see 'xaxp' above.\n\n 'yaxs' The style of axis interval calculation to be used for the\n y-axis. See 'xaxs' above.\n\n 'yaxt' A character which specifies the y axis type. Specifying\n '\"n\"' suppresses plotting.\n\n 'ylbias' A positive real value used in the positioning of text in\n the margins by 'axis' and 'mtext'. The default is in\n principle device-specific, but currently '0.2' for all of R's\n own devices. Set this to '0.2' for compatibility with R <\n 2.14.0 on 'x11' and 'windows()' devices.\n\n 'ylog' A logical value; see 'xlog' above.\n\nColor Specification:\n\n Colors can be specified in several different ways. The simplest\n way is with a character string giving the color name (e.g.,\n '\"red\"'). A list of the possible colors can be obtained with the\n function 'colors'. Alternatively, colors can be specified\n directly in terms of their RGB components with a string of the\n form '\"#RRGGBB\"' where each of the pairs 'RR', 'GG', 'BB' consist\n of two hexadecimal digits giving a value in the range '00' to\n 'FF'. Colors can also be specified by giving an index into a\n small table of colors, the 'palette': indices wrap round so with\n the default palette of size 8, '10' is the same as '2'. This\n provides compatibility with S. Index '0' corresponds to the\n background color. Note that the palette (apart from '0' which is\n per-device) is a per-session setting.\n\n Negative integer colours are errors.\n\n Additionally, '\"transparent\"' is _transparent_, useful for filled\n areas (such as the background!), and just invisible for things\n like lines or text. In most circumstances (integer) 'NA' is\n equivalent to '\"transparent\"' (but not for 'text' and 'mtext').\n\n Semi-transparent colors are available for use on devices that\n support them.\n\n The functions 'rgb', 'hsv', 'hcl', 'gray' and 'rainbow' provide\n additional ways of generating colors.\n\nLine Type Specification:\n\n Line types can either be specified by giving an index into a small\n built-in table of line types (1 = solid, 2 = dashed, etc, see\n 'lty' above) or directly as the lengths of on/off stretches of\n line. This is done with a string of an even number (up to eight)\n of characters, namely _non-zero_ (hexadecimal) digits which give\n the lengths in consecutive positions in the string. For example,\n the string '\"33\"' specifies three units on followed by three off\n and '\"3313\"' specifies three units on followed by three off\n followed by one on and finally three off. The 'units' here are\n (on most devices) proportional to 'lwd', and with 'lwd = 1' are in\n pixels or points or 1/96 inch.\n\n The five standard dash-dot line types ('lty = 2:6') correspond to\n 'c(\"44\", \"13\", \"1343\", \"73\", \"2262\")'.\n\n Note that 'NA' is not a valid value for 'lty'.\n\nNote:\n\n The effect of restoring all the (settable) graphics parameters as\n in the examples is hard to predict if the device has been resized.\n Several of them are attempting to set the same things in different\n ways, and those last in the alphabet will win. In particular, the\n settings of 'mai', 'mar', 'pin', 'plt' and 'pty' interact, as do\n the outer margin settings, the figure layout and figure region\n size.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Murrell, P. (2005) _R Graphics_. Chapman & Hall/CRC Press.\n\nSee Also:\n\n 'plot.default' for some high-level plotting parameters; 'colors';\n 'clip'; 'options' for other setup parameters; graphic devices\n 'x11', 'postscript' and setting up device regions by 'layout' and\n 'split.screen'.\n\nExamples:\n\n op <- par(mfrow = c(2, 2), # 2 x 2 pictures on one plot\n pty = \"s\") # square plotting region,\n # independent of device size\n \n ## At end of plotting, reset to previous settings:\n par(op)\n \n ## Alternatively,\n op <- par(no.readonly = TRUE) # the whole list of settable par's.\n ## do lots of plotting and par(.) calls, then reset:\n par(op)\n ## Note this is not in general good practice\n \n par(\"ylog\") # FALSE\n plot(1 : 12, log = \"y\")\n par(\"ylog\") # TRUE\n \n plot(1:2, xaxs = \"i\") # 'inner axis' w/o extra space\n par(c(\"usr\", \"xaxp\"))\n \n ( nr.prof <-\n c(prof.pilots = 16, lawyers = 11, farmers = 10, salesmen = 9, physicians = 9,\n mechanics = 6, policemen = 6, managers = 6, engineers = 5, teachers = 4,\n housewives = 3, students = 3, armed.forces = 1))\n par(las = 3)\n barplot(rbind(nr.prof)) # R 0.63.2: shows alignment problem\n par(las = 0) # reset to default\n \n require(grDevices) # for gray\n ## 'fg' use:\n plot(1:12, type = \"b\", main = \"'fg' : axes, ticks and box in gray\",\n fg = gray(0.7), bty = \"7\" , sub = R.version.string)\n \n ex <- function() {\n old.par <- par(no.readonly = TRUE) # all par settings which\n # could be changed.\n on.exit(par(old.par))\n ## ...\n ## ... do lots of par() settings and plots\n ## ...\n invisible() #-- now, par(old.par) will be executed\n }\n ex()\n \n ## Line types\n showLty <- function(ltys, xoff = 0, ...) {\n stopifnot((n <- length(ltys)) >= 1)\n op <- par(mar = rep(.5,4)); on.exit(par(op))\n plot(0:1, 0:1, type = \"n\", axes = FALSE, ann = FALSE)\n y <- (n:1)/(n+1)\n clty <- as.character(ltys)\n mytext <- function(x, y, txt)\n text(x, y, txt, adj = c(0, -.3), cex = 0.8, ...)\n abline(h = y, lty = ltys, ...); mytext(xoff, y, clty)\n y <- y - 1/(3*(n+1))\n abline(h = y, lty = ltys, lwd = 2, ...)\n mytext(1/8+xoff, y, paste(clty,\" lwd = 2\"))\n }\n showLty(c(\"solid\", \"dashed\", \"dotted\", \"dotdash\", \"longdash\", \"twodash\"))\n par(new = TRUE) # the same:\n showLty(c(\"solid\", \"44\", \"13\", \"1343\", \"73\", \"2262\"), xoff = .2, col = 2)\n showLty(c(\"11\", \"22\", \"33\", \"44\", \"12\", \"13\", \"14\", \"21\", \"31\"))\n\n\n## Common parameter options\n\nEight useful parameter arguments help improve the readability of the plot:\n\n- `xlab`: specifies the x-axis label of the plot\n- `ylab`: specifies the y-axis label\n- `main`: titles your graph\n- `pch`: specifies the symbology of your graph\n- `lty`: specifies the line type of your graph\n- `lwd`: specifies line thickness\n-\t`cex` : specifies size\n- `col`: specifies the colors for your graph.\n\nWe will explore use of these arguments below.\n\n## Common parameter options\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](images/atrributes.png){width=200%}\n:::\n:::\n\n\n\n## 2. Plot Attributes\n\nPlot attributes are those that map your data to the plot. This mean this is where you specify what variables in the data frame you want to plot. \n\nWe will only look at four types of plots today:\n\n- `hist()` displays histogram of one variable\n- `plot()` displays x-y plot of two variables\n- `boxplot()` displays boxplot \n- `barplot()` displays barplot\n\n\n## `histogram()` Help File\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?hist\n```\n:::\n\nHistograms\n\nDescription:\n\n The generic function 'hist' computes a histogram of the given data\n values. If 'plot = TRUE', the resulting object of class\n '\"histogram\"' is plotted by 'plot.histogram', before it is\n returned.\n\nUsage:\n\n hist(x, ...)\n \n ## Default S3 method:\n hist(x, breaks = \"Sturges\",\n freq = NULL, probability = !freq,\n include.lowest = TRUE, right = TRUE, fuzz = 1e-7,\n density = NULL, angle = 45, col = \"lightgray\", border = NULL,\n main = paste(\"Histogram of\" , xname),\n xlim = range(breaks), ylim = NULL,\n xlab = xname, ylab,\n axes = TRUE, plot = TRUE, labels = FALSE,\n nclass = NULL, warn.unused = TRUE, ...)\n \nArguments:\n\n x: a vector of values for which the histogram is desired.\n\n breaks: one of:\n\n • a vector giving the breakpoints between histogram cells,\n\n • a function to compute the vector of breakpoints,\n\n • a single number giving the number of cells for the\n histogram,\n\n • a character string naming an algorithm to compute the\n number of cells (see 'Details'),\n\n • a function to compute the number of cells.\n\n In the last three cases the number is a suggestion only; as\n the breakpoints will be set to 'pretty' values, the number is\n limited to '1e6' (with a warning if it was larger). If\n 'breaks' is a function, the 'x' vector is supplied to it as\n the only argument (and the number of breaks is only limited\n by the amount of available memory).\n\n freq: logical; if 'TRUE', the histogram graphic is a representation\n of frequencies, the 'counts' component of the result; if\n 'FALSE', probability densities, component 'density', are\n plotted (so that the histogram has a total area of one).\n Defaults to 'TRUE' _if and only if_ 'breaks' are equidistant\n (and 'probability' is not specified).\n\nprobability: an _alias_ for '!freq', for S compatibility.\n\ninclude.lowest: logical; if 'TRUE', an 'x[i]' equal to the 'breaks'\n value will be included in the first (or last, for 'right =\n FALSE') bar. This will be ignored (with a warning) unless\n 'breaks' is a vector.\n\n right: logical; if 'TRUE', the histogram cells are right-closed\n (left open) intervals.\n\n fuzz: non-negative number, for the case when the data is \"pretty\"\n and some observations 'x[.]' are close but not exactly on a\n 'break'. For counting fuzzy breaks proportional to 'fuzz'\n are used. The default is occasionally suboptimal.\n\n density: the density of shading lines, in lines per inch. The default\n value of 'NULL' means that no shading lines are drawn.\n Non-positive values of 'density' also inhibit the drawing of\n shading lines.\n\n angle: the slope of shading lines, given as an angle in degrees\n (counter-clockwise).\n\n col: a colour to be used to fill the bars.\n\n border: the color of the border around the bars. The default is to\n use the standard foreground color.\n\nmain, xlab, ylab: main title and axis labels: these arguments to\n 'title()' get \"smart\" defaults here, e.g., the default 'ylab'\n is '\"Frequency\"' iff 'freq' is true.\n\nxlim, ylim: the range of x and y values with sensible defaults. Note\n that 'xlim' is _not_ used to define the histogram (breaks),\n but only for plotting (when 'plot = TRUE').\n\n axes: logical. If 'TRUE' (default), axes are draw if the plot is\n drawn.\n\n plot: logical. If 'TRUE' (default), a histogram is plotted,\n otherwise a list of breaks and counts is returned. In the\n latter case, a warning is used if (typically graphical)\n arguments are specified that only apply to the 'plot = TRUE'\n case.\n\n labels: logical or character string. Additionally draw labels on top\n of bars, if not 'FALSE'; see 'plot.histogram'.\n\n nclass: numeric (integer). For S(-PLUS) compatibility only, 'nclass'\n is equivalent to 'breaks' for a scalar or character argument.\n\nwarn.unused: logical. If 'plot = FALSE' and 'warn.unused = TRUE', a\n warning will be issued when graphical parameters are passed\n to 'hist.default()'.\n\n ...: further arguments and graphical parameters passed to\n 'plot.histogram' and thence to 'title' and 'axis' (if 'plot =\n TRUE').\n\nDetails:\n\n The definition of _histogram_ differs by source (with\n country-specific biases). R's default with equi-spaced breaks\n (also the default) is to plot the counts in the cells defined by\n 'breaks'. Thus the height of a rectangle is proportional to the\n number of points falling into the cell, as is the area _provided_\n the breaks are equally-spaced.\n\n The default with non-equi-spaced breaks is to give a plot of area\n one, in which the _area_ of the rectangles is the fraction of the\n data points falling in the cells.\n\n If 'right = TRUE' (default), the histogram cells are intervals of\n the form (a, b], i.e., they include their right-hand endpoint, but\n not their left one, with the exception of the first cell when\n 'include.lowest' is 'TRUE'.\n\n For 'right = FALSE', the intervals are of the form [a, b), and\n 'include.lowest' means '_include highest_'.\n\n A numerical tolerance of 1e-7 times the median bin size (for more\n than four bins, otherwise the median is substituted) is applied\n when counting entries on the edges of bins. This is not included\n in the reported 'breaks' nor in the calculation of 'density'.\n\n The default for 'breaks' is '\"Sturges\"': see 'nclass.Sturges'.\n Other names for which algorithms are supplied are '\"Scott\"' and\n '\"FD\"' / '\"Freedman-Diaconis\"' (with corresponding functions\n 'nclass.scott' and 'nclass.FD'). Case is ignored and partial\n matching is used. Alternatively, a function can be supplied which\n will compute the intended number of breaks or the actual\n breakpoints as a function of 'x'.\n\nValue:\n\n an object of class '\"histogram\"' which is a list with components:\n\n breaks: the n+1 cell boundaries (= 'breaks' if that was a vector).\n These are the nominal breaks, not with the boundary fuzz.\n\n counts: n integers; for each cell, the number of 'x[]' inside.\n\n density: values f^(x[i]), as estimated density values. If\n 'all(diff(breaks) == 1)', they are the relative frequencies\n 'counts/n' and in general satisfy sum[i; f^(x[i])\n (b[i+1]-b[i])] = 1, where b[i] = 'breaks[i]'.\n\n mids: the n cell midpoints.\n\n xname: a character string with the actual 'x' argument name.\n\nequidist: logical, indicating if the distances between 'breaks' are all\n the same.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Venables, W. N. and Ripley. B. D. (2002) _Modern Applied\n Statistics with S_. Springer.\n\nSee Also:\n\n 'nclass.Sturges', 'stem', 'density', 'truehist' in package 'MASS'.\n\n Typical plots with vertical bars are _not_ histograms. Consider\n 'barplot' or 'plot(*, type = \"h\")' for such bar plots.\n\nExamples:\n\n op <- par(mfrow = c(2, 2))\n hist(islands)\n utils::str(hist(islands, col = \"gray\", labels = TRUE))\n \n hist(sqrt(islands), breaks = 12, col = \"lightblue\", border = \"pink\")\n ##-- For non-equidistant breaks, counts should NOT be graphed unscaled:\n r <- hist(sqrt(islands), breaks = c(4*0:5, 10*3:5, 70, 100, 140),\n col = \"blue1\")\n text(r$mids, r$density, r$counts, adj = c(.5, -.5), col = \"blue3\")\n sapply(r[2:3], sum)\n sum(r$density * diff(r$breaks)) # == 1\n lines(r, lty = 3, border = \"purple\") # -> lines.histogram(*)\n par(op)\n \n require(utils) # for str\n str(hist(islands, breaks = 12, plot = FALSE)) #-> 10 (~= 12) breaks\n str(hist(islands, breaks = c(12,20,36,80,200,1000,17000), plot = FALSE))\n \n hist(islands, breaks = c(12,20,36,80,200,1000,17000), freq = TRUE,\n main = \"WRONG histogram\") # and warning\n \n ## Extreme outliers; the \"FD\" rule would take very large number of 'breaks':\n XXL <- c(1:9, c(-1,1)*1e300)\n hh <- hist(XXL, \"FD\") # did not work in R <= 3.4.1; now gives warning\n ## pretty() determines how many counts are used (platform dependently!):\n length(hh$breaks) ## typically 1 million -- though 1e6 was \"a suggestion only\"\n \n ## R >= 4.2.0: no \"*.5\" labels on y-axis:\n hist(c(2,3,3,5,5,6,6,6,7))\n \n require(stats)\n set.seed(14)\n x <- rchisq(100, df = 4)\n \n ## Histogram with custom x-axis:\n hist(x, xaxt = \"n\")\n axis(1, at = 0:17)\n \n \n ## Comparing data with a model distribution should be done with qqplot()!\n qqplot(x, qchisq(ppoints(x), df = 4)); abline(0, 1, col = 2, lty = 2)\n \n ## if you really insist on using hist() ... :\n hist(x, freq = FALSE, ylim = c(0, 0.2))\n curve(dchisq(x, df = 4), col = 2, lty = 2, lwd = 2, add = TRUE)\n\n\n## `histogram()` example\n\nReminder function signature\n```\nhist(x, breaks = \"Sturges\",\n freq = NULL, probability = !freq,\n include.lowest = TRUE, right = TRUE, fuzz = 1e-7,\n density = NULL, angle = 45, col = \"lightgray\", border = NULL,\n main = paste(\"Histogram of\" , xname),\n xlim = range(breaks), ylim = NULL,\n xlab = xname, ylab,\n axes = TRUE, plot = TRUE, labels = FALSE,\n nclass = NULL, warn.unused = TRUE, ...)\n```\n\nLet's practice\n\n::: {.cell}\n\n```{.r .cell-code}\nhist(df$age)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-12-1.png){width=960}\n:::\n\n```{.r .cell-code}\nhist(\n\tdf$age, \n\tfreq=FALSE, \n\tmain=\"Histogram\", \n\txlab=\"Age (years)\"\n\t)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-12-2.png){width=960}\n:::\n:::\n\n\n\n## `plot()` Help File\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?plot\n```\n:::\n\nGeneric X-Y Plotting\n\nDescription:\n\n Generic function for plotting of R objects.\n\n For simple scatter plots, 'plot.default' will be used. However,\n there are 'plot' methods for many R objects, including\n 'function's, 'data.frame's, 'density' objects, etc. Use\n 'methods(plot)' and the documentation for these. Most of these\n methods are implemented using traditional graphics (the 'graphics'\n package), but this is not mandatory.\n\n For more details about graphical parameter arguments used by\n traditional graphics, see 'par'.\n\nUsage:\n\n plot(x, y, ...)\n \nArguments:\n\n x: the coordinates of points in the plot. Alternatively, a\n single plotting structure, function or _any R object with a\n 'plot' method_ can be provided.\n\n y: the y coordinates of points in the plot, _optional_ if 'x' is\n an appropriate structure.\n\n ...: Arguments to be passed to methods, such as graphical\n parameters (see 'par'). Many methods will accept the\n following arguments:\n\n 'type' what type of plot should be drawn. Possible types are\n\n • '\"p\"' for *p*oints,\n\n • '\"l\"' for *l*ines,\n\n • '\"b\"' for *b*oth,\n\n • '\"c\"' for the lines part alone of '\"b\"',\n\n • '\"o\"' for both '*o*verplotted',\n\n • '\"h\"' for '*h*istogram' like (or 'high-density')\n vertical lines,\n\n • '\"s\"' for stair *s*teps,\n\n • '\"S\"' for other *s*teps, see 'Details' below,\n\n • '\"n\"' for no plotting.\n\n All other 'type's give a warning or an error; using,\n e.g., 'type = \"punkte\"' being equivalent to 'type = \"p\"'\n for S compatibility. Note that some methods, e.g.\n 'plot.factor', do not accept this.\n\n 'main' an overall title for the plot: see 'title'.\n\n 'sub' a subtitle for the plot: see 'title'.\n\n 'xlab' a title for the x axis: see 'title'.\n\n 'ylab' a title for the y axis: see 'title'.\n\n 'asp' the y/x aspect ratio, see 'plot.window'.\n\nDetails:\n\n The two step types differ in their x-y preference: Going from\n (x1,y1) to (x2,y2) with x1 < x2, 'type = \"s\"' moves first\n horizontal, then vertical, whereas 'type = \"S\"' moves the other\n way around.\n\nNote:\n\n The 'plot' generic was moved from the 'graphics' package to the\n 'base' package in R 4.0.0. It is currently re-exported from the\n 'graphics' namespace to allow packages importing it from there to\n continue working, but this may change in future versions of R.\n\nSee Also:\n\n 'plot.default', 'plot.formula' and other methods; 'points',\n 'lines', 'par'. For thousands of points, consider using\n 'smoothScatter()' instead of 'plot()'.\n\n For X-Y-Z plotting see 'contour', 'persp' and 'image'.\n\nExamples:\n\n require(stats) # for lowess, rpois, rnorm\n require(graphics) # for plot methods\n plot(cars)\n lines(lowess(cars))\n \n plot(sin, -pi, 2*pi) # see ?plot.function\n \n ## Discrete Distribution Plot:\n plot(table(rpois(100, 5)), type = \"h\", col = \"red\", lwd = 10,\n main = \"rpois(100, lambda = 5)\")\n \n ## Simple quantiles/ECDF, see ecdf() {library(stats)} for a better one:\n plot(x <- sort(rnorm(47)), type = \"s\", main = \"plot(x, type = \\\"s\\\")\")\n points(x, cex = .5, col = \"dark red\")\n\n\n\n## `plot()` example\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nplot(df$age, df$IgG_concentration)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-15-1.png){width=960}\n:::\n\n```{.r .cell-code}\nplot(\n\tdf$age, \n\tdf$IgG_concentration, \n\ttype=\"p\", \n\tmain=\"Age by IgG Concentrations\", \n\txlab=\"Age (years)\", \n\tylab=\"IgG Concentration (IU/mL)\", \n\tpch=16, \n\tcex=0.9,\n\tcol=\"lightblue\")\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-15-2.png){width=960}\n:::\n:::\n\n\n\n## `boxplot()` Help File\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?boxplot\n```\n:::\n\nBox Plots\n\nDescription:\n\n Produce box-and-whisker plot(s) of the given (grouped) values.\n\nUsage:\n\n boxplot(x, ...)\n \n ## S3 method for class 'formula'\n boxplot(formula, data = NULL, ..., subset, na.action = NULL,\n xlab = mklab(y_var = horizontal),\n ylab = mklab(y_var =!horizontal),\n add = FALSE, ann = !add, horizontal = FALSE,\n drop = FALSE, sep = \".\", lex.order = FALSE)\n \n ## Default S3 method:\n boxplot(x, ..., range = 1.5, width = NULL, varwidth = FALSE,\n notch = FALSE, outline = TRUE, names, plot = TRUE,\n border = par(\"fg\"), col = \"lightgray\", log = \"\",\n pars = list(boxwex = 0.8, staplewex = 0.5, outwex = 0.5),\n ann = !add, horizontal = FALSE, add = FALSE, at = NULL)\n \nArguments:\n\n formula: a formula, such as 'y ~ grp', where 'y' is a numeric vector\n of data values to be split into groups according to the\n grouping variable 'grp' (usually a factor). Note that '~ g1\n + g2' is equivalent to 'g1:g2'.\n\n data: a data.frame (or list) from which the variables in 'formula'\n should be taken.\n\n subset: an optional vector specifying a subset of observations to be\n used for plotting.\n\nna.action: a function which indicates what should happen when the data\n contain 'NA's. The default is to ignore missing values in\n either the response or the group.\n\nxlab, ylab: x- and y-axis annotation, since R 3.6.0 with a non-empty\n default. Can be suppressed by 'ann=FALSE'.\n\n ann: 'logical' indicating if axes should be annotated (by 'xlab'\n and 'ylab').\n\ndrop, sep, lex.order: passed to 'split.default', see there.\n\n x: for specifying data from which the boxplots are to be\n produced. Either a numeric vector, or a single list\n containing such vectors. Additional unnamed arguments specify\n further data as separate vectors (each corresponding to a\n component boxplot). 'NA's are allowed in the data.\n\n ...: For the 'formula' method, named arguments to be passed to the\n default method.\n\n For the default method, unnamed arguments are additional data\n vectors (unless 'x' is a list when they are ignored), and\n named arguments are arguments and graphical parameters to be\n passed to 'bxp' in addition to the ones given by argument\n 'pars' (and override those in 'pars'). Note that 'bxp' may or\n may not make use of graphical parameters it is passed: see\n its documentation.\n\n range: this determines how far the plot whiskers extend out from the\n box. If 'range' is positive, the whiskers extend to the most\n extreme data point which is no more than 'range' times the\n interquartile range from the box. A value of zero causes the\n whiskers to extend to the data extremes.\n\n width: a vector giving the relative widths of the boxes making up\n the plot.\n\nvarwidth: if 'varwidth' is 'TRUE', the boxes are drawn with widths\n proportional to the square-roots of the number of\n observations in the groups.\n\n notch: if 'notch' is 'TRUE', a notch is drawn in each side of the\n boxes. If the notches of two plots do not overlap this is\n 'strong evidence' that the two medians differ (Chambers _et\n al_, 1983, p. 62). See 'boxplot.stats' for the calculations\n used.\n\n outline: if 'outline' is not true, the outliers are not drawn (as\n points whereas S+ uses lines).\n\n names: group labels which will be printed under each boxplot. Can\n be a character vector or an expression (see plotmath).\n\n boxwex: a scale factor to be applied to all boxes. When there are\n only a few groups, the appearance of the plot can be improved\n by making the boxes narrower.\n\nstaplewex: staple line width expansion, proportional to box width.\n\n outwex: outlier line width expansion, proportional to box width.\n\n plot: if 'TRUE' (the default) then a boxplot is produced. If not,\n the summaries which the boxplots are based on are returned.\n\n border: an optional vector of colors for the outlines of the\n boxplots. The values in 'border' are recycled if the length\n of 'border' is less than the number of plots.\n\n col: if 'col' is non-null it is assumed to contain colors to be\n used to colour the bodies of the box plots. By default they\n are in the background colour.\n\n log: character indicating if x or y or both coordinates should be\n plotted in log scale.\n\n pars: a list of (potentially many) more graphical parameters, e.g.,\n 'boxwex' or 'outpch'; these are passed to 'bxp' (if 'plot' is\n true); for details, see there.\n\nhorizontal: logical indicating if the boxplots should be horizontal;\n default 'FALSE' means vertical boxes.\n\n add: logical, if true _add_ boxplot to current plot.\n\n at: numeric vector giving the locations where the boxplots should\n be drawn, particularly when 'add = TRUE'; defaults to '1:n'\n where 'n' is the number of boxes.\n\nDetails:\n\n The generic function 'boxplot' currently has a default method\n ('boxplot.default') and a formula interface ('boxplot.formula').\n\n If multiple groups are supplied either as multiple arguments or\n via a formula, parallel boxplots will be plotted, in the order of\n the arguments or the order of the levels of the factor (see\n 'factor').\n\n Missing values are ignored when forming boxplots.\n\nValue:\n\n List with the following components:\n\n stats: a matrix, each column contains the extreme of the lower\n whisker, the lower hinge, the median, the upper hinge and the\n extreme of the upper whisker for one group/plot. If all the\n inputs have the same class attribute, so will this component.\n\n n: a vector with the number of (non-'NA') observations in each\n group.\n\n conf: a matrix where each column contains the lower and upper\n extremes of the notch.\n\n out: the values of any data points which lie beyond the extremes\n of the whiskers.\n\n group: a vector of the same length as 'out' whose elements indicate\n to which group the outlier belongs.\n\n names: a vector of names for the groups.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988). _The New\n S Language_. Wadsworth & Brooks/Cole.\n\n Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A.\n (1983). _Graphical Methods for Data Analysis_. Wadsworth &\n Brooks/Cole.\n\n Murrell, P. (2005). _R Graphics_. Chapman & Hall/CRC Press.\n\n See also 'boxplot.stats'.\n\nSee Also:\n\n 'boxplot.stats' which does the computation, 'bxp' for the plotting\n and more examples; and 'stripchart' for an alternative (with small\n data sets).\n\nExamples:\n\n ## boxplot on a formula:\n boxplot(count ~ spray, data = InsectSprays, col = \"lightgray\")\n # *add* notches (somewhat funny here <--> warning \"notches .. outside hinges\"):\n boxplot(count ~ spray, data = InsectSprays,\n notch = TRUE, add = TRUE, col = \"blue\")\n \n boxplot(decrease ~ treatment, data = OrchardSprays, col = \"bisque\",\n log = \"y\")\n ## horizontal=TRUE, switching y <--> x :\n boxplot(decrease ~ treatment, data = OrchardSprays, col = \"bisque\",\n log = \"x\", horizontal=TRUE)\n \n rb <- boxplot(decrease ~ treatment, data = OrchardSprays, col = \"bisque\")\n title(\"Comparing boxplot()s and non-robust mean +/- SD\")\n mn.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, mean)\n sd.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, sd)\n xi <- 0.3 + seq(rb$n)\n points(xi, mn.t, col = \"orange\", pch = 18)\n arrows(xi, mn.t - sd.t, xi, mn.t + sd.t,\n code = 3, col = \"pink\", angle = 75, length = .1)\n \n ## boxplot on a matrix:\n mat <- cbind(Uni05 = (1:100)/21, Norm = rnorm(100),\n `5T` = rt(100, df = 5), Gam2 = rgamma(100, shape = 2))\n boxplot(mat) # directly, calling boxplot.matrix()\n \n ## boxplot on a data frame:\n df. <- as.data.frame(mat)\n par(las = 1) # all axis labels horizontal\n boxplot(df., main = \"boxplot(*, horizontal = TRUE)\", horizontal = TRUE)\n \n ## Using 'at = ' and adding boxplots -- example idea by Roger Bivand :\n boxplot(len ~ dose, data = ToothGrowth,\n boxwex = 0.25, at = 1:3 - 0.2,\n subset = supp == \"VC\", col = \"yellow\",\n main = \"Guinea Pigs' Tooth Growth\",\n xlab = \"Vitamin C dose mg\",\n ylab = \"tooth length\",\n xlim = c(0.5, 3.5), ylim = c(0, 35), yaxs = \"i\")\n boxplot(len ~ dose, data = ToothGrowth, add = TRUE,\n boxwex = 0.25, at = 1:3 + 0.2,\n subset = supp == \"OJ\", col = \"orange\")\n legend(2, 9, c(\"Ascorbic acid\", \"Orange juice\"),\n fill = c(\"yellow\", \"orange\"))\n \n ## With less effort (slightly different) using factor *interaction*:\n boxplot(len ~ dose:supp, data = ToothGrowth,\n boxwex = 0.5, col = c(\"orange\", \"yellow\"),\n main = \"Guinea Pigs' Tooth Growth\",\n xlab = \"Vitamin C dose mg\", ylab = \"tooth length\",\n sep = \":\", lex.order = TRUE, ylim = c(0, 35), yaxs = \"i\")\n \n ## more examples in help(bxp)\n\n\n\n## `boxplot()` example\n\nReminder function signature\n```\nboxplot(formula, data = NULL, ..., subset, na.action = NULL,\n xlab = mklab(y_var = horizontal),\n ylab = mklab(y_var =!horizontal),\n add = FALSE, ann = !add, horizontal = FALSE,\n drop = FALSE, sep = \".\", lex.order = FALSE)\n```\n\nLet's practice\n\n::: {.cell}\n\n```{.r .cell-code}\nboxplot(IgG_concentration~age_group, data=df)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-18-1.png){width=960}\n:::\n\n```{.r .cell-code}\nboxplot(\n\tlog(df$IgG_concentration)~df$age_group, \n\tmain=\"Age by IgG Concentrations\", \n\txlab=\"Age Group (years)\", \n\tylab=\"log IgG Concentration (mIU/mL)\", \n\tnames=c(\"1-5\",\"6-10\", \"11-15\"), \n\tvarwidth=T\n\t)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-18-2.png){width=960}\n:::\n:::\n\n\n\n## `barplot()` Help File\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?barplot\n```\n:::\n\nBar Plots\n\nDescription:\n\n Creates a bar plot with vertical or horizontal bars.\n\nUsage:\n\n barplot(height, ...)\n \n ## Default S3 method:\n barplot(height, width = 1, space = NULL,\n names.arg = NULL, legend.text = NULL, beside = FALSE,\n horiz = FALSE, density = NULL, angle = 45,\n col = NULL, border = par(\"fg\"),\n main = NULL, sub = NULL, xlab = NULL, ylab = NULL,\n xlim = NULL, ylim = NULL, xpd = TRUE, log = \"\",\n axes = TRUE, axisnames = TRUE,\n cex.axis = par(\"cex.axis\"), cex.names = par(\"cex.axis\"),\n inside = TRUE, plot = TRUE, axis.lty = 0, offset = 0,\n add = FALSE, ann = !add && par(\"ann\"), args.legend = NULL, ...)\n \n ## S3 method for class 'formula'\n barplot(formula, data, subset, na.action,\n horiz = FALSE, xlab = NULL, ylab = NULL, ...)\n \nArguments:\n\n height: either a vector or matrix of values describing the bars which\n make up the plot. If 'height' is a vector, the plot consists\n of a sequence of rectangular bars with heights given by the\n values in the vector. If 'height' is a matrix and 'beside'\n is 'FALSE' then each bar of the plot corresponds to a column\n of 'height', with the values in the column giving the heights\n of stacked sub-bars making up the bar. If 'height' is a\n matrix and 'beside' is 'TRUE', then the values in each column\n are juxtaposed rather than stacked.\n\n width: optional vector of bar widths. Re-cycled to length the number\n of bars drawn. Specifying a single value will have no\n visible effect unless 'xlim' is specified.\n\n space: the amount of space (as a fraction of the average bar width)\n left before each bar. May be given as a single number or one\n number per bar. If 'height' is a matrix and 'beside' is\n 'TRUE', 'space' may be specified by two numbers, where the\n first is the space between bars in the same group, and the\n second the space between the groups. If not given\n explicitly, it defaults to 'c(0,1)' if 'height' is a matrix\n and 'beside' is 'TRUE', and to 0.2 otherwise.\n\nnames.arg: a vector of names to be plotted below each bar or group of\n bars. If this argument is omitted, then the names are taken\n from the 'names' attribute of 'height' if this is a vector,\n or the column names if it is a matrix.\n\nlegend.text: a vector of text used to construct a legend for the plot,\n or a logical indicating whether a legend should be included.\n This is only useful when 'height' is a matrix. In that case\n given legend labels should correspond to the rows of\n 'height'; if 'legend.text' is true, the row names of 'height'\n will be used as labels if they are non-null.\n\n beside: a logical value. If 'FALSE', the columns of 'height' are\n portrayed as stacked bars, and if 'TRUE' the columns are\n portrayed as juxtaposed bars.\n\n horiz: a logical value. If 'FALSE', the bars are drawn vertically\n with the first bar to the left. If 'TRUE', the bars are\n drawn horizontally with the first at the bottom.\n\n density: a vector giving the density of shading lines, in lines per\n inch, for the bars or bar components. The default value of\n 'NULL' means that no shading lines are drawn. Non-positive\n values of 'density' also inhibit the drawing of shading\n lines.\n\n angle: the slope of shading lines, given as an angle in degrees\n (counter-clockwise), for the bars or bar components.\n\n col: a vector of colors for the bars or bar components. By\n default, '\"grey\"' is used if 'height' is a vector, and a\n gamma-corrected grey palette if 'height' is a matrix; see\n 'grey.colors'.\n\n border: the color to be used for the border of the bars. Use 'border\n = NA' to omit borders. If there are shading lines, 'border =\n TRUE' means use the same colour for the border as for the\n shading lines.\n\nmain,sub: main title and subtitle for the plot.\n\n xlab: a label for the x axis.\n\n ylab: a label for the y axis.\n\n xlim: limits for the x axis.\n\n ylim: limits for the y axis.\n\n xpd: logical. Should bars be allowed to go outside region?\n\n log: string specifying if axis scales should be logarithmic; see\n 'plot.default'.\n\n axes: logical. If 'TRUE', a vertical (or horizontal, if 'horiz' is\n true) axis is drawn.\n\naxisnames: logical. If 'TRUE', and if there are 'names.arg' (see\n above), the other axis is drawn (with 'lty = 0') and labeled.\n\ncex.axis: expansion factor for numeric axis labels (see 'par('cex')').\n\ncex.names: expansion factor for axis names (bar labels).\n\n inside: logical. If 'TRUE', the lines which divide adjacent\n (non-stacked!) bars will be drawn. Only applies when 'space\n = 0' (which it partly is when 'beside = TRUE').\n\n plot: logical. If 'FALSE', nothing is plotted.\n\naxis.lty: the graphics parameter 'lty' (see 'par('lty')') applied to\n the axis and tick marks of the categorical (default\n horizontal) axis. Note that by default the axis is\n suppressed.\n\n offset: a vector indicating how much the bars should be shifted\n relative to the x axis.\n\n add: logical specifying if bars should be added to an already\n existing plot; defaults to 'FALSE'.\n\n ann: logical specifying if the default annotation ('main', 'sub',\n 'xlab', 'ylab') should appear on the plot, see 'title'.\n\nargs.legend: list of additional arguments to pass to 'legend()'; names\n of the list are used as argument names. Only used if\n 'legend.text' is supplied.\n\n formula: a formula where the 'y' variables are numeric data to plot\n against the categorical 'x' variables. The formula can have\n one of three forms:\n\n y ~ x\n y ~ x1 + x2\n cbind(y1, y2) ~ x\n \n (see the examples).\n\n data: a data frame (or list) from which the variables in formula\n should be taken.\n\n subset: an optional vector specifying a subset of observations to be\n used.\n\nna.action: a function which indicates what should happen when the data\n contain 'NA' values. The default is to ignore missing values\n in the given variables.\n\n ...: arguments to be passed to/from other methods. For the\n default method these can include further arguments (such as\n 'axes', 'asp' and 'main') and graphical parameters (see\n 'par') which are passed to 'plot.window()', 'title()' and\n 'axis'.\n\nValue:\n\n A numeric vector (or matrix, when 'beside = TRUE'), say 'mp',\n giving the coordinates of _all_ the bar midpoints drawn, useful\n for adding to the graph.\n\n If 'beside' is true, use 'colMeans(mp)' for the midpoints of each\n _group_ of bars, see example.\n\nAuthor(s):\n\n R Core, with a contribution by Arni Magnusson.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Murrell, P. (2005) _R Graphics_. Chapman & Hall/CRC Press.\n\nSee Also:\n\n 'plot(..., type = \"h\")', 'dotchart'; 'hist' for bars of a\n _continuous_ variable. 'mosaicplot()', more sophisticated to\n visualize _several_ categorical variables.\n\nExamples:\n\n # Formula method\n barplot(GNP ~ Year, data = longley)\n barplot(cbind(Employed, Unemployed) ~ Year, data = longley)\n \n ## 3rd form of formula - 2 categories :\n op <- par(mfrow = 2:1, mgp = c(3,1,0)/2, mar = .1+c(3,3:1))\n summary(d.Titanic <- as.data.frame(Titanic))\n barplot(Freq ~ Class + Survived, data = d.Titanic,\n subset = Age == \"Adult\" & Sex == \"Male\",\n main = \"barplot(Freq ~ Class + Survived, *)\", ylab = \"# {passengers}\", legend.text = TRUE)\n # Corresponding table :\n (xt <- xtabs(Freq ~ Survived + Class + Sex, d.Titanic, subset = Age==\"Adult\"))\n # Alternatively, a mosaic plot :\n mosaicplot(xt[,,\"Male\"], main = \"mosaicplot(Freq ~ Class + Survived, *)\", color=TRUE)\n par(op)\n \n \n # Default method\n require(grDevices) # for colours\n tN <- table(Ni <- stats::rpois(100, lambda = 5))\n r <- barplot(tN, col = rainbow(20))\n #- type = \"h\" plotting *is* 'bar'plot\n lines(r, tN, type = \"h\", col = \"red\", lwd = 2)\n \n barplot(tN, space = 1.5, axisnames = FALSE,\n sub = \"barplot(..., space= 1.5, axisnames = FALSE)\")\n \n barplot(VADeaths, plot = FALSE)\n barplot(VADeaths, plot = FALSE, beside = TRUE)\n \n mp <- barplot(VADeaths) # default\n tot <- colMeans(VADeaths)\n text(mp, tot + 3, format(tot), xpd = TRUE, col = \"blue\")\n barplot(VADeaths, beside = TRUE,\n col = c(\"lightblue\", \"mistyrose\", \"lightcyan\",\n \"lavender\", \"cornsilk\"),\n legend.text = rownames(VADeaths), ylim = c(0, 100))\n title(main = \"Death Rates in Virginia\", font.main = 4)\n \n hh <- t(VADeaths)[, 5:1]\n mybarcol <- \"gray20\"\n mp <- barplot(hh, beside = TRUE,\n col = c(\"lightblue\", \"mistyrose\",\n \"lightcyan\", \"lavender\"),\n legend.text = colnames(VADeaths), ylim = c(0,100),\n main = \"Death Rates in Virginia\", font.main = 4,\n sub = \"Faked upper 2*sigma error bars\", col.sub = mybarcol,\n cex.names = 1.5)\n segments(mp, hh, mp, hh + 2*sqrt(1000*hh/100), col = mybarcol, lwd = 1.5)\n stopifnot(dim(mp) == dim(hh)) # corresponding matrices\n mtext(side = 1, at = colMeans(mp), line = -2,\n text = paste(\"Mean\", formatC(colMeans(hh))), col = \"red\")\n \n # Bar shading example\n barplot(VADeaths, angle = 15+10*1:5, density = 20, col = \"black\",\n legend.text = rownames(VADeaths))\n title(main = list(\"Death Rates in Virginia\", font = 4))\n \n # Border color\n barplot(VADeaths, border = \"dark blue\") \n \n \n # Log scales (not much sense here)\n barplot(tN, col = heat.colors(12), log = \"y\")\n barplot(tN, col = gray.colors(20), log = \"xy\")\n \n # Legend location\n barplot(height = cbind(x = c(465, 91) / 465 * 100,\n y = c(840, 200) / 840 * 100,\n z = c(37, 17) / 37 * 100),\n beside = FALSE,\n width = c(465, 840, 37),\n col = c(1, 2),\n legend.text = c(\"A\", \"B\"),\n args.legend = list(x = \"topleft\"))\n\n\n\n## `barplot()` example\n\nThe function takes the a lot of arguments to control the way the way our data is plotted. \n\nReminder function signature\n```\nbarplot(height, width = 1, space = NULL,\n names.arg = NULL, legend.text = NULL, beside = FALSE,\n horiz = FALSE, density = NULL, angle = 45,\n col = NULL, border = par(\"fg\"),\n main = NULL, sub = NULL, xlab = NULL, ylab = NULL,\n xlim = NULL, ylim = NULL, xpd = TRUE, log = \"\",\n axes = TRUE, axisnames = TRUE,\n cex.axis = par(\"cex.axis\"), cex.names = par(\"cex.axis\"),\n inside = TRUE, plot = TRUE, axis.lty = 0, offset = 0,\n add = FALSE, ann = !add && par(\"ann\"), args.legend = NULL, ...)\n```\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfreq <- table(df$seropos, df$age_group)\nbarplot(freq)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-21-1.png){width=960}\n:::\n\n```{.r .cell-code}\nprop.cell.percentages <- prop.table(freq)\nbarplot(prop.cell.percentages)\n```\n\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-21-2.png){width=960}\n:::\n:::\n\n\n## 3. Legend!\n\nIn Base R plotting the legend is not automatically generated. This is nice because it gives you a huge amount of control over how your legend looks, but it is also easy to mislabel your colors, symbols, line types, etc. So, basically be careful.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?legend\n```\n:::\n\n::: {.cell}\n::: {.cell-output .cell-output-stdout}\n```\nAdd Legends to Plots\n\nDescription:\n\n This function can be used to add legends to plots. Note that a\n call to the function 'locator(1)' can be used in place of the 'x'\n and 'y' arguments.\n\nUsage:\n\n legend(x, y = NULL, legend, fill = NULL, col = par(\"col\"),\n border = \"black\", lty, lwd, pch,\n angle = 45, density = NULL, bty = \"o\", bg = par(\"bg\"),\n box.lwd = par(\"lwd\"), box.lty = par(\"lty\"), box.col = par(\"fg\"),\n pt.bg = NA, cex = 1, pt.cex = cex, pt.lwd = lwd,\n xjust = 0, yjust = 1, x.intersp = 1, y.intersp = 1,\n adj = c(0, 0.5), text.width = NULL, text.col = par(\"col\"),\n text.font = NULL, merge = do.lines && has.pch, trace = FALSE,\n plot = TRUE, ncol = 1, horiz = FALSE, title = NULL,\n inset = 0, xpd, title.col = text.col[1], title.adj = 0.5,\n title.cex = cex[1], title.font = text.font[1],\n seg.len = 2)\n \nArguments:\n\n x, y: the x and y co-ordinates to be used to position the legend.\n They can be specified by keyword or in any way which is\n accepted by 'xy.coords': See 'Details'.\n\n legend: a character or expression vector of length >= 1 to appear in\n the legend. Other objects will be coerced by\n 'as.graphicsAnnot'.\n\n fill: if specified, this argument will cause boxes filled with the\n specified colors (or shaded in the specified colors) to\n appear beside the legend text.\n\n col: the color of points or lines appearing in the legend.\n\n border: the border color for the boxes (used only if 'fill' is\n specified).\n\nlty, lwd: the line types and widths for lines appearing in the legend.\n One of these two _must_ be specified for line drawing.\n\n pch: the plotting symbols appearing in the legend, as numeric\n vector or a vector of 1-character strings (see 'points').\n Unlike 'points', this can all be specified as a single\n multi-character string. _Must_ be specified for symbol\n drawing.\n\n angle: angle of shading lines.\n\n density: the density of shading lines, if numeric and positive. If\n 'NULL' or negative or 'NA' color filling is assumed.\n\n bty: the type of box to be drawn around the legend. The allowed\n values are '\"o\"' (the default) and '\"n\"'.\n\n bg: the background color for the legend box. (Note that this is\n only used if 'bty != \"n\"'.)\n\nbox.lty, box.lwd, box.col: the line type, width and color for the\n legend box (if 'bty = \"o\"').\n\n pt.bg: the background color for the 'points', corresponding to its\n argument 'bg'.\n\n cex: character expansion factor *relative* to current\n 'par(\"cex\")'. Used for text, and provides the default for\n 'pt.cex'.\n\n pt.cex: expansion factor(s) for the points.\n\n pt.lwd: line width for the points, defaults to the one for lines, or\n if that is not set, to 'par(\"lwd\")'.\n\n xjust: how the legend is to be justified relative to the legend x\n location. A value of 0 means left justified, 0.5 means\n centered and 1 means right justified.\n\n yjust: the same as 'xjust' for the legend y location.\n\nx.intersp: character interspacing factor for horizontal (x) spacing\n between symbol and legend text.\n\ny.intersp: vertical (y) distances (in lines of text shared above/below\n each legend entry). A vector with one element for each row\n of the legend can be used.\n\n adj: numeric of length 1 or 2; the string adjustment for legend\n text. Useful for y-adjustment when 'labels' are plotmath\n expressions.\n\ntext.width: the width of the legend text in x ('\"user\"') coordinates.\n (Should be positive even for a reversed x axis.) Can be a\n single positive numeric value (same width for each column of\n the legend), a vector (one element for each column of the\n legend), 'NULL' (default) for computing a proper maximum\n value of 'strwidth(legend)'), or 'NA' for computing a proper\n column wise maximum value of 'strwidth(legend)').\n\ntext.col: the color used for the legend text.\n\ntext.font: the font used for the legend text, see 'text'.\n\n merge: logical; if 'TRUE', merge points and lines but not filled\n boxes. Defaults to 'TRUE' if there are points and lines.\n\n trace: logical; if 'TRUE', shows how 'legend' does all its magical\n computations.\n\n plot: logical. If 'FALSE', nothing is plotted but the sizes are\n returned.\n\n ncol: the number of columns in which to set the legend items\n (default is 1, a vertical legend).\n\n horiz: logical; if 'TRUE', set the legend horizontally rather than\n vertically (specifying 'horiz' overrides the 'ncol'\n specification).\n\n title: a character string or length-one expression giving a title to\n be placed at the top of the legend. Other objects will be\n coerced by 'as.graphicsAnnot'.\n\n inset: inset distance(s) from the margins as a fraction of the plot\n region when legend is placed by keyword.\n\n xpd: if supplied, a value of the graphical parameter 'xpd' to be\n used while the legend is being drawn.\n\ntitle.col: color for 'title', defaults to 'text.col[1]'.\n\ntitle.adj: horizontal adjustment for 'title': see the help for\n 'par(\"adj\")'.\n\ntitle.cex: expansion factor(s) for the title, defaults to 'cex[1]'.\n\ntitle.font: the font used for the legend title, defaults to\n 'text.font[1]', see 'text'.\n\n seg.len: the length of lines drawn to illustrate 'lty' and/or 'lwd'\n (in units of character widths).\n\nDetails:\n\n Arguments 'x', 'y', 'legend' are interpreted in a non-standard way\n to allow the coordinates to be specified _via_ one or two\n arguments. If 'legend' is missing and 'y' is not numeric, it is\n assumed that the second argument is intended to be 'legend' and\n that the first argument specifies the coordinates.\n\n The coordinates can be specified in any way which is accepted by\n 'xy.coords'. If this gives the coordinates of one point, it is\n used as the top-left coordinate of the rectangle containing the\n legend. If it gives the coordinates of two points, these specify\n opposite corners of the rectangle (either pair of corners, in any\n order).\n\n The location may also be specified by setting 'x' to a single\n keyword from the list '\"bottomright\"', '\"bottom\"', '\"bottomleft\"',\n '\"left\"', '\"topleft\"', '\"top\"', '\"topright\"', '\"right\"' and\n '\"center\"'. This places the legend on the inside of the plot frame\n at the given location. Partial argument matching is used. The\n optional 'inset' argument specifies how far the legend is inset\n from the plot margins. If a single value is given, it is used for\n both margins; if two values are given, the first is used for 'x'-\n distance, the second for 'y'-distance.\n\n Attribute arguments such as 'col', 'pch', 'lty', etc, are recycled\n if necessary: 'merge' is not. Set entries of 'lty' to '0' or set\n entries of 'lwd' to 'NA' to suppress lines in corresponding legend\n entries; set 'pch' values to 'NA' to suppress points.\n\n Points are drawn _after_ lines in order that they can cover the\n line with their background color 'pt.bg', if applicable.\n\n See the examples for how to right-justify labels.\n\n Since they are not used for Unicode code points, values '-31:-1'\n are silently omitted, as are 'NA' and '\"\"' values.\n\nValue:\n\n A list with list components\n\n rect: a list with components\n\n 'w', 'h' positive numbers giving *w*idth and *h*eight of the\n legend's box.\n\n 'left', 'top' x and y coordinates of upper left corner of the\n box.\n\n text: a list with components\n\n 'x, y' numeric vectors of length 'length(legend)', giving the\n x and y coordinates of the legend's text(s).\n\n returned invisibly.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\n Murrell, P. (2005) _R Graphics_. Chapman & Hall/CRC Press.\n\nSee Also:\n\n 'plot', 'barplot' which uses 'legend()', and 'text' for more\n examples of math expressions.\n\nExamples:\n\n ## Run the example in '?matplot' or the following:\n leg.txt <- c(\"Setosa Petals\", \"Setosa Sepals\",\n \"Versicolor Petals\", \"Versicolor Sepals\")\n y.leg <- c(4.5, 3, 2.1, 1.4, .7)\n cexv <- c(1.2, 1, 4/5, 2/3, 1/2)\n matplot(c(1, 8), c(0, 4.5), type = \"n\", xlab = \"Length\", ylab = \"Width\",\n main = \"Petal and Sepal Dimensions in Iris Blossoms\")\n for (i in seq(cexv)) {\n text (1, y.leg[i] - 0.1, paste(\"cex=\", formatC(cexv[i])), cex = 0.8, adj = 0)\n legend(3, y.leg[i], leg.txt, pch = \"sSvV\", col = c(1, 3), cex = cexv[i])\n }\n ## cex *vector* [in R <= 3.5.1 has 'if(xc < 0)' w/ length(xc) == 2]\n legend(\"right\", leg.txt, pch = \"sSvV\", col = c(1, 3),\n cex = 1+(-1:2)/8, trace = TRUE)# trace: show computed lengths & coords\n \n ## 'merge = TRUE' for merging lines & points:\n x <- seq(-pi, pi, length.out = 65)\n for(reverse in c(FALSE, TRUE)) { ## normal *and* reverse axes:\n F <- if(reverse) rev else identity\n plot(x, sin(x), type = \"l\", col = 3, lty = 2,\n xlim = F(range(x)), ylim = F(c(-1.2, 1.8)))\n points(x, cos(x), pch = 3, col = 4)\n lines(x, tan(x), type = \"b\", lty = 1, pch = 4, col = 6)\n title(\"legend('top', lty = c(2, -1, 1), pch = c(NA, 3, 4), merge = TRUE)\",\n cex.main = 1.1)\n legend(\"top\", c(\"sin\", \"cos\", \"tan\"), col = c(3, 4, 6),\n text.col = \"green4\", lty = c(2, -1, 1), pch = c(NA, 3, 4),\n merge = TRUE, bg = \"gray90\", trace=TRUE)\n \n } # for(..)\n \n ## right-justifying a set of labels: thanks to Uwe Ligges\n x <- 1:5; y1 <- 1/x; y2 <- 2/x\n plot(rep(x, 2), c(y1, y2), type = \"n\", xlab = \"x\", ylab = \"y\")\n lines(x, y1); lines(x, y2, lty = 2)\n temp <- legend(\"topright\", legend = c(\" \", \" \"),\n text.width = strwidth(\"1,000,000\"),\n lty = 1:2, xjust = 1, yjust = 1, inset = 1/10,\n title = \"Line Types\", title.cex = 0.5, trace=TRUE)\n text(temp$rect$left + temp$rect$w, temp$text$y,\n c(\"1,000\", \"1,000,000\"), pos = 2)\n \n \n ##--- log scaled Examples ------------------------------\n leg.txt <- c(\"a one\", \"a two\")\n \n par(mfrow = c(2, 2))\n for(ll in c(\"\",\"x\",\"y\",\"xy\")) {\n plot(2:10, log = ll, main = paste0(\"log = '\", ll, \"'\"))\n abline(1, 1)\n lines(2:3, 3:4, col = 2)\n points(2, 2, col = 3)\n rect(2, 3, 3, 2, col = 4)\n text(c(3,3), 2:3, c(\"rect(2,3,3,2, col=4)\",\n \"text(c(3,3),2:3,\\\"c(rect(...)\\\")\"), adj = c(0, 0.3))\n legend(list(x = 2,y = 8), legend = leg.txt, col = 2:3, pch = 1:2,\n lty = 1) #, trace = TRUE)\n } # ^^^^^^^ to force lines -> automatic merge=TRUE\n par(mfrow = c(1,1))\n \n ##-- Math expressions: ------------------------------\n x <- seq(-pi, pi, length.out = 65)\n plot(x, sin(x), type = \"l\", col = 2, xlab = expression(phi),\n ylab = expression(f(phi)))\n abline(h = -1:1, v = pi/2*(-6:6), col = \"gray90\")\n lines(x, cos(x), col = 3, lty = 2)\n ex.cs1 <- expression(plain(sin) * phi, paste(\"cos\", phi)) # 2 ways\n utils::str(legend(-3, .9, ex.cs1, lty = 1:2, plot = FALSE,\n adj = c(0, 0.6))) # adj y !\n legend(-3, 0.9, ex.cs1, lty = 1:2, col = 2:3, adj = c(0, 0.6))\n \n require(stats)\n x <- rexp(100, rate = .5)\n hist(x, main = \"Mean and Median of a Skewed Distribution\")\n abline(v = mean(x), col = 2, lty = 2, lwd = 2)\n abline(v = median(x), col = 3, lty = 3, lwd = 2)\n ex12 <- expression(bar(x) == sum(over(x[i], n), i == 1, n),\n hat(x) == median(x[i], i == 1, n))\n utils::str(legend(4.1, 30, ex12, col = 2:3, lty = 2:3, lwd = 2))\n \n ## 'Filled' boxes -- see also example(barplot) which may call legend(*, fill=)\n barplot(VADeaths)\n legend(\"topright\", rownames(VADeaths), fill = gray.colors(nrow(VADeaths)))\n \n ## Using 'ncol'\n x <- 0:64/64\n for(R in c(identity, rev)) { # normal *and* reverse x-axis works fine:\n xl <- R(range(x)); x1 <- xl[1]\n matplot(x, outer(x, 1:7, function(x, k) sin(k * pi * x)), xlim=xl,\n type = \"o\", col = 1:7, ylim = c(-1, 1.5), pch = \"*\")\n op <- par(bg = \"antiquewhite1\")\n legend(x1, 1.5, paste(\"sin(\", 1:7, \"pi * x)\"), col = 1:7, lty = 1:7,\n pch = \"*\", ncol = 4, cex = 0.8)\n legend(\"bottomright\", paste(\"sin(\", 1:7, \"pi * x)\"), col = 1:7, lty = 1:7,\n pch = \"*\", cex = 0.8)\n legend(x1, -.1, paste(\"sin(\", 1:4, \"pi * x)\"), col = 1:4, lty = 1:4,\n ncol = 2, cex = 0.8)\n legend(x1, -.4, paste(\"sin(\", 5:7, \"pi * x)\"), col = 4:6, pch = 24,\n ncol = 2, cex = 1.5, lwd = 2, pt.bg = \"pink\", pt.cex = 1:3)\n par(op)\n \n } # for(..)\n \n ## point covering line :\n y <- sin(3*pi*x)\n plot(x, y, type = \"l\", col = \"blue\",\n main = \"points with bg & legend(*, pt.bg)\")\n points(x, y, pch = 21, bg = \"white\")\n legend(.4,1, \"sin(c x)\", pch = 21, pt.bg = \"white\", lty = 1, col = \"blue\")\n \n ## legends with titles at different locations\n plot(x, y, type = \"n\")\n legend(\"bottomright\", \"(x,y)\", pch=1, title= \"bottomright\")\n legend(\"bottom\", \"(x,y)\", pch=1, title= \"bottom\")\n legend(\"bottomleft\", \"(x,y)\", pch=1, title= \"bottomleft\")\n legend(\"left\", \"(x,y)\", pch=1, title= \"left\")\n legend(\"topleft\", \"(x,y)\", pch=1, title= \"topleft, inset = .05\", inset = .05)\n legend(\"top\", \"(x,y)\", pch=1, title= \"top\")\n legend(\"topright\", \"(x,y)\", pch=1, title= \"topright, inset = .02\",inset = .02)\n legend(\"right\", \"(x,y)\", pch=1, title= \"right\")\n legend(\"center\", \"(x,y)\", pch=1, title= \"center\")\n \n # using text.font (and text.col):\n op <- par(mfrow = c(2, 2), mar = rep(2.1, 4))\n c6 <- terrain.colors(10)[1:6]\n for(i in 1:4) {\n plot(1, type = \"n\", axes = FALSE, ann = FALSE); title(paste(\"text.font =\",i))\n legend(\"top\", legend = LETTERS[1:6], col = c6,\n ncol = 2, cex = 2, lwd = 3, text.font = i, text.col = c6)\n }\n par(op)\n \n # using text.width for several columns\n plot(1, type=\"n\")\n legend(\"topleft\", c(\"This legend\", \"has\", \"equally sized\", \"columns.\"),\n pch = 1:4, ncol = 4)\n legend(\"bottomleft\", c(\"This legend\", \"has\", \"optimally sized\", \"columns.\"),\n pch = 1:4, ncol = 4, text.width = NA)\n legend(\"right\", letters[1:4], pch = 1:4, ncol = 4,\n text.width = 1:4 / 50)\n```\n:::\n:::\n\n\n\n\n## Add legend to the plot\n\nReminder function signature\n```\nlegend(x, y = NULL, legend, fill = NULL, col = par(\"col\"),\n border = \"black\", lty, lwd, pch,\n angle = 45, density = NULL, bty = \"o\", bg = par(\"bg\"),\n box.lwd = par(\"lwd\"), box.lty = par(\"lty\"), box.col = par(\"fg\"),\n pt.bg = NA, cex = 1, pt.cex = cex, pt.lwd = lwd,\n xjust = 0, yjust = 1, x.intersp = 1, y.intersp = 1,\n adj = c(0, 0.5), text.width = NULL, text.col = par(\"col\"),\n text.font = NULL, merge = do.lines && has.pch, trace = FALSE,\n plot = TRUE, ncol = 1, horiz = FALSE, title = NULL,\n inset = 0, xpd, title.col = text.col[1], title.adj = 0.5,\n title.cex = cex[1], title.font = text.font[1],\n seg.len = 2)\n```\n\nLet's practice\n\n::: {.cell}\n\n```{.r .cell-code}\nbarplot(prop.cell.percentages, col=c(\"darkblue\",\"red\"), ylim=c(0,0.5), main=\"Seropositivity by Age Group\")\nlegend(x=2.5, y=0.5,\n\t\t\t fill=c(\"darkblue\",\"red\"), \n\t\t\t legend = c(\"seronegative\", \"seropositive\"))\n```\n:::\n\n\n\n## Add legend to the plot\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-25-1.png){width=960}\n:::\n:::\n\n\n\n## `barplot()` example\n\nGetting closer, but what I really want is column proportions (i.e., the proportions should sum to one for each age group). Also, the age groups need more meaningful names.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfreq <- table(df$seropos, df$age_group)\nprop.column.percentages <- prop.table(freq, margin=2)\ncolnames(prop.column.percentages) <- c(\"1-5 yo\", \"6-10 yo\", \"11-15 yo\")\n\nbarplot(prop.column.percentages, col=c(\"darkblue\",\"red\"), ylim=c(0,1.35), main=\"Seropositivity by Age Group\")\naxis(2, at = c(0.2, 0.4, 0.6, 0.8,1))\nlegend(x=2.8, y=1.35,\n\t\t\t fill=c(\"darkblue\",\"red\"), \n\t\t\t legend = c(\"seronegative\", \"seropositive\"))\n```\n:::\n\n\n## `barplot()` example\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-27-1.png){width=960}\n:::\n:::\n\n\n\n\n## `barplot()` example\n\nNow, let look at seropositivity by two individual level characteristics in the same plot. \n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\npar(mfrow = c(1,2))\nbarplot(prop.column.percentages, col=c(\"darkblue\",\"red\"), ylim=c(0,1.35), main=\"Seropositivity by Age Group\")\naxis(2, at = c(0.2, 0.4, 0.6, 0.8,1))\nlegend(\"topright\",\n\t\t\t fill=c(\"darkblue\",\"red\"), \n\t\t\t legend = c(\"seronegative\", \"seropositive\"))\n\nbarplot(prop.column.percentages2, col=c(\"darkblue\",\"red\"), ylim=c(0,1.35), main=\"Seropositivity by Residence\")\naxis(2, at = c(0.2, 0.4, 0.6, 0.8,1))\nlegend(\"topright\", fill=c(\"darkblue\",\"red\"), legend = c(\"seronegative\", \"seropositive\"))\n```\n:::\n\n\n\n## `barplot()` example\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-30-1.png){width=960}\n:::\n:::\n\n\n\n## Summary\n\n-\t\t\n\n## Acknowledgements\n\nThese are the materials we looked through, modified, or extracted to complete this module's lecture.\n\n- [\"Base Plotting in R\" by Medium](https://towardsdatascience.com/base-plotting-in-r-eb365da06b22)\n-\t\t[\"Base R margins: a cheatsheet\"](https://r-graph-gallery.com/74-margin-and-oma-cheatsheet.html)\n", "supporting": [ "Module10-DataVisualization_files" ], diff --git a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-15-2.png b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-15-2.png index 5656265..4e5c9c8 100644 Binary files a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-15-2.png and b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-15-2.png differ diff --git a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-25-1.png b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-25-1.png index 232d44e..edfae88 100644 Binary files a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-25-1.png and b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-25-1.png differ diff --git a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-26-1.png b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-26-1.png index 1abfaa6..232d44e 100644 Binary files a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-26-1.png and b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-26-1.png differ diff --git a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-27-1.png b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-27-1.png index 1abfaa6..232d44e 100644 Binary files a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-27-1.png and b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-27-1.png differ diff --git a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-28-1.png b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-28-1.png new file mode 100644 index 0000000..c6eb02c Binary files /dev/null and b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-28-1.png differ diff --git a/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-30-1.png b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-30-1.png new file mode 100644 index 0000000..c6eb02c Binary files /dev/null and b/_freeze/modules/Module10-DataVisualization/figure-revealjs/unnamed-chunk-30-1.png differ diff --git a/_freeze/modules/ModuleXX-Iteration/execute-results/html.json b/_freeze/modules/ModuleXX-Iteration/execute-results/html.json index e645b03..1afe07a 100644 --- a/_freeze/modules/ModuleXX-Iteration/execute-results/html.json +++ b/_freeze/modules/ModuleXX-Iteration/execute-results/html.json @@ -1,8 +1,7 @@ { - "hash": "08f7e544d2bf32fea09fb30b8607df0b", + "hash": "3038ecdd34c4713f40e08365819703e7", "result": { - "engine": "knitr", - "markdown": "---\ntitle: \"Iteration in R\"\nformat: revealjs\n---\n\n\n\n\n\n\n\n## Learning goals\n\n1. Replace repetitive code with a `for` loop\n1. Compare and contrast `for` loops and `*apply()` functions\n1. Use vectorization to replace unnecessary loops\n\n## What is iteration?\n\n* Whenever you repeat something, that's iteration.\n* In `R`, this means running the same code multiple times in a row.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(\"penguins\", package = \"palmerpenguins\")\nfor (this_island in levels(penguins$island)) {\n\tisland_mean <-\n\t\tpenguins$bill_depth_mm[penguins$island == this_island] |>\n\t\tmean(na.rm = TRUE) |>\n\t\tround(digits = 2)\n\t\n\tcat(paste(\"The mean bill depth on\", this_island, \"Island was\", island_mean,\n\t\t\t\t\t\t\t\"mm.\\n\"))\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nThe mean bill depth on Biscoe Island was 15.87 mm.\nThe mean bill depth on Dream Island was 18.34 mm.\nThe mean bill depth on Torgersen Island was 18.43 mm.\n```\n\n\n:::\n:::\n\n\n\n\n## Parts of a loop\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"1,9\"}\nfor (this_island in levels(penguins$island)) {\n\tisland_mean <-\n\t\tpenguins$bill_depth_mm[penguins$island == this_island] |>\n\t\tmean(na.rm = TRUE) |>\n\t\tround(digits = 2)\n\t\n\tcat(paste(\"The mean bill depth on\", this_island, \"Island was\", island_mean,\n\t\t\t\t\t\t\t\"mm.\\n\"))\n}\n```\n:::\n\n\n\n\nThe **header** declares how many times we will repeat the same code. The header\ncontains a **control variable** that changes in each repetition and a\n**sequence** of values for the control variable to take.\n\n## Parts of a loop\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"2-8\"}\nfor (this_island in levels(penguins$island)) {\n\tisland_mean <-\n\t\tpenguins$bill_depth_mm[penguins$island == this_island] |>\n\t\tmean(na.rm = TRUE) |>\n\t\tround(digits = 2)\n\t\n\tcat(paste(\"The mean bill depth on\", this_island, \"Island was\", island_mean,\n\t\t\t\t\t\t\t\"mm.\\n\"))\n}\n```\n:::\n\n\n\n\nThe **body** of the loop contains code that will be repeated a number of times\nbased on the header instructions. In `R`, the body has to be surrounded by\ncurly braces.\n\n## Header parts\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (this_island in levels(penguins$island)) {...}\n```\n:::\n\n\n\n\n* `for`: keyword that declares we are doing a for loop.\n* `(...)`: parentheses after `for` declare the control variable and sequence.\n* `this_island`: the control variable.\n* `in`: keyword that separates the control varibale and sequence.\n* `levels(penguins$island)`: the sequence.\n* `{}`: curly braces will contain the body code.\n\n## Header parts\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (this_island in levels(penguins$island)) {...}\n```\n:::\n\n\n\n\n* Since `levels(penguins$island)` evaluates to\n`c(\"Biscoe\", \"Dream\", \"Torgersen\")`, our loop will repeat 3 times.\n\n| Iteration | `this_island` |\n|-----------|---------------|\n| 1 | \"Biscoe\" |\n| 2 | \"Dream\" |\n| 3 | \"Torgersen\" |\n\n* Everything inside of `{...}` will be repeated three times.\n\n## Loop iteration 1\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nisland_mean <-\n\tpenguins$bill_depth_mm[penguins$island == \"Biscoe\"] |>\n\tmean(na.rm = TRUE) |>\n\tround(digits = 2)\n\ncat(paste(\"The mean bill depth on\", \"Biscoe\", \"Island was\", island_mean,\n\t\t\t\t\t\"mm.\\n\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nThe mean bill depth on Biscoe Island was 15.87 mm.\n```\n\n\n:::\n:::\n\n\n\n\n## Loop iteration 2\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nisland_mean <-\n\tpenguins$bill_depth_mm[penguins$island == \"Dream\"] |>\n\tmean(na.rm = TRUE) |>\n\tround(digits = 2)\n\ncat(paste(\"The mean bill depth on\", \"Dream\", \"Island was\", island_mean,\n\t\t\t\t\t\"mm.\\n\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nThe mean bill depth on Dream Island was 18.34 mm.\n```\n\n\n:::\n:::\n\n\n\n\n## Loop iteration 3\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nisland_mean <-\n\tpenguins$bill_depth_mm[penguins$island == \"Torgersen\"] |>\n\tmean(na.rm = TRUE) |>\n\tround(digits = 2)\n\ncat(paste(\"The mean bill depth on\", \"Torgersen\", \"Island was\", island_mean,\n\t\t\t\t\t\"mm.\\n\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nThe mean bill depth on Torgersen Island was 18.43 mm.\n```\n\n\n:::\n:::\n\n\n\n\n## The loop structure automates this process for us so we don't have to copy and paste our code!\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (this_island in levels(penguins$island)) {\n\tisland_mean <-\n\t\tpenguins$bill_depth_mm[penguins$island == this_island] |>\n\t\tmean(na.rm = TRUE) |>\n\t\tround(digits = 2)\n\t\n\tcat(paste(\"The mean bill depth on\", this_island, \"Island was\", island_mean,\n\t\t\t\t\t\t\t\"mm.\\n\"))\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nThe mean bill depth on Biscoe Island was 15.87 mm.\nThe mean bill depth on Dream Island was 18.34 mm.\nThe mean bill depth on Torgersen Island was 18.43 mm.\n```\n\n\n:::\n:::\n\n\n\n\n## Remember: write DRY code!\n\n* DRY = \"Don't Repeat Yourself\"\n* Instead of copying and pasting, write loops and functions.\n* Easier to debug and change in the future!\n\n. . .\n\n* Of course, we all copy and paste code sometimes. If you are running on a\ntight deadline or can't get a loop or function to work, you might need to.\n**DRY code is good, but working code is best!**\n\n## {#tweet-slide data-menu-title=\"Hadley tweet\" .center}\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](../images/hadley-tweet.PNG)\n:::\n:::\n\n\n\n\n## You try it!\n\nWrite a loop that goes from 1 to 10, squares each of the numbers, and prints\nthe squared number.\n\n. . .\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (i in 1:10) {\n\tcat(i ^ 2, \"\\n\")\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n1 \n4 \n9 \n16 \n25 \n36 \n49 \n64 \n81 \n100 \n```\n\n\n:::\n:::\n\n\n\n\n## Wait, did we need to do that? {.incremental}\n\n* Well, yes, because you need to practice loops!\n* But technically no, because we can use **vectorization**.\n* Almost all basic operations in R are **vectorized**: they work on a vector of\narguments all at the same time.\n\n## Wait, did we need to do that?\n\n* Well, yes, because you need to practice loops!\n* But technically no, because we can use **vectorization**.\n* Almost all basic operations in R are **vectorized**: they work on a vector of\narguments all at the same time.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# No loop needed!\n(1:10)^2\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] 1 4 9 16 25 36 49 64 81 100\n```\n\n\n:::\n:::\n\n\n\n\n## Wait, did we need to do that?\n\n* Well, yes, because you need to practice loops!\n* But technically no, because we can use **vectorization**.\n* Almost all basic operations in R are **vectorized**: they work on a vector of\narguments all at the same time.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# No loop needed!\n(1:10)^2\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] 1 4 9 16 25 36 49 64 81 100\n```\n\n\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# Get the first 10 odd numbers, a common CS 101 loop problem on exams\n(1:20)[which((1:20 %% 2) == 1)]\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n [1] 1 3 5 7 9 11 13 15 17 19\n```\n\n\n:::\n:::\n\n\n\n\n* So you should really try vectorization first, then use loops only when\nyou can't use vectorization.\n\n## Loop walkthrough\n\n* Let's walk through a complex but useful example where we can't use\nvectorization.\n* Load the cleaned measles dataset, and subset it so you only have MCV1 records.\n\n. . .\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmeas <- readRDS(here::here(\"data\", \"measles_final.Rds\")) |>\n\tsubset(vaccine_antigen == \"MCV1\")\nstr(meas)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n'data.frame':\t7972 obs. of 7 variables:\n $ iso3c : chr \"AFG\" \"AFG\" \"AFG\" \"AFG\" ...\n $ time : int 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 ...\n $ country : chr \"Afghanistan\" \"Afghanistan\" \"Afghanistan\" \"Afghanistan\" ...\n $ Cases : int 2792 5166 2900 640 353 2012 1511 638 1154 492 ...\n $ vaccine_antigen : chr \"MCV1\" \"MCV1\" \"MCV1\" \"MCV1\" ...\n $ vaccine_coverage: int 11 NA 8 9 14 14 14 31 34 22 ...\n $ total_pop : chr \"12486631\" \"11155195\" \"10088289\" \"9951449\" ...\n```\n\n\n:::\n:::\n\n\n\n\n## Loop walkthrough\n\n* First, make an empty `list`. This is where we'll store our results. Make it\nthe same length as the number of countries in the dataset.\n\n. . .\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nres <- vector(mode = \"list\", length = length(unique(meas$country)))\n```\n:::\n\n\n\n\n* This is called *preallocation* and it can make your loops much faster.\n\n## Loop walkthrough\n\n* Loop through every country in the dataset, and get the median, first and third\nquartiles, and range for each country. Store those summary statistics in a data frame.\n* What should the header look like?\n\n. . .\n\n\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncountries <- unique(meas$country)\nfor (i in 1:length(countries)) {...}\n```\n:::\n\n\n\n\n. . .\n\n* Note that we use the **index** as the control variable. When you need to\ndo complex operations inside a loop, this is easier than the **for-each**\nconstruction we used earlier.\n\n## Loop walkthrough {.scrollable}\n\n* Now write out the body of the code. First we need to subset the data, to get\nonly the data for the current country.\n\n. . .\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (i in 1:length(countries)) {\n\t# Get the data for the current country only\n\tcountry_data <- subset(meas, country == countries[i])\n}\n```\n:::\n\n\n\n\n. . .\n\n* Next we need to get the summary of the cases for that country.\n\n. . .\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (i in 1:length(countries)) {\n\t# Get the data for the current country only\n\tcountry_data <- subset(meas, country == countries[i])\n\t\n\t# Get the summary statistics for this country\n\tcountry_cases <- country_data$Cases\n\tcountry_med <- median(country_cases, na.rm = TRUE)\n\tcountry_iqr <- IQR(country_cases, na.rm = TRUE)\n\tcountry_range <- range(country_cases, na.rm = TRUE)\n}\n```\n:::\n\n\n\n\n. . .\n\n* Next we save the summary statistics into a data frame.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (i in 1:length(countries)) {\n\t# Get the data for the current country only\n\tcountry_data <- subset(meas, country == countries[i])\n\t\n\t# Get the summary statistics for this country\n\tcountry_cases <- country_data$Cases\n\tcountry_quart <- quantile(\n\t\tcountry_cases, na.rm = TRUE, probs = c(0.25, 0.5, 0.75)\n\t)\n\tcountry_range <- range(country_cases, na.rm = TRUE)\n\t\n\t# Save the summary statistics into a data frame\n\tcountry_summary <- data.frame(\n\t\tcountry = countries[[i]],\n\t\tmin = country_range[[1]],\n\t\tQ1 = country_quart[[1]],\n\t\tmedian = country_quart[[2]],\n\t\tQ3 = country_quart[[3]],\n\t\tmax = country_range[[2]]\n\t)\n}\n```\n:::\n\n\n\n\n. . .\n\n* And finally, we save the data frame as the next element in our storage list.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (i in 1:length(countries)) {\n\t# Get the data for the current country only\n\tcountry_data <- subset(meas, country == countries[i])\n\t\n\t# Get the summary statistics for this country\n\tcountry_cases <- country_data$Cases\n\tcountry_quart <- quantile(\n\t\tcountry_cases, na.rm = TRUE, probs = c(0.25, 0.5, 0.75)\n\t)\n\tcountry_range <- range(country_cases, na.rm = TRUE)\n\t\n\t# Save the summary statistics into a data frame\n\tcountry_summary <- data.frame(\n\t\tcountry = countries[[i]],\n\t\tmin = country_range[[1]],\n\t\tQ1 = country_quart[[1]],\n\t\tmedian = country_quart[[2]],\n\t\tQ3 = country_quart[[3]],\n\t\tmax = country_range[[2]]\n\t)\n\t\n\t# Save the results to our container\n\tres[[i]] <- country_summary\n}\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in min(x): no non-missing arguments to min; returning Inf\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in max(x): no non-missing arguments to max; returning -Inf\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in min(x): no non-missing arguments to min; returning Inf\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in max(x): no non-missing arguments to max; returning -Inf\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in min(x): no non-missing arguments to min; returning Inf\n```\n\n\n:::\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning in max(x): no non-missing arguments to max; returning -Inf\n```\n\n\n:::\n:::\n\n\n\n\n. . .\n\n* Let's take a look at the results.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(res)\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n[[1]]\n country min Q1 median Q3 max\n1 Afghanistan 353 1154 2205 5166 31107\n\n[[2]]\n country min Q1 median Q3 max\n1 Angola 29 700 3271 14474 30067\n\n[[3]]\n country min Q1 median Q3 max\n1 Albania 0 1 12 29 136034\n\n[[4]]\n country min Q1 median Q3 max\n1 Andorra 0 0 1 2 5\n\n[[5]]\n country min Q1 median Q3 max\n1 United Arab Emirates 22 89.75 320 1128 2913\n\n[[6]]\n country min Q1 median Q3 max\n1 Argentina 0 0 17 4591.5 42093\n```\n\n\n:::\n:::\n\n\n\n\n* How do we deal with this to get it into a nice form?\n\n. . .\n\n* We can use a *vectorization* trick: the function `do.call()` seems like\nancient computer science magic. And it is. But it will actually help us a\nlot.\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nres_df <- do.call(rbind, res)\nhead(res_df)\n```\n\n::: {.cell-output-display}\n\n\n|country | min| Q1| median| Q3| max|\n|:--------------------|---:|-------:|------:|-------:|------:|\n|Afghanistan | 353| 1154.00| 2205| 5166.0| 31107|\n|Angola | 29| 700.00| 3271| 14474.0| 30067|\n|Albania | 0| 1.00| 12| 29.0| 136034|\n|Andorra | 0| 0.00| 1| 2.0| 5|\n|United Arab Emirates | 22| 89.75| 320| 1128.0| 2913|\n|Argentina | 0| 0.00| 17| 4591.5| 42093|\n:::\n:::\n\n\n\n\n* It combined our data frames together! Let's take a look at the `rbind` and\n`do.call()` help packages to see what happened.\n\n. . .\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?rbind\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nCombine R Objects by Rows or Columns\n\nDescription:\n\n Take a sequence of vector, matrix or data-frame arguments and\n combine by _c_olumns or _r_ows, respectively. These are generic\n functions with methods for other R classes.\n\nUsage:\n\n cbind(..., deparse.level = 1)\n rbind(..., deparse.level = 1)\n ## S3 method for class 'data.frame'\n rbind(..., deparse.level = 1, make.row.names = TRUE,\n stringsAsFactors = FALSE, factor.exclude = TRUE)\n \nArguments:\n\n ...: (generalized) vectors or matrices. These can be given as\n named arguments. Other R objects may be coerced as\n appropriate, or S4 methods may be used: see sections\n 'Details' and 'Value'. (For the '\"data.frame\"' method of\n 'cbind' these can be further arguments to 'data.frame' such\n as 'stringsAsFactors'.)\n\ndeparse.level: integer controlling the construction of labels in the\n case of non-matrix-like arguments (for the default method):\n 'deparse.level = 0' constructs no labels;\n the default 'deparse.level = 1' typically and 'deparse.level\n = 2' always construct labels from the argument names, see the\n 'Value' section below.\n\nmake.row.names: (only for data frame method:) logical indicating if\n unique and valid 'row.names' should be constructed from the\n arguments.\n\nstringsAsFactors: logical, passed to 'as.data.frame'; only has an\n effect when the '...' arguments contain a (non-'data.frame')\n 'character'.\n\nfactor.exclude: if the data frames contain factors, the default 'TRUE'\n ensures that 'NA' levels of factors are kept, see PR#17562\n and the 'Data frame methods'. In R versions up to 3.6.x,\n 'factor.exclude = NA' has been implicitly hardcoded (R <=\n 3.6.0) or the default (R = 3.6.x, x >= 1).\n\nDetails:\n\n The functions 'cbind' and 'rbind' are S3 generic, with methods for\n data frames. The data frame method will be used if at least one\n argument is a data frame and the rest are vectors or matrices.\n There can be other methods; in particular, there is one for time\n series objects. See the section on 'Dispatch' for how the method\n to be used is selected. If some of the arguments are of an S4\n class, i.e., 'isS4(.)' is true, S4 methods are sought also, and\n the hidden 'cbind' / 'rbind' functions from package 'methods'\n maybe called, which in turn build on 'cbind2' or 'rbind2',\n respectively. In that case, 'deparse.level' is obeyed, similarly\n to the default method.\n\n In the default method, all the vectors/matrices must be atomic\n (see 'vector') or lists. Expressions are not allowed. Language\n objects (such as formulae and calls) and pairlists will be coerced\n to lists: other objects (such as names and external pointers) will\n be included as elements in a list result. Any classes the inputs\n might have are discarded (in particular, factors are replaced by\n their internal codes).\n\n If there are several matrix arguments, they must all have the same\n number of columns (or rows) and this will be the number of columns\n (or rows) of the result. If all the arguments are vectors, the\n number of columns (rows) in the result is equal to the length of\n the longest vector. Values in shorter arguments are recycled to\n achieve this length (with a 'warning' if they are recycled only\n _fractionally_).\n\n When the arguments consist of a mix of matrices and vectors the\n number of columns (rows) of the result is determined by the number\n of columns (rows) of the matrix arguments. Any vectors have their\n values recycled or subsetted to achieve this length.\n\n For 'cbind' ('rbind'), vectors of zero length (including 'NULL')\n are ignored unless the result would have zero rows (columns), for\n S compatibility. (Zero-extent matrices do not occur in S3 and are\n not ignored in R.)\n\n Matrices are restricted to less than 2^31 rows and columns even on\n 64-bit systems. So input vectors have the same length\n restriction: as from R 3.2.0 input matrices with more elements\n (but meeting the row and column restrictions) are allowed.\n\nValue:\n\n For the default method, a matrix combining the '...' arguments\n column-wise or row-wise. (Exception: if there are no inputs or\n all the inputs are 'NULL', the value is 'NULL'.)\n\n The type of a matrix result determined from the highest type of\n any of the inputs in the hierarchy raw < logical < integer <\n double < complex < character < list .\n\n For 'cbind' ('rbind') the column (row) names are taken from the\n 'colnames' ('rownames') of the arguments if these are matrix-like.\n Otherwise from the names of the arguments or where those are not\n supplied and 'deparse.level > 0', by deparsing the expressions\n given, for 'deparse.level = 1' only if that gives a sensible name\n (a 'symbol', see 'is.symbol').\n\n For 'cbind' row names are taken from the first argument with\n appropriate names: rownames for a matrix, or names for a vector of\n length the number of rows of the result.\n\n For 'rbind' column names are taken from the first argument with\n appropriate names: colnames for a matrix, or names for a vector of\n length the number of columns of the result.\n\nData frame methods:\n\n The 'cbind' data frame method is just a wrapper for\n 'data.frame(..., check.names = FALSE)'. This means that it will\n split matrix columns in data frame arguments, and convert\n character columns to factors unless 'stringsAsFactors = FALSE' is\n specified.\n\n The 'rbind' data frame method first drops all zero-column and\n zero-row arguments. (If that leaves none, it returns the first\n argument with columns otherwise a zero-column zero-row data\n frame.) It then takes the classes of the columns from the first\n data frame, and matches columns by name (rather than by position).\n Factors have their levels expanded as necessary (in the order of\n the levels of the level sets of the factors encountered) and the\n result is an ordered factor if and only if all the components were\n ordered factors. Old-style categories (integer vectors with\n levels) are promoted to factors.\n\n Note that for result column 'j', 'factor(., exclude = X(j))' is\n applied, where\n\n X(j) := if(isTRUE(factor.exclude)) {\n if(!NA.lev[j]) NA # else NULL\n } else factor.exclude\n \n where 'NA.lev[j]' is true iff any contributing data frame has had\n a 'factor' in column 'j' with an explicit 'NA' level.\n\nDispatch:\n\n The method dispatching is _not_ done via 'UseMethod()', but by\n C-internal dispatching. Therefore there is no need for, e.g.,\n 'rbind.default'.\n\n The dispatch algorithm is described in the source file\n ('.../src/main/bind.c') as\n\n 1. For each argument we get the list of possible class\n memberships from the class attribute.\n\n 2. We inspect each class in turn to see if there is an\n applicable method.\n\n 3. If we find a method, we use it. Otherwise, if there was an\n S4 object among the arguments, we try S4 dispatch; otherwise,\n we use the default code.\n\n If you want to combine other objects with data frames, it may be\n necessary to coerce them to data frames first. (Note that this\n algorithm can result in calling the data frame method if all the\n arguments are either data frames or vectors, and this will result\n in the coercion of character vectors to factors.)\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'c' to combine vectors (and lists) as vectors, 'data.frame' to\n combine vectors and matrices as a data frame.\n\nExamples:\n\n m <- cbind(1, 1:7) # the '1' (= shorter vector) is recycled\n m\n m <- cbind(m, 8:14)[, c(1, 3, 2)] # insert a column\n m\n cbind(1:7, diag(3)) # vector is subset -> warning\n \n cbind(0, rbind(1, 1:3))\n cbind(I = 0, X = rbind(a = 1, b = 1:3)) # use some names\n xx <- data.frame(I = rep(0,2))\n cbind(xx, X = rbind(a = 1, b = 1:3)) # named differently\n \n cbind(0, matrix(1, nrow = 0, ncol = 4)) #> Warning (making sense)\n dim(cbind(0, matrix(1, nrow = 2, ncol = 0))) #-> 2 x 1\n \n ## deparse.level\n dd <- 10\n rbind(1:4, c = 2, \"a++\" = 10, dd, deparse.level = 0) # middle 2 rownames\n rbind(1:4, c = 2, \"a++\" = 10, dd, deparse.level = 1) # 3 rownames (default)\n rbind(1:4, c = 2, \"a++\" = 10, dd, deparse.level = 2) # 4 rownames\n \n ## cheap row names:\n b0 <- gl(3,4, labels=letters[1:3])\n bf <- setNames(b0, paste0(\"o\", seq_along(b0)))\n df <- data.frame(a = 1, B = b0, f = gl(4,3))\n df. <- data.frame(a = 1, B = bf, f = gl(4,3))\n new <- data.frame(a = 8, B =\"B\", f = \"1\")\n (df1 <- rbind(df , new))\n (df.1 <- rbind(df., new))\n stopifnot(identical(df1, rbind(df, new, make.row.names=FALSE)),\n identical(df1, rbind(df., new, make.row.names=FALSE)))\n```\n\n\n:::\n:::\n\n\n\n\n. . .\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?do.call\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nExecute a Function Call\n\nDescription:\n\n 'do.call' constructs and executes a function call from a name or a\n function and a list of arguments to be passed to it.\n\nUsage:\n\n do.call(what, args, quote = FALSE, envir = parent.frame())\n \nArguments:\n\n what: either a function or a non-empty character string naming the\n function to be called.\n\n args: a _list_ of arguments to the function call. The 'names'\n attribute of 'args' gives the argument names.\n\n quote: a logical value indicating whether to quote the arguments.\n\n envir: an environment within which to evaluate the call. This will\n be most useful if 'what' is a character string and the\n arguments are symbols or quoted expressions.\n\nDetails:\n\n If 'quote' is 'FALSE', the default, then the arguments are\n evaluated (in the calling environment, not in 'envir'). If\n 'quote' is 'TRUE' then each argument is quoted (see 'quote') so\n that the effect of argument evaluation is to remove the quotes -\n leaving the original arguments unevaluated when the call is\n constructed.\n\n The behavior of some functions, such as 'substitute', will not be\n the same for functions evaluated using 'do.call' as if they were\n evaluated from the interpreter. The precise semantics are\n currently undefined and subject to change.\n\nValue:\n\n The result of the (evaluated) function call.\n\nWarning:\n\n This should not be used to attempt to evade restrictions on the\n use of '.Internal' and other non-API calls.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'call' which creates an unevaluated call.\n\nExamples:\n\n do.call(\"complex\", list(imaginary = 1:3))\n \n ## if we already have a list (e.g., a data frame)\n ## we need c() to add further arguments\n tmp <- expand.grid(letters[1:2], 1:3, c(\"+\", \"-\"))\n do.call(\"paste\", c(tmp, sep = \"\"))\n \n do.call(paste, list(as.name(\"A\"), as.name(\"B\")), quote = TRUE)\n \n ## examples of where objects will be found.\n A <- 2\n f <- function(x) print(x^2)\n env <- new.env()\n assign(\"A\", 10, envir = env)\n assign(\"f\", f, envir = env)\n f <- function(x) print(x)\n f(A) # 2\n do.call(\"f\", list(A)) # 2\n do.call(\"f\", list(A), envir = env) # 4\n do.call( f, list(A), envir = env) # 2\n do.call(\"f\", list(quote(A)), envir = env) # 100\n do.call( f, list(quote(A)), envir = env) # 10\n do.call(\"f\", list(as.name(\"A\")), envir = env) # 100\n \n eval(call(\"f\", A)) # 2\n eval(call(\"f\", quote(A))) # 2\n eval(call(\"f\", A), envir = env) # 4\n eval(call(\"f\", quote(A)), envir = env) # 100\n```\n\n\n:::\n:::\n\n\n\n\n. . .\n\n* OK, so basically what happened is that\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndo.call(rbind, list)\n```\n:::\n\n\n\n\n* Gets transformed into\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrbind(list[[1]], list[[2]], list[[3]], ..., list[[length(list)]])\n```\n:::\n\n\n\n\n* That's vectorization magic!\n\n## You try it! (if we have time) {.smaller}\n\n* Use the code you wrote before the get the incidence per 1000 people on the\nentire measles data set (add a column for incidence to the full data).\n* Use the code `plot(NULL, NULL, ...)` to make a blank plot. You will need to\nset the `xlim` and `ylim` arguments to sensible values, and specify the axis\ntitles as \"Year\" and \"Incidence per 1000 people\".\n* Using a `for` loop and the `lines()` function, make a plot that shows all of\nthe incidence curves over time, overlapping on the plot.\n* HINT: use `col = adjustcolor(black, alpha.f = 0.25)` to make the curves\ntransparent, so you can see the others.\n* BONUS PROBLEM: using the function `cumsum()`, make a plot of the cumulative\nincidence per 1000 people over time for all of the countries. (Dealing with\nthe NA's here is tricky!!)\n\n## Main problem solution\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmeas$cases_per_thousand <- meas$Cases / as.numeric(meas$total_pop) * 1000\ncountries <- unique(meas$country)\n\nplot(\n\tNULL, NULL,\n\txlim = c(1980, 2022),\n\tylim = c(0, 50),\n\txlab = \"Year\",\n\tylab = \"Incidence per 1000 people\"\n)\n\nfor (i in 1:length(countries)) {\n\tcountry_data <- subset(meas, country == countries[[i]])\n\tlines(\n\t\tx = country_data$time,\n\t\ty = country_data$cases_per_thousand,\n\t\tcol = adjustcolor(\"black\", alpha.f = 0.25)\n\t)\n}\n```\n:::\n\n\n\n\n## Main problem solution\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](ModuleXX-Iteration_files/figure-revealjs/unnamed-chunk-30-1.png){width=960}\n:::\n:::\n\n\n\n\n## Bonus problem solution\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# First calculate the cumulative cases, treating NA as zeroes\ncumulative_cases <- ave(\n\tx = ifelse(is.na(meas$Cases), 0, meas$Cases),\n\tmeas$country,\n\tFUN = cumsum\n)\n\n# Now put the NAs back where they should be\nmeas$cumulative_cases <- cumulative_cases + (meas$Cases * 0)\n\nplot(\n\tNULL, NULL,\n\txlim = c(1980, 2022),\n\tylim = c(1, 6.2e6),\n\txlab = \"Year\",\n\tylab = \"Cumulative cases per 1000 people\"\n)\n\nfor (i in 1:length(countries)) {\n\tcountry_data <- subset(meas, country == countries[[i]])\n\tlines(\n\t\tx = country_data$time,\n\t\ty = country_data$cumulative_cases,\n\t\tcol = adjustcolor(\"black\", alpha.f = 0.25)\n\t)\n}\n\ntext(\n\tx = 2020,\n\ty = 6e6,\n\tlabels = \"China →\"\n)\n```\n:::\n\n\n\n\n## Bonus problem solution\n\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](ModuleXX-Iteration_files/figure-revealjs/unnamed-chunk-32-1.png){width=960}\n:::\n:::\n\n\n\n\n## More practice on your own {.smaller}\n\n* Merge the `countries-regions.csv` data with the `measles_final.Rds` data.\nReshape the measles data so that `MCV1` and `MCV2` vaccine coverage are two\nseparate columns. Then use a loop to fit a poisson regression model for each\ncontinent where `Cases` is the outcome, and `MCV1 coverage` and `MCV2 coverage`\nare the predictors. Discuss your findings, and try adding an interation term.\n* Assess the impact of `age_months` as a confounder in the Diphtheria serology\ndata. First, write code to transform `age_months` into age ranges for each\nyear. Then, using a loop, calculate the crude odds ratio for the effect of\nvaccination on infection for each of the age ranges. How does the odds ratio\nchange as age increases? Can you formalize this analysis by fitting a logistic\nregression model with `age_months` and vaccination as predictors?\n\n\n", + "markdown": "---\ntitle: \"Iteration in R\"\nformat:\n revealjs:\n toc: false\n---\n\n\n\n\n\n## Learning goals\n\n1. Replace repetitive code with a `for` loop\n1. Use vectorization to replace unnecessary loops\n\n## What is iteration?\n\n* Whenever you repeat something, that's iteration.\n* In `R`, this means running the same code multiple times in a row.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(\"penguins\", package = \"palmerpenguins\")\nfor (this_island in levels(penguins$island)) {\n\tisland_mean <-\n\t\tpenguins$bill_depth_mm[penguins$island == this_island] |>\n\t\tmean(na.rm = TRUE) |>\n\t\tround(digits = 2)\n\t\n\tcat(paste(\"The mean bill depth on\", this_island, \"Island was\", island_mean,\n\t\t\t\t\t\t\t\"mm.\\n\"))\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nThe mean bill depth on Biscoe Island was 15.87 mm.\nThe mean bill depth on Dream Island was 18.34 mm.\nThe mean bill depth on Torgersen Island was 18.43 mm.\n```\n:::\n:::\n\n\n## Parts of a loop\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"1,9\"}\nfor (this_island in levels(penguins$island)) {\n\tisland_mean <-\n\t\tpenguins$bill_depth_mm[penguins$island == this_island] |>\n\t\tmean(na.rm = TRUE) |>\n\t\tround(digits = 2)\n\t\n\tcat(paste(\"The mean bill depth on\", this_island, \"Island was\", island_mean,\n\t\t\t\t\t\t\t\"mm.\\n\"))\n}\n```\n:::\n\n\nThe **header** declares how many times we will repeat the same code. The header\ncontains a **control variable** that changes in each repetition and a\n**sequence** of values for the control variable to take.\n\n## Parts of a loop\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"2-8\"}\nfor (this_island in levels(penguins$island)) {\n\tisland_mean <-\n\t\tpenguins$bill_depth_mm[penguins$island == this_island] |>\n\t\tmean(na.rm = TRUE) |>\n\t\tround(digits = 2)\n\t\n\tcat(paste(\"The mean bill depth on\", this_island, \"Island was\", island_mean,\n\t\t\t\t\t\t\t\"mm.\\n\"))\n}\n```\n:::\n\n\nThe **body** of the loop contains code that will be repeated a number of times\nbased on the header instructions. In `R`, the body has to be surrounded by\ncurly braces.\n\n## Header parts\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (this_island in levels(penguins$island)) {...}\n```\n:::\n\n\n* `for`: keyword that declares we are doing a for loop.\n* `(...)`: parentheses after `for` declare the control variable and sequence.\n* `this_island`: the control variable.\n* `in`: keyword that separates the control varibale and sequence.\n* `levels(penguins$island)`: the sequence.\n* `{}`: curly braces will contain the body code.\n\n## Header parts\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (this_island in levels(penguins$island)) {...}\n```\n:::\n\n\n* Since `levels(penguins$island)` evaluates to\n`c(\"Biscoe\", \"Dream\", \"Torgersen\")`, our loop will repeat 3 times.\n\n| Iteration | `this_island` |\n|-----------|---------------|\n| 1 | \"Biscoe\" |\n| 2 | \"Dream\" |\n| 3 | \"Torgersen\" |\n\n* Everything inside of `{...}` will be repeated three times.\n\n## Loop iteration 1\n\n\n::: {.cell}\n\n```{.r .cell-code}\nisland_mean <-\n\tpenguins$bill_depth_mm[penguins$island == \"Biscoe\"] |>\n\tmean(na.rm = TRUE) |>\n\tround(digits = 2)\n\ncat(paste(\"The mean bill depth on\", \"Biscoe\", \"Island was\", island_mean,\n\t\t\t\t\t\"mm.\\n\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nThe mean bill depth on Biscoe Island was 15.87 mm.\n```\n:::\n:::\n\n\n## Loop iteration 2\n\n\n::: {.cell}\n\n```{.r .cell-code}\nisland_mean <-\n\tpenguins$bill_depth_mm[penguins$island == \"Dream\"] |>\n\tmean(na.rm = TRUE) |>\n\tround(digits = 2)\n\ncat(paste(\"The mean bill depth on\", \"Dream\", \"Island was\", island_mean,\n\t\t\t\t\t\"mm.\\n\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nThe mean bill depth on Dream Island was 18.34 mm.\n```\n:::\n:::\n\n\n## Loop iteration 3\n\n\n::: {.cell}\n\n```{.r .cell-code}\nisland_mean <-\n\tpenguins$bill_depth_mm[penguins$island == \"Torgersen\"] |>\n\tmean(na.rm = TRUE) |>\n\tround(digits = 2)\n\ncat(paste(\"The mean bill depth on\", \"Torgersen\", \"Island was\", island_mean,\n\t\t\t\t\t\"mm.\\n\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nThe mean bill depth on Torgersen Island was 18.43 mm.\n```\n:::\n:::\n\n\n## The loop structure automates this process for us so we don't have to copy and paste our code!\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (this_island in levels(penguins$island)) {\n\tisland_mean <-\n\t\tpenguins$bill_depth_mm[penguins$island == this_island] |>\n\t\tmean(na.rm = TRUE) |>\n\t\tround(digits = 2)\n\t\n\tcat(paste(\"The mean bill depth on\", this_island, \"Island was\", island_mean,\n\t\t\t\t\t\t\t\"mm.\\n\"))\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nThe mean bill depth on Biscoe Island was 15.87 mm.\nThe mean bill depth on Dream Island was 18.34 mm.\nThe mean bill depth on Torgersen Island was 18.43 mm.\n```\n:::\n:::\n\n\n## Remember: write DRY code!\n\n* DRY = \"Don't Repeat Yourself\"\n* Instead of copying and pasting, write loops and functions.\n* Easier to debug and change in the future!\n\n. . .\n\n* Of course, we all copy and paste code sometimes. If you are running on a\ntight deadline or can't get a loop or function to work, you might need to.\n**DRY code is good, but working code is best!**\n\n## {#tweet-slide data-menu-title=\"Hadley tweet\" .center}\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](../images/hadley-tweet.PNG)\n:::\n:::\n\n\n## You try it!\n\nWrite a loop that goes from 1 to 10, squares each of the numbers, and prints\nthe squared number.\n\n. . .\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (i in 1:10) {\n\tcat(i ^ 2, \"\\n\")\n}\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n1 \n4 \n9 \n16 \n25 \n36 \n49 \n64 \n81 \n100 \n```\n:::\n:::\n\n\n## Wait, did we need to do that? {.incremental}\n\n* Well, yes, because you need to practice loops!\n* But technically no, because we can use **vectorization**.\n* Almost all basic operations in R are **vectorized**: they work on a vector of\narguments all at the same time.\n\n## Wait, did we need to do that?\n\n* Well, yes, because you need to practice loops!\n* But technically no, because we can use **vectorization**.\n* Almost all basic operations in R are **vectorized**: they work on a vector of\narguments all at the same time.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# No loop needed!\n(1:10)^2\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 1 4 9 16 25 36 49 64 81 100\n```\n:::\n:::\n\n\n## Wait, did we need to do that?\n\n* Well, yes, because you need to practice loops!\n* But technically no, because we can use **vectorization**.\n* Almost all basic operations in R are **vectorized**: they work on a vector of\narguments all at the same time.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# No loop needed!\n(1:10)^2\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 1 4 9 16 25 36 49 64 81 100\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# Get the first 10 odd numbers, a common CS 101 loop problem on exams\n(1:20)[which((1:20 %% 2) == 1)]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n [1] 1 3 5 7 9 11 13 15 17 19\n```\n:::\n:::\n\n\n* So you should really try vectorization first, then use loops only when\nyou can't use vectorization.\n\n## Loop walkthrough\n\n* Let's walk through a complex but useful example where we can't use\nvectorization.\n* Load the cleaned measles dataset, and subset it so you only have MCV1 records.\n\n. . .\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmeas <- readRDS(here::here(\"data\", \"measles_final.Rds\")) |>\n\tsubset(vaccine_antigen == \"MCV1\")\nstr(meas)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n'data.frame':\t7972 obs. of 7 variables:\n $ iso3c : chr \"AFG\" \"AFG\" \"AFG\" \"AFG\" ...\n $ time : int 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 ...\n $ country : chr \"Afghanistan\" \"Afghanistan\" \"Afghanistan\" \"Afghanistan\" ...\n $ Cases : int 2792 5166 2900 640 353 2012 1511 638 1154 492 ...\n $ vaccine_antigen : chr \"MCV1\" \"MCV1\" \"MCV1\" \"MCV1\" ...\n $ vaccine_coverage: int 11 NA 8 9 14 14 14 31 34 22 ...\n $ total_pop : chr \"12486631\" \"11155195\" \"10088289\" \"9951449\" ...\n```\n:::\n:::\n\n\n## Loop walkthrough\n\n* First, make an empty `list`. This is where we'll store our results. Make it\nthe same length as the number of countries in the dataset.\n\n. . .\n\n\n::: {.cell}\n\n```{.r .cell-code}\nres <- vector(mode = \"list\", length = length(unique(meas$country)))\n```\n:::\n\n\n* This is called *preallocation* and it can make your loops much faster.\n\n## Loop walkthrough\n\n* Loop through every country in the dataset, and get the median, first and third\nquartiles, and range for each country. Store those summary statistics in a data frame.\n* What should the header look like?\n\n. . .\n\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncountries <- unique(meas$country)\nfor (i in 1:length(countries)) {...}\n```\n:::\n\n\n. . .\n\n* Note that we use the **index** as the control variable. When you need to\ndo complex operations inside a loop, this is easier than the **for-each**\nconstruction we used earlier.\n\n## Loop walkthrough {.scrollable}\n\n* Now write out the body of the code. First we need to subset the data, to get\nonly the data for the current country.\n\n. . .\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (i in 1:length(countries)) {\n\t# Get the data for the current country only\n\tcountry_data <- subset(meas, country == countries[i])\n}\n```\n:::\n\n\n. . .\n\n* Next we need to get the summary of the cases for that country.\n\n. . .\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (i in 1:length(countries)) {\n\t# Get the data for the current country only\n\tcountry_data <- subset(meas, country == countries[i])\n\t\n\t# Get the summary statistics for this country\n\tcountry_cases <- country_data$Cases\n\tcountry_med <- median(country_cases, na.rm = TRUE)\n\tcountry_iqr <- IQR(country_cases, na.rm = TRUE)\n\tcountry_range <- range(country_cases, na.rm = TRUE)\n}\n```\n:::\n\n\n. . .\n\n* Next we save the summary statistics into a data frame.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (i in 1:length(countries)) {\n\t# Get the data for the current country only\n\tcountry_data <- subset(meas, country == countries[i])\n\t\n\t# Get the summary statistics for this country\n\tcountry_cases <- country_data$Cases\n\tcountry_quart <- quantile(\n\t\tcountry_cases, na.rm = TRUE, probs = c(0.25, 0.5, 0.75)\n\t)\n\tcountry_range <- range(country_cases, na.rm = TRUE)\n\t\n\t# Save the summary statistics into a data frame\n\tcountry_summary <- data.frame(\n\t\tcountry = countries[[i]],\n\t\tmin = country_range[[1]],\n\t\tQ1 = country_quart[[1]],\n\t\tmedian = country_quart[[2]],\n\t\tQ3 = country_quart[[3]],\n\t\tmax = country_range[[2]]\n\t)\n}\n```\n:::\n\n\n. . .\n\n* And finally, we save the data frame as the next element in our storage list.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfor (i in 1:length(countries)) {\n\t# Get the data for the current country only\n\tcountry_data <- subset(meas, country == countries[i])\n\t\n\t# Get the summary statistics for this country\n\tcountry_cases <- country_data$Cases\n\tcountry_quart <- quantile(\n\t\tcountry_cases, na.rm = TRUE, probs = c(0.25, 0.5, 0.75)\n\t)\n\tcountry_range <- range(country_cases, na.rm = TRUE)\n\t\n\t# Save the summary statistics into a data frame\n\tcountry_summary <- data.frame(\n\t\tcountry = countries[[i]],\n\t\tmin = country_range[[1]],\n\t\tQ1 = country_quart[[1]],\n\t\tmedian = country_quart[[2]],\n\t\tQ3 = country_quart[[3]],\n\t\tmax = country_range[[2]]\n\t)\n\t\n\t# Save the results to our container\n\tres[[i]] <- country_summary\n}\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning in min(x): no non-missing arguments to min; returning Inf\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning in max(x): no non-missing arguments to max; returning -Inf\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning in min(x): no non-missing arguments to min; returning Inf\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning in max(x): no non-missing arguments to max; returning -Inf\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning in min(x): no non-missing arguments to min; returning Inf\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nWarning in max(x): no non-missing arguments to max; returning -Inf\n```\n:::\n:::\n\n\n. . .\n\n* Let's take a look at the results.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhead(res)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[[1]]\n country min Q1 median Q3 max\n1 Afghanistan 353 1154 2205 5166 31107\n\n[[2]]\n country min Q1 median Q3 max\n1 Angola 29 700 3271 14474 30067\n\n[[3]]\n country min Q1 median Q3 max\n1 Albania 0 1 12 29 136034\n\n[[4]]\n country min Q1 median Q3 max\n1 Andorra 0 0 1 2 5\n\n[[5]]\n country min Q1 median Q3 max\n1 United Arab Emirates 22 89.75 320 1128 2913\n\n[[6]]\n country min Q1 median Q3 max\n1 Argentina 0 0 17 4591.5 42093\n```\n:::\n:::\n\n\n* How do we deal with this to get it into a nice form?\n\n. . .\n\n* We can use a *vectorization* trick: the function `do.call()` seems like\nancient computer science magic. And it is. But it will actually help us a\nlot.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nres_df <- do.call(rbind, res)\nhead(res_df)\n```\n\n::: {.cell-output-display}\n|country | min| Q1| median| Q3| max|\n|:--------------------|---:|-------:|------:|-------:|------:|\n|Afghanistan | 353| 1154.00| 2205| 5166.0| 31107|\n|Angola | 29| 700.00| 3271| 14474.0| 30067|\n|Albania | 0| 1.00| 12| 29.0| 136034|\n|Andorra | 0| 0.00| 1| 2.0| 5|\n|United Arab Emirates | 22| 89.75| 320| 1128.0| 2913|\n|Argentina | 0| 0.00| 17| 4591.5| 42093|\n:::\n:::\n\n\n* It combined our data frames together! Let's take a look at the `rbind` and\n`do.call()` help packages to see what happened.\n\n. . .\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?rbind\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nCombine R Objects by Rows or Columns\n\nDescription:\n\n Take a sequence of vector, matrix or data-frame arguments and\n combine by _c_olumns or _r_ows, respectively. These are generic\n functions with methods for other R classes.\n\nUsage:\n\n cbind(..., deparse.level = 1)\n rbind(..., deparse.level = 1)\n ## S3 method for class 'data.frame'\n rbind(..., deparse.level = 1, make.row.names = TRUE,\n stringsAsFactors = FALSE, factor.exclude = TRUE)\n \nArguments:\n\n ...: (generalized) vectors or matrices. These can be given as\n named arguments. Other R objects may be coerced as\n appropriate, or S4 methods may be used: see sections\n 'Details' and 'Value'. (For the '\"data.frame\"' method of\n 'cbind' these can be further arguments to 'data.frame' such\n as 'stringsAsFactors'.)\n\ndeparse.level: integer controlling the construction of labels in the\n case of non-matrix-like arguments (for the default method):\n 'deparse.level = 0' constructs no labels;\n the default 'deparse.level = 1' typically and 'deparse.level\n = 2' always construct labels from the argument names, see the\n 'Value' section below.\n\nmake.row.names: (only for data frame method:) logical indicating if\n unique and valid 'row.names' should be constructed from the\n arguments.\n\nstringsAsFactors: logical, passed to 'as.data.frame'; only has an\n effect when the '...' arguments contain a (non-'data.frame')\n 'character'.\n\nfactor.exclude: if the data frames contain factors, the default 'TRUE'\n ensures that 'NA' levels of factors are kept, see PR#17562\n and the 'Data frame methods'. In R versions up to 3.6.x,\n 'factor.exclude = NA' has been implicitly hardcoded (R <=\n 3.6.0) or the default (R = 3.6.x, x >= 1).\n\nDetails:\n\n The functions 'cbind' and 'rbind' are S3 generic, with methods for\n data frames. The data frame method will be used if at least one\n argument is a data frame and the rest are vectors or matrices.\n There can be other methods; in particular, there is one for time\n series objects. See the section on 'Dispatch' for how the method\n to be used is selected. If some of the arguments are of an S4\n class, i.e., 'isS4(.)' is true, S4 methods are sought also, and\n the hidden 'cbind' / 'rbind' functions from package 'methods'\n maybe called, which in turn build on 'cbind2' or 'rbind2',\n respectively. In that case, 'deparse.level' is obeyed, similarly\n to the default method.\n\n In the default method, all the vectors/matrices must be atomic\n (see 'vector') or lists. Expressions are not allowed. Language\n objects (such as formulae and calls) and pairlists will be coerced\n to lists: other objects (such as names and external pointers) will\n be included as elements in a list result. Any classes the inputs\n might have are discarded (in particular, factors are replaced by\n their internal codes).\n\n If there are several matrix arguments, they must all have the same\n number of columns (or rows) and this will be the number of columns\n (or rows) of the result. If all the arguments are vectors, the\n number of columns (rows) in the result is equal to the length of\n the longest vector. Values in shorter arguments are recycled to\n achieve this length (with a 'warning' if they are recycled only\n _fractionally_).\n\n When the arguments consist of a mix of matrices and vectors the\n number of columns (rows) of the result is determined by the number\n of columns (rows) of the matrix arguments. Any vectors have their\n values recycled or subsetted to achieve this length.\n\n For 'cbind' ('rbind'), vectors of zero length (including 'NULL')\n are ignored unless the result would have zero rows (columns), for\n S compatibility. (Zero-extent matrices do not occur in S3 and are\n not ignored in R.)\n\n Matrices are restricted to less than 2^31 rows and columns even on\n 64-bit systems. So input vectors have the same length\n restriction: as from R 3.2.0 input matrices with more elements\n (but meeting the row and column restrictions) are allowed.\n\nValue:\n\n For the default method, a matrix combining the '...' arguments\n column-wise or row-wise. (Exception: if there are no inputs or\n all the inputs are 'NULL', the value is 'NULL'.)\n\n The type of a matrix result determined from the highest type of\n any of the inputs in the hierarchy raw < logical < integer <\n double < complex < character < list .\n\n For 'cbind' ('rbind') the column (row) names are taken from the\n 'colnames' ('rownames') of the arguments if these are matrix-like.\n Otherwise from the names of the arguments or where those are not\n supplied and 'deparse.level > 0', by deparsing the expressions\n given, for 'deparse.level = 1' only if that gives a sensible name\n (a 'symbol', see 'is.symbol').\n\n For 'cbind' row names are taken from the first argument with\n appropriate names: rownames for a matrix, or names for a vector of\n length the number of rows of the result.\n\n For 'rbind' column names are taken from the first argument with\n appropriate names: colnames for a matrix, or names for a vector of\n length the number of columns of the result.\n\nData frame methods:\n\n The 'cbind' data frame method is just a wrapper for\n 'data.frame(..., check.names = FALSE)'. This means that it will\n split matrix columns in data frame arguments, and convert\n character columns to factors unless 'stringsAsFactors = FALSE' is\n specified.\n\n The 'rbind' data frame method first drops all zero-column and\n zero-row arguments. (If that leaves none, it returns the first\n argument with columns otherwise a zero-column zero-row data\n frame.) It then takes the classes of the columns from the first\n data frame, and matches columns by name (rather than by position).\n Factors have their levels expanded as necessary (in the order of\n the levels of the level sets of the factors encountered) and the\n result is an ordered factor if and only if all the components were\n ordered factors. (The last point differs from S-PLUS.) Old-style\n categories (integer vectors with levels) are promoted to factors.\n\n Note that for result column 'j', 'factor(., exclude = X(j))' is\n applied, where\n\n X(j) := if(isTRUE(factor.exclude)) {\n if(!NA.lev[j]) NA # else NULL\n } else factor.exclude\n \n where 'NA.lev[j]' is true iff any contributing data frame has had\n a 'factor' in column 'j' with an explicit 'NA' level.\n\nDispatch:\n\n The method dispatching is _not_ done via 'UseMethod()', but by\n C-internal dispatching. Therefore there is no need for, e.g.,\n 'rbind.default'.\n\n The dispatch algorithm is described in the source file\n ('.../src/main/bind.c') as\n\n 1. For each argument we get the list of possible class\n memberships from the class attribute.\n\n 2. We inspect each class in turn to see if there is an\n applicable method.\n\n 3. If we find a method, we use it. Otherwise, if there was an\n S4 object among the arguments, we try S4 dispatch; otherwise,\n we use the default code.\n\n If you want to combine other objects with data frames, it may be\n necessary to coerce them to data frames first. (Note that this\n algorithm can result in calling the data frame method if all the\n arguments are either data frames or vectors, and this will result\n in the coercion of character vectors to factors.)\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'c' to combine vectors (and lists) as vectors, 'data.frame' to\n combine vectors and matrices as a data frame.\n\nExamples:\n\n m <- cbind(1, 1:7) # the '1' (= shorter vector) is recycled\n m\n m <- cbind(m, 8:14)[, c(1, 3, 2)] # insert a column\n m\n cbind(1:7, diag(3)) # vector is subset -> warning\n \n cbind(0, rbind(1, 1:3))\n cbind(I = 0, X = rbind(a = 1, b = 1:3)) # use some names\n xx <- data.frame(I = rep(0,2))\n cbind(xx, X = rbind(a = 1, b = 1:3)) # named differently\n \n cbind(0, matrix(1, nrow = 0, ncol = 4)) #> Warning (making sense)\n dim(cbind(0, matrix(1, nrow = 2, ncol = 0))) #-> 2 x 1\n \n ## deparse.level\n dd <- 10\n rbind(1:4, c = 2, \"a++\" = 10, dd, deparse.level = 0) # middle 2 rownames\n rbind(1:4, c = 2, \"a++\" = 10, dd, deparse.level = 1) # 3 rownames (default)\n rbind(1:4, c = 2, \"a++\" = 10, dd, deparse.level = 2) # 4 rownames\n \n ## cheap row names:\n b0 <- gl(3,4, labels=letters[1:3])\n bf <- setNames(b0, paste0(\"o\", seq_along(b0)))\n df <- data.frame(a = 1, B = b0, f = gl(4,3))\n df. <- data.frame(a = 1, B = bf, f = gl(4,3))\n new <- data.frame(a = 8, B =\"B\", f = \"1\")\n (df1 <- rbind(df , new))\n (df.1 <- rbind(df., new))\n stopifnot(identical(df1, rbind(df, new, make.row.names=FALSE)),\n identical(df1, rbind(df., new, make.row.names=FALSE)))\n```\n:::\n:::\n\n\n. . .\n\n\n::: {.cell}\n\n```{.r .cell-code}\n?do.call\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nExecute a Function Call\n\nDescription:\n\n 'do.call' constructs and executes a function call from a name or a\n function and a list of arguments to be passed to it.\n\nUsage:\n\n do.call(what, args, quote = FALSE, envir = parent.frame())\n \nArguments:\n\n what: either a function or a non-empty character string naming the\n function to be called.\n\n args: a _list_ of arguments to the function call. The 'names'\n attribute of 'args' gives the argument names.\n\n quote: a logical value indicating whether to quote the arguments.\n\n envir: an environment within which to evaluate the call. This will\n be most useful if 'what' is a character string and the\n arguments are symbols or quoted expressions.\n\nDetails:\n\n If 'quote' is 'FALSE', the default, then the arguments are\n evaluated (in the calling environment, not in 'envir'). If\n 'quote' is 'TRUE' then each argument is quoted (see 'quote') so\n that the effect of argument evaluation is to remove the quotes -\n leaving the original arguments unevaluated when the call is\n constructed.\n\n The behavior of some functions, such as 'substitute', will not be\n the same for functions evaluated using 'do.call' as if they were\n evaluated from the interpreter. The precise semantics are\n currently undefined and subject to change.\n\nValue:\n\n The result of the (evaluated) function call.\n\nWarning:\n\n This should not be used to attempt to evade restrictions on the\n use of '.Internal' and other non-API calls.\n\nReferences:\n\n Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S\n Language_. Wadsworth & Brooks/Cole.\n\nSee Also:\n\n 'call' which creates an unevaluated call.\n\nExamples:\n\n do.call(\"complex\", list(imaginary = 1:3))\n \n ## if we already have a list (e.g., a data frame)\n ## we need c() to add further arguments\n tmp <- expand.grid(letters[1:2], 1:3, c(\"+\", \"-\"))\n do.call(\"paste\", c(tmp, sep = \"\"))\n \n do.call(paste, list(as.name(\"A\"), as.name(\"B\")), quote = TRUE)\n \n ## examples of where objects will be found.\n A <- 2\n f <- function(x) print(x^2)\n env <- new.env()\n assign(\"A\", 10, envir = env)\n assign(\"f\", f, envir = env)\n f <- function(x) print(x)\n f(A) # 2\n do.call(\"f\", list(A)) # 2\n do.call(\"f\", list(A), envir = env) # 4\n do.call( f, list(A), envir = env) # 2\n do.call(\"f\", list(quote(A)), envir = env) # 100\n do.call( f, list(quote(A)), envir = env) # 10\n do.call(\"f\", list(as.name(\"A\")), envir = env) # 100\n \n eval(call(\"f\", A)) # 2\n eval(call(\"f\", quote(A))) # 2\n eval(call(\"f\", A), envir = env) # 4\n eval(call(\"f\", quote(A)), envir = env) # 100\n```\n:::\n:::\n\n\n. . .\n\n* OK, so basically what happened is that\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndo.call(rbind, list)\n```\n:::\n\n\n* Gets transformed into\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrbind(list[[1]], list[[2]], list[[3]], ..., list[[length(list)]])\n```\n:::\n\n\n* That's vectorization magic!\n\n## You try it! (if we have time) {.smaller}\n\n* Use the code you wrote before the get the incidence per 1000 people on the\nentire measles data set (add a column for incidence to the full data).\n* Use the code `plot(NULL, NULL, ...)` to make a blank plot. You will need to\nset the `xlim` and `ylim` arguments to sensible values, and specify the axis\ntitles as \"Year\" and \"Incidence per 1000 people\".\n* Using a `for` loop and the `lines()` function, make a plot that shows all of\nthe incidence curves over time, overlapping on the plot.\n* HINT: use `col = adjustcolor(black, alpha.f = 0.25)` to make the curves\ntransparent, so you can see the others.\n* BONUS PROBLEM: using the function `cumsum()`, make a plot of the cumulative\nincidence per 1000 people over time for all of the countries. (Dealing with\nthe NA's here is tricky!!)\n\n## Main problem solution\n\n\n::: {.cell}\n\n```{.r .cell-code}\nmeas$cases_per_thousand <- meas$Cases / as.numeric(meas$total_pop) * 1000\ncountries <- unique(meas$country)\n\nplot(\n\tNULL, NULL,\n\txlim = c(1980, 2022),\n\tylim = c(0, 50),\n\txlab = \"Year\",\n\tylab = \"Incidence per 1000 people\"\n)\n\nfor (i in 1:length(countries)) {\n\tcountry_data <- subset(meas, country == countries[[i]])\n\tlines(\n\t\tx = country_data$time,\n\t\ty = country_data$cases_per_thousand,\n\t\tcol = adjustcolor(\"black\", alpha.f = 0.25)\n\t)\n}\n```\n:::\n\n\n## Main problem solution\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](ModuleXX-Iteration_files/figure-revealjs/unnamed-chunk-30-1.png){width=960}\n:::\n:::\n\n\n## Bonus problem solution\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# First calculate the cumulative cases, treating NA as zeroes\ncumulative_cases <- ave(\n\tx = ifelse(is.na(meas$Cases), 0, meas$Cases),\n\tmeas$country,\n\tFUN = cumsum\n)\n\n# Now put the NAs back where they should be\nmeas$cumulative_cases <- cumulative_cases + (meas$Cases * 0)\n\nplot(\n\tNULL, NULL,\n\txlim = c(1980, 2022),\n\tylim = c(1, 6.2e6),\n\txlab = \"Year\",\n\tylab = \"Cumulative cases per 1000 people\"\n)\n\nfor (i in 1:length(countries)) {\n\tcountry_data <- subset(meas, country == countries[[i]])\n\tlines(\n\t\tx = country_data$time,\n\t\ty = country_data$cumulative_cases,\n\t\tcol = adjustcolor(\"black\", alpha.f = 0.25)\n\t)\n}\n\ntext(\n\tx = 2020,\n\ty = 6e6,\n\tlabels = \"China →\"\n)\n```\n:::\n\n\n## Bonus problem solution\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](ModuleXX-Iteration_files/figure-revealjs/unnamed-chunk-32-1.png){width=960}\n:::\n:::\n\n\n## More practice on your own {.smaller}\n\n* Merge the `countries-regions.csv` data with the `measles_final.Rds` data.\nReshape the measles data so that `MCV1` and `MCV2` vaccine coverage are two\nseparate columns. Then use a loop to fit a poisson regression model for each\ncontinent where `Cases` is the outcome, and `MCV1 coverage` and `MCV2 coverage`\nare the predictors. Discuss your findings, and try adding an interation term.\n* Assess the impact of `age_months` as a confounder in the Diphtheria serology\ndata. First, write code to transform `age_months` into age ranges for each\nyear. Then, using a loop, calculate the crude odds ratio for the effect of\nvaccination on infection for each of the age ranges. How does the odds ratio\nchange as age increases? Can you formalize this analysis by fitting a logistic\nregression model with `age_months` and vaccination as predictors?\n\n\n", "supporting": [ "ModuleXX-Iteration_files" ], diff --git a/_freeze/modules/ModuleXX-Iteration/figure-revealjs/unnamed-chunk-30-1.png b/_freeze/modules/ModuleXX-Iteration/figure-revealjs/unnamed-chunk-30-1.png index 84a077e..d1c55ba 100644 Binary files a/_freeze/modules/ModuleXX-Iteration/figure-revealjs/unnamed-chunk-30-1.png and b/_freeze/modules/ModuleXX-Iteration/figure-revealjs/unnamed-chunk-30-1.png differ diff --git a/_freeze/modules/ModuleXX-Iteration/figure-revealjs/unnamed-chunk-32-1.png b/_freeze/modules/ModuleXX-Iteration/figure-revealjs/unnamed-chunk-32-1.png index 009fce5..66ca0eb 100644 Binary files a/_freeze/modules/ModuleXX-Iteration/figure-revealjs/unnamed-chunk-32-1.png and b/_freeze/modules/ModuleXX-Iteration/figure-revealjs/unnamed-chunk-32-1.png differ diff --git a/archive/participants.xlsx b/archive/participants.xlsx new file mode 100644 index 0000000..9ec8bb6 Binary files /dev/null and b/archive/participants.xlsx differ diff --git a/docs/archive/CaseStudy01.html b/docs/archive/CaseStudy01.html new file mode 100644 index 0000000..81d333a --- /dev/null +++ b/docs/archive/CaseStudy01.html @@ -0,0 +1,1008 @@ + + + + + + + + + + + + + + + SISMID Module NUMBER Materials (2025) - Algorithmic Thinking Case Study 1 + + + + + + + + + + + + + + + +
+
+ +
+

Algorithmic Thinking Case Study 1

+

SISMID 2024 – Introduction to R

+ +
+ + +
+ +
+
+

Learning goals

+
    +
  • Use logical operators, subsetting functions, and math calculations in R
  • +
  • Translate human-understandable problem descriptions into instructions that R can understand.
  • +
+
+
+
+

Remember, R always does EXACTLY what you tell it to do!

+ +
+
+

Instructions

+
    +
  • Make a new R script for this case study, and save it to your code folder.
  • +
  • We’ll use the diphtheria serosample data from Exercise 1 for this case study. Load it into R and use the functions we’ve learned to look at it.
  • +
+
+
+

Instructions

+
    +
  • Make a new R script for this case study, and save it to your code folder.
  • +
  • We’ll use the diphtheria serosample data from Exercise 1 for this case study. Load it into R and use the functions we’ve learned to look at it.
  • +
  • The str() of your dataset should look like this.
  • +
+
+
+
tibble [250 × 5] (S3: tbl_df/tbl/data.frame)
+ $ age_months  : num [1:250] 15 44 103 88 88 118 85 19 78 112 ...
+ $ group       : chr [1:250] "urban" "rural" "urban" "urban" ...
+ $ DP_antibody : num [1:250] 0.481 0.657 1.368 1.218 0.333 ...
+ $ DP_infection: num [1:250] 1 1 1 1 1 1 1 1 1 1 ...
+ $ DP_vacc     : num [1:250] 0 1 1 1 1 1 1 1 1 1 ...
+
+
+
+
+

Q1: Was the overall prevalence higher in urban or rural areas?

+
+
    +
  1. How do we calculate the prevalence from the data?
  2. +
  3. How do we calculate the prevalence separately for urban and rural areas?
  4. +
  5. How do we determine which prevalence is higher and if the difference is meaningful?
  6. +
+
+
+
+

Q1: How do we calculate the prevalence from the data?

+
+
    +
  • The variable DP_infection in our dataset is binary / dichotomous.
  • +
  • The prevalence is the number or percent of people who had the disease over some duration.
  • +
  • The average of a binary variable gives the prevalence!
  • +
+
+
+
+
mean(diph$DP_infection)
+
+
[1] 0.8
+
+
+
+
+
+

Q1: How do we calculate the prevalence separately for urban and rural areas?

+
+
+
mean(diph[diph$group == "urban", ]$DP_infection)
+
+
[1] 0.8235294
+
+
mean(diph[diph$group == "rural", ]$DP_infection)
+
+
[1] 0.778626
+
+
+
+
+
    +
  • There are many ways you could write this code! You can use subset() or you can write the indices many ways.
  • +
  • Using tbl_df objects from haven uses different [[ rules than a base R data frame.
  • +
+
+
+
+

Q1: How do we calculate the prevalence separately for urban and rural areas?

+
    +
  • One easy way is to use the aggregate() function.
  • +
+
+
aggregate(DP_infection ~ group, data = diph, FUN = mean)
+
+
  group DP_infection
+1 rural    0.7786260
+2 urban    0.8235294
+
+
+
+
+

Q1: How do we determine which prevalence is higher and if the difference is meaningful?

+
+
    +
  • We probably need to include a confidence interval in our calculation.
  • +
  • This is actually not so easy without more advanced tools that we will learn in upcoming modules.
  • +
  • Right now the best options are to do it by hand or google a function.
  • +
+
+
+
+

Q1: By hand

+
+
p_urban <- mean(diph[diph$group == "urban", ]$DP_infection)
+p_rural <- mean(diph[diph$group == "rural", ]$DP_infection)
+se_urban <- sqrt(p_urban * (1 - p_urban) / nrow(diph[diph$group == "urban", ]))
+se_rural <- sqrt(p_rural * (1 - p_rural) / nrow(diph[diph$group == "rural", ])) 
+
+result_urban <- paste0(
+    "Urban: ", round(p_urban, 2), "; 95% CI: (",
+    round(p_urban - 1.96 * se_urban, 2), ", ",
+    round(p_urban + 1.96 * se_urban, 2), ")"
+)
+
+result_rural <- paste0(
+    "Rural: ", round(p_rural, 2), "; 95% CI: (",
+    round(p_rural - 1.96 * se_rural, 2), ", ",
+    round(p_rural + 1.96 * se_rural, 2), ")"
+)
+
+cat(result_urban, result_rural, sep = "\n")
+
+
Urban: 0.82; 95% CI: (0.76, 0.89)
+Rural: 0.78; 95% CI: (0.71, 0.85)
+
+
+
+
+

Q1: By hand

+
    +
  • We can see that the 95% CI’s overlap, so the groups are probably not that different. To be sure, we need to do a 2-sample test! But this is not a statistics class.
  • +
  • Some people will tell you that coding like this is “bad”. But ‘bad’ code that gives you answers is better than broken code! We will learn techniques for writing this with less work and less repetition in upcoming modules.
  • +
+
+
+

Q1: Googling a package

+
+
+
# install.packages("DescTools")
+library(DescTools)
+
+aggregate(DP_infection ~ group, data = diph, FUN = DescTools::MeanCI)
+
+
  group DP_infection.mean DP_infection.lwr.ci DP_infection.upr.ci
+1 rural         0.7786260           0.7065872           0.8506647
+2 urban         0.8235294           0.7540334           0.8930254
+
+
+
+
+
+

You try it!

+
    +
  • Using any of the approaches you can think of, answer this question!
  • +
  • How many children under 5 were vaccinated? In children under 5, did vaccination lower the prevalence of infection?
  • +
+
+
+

You try it!

+
+
# How many children under 5 were vaccinated
+sum(diph$DP_vacc[diph$age_months < 60])
+
+
[1] 91
+
+
# Prevalence in both vaccine groups for children under 5
+aggregate(
+    DP_infection ~ DP_vacc,
+    data = subset(diph, age_months < 60),
+    FUN = DescTools::MeanCI
+)
+
+
  DP_vacc DP_infection.mean DP_infection.lwr.ci DP_infection.upr.ci
+1       0         0.4285714           0.1977457           0.6593972
+2       1         0.6373626           0.5366845           0.7380407
+
+
+

It appears that prevalence was HIGHER in the vaccine group? That is counterintuitive, but the sample size for the unvaccinated group is too small to be sure.

+
+
+

Congratulations for finishing the first case study!

+
    +
  • What R functions and skills did you practice?
  • +
  • What other questions could you answer about the same dataset with the skills you know now?
  • +
+ + +
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/docs/modules/Module00-Welcome.html b/docs/modules/Module00-Welcome.html index 27ae87d..1601c48 100644 --- a/docs/modules/Module00-Welcome.html +++ b/docs/modules/Module00-Welcome.html @@ -8,11 +8,11 @@ - + - SISMID Module NUMBER Materials (2025) – Welcome to SISMID Workshop: Introduction to R + SISMID Module NUMBER Materials (2025) - Welcome to SISMID Workshop: Introduction to R @@ -157,8 +157,7 @@ } .callout.callout-titled .callout-body > .callout-content > :last-child { - padding-bottom: 0.5rem; - margin-bottom: 0; + margin-bottom: 0.5rem; } .callout.callout-titled .callout-icon::before { @@ -343,35 +342,16 @@

Welcome to SISMID Workshop: Introduction to R

-
-

Welcome to SISMID Workshop: Introduction to R!

-

Amy Winter (she/her) Assistant Professor, Department of Epidemiology and Biostatistics Email: awinter@uga.edu

-

Zane Billings (he/him) PhD Candidate, Department of Epidemiology and Biostatistics Email: Wesley.Billings@uga.edu

+

Amy Winter (she/her)

+

Assistant Professor, Department of Epidemiology and Biostatistics

+

Email: awinter@uga.edu

+


+

Zane Billings (he/him)

+

PhD Candidate, Department of Epidemiology and Biostatistics

+

Email: Wesley.Billings@uga.edu

Introductions

@@ -399,7 +379,7 @@

What is R?

  • R is both open source and open development

  • -R logo
    +R logo

    What is R?

      @@ -414,7 +394,7 @@

      What is R?

      What is R?

      • Program: R is a clear and accessible programming tool
      • -
      • Transform: R is made up of a collection of libraries designed specifically for data science
      • +
      • Transform: R is made up of a collection of packages/libraries designed specifically for statistical computing
      • Discover: Investigate the data, refine your hypothesis and analyze them
      • Model: R provides a wide array of tools to capture the right model for your data
      • Communicate: Integrate codes, graphs, and outputs to a report with R Markdown or build Shiny apps to share with the world
      • @@ -439,26 +419,26 @@

        Why not R?

    -

    Is R DIfficult?

    +

    Is R Difficult?

      -
    • Short answer – It has a steep learning curve.
    • -
    • Years ago, R was a difficult language to master. The language was confusing and not as structured as the other programming tools.
    • +
    • Short answer – It has a steep learning curve, like all programming languages
    • +
    • Years ago, R was a difficult language to master.
    • Hadley Wickham developed a collection of packages called tidyverse. Data manipulation became trivial and intuitive. Creating a graph was not so difficult anymore.
    -
    +

    Overall Workshop Objectives

    By the end of this workshop, you should be able to

    1. start a new project, read in data, and conduct basic data manipulation, analysis, and visualization
    2. know how to use and find packages/functions that we did not specifically learn in class
    3. -
    4. troubleshoot errors (xxzane? – not included right now)
    5. +
    6. troubleshoot errors

    This workshop differs from “Introduction to Tidyverse”

    We will focus this class on using Base R functions and packages, i.e., pre-installed into R and the basis for most other functions and packages! If you know Base R then are will be more equipped to use all the other useful/pretty packages that exit.

    -

    the Tidyverse is one set of useful/pretty packages, designed to can make your code more intuitive as compared to the original older Base R. Tidyverse advantages:

    +

    The Tidyverse is one set of useful/pretty sets of packages, designed to can make your code more intuitive as compared to the original older Base R. Tidyverse advantages:

    • consistent structure - making it easier to learn how to use different packages
    • particularly good for wrangling (manipulating, cleaning, joining) data
      @@ -466,22 +446,27 @@

      This workshop differs from “Introduction to Tidyverse”

    • more flexible for visualizing data
    -Tidyverse hex sticker
    +Tidyverse hex sticker

    Workshop Overview

    -

    14 lecture blocks that will each: - Start with learning objectives - End with summary slides - Include mini-exercise(s) or a full exercise

    -

    Themes that will show up throughout the workshop: - Reproducibility - Good coding techniques - Thinking algorithmically - Basic terms / R jargon

    +

    14 lecture blocks that will each:

    +
      +
    • Start with learning objectives
    • +
    • End with summary slides
    • +
    • Include mini-exercise(s) or a full exercise
    • +
    +

    Themes that will show up throughout the workshop:

    +

    Reproducibility

    xxzane slides

    -
    -

    Good coding techniques

    -
    -
    -

    Thinking algorithmically

    -

    Useful (+ Free) Resources

    Want more?

    @@ -515,10 +500,8 @@

    Installing R

  • Install RStudio
  • -
    @@ -547,6 +530,7 @@

    Installing R

    Reveal.initialize({ 'controlsAuto': true, 'previewLinksAuto': false, +'smaller': true, 'pdfSeparateFragments': false, 'autoAnimateEasing': "ease", 'autoAnimateDuration': 1, @@ -801,7 +785,18 @@

    Installing R

    } return false; } - const onCopySuccess = function(e) { + const clipboard = new window.ClipboardJS('.code-copy-button', { + text: function(trigger) { + const codeEl = trigger.previousElementSibling.cloneNode(true); + for (const childEl of codeEl.children) { + if (isCodeAnnotation(childEl)) { + childEl.remove(); + } + } + return codeEl.innerText; + } + }); + clipboard.on('success', function(e) { // button target const button = e.trigger; // don't keep focus @@ -833,50 +828,11 @@

    Installing R

    }, 1000); // clear code selection e.clearSelection(); - } - const getTextToCopy = function(trigger) { - const codeEl = trigger.previousElementSibling.cloneNode(true); - for (const childEl of codeEl.children) { - if (isCodeAnnotation(childEl)) { - childEl.remove(); - } - } - return codeEl.innerText; - } - const clipboard = new window.ClipboardJS('.code-copy-button:not([data-in-quarto-modal])', { - text: getTextToCopy }); - clipboard.on('success', onCopySuccess); - if (window.document.getElementById('quarto-embedded-source-code-modal')) { - // For code content inside modals, clipBoardJS needs to be initialized with a container option - // TODO: Check when it could be a function (https://github.com/zenorocha/clipboard.js/issues/860) - const clipboardModal = new window.ClipboardJS('.code-copy-button[data-in-quarto-modal]', { - text: getTextToCopy, - container: window.document.getElementById('quarto-embedded-source-code-modal') - }); - clipboardModal.on('success', onCopySuccess); - } - var localhostRegex = new RegExp(/^(?:http|https):\/\/localhost\:?[0-9]*\//); - var mailtoRegex = new RegExp(/^mailto:/); - var filterRegex = new RegExp('/' + window.location.host + '/'); - var isInternal = (href) => { - return filterRegex.test(href) || localhostRegex.test(href) || mailtoRegex.test(href); - } - // Inspect non-navigation links and adorn them if external - var links = window.document.querySelectorAll('a[href]:not(.nav-link):not(.navbar-brand):not(.toc-action):not(.sidebar-link):not(.sidebar-item-toggle):not(.pagination-link):not(.no-external):not([aria-hidden]):not(.dropdown-item):not(.quarto-navigation-tool):not(.about-link)'); - for (var i=0; iInstalling R interactive: true, interactiveBorder: 10, theme: 'light-border', - placement: 'bottom-start', + placement: 'bottom-start' }; - if (contentFn) { - config.content = contentFn; - } - if (onTriggerFn) { - config.onTrigger = onTriggerFn; - } - if (onUntriggerFn) { - config.onUntrigger = onUntriggerFn; - } config['offset'] = [0,0]; config['maxWidth'] = 700; window.tippy(el, config); @@ -910,11 +857,7 @@

    Installing R

    try { href = new URL(href).hash; } catch {} const id = href.replace(/^#\/?/, ""); const note = window.document.getElementById(id); - if (note) { - return note.innerHTML; - } else { - return ""; - } + return note.innerHTML; }); } const findCites = (el) => { diff --git a/docs/modules/Module01-Intro.html b/docs/modules/Module01-Intro.html index a580ca7..ccc4114 100644 --- a/docs/modules/Module01-Intro.html +++ b/docs/modules/Module01-Intro.html @@ -8,11 +8,11 @@ - + - SISMID Module NUMBER Materials (2025) – Module 1: Introduction to RStudio and R Basics + SISMID Module NUMBER Materials (2025) - Module 1: Introduction to RStudio and R Basics @@ -32,7 +32,7 @@ } /* CSS for syntax highlighting */ pre > code.sourceCode { white-space: pre; position: relative; } - pre > code.sourceCode > span { line-height: 1.25; } + pre > code.sourceCode > span { display: inline-block; line-height: 1.25; } pre > code.sourceCode > span:empty { height: 1.2em; } .sourceCode { overflow: visible; } code.sourceCode > span { color: inherit; text-decoration: inherit; } @@ -43,7 +43,7 @@ } @media print { pre > code.sourceCode { white-space: pre-wrap; } - pre > code.sourceCode > span { display: inline-block; text-indent: -5em; padding-left: 5em; } + pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; } } pre.numberSource code { counter-reset: source-line 0; } @@ -71,7 +71,7 @@ code span.at { color: #657422; } /* Attribute */ code span.bn { color: #ad0000; } /* BaseN */ code span.bu { } /* BuiltIn */ - code span.cf { color: #003b4f; font-weight: bold; } /* ControlFlow */ + code span.cf { color: #003b4f; } /* ControlFlow */ code span.ch { color: #20794d; } /* Char */ code span.cn { color: #8f5902; } /* Constant */ code span.co { color: #5e5e5e; } /* Comment */ @@ -85,7 +85,7 @@ code span.fu { color: #4758ab; } /* Function */ code span.im { color: #00769e; } /* Import */ code span.in { color: #5e5e5e; } /* Information */ - code span.kw { color: #003b4f; font-weight: bold; } /* Keyword */ + code span.kw { color: #003b4f; } /* Keyword */ code span.op { color: #5e5e5e; } /* Operator */ code span.ot { color: #003b4f; } /* Other */ code span.pp { color: #ad0000; } /* Preprocessor */ @@ -222,8 +222,7 @@ } .callout.callout-titled .callout-body > .callout-content > :last-child { - padding-bottom: 0.5rem; - margin-bottom: 0; + margin-bottom: 0.5rem; } .callout.callout-titled .callout-icon::before { @@ -408,58 +407,17 @@

    Module 1: Introduction to RStudio and R Basics

    -
    -

    Learning Objectives

    After module 1, you should be able to…

    • Create and save an R script
    • -
    • Describe the utility and differences b/w the console and an R script
    • -
    • Modify R Studio windows
    • +
    • Describe the utility and differences b/w the Console and the Source panes
    • +
    • Modify R Studio panes
    • Create objects
    • Describe the difference b/w character, numeric, list, and matrix objects
    • -
    • Reference objects in the RStudio Global Environment
    • +
    • Reference objects in the RStudio Environment pane
    • Use basic arithmetic operators in R
    • Use comments within an R script to create header, sections, and make notes
    @@ -472,11 +430,11 @@

    Working with R – RStudio

  • Makes things easier
  • Is NOT a dropdown statistical tool (such as Stata)
  • -RStudio logo
    +RStudio logo

    RStudio

    Easier working with R

    @@ -495,20 +453,23 @@

    RStudio

    RStudio

    -RStudio
    +RStudio

    Getting the editor

    -
    +

    Working with R in RStudio - 2 major panes:

      -
    1. The Source/Editor: “Analysis” Script + Interactive Exploration +
    2. The Source/Editor: xxamy
    3. +
      +
    • “Analysis” Script
    • Static copy of what you did (reproducibility)
    • Top by default
    • -
    -
  • The R Console: “interprets” whatever you type + +
      +
    1. The R Console: “interprets” whatever you type:

      • Calculator
      • Try things out interactively, then add to your editor
      • @@ -533,18 +494,18 @@

        R Console

      • Code is not saved
      -
  • +

    RStudio

    Useful RStudio “cheat sheet”: https://github.com/rstudio/cheatsheets/blob/main/rstudio-ide.pdf

    -RStudio
    +RStudio

    RStudio Layout

    If RStudio doesn’t look the way you want (or like our RStudio), then do:

    -

    RStudio –> View –> Panes –> Pane Layout

    +

    In R Studio Menu Bar go to View Menu –> Panes –> Pane Layout

    -
    +

    Workspace/Environment

      @@ -557,7 +518,7 @@

      Workspace/Environment

      Workspace/History

      • Shows previous commands. Good to look at for debugging, but don’t rely on it.
      • -
      • Also type the “up” key in the Console to scroll through previous commands
      • +
      • Also type the “up” and “down” key in the Console to scroll through previous commands
    @@ -573,35 +534,36 @@

    Workspace/Other Panes

    Getting Started

      -
    • File –> New File –> R Script
    • +
    • In R Studio Menu Bar go to File Menu –> New File –> R Script
    • Save the blank R script as Module1.R

    Explaining output on slides

    -

    In slides, a command (we’ll also call them code or a code chunk) will look like this

    +

    In slides, the R command/code will be in a box, and then directly after it, will be the output of the code starting with [1]

    -
    print("I'm code")
    +
    print("I'm code")
    [1] "I'm code"
    -

    And then directly after it, will be the output of the code.
    -So print("I'm code") is the code chunk and [1] "I'm code" is the output.

    +

    So print("I'm code") is the command and [1] "I'm code" is the output.

    +


    +

    Commands/code and output written as inline text will be typewriter blue font. For example code

    R as a calculator

    You can do basic arithmetic in R, which I surprisingly use all the time.

    -
    2 + 2
    +
    2 + 2
    [1] 4
    -
    2 * 4
    +
    2 * 4
    [1] 8
    -
    2^3
    +
    2^3
    [1] 8
    @@ -611,18 +573,18 @@

    R as a calculator

    R as a calculator

    • The R console is a full calculator
    • -
    • Try to play around with it: +
    • Arithmetic operators:
        -
      • +, -, /, * are add, subtract, divide and multiply
      • -
      • ^ or ** is power
      • -
      • parentheses – ( and ) – work with order of operations
      • -
      • %% finds the remainder
      • +
      • +, -, /, * are add, subtract, divide and multiply
      • +
      • ^ or ** is power
      • +
      • parentheses – ( and ) – work with order of operations
      • +
      • %% finds the remainder
    -
    +

    Execute / Run Code

    -

    To execute or run a line of code, you just put your cursor on line of code and then:

    +

    To execute or run a line of code (i.e., command), you just put your cursor on the command and then:

    1. Press Run (which you will find at the top of your window)
    @@ -630,13 +592,13 @@

    Execute / Run Code

    1. Press Cmd + Return (iOS) OR Ctrl + Enter (Windows).
    -

    To execute or run multiple lines of code, you just need to highlight the code you want to run and then follow option 1 or 2.

    +

    To execute or run multiple lines of code, you need to highlight the code you want to run and then follow option 1 or 2.

    Mini exercise

    Execute 5+4 from your .R file, and then find the answer 9 in the Console.

    -
    +

    Commenting in Scripts

    The syntax # creates a comment, which means anything to the right of # will not be executed / run

    Commenting is useful to:

    @@ -650,49 +612,46 @@

    Commenting in Scripts

    Commenting an R Script header

    Add a comment header to Module1.R. This is the one I typically use, but you may have your own preference. The goal is that you are consistent so that future you / collaborators can make sense of your code.

    -
    ### Title: Module 1
    -### Author: Amy Winter 
    -### Objective: Mini Exercise - Developing first R Script
    -### Date: 15 July 2024
    +
    ### Title: Module 1
    +### Author: Amy Winter 
    +### Objective: Mini Exercise - Developing first R Script
    +### Date: 15 July 2024

    Commenting to create sections

    -

    You can also create sections within your code by ending a comment with 4 hash marks. This is very useful for creating an outline of your R Script. The “Outline” can be found in the top right of the your source window.

    +

    You can also create sections within your code by ending a comment with 4 hash marks. This is very useful for creating an outline of your R Script. The “Outline” can be found in the top right of the your Source pane

    -
    # Section 1 Header ####
    -## Section 2 Sub-header ####
    -### Section 3 Sub-sub-header ####
    -#### Section 4 Sub-sub-sub-header ####
    +
    # Section 1 Header ####
    +## Section 2 Sub-header ####
    +### Section 3 Sub-sub-header ####
    +#### Section 4 Sub-sub-sub-header ####

    Commenting to explain code

    -
    ## this # is still a comment
    -### you can use many #'s as you want
    -
    -# sometimes you have a really long comment,
    -#    like explaining what you are doing
    -#    for a step in analysis. 
    -# Take it to another line
    +
    ## this # is still a comment
    +### you can use many #'s as you want
    +
    +# sometimes you have a really long comment,
    +#    like explaining what you are doing
    +#    for a step in analysis. 
    +# Take it to another line
    -
    -
    -

    Commenting to explain code

    I tend to use:

      -
    • One hash tag with a space to describe what is happening in the following few lines of code
    • -
    • One hastag with no space after a command to list specifics
    • +
    • One hash mark with a space to describe what is happening in the following few lines of code
    • +
    • One hash mark with no space after a command to list specifics
    -
    # Practicing my arithmetic
    -5+2
    -3*5
    -9/8
    -
    -5+2 #5 plus 2 
    +
    # Practicing my arithmetic
    +5+2
    +3*5
    +9/8
    +
    +5+2 #5 plus 2 
    @@ -713,15 +672,15 @@

    Objects

    • You can create objects from within the R environment and from files on your computer
    • R uses <- to assign values to an object name
    • -
    • Note: Object names are case-sensitive, i.e. X and x are different
    • +
    • Note: Object names are case-sensitive, i.e. X and x are different
    • Here are examples of creating five different objects:
    -
    number.object <- 3
    -character.object <- "blue"
    -vector.object1 <- c(2,3,4,5)
    -vector.object2 <- c("blue", "red", "yellow")
    -matrix.object <- matrix(data=vector.object1, nrow=2, ncol=2, byrow=TRUE)
    +
    number.object <- 3
    +character.object <- "blue"
    +vector.object1 <- c(2,3,4,5)
    +vector.object2 <- c("blue", "red", "yellow")
    +matrix.object <- matrix(data=vector.object1, nrow=2, ncol=2, byrow=TRUE)

    Note, c() and matrix() are functions, which we will talk more about in module 2.

    @@ -729,26 +688,26 @@

    Objects

    Mini Exercise

    Try creating one or two of these objects in your R script

    -
    number.object <- 3
    -character.object <- "blue"
    -vector.object1 <- c(2,3,4,5)
    -vector.object2 <- c("blue", "red", "yellow")
    -matrix.object <- matrix(data=vector.object1, nrow=2, ncol=2, byrow=TRUE)
    +
    number.object <- 3
    +character.object <- "blue"
    +vector.object1 <- c(2,3,4,5)
    +vector.object2 <- c("blue", "red", "yellow")
    +matrix.object <- matrix(data=vector.object1, nrow=2, ncol=2, byrow=TRUE)

    Objects

    Note, you can find these objects now in the Global Environment.

    -

    Also, you can call them anytime (i.e, see them in the Console) by executing (running) the object. For example,

    +

    Also, you can print them anytime (i.e, see them in the Console) by executing (running) the object. For example,

    -
    character.object
    +
    character.object
    [1] "blue"
    -
    matrix.object
    +
    matrix.object
         [,1] [,2]
     [1,]    2    3
    @@ -756,16 +715,18 @@ 

    Objects

    -
    -

    Assignment - Good coding

    -

    = and <- can both be used for assignment, but <- is better coding practice, because == is a logical operator. We will talk about this more, later.

    +
    +
    +

    Object names and assingment - Good coding

    +

    xxzane

    +

    = and <- can both be used for assignment, but <- is better coding practice, because sometimes = doesn’t work and we want to distinguish between the logical operator ==. We will talk about this more, later.

    Lists

    List is a special data class, that can hold vectors, strings, matrices, models, list of other lists.

    -
    list.object <- list(number.object, vector.object2, matrix.object)
    -list.object
    +
    list.object <- list(number.object, vector.object2, matrix.object)
    +list.object
    [[1]]
     [1] 3
    @@ -804,33 +765,31 @@ 

    Summary

  • The Editor is for static code like R Scripts
  • The Console is for testing code that can’t be saved
  • Commenting is your new best friend
  • -
  • In R we create objects that can be viewed in the Environment panel and called anytime
  • +
  • In R we create objects that can be viewed in the Environment pane and used anytime
  • An object is something that can be worked with in R
  • Use <- syntax to create objects
  • -
    +

    Mini Exercise

    1. Create a new number object and name it my.object
    2. Create a vector of 4 numbers and name it my.vector using the c() function
    3. -
    4. Add my.object and my.vector together use arithmatic operator
    5. +
    6. Add my.object and my.vector together using an arithmetic operator

    Acknowledgements

    -

    These are the materials I looked through, modified, or extracted to complete this module’s lecture.

    +

    These are the materials we looked through, modified, or extracted to complete this module’s lecture.

    -
    -
    -
    +
    @@ -857,6 +816,7 @@

    Acknowledgements

    Reveal.initialize({ 'controlsAuto': true, 'previewLinksAuto': false, +'smaller': true, 'pdfSeparateFragments': false, 'autoAnimateEasing': "ease", 'autoAnimateDuration': 1, @@ -1111,7 +1071,18 @@

    Acknowledgements

    } return false; } - const onCopySuccess = function(e) { + const clipboard = new window.ClipboardJS('.code-copy-button', { + text: function(trigger) { + const codeEl = trigger.previousElementSibling.cloneNode(true); + for (const childEl of codeEl.children) { + if (isCodeAnnotation(childEl)) { + childEl.remove(); + } + } + return codeEl.innerText; + } + }); + clipboard.on('success', function(e) { // button target const button = e.trigger; // don't keep focus @@ -1143,50 +1114,11 @@

    Acknowledgements

    }, 1000); // clear code selection e.clearSelection(); - } - const getTextToCopy = function(trigger) { - const codeEl = trigger.previousElementSibling.cloneNode(true); - for (const childEl of codeEl.children) { - if (isCodeAnnotation(childEl)) { - childEl.remove(); - } - } - return codeEl.innerText; - } - const clipboard = new window.ClipboardJS('.code-copy-button:not([data-in-quarto-modal])', { - text: getTextToCopy }); - clipboard.on('success', onCopySuccess); - if (window.document.getElementById('quarto-embedded-source-code-modal')) { - // For code content inside modals, clipBoardJS needs to be initialized with a container option - // TODO: Check when it could be a function (https://github.com/zenorocha/clipboard.js/issues/860) - const clipboardModal = new window.ClipboardJS('.code-copy-button[data-in-quarto-modal]', { - text: getTextToCopy, - container: window.document.getElementById('quarto-embedded-source-code-modal') - }); - clipboardModal.on('success', onCopySuccess); - } - var localhostRegex = new RegExp(/^(?:http|https):\/\/localhost\:?[0-9]*\//); - var mailtoRegex = new RegExp(/^mailto:/); - var filterRegex = new RegExp('/' + window.location.host + '/'); - var isInternal = (href) => { - return filterRegex.test(href) || localhostRegex.test(href) || mailtoRegex.test(href); - } - // Inspect non-navigation links and adorn them if external - var links = window.document.querySelectorAll('a[href]:not(.nav-link):not(.navbar-brand):not(.toc-action):not(.sidebar-link):not(.sidebar-item-toggle):not(.pagination-link):not(.no-external):not([aria-hidden]):not(.dropdown-item):not(.quarto-navigation-tool):not(.about-link)'); - for (var i=0; iAcknowledgements interactive: true, interactiveBorder: 10, theme: 'light-border', - placement: 'bottom-start', + placement: 'bottom-start' }; - if (contentFn) { - config.content = contentFn; - } - if (onTriggerFn) { - config.onTrigger = onTriggerFn; - } - if (onUntriggerFn) { - config.onUntrigger = onUntriggerFn; - } config['offset'] = [0,0]; config['maxWidth'] = 700; window.tippy(el, config); @@ -1220,11 +1143,7 @@

    Acknowledgements

    try { href = new URL(href).hash; } catch {} const id = href.replace(/^#\/?/, ""); const note = window.document.getElementById(id); - if (note) { - return note.innerHTML; - } else { - return ""; - } + return note.innerHTML; }); } const findCites = (el) => { diff --git a/docs/modules/Module02-Functions.html b/docs/modules/Module02-Functions.html index cde85c0..cc0cd42 100644 --- a/docs/modules/Module02-Functions.html +++ b/docs/modules/Module02-Functions.html @@ -8,11 +8,11 @@ - + - SISMID Module NUMBER Materials (2025) – Module 2: Functions + SISMID Module NUMBER Materials (2025) - Module 2: Functions @@ -32,7 +32,7 @@ } /* CSS for syntax highlighting */ pre > code.sourceCode { white-space: pre; position: relative; } - pre > code.sourceCode > span { line-height: 1.25; } + pre > code.sourceCode > span { display: inline-block; line-height: 1.25; } pre > code.sourceCode > span:empty { height: 1.2em; } .sourceCode { overflow: visible; } code.sourceCode > span { color: inherit; text-decoration: inherit; } @@ -43,7 +43,7 @@ } @media print { pre > code.sourceCode { white-space: pre-wrap; } - pre > code.sourceCode > span { display: inline-block; text-indent: -5em; padding-left: 5em; } + pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; } } pre.numberSource code { counter-reset: source-line 0; } @@ -71,7 +71,7 @@ code span.at { color: #657422; } /* Attribute */ code span.bn { color: #ad0000; } /* BaseN */ code span.bu { } /* BuiltIn */ - code span.cf { color: #003b4f; font-weight: bold; } /* ControlFlow */ + code span.cf { color: #003b4f; } /* ControlFlow */ code span.ch { color: #20794d; } /* Char */ code span.cn { color: #8f5902; } /* Constant */ code span.co { color: #5e5e5e; } /* Comment */ @@ -85,7 +85,7 @@ code span.fu { color: #4758ab; } /* Function */ code span.im { color: #00769e; } /* Import */ code span.in { color: #5e5e5e; } /* Information */ - code span.kw { color: #003b4f; font-weight: bold; } /* Keyword */ + code span.kw { color: #003b4f; } /* Keyword */ code span.op { color: #5e5e5e; } /* Operator */ code span.ot { color: #003b4f; } /* Other */ code span.pp { color: #ad0000; } /* Preprocessor */ @@ -222,8 +222,7 @@ } .callout.callout-titled .callout-body > .callout-content > :last-child { - padding-bottom: 0.5rem; - margin-bottom: 0; + margin-bottom: 0.5rem; } .callout.callout-titled .callout-icon::before { @@ -408,31 +407,6 @@

    Module 2: Functions

    -
    -

    Learning Objectives

    @@ -446,10 +420,10 @@

    Learning Objectives

    Function - Basic term

    -

    Function - Functions are “self contained” modules of code that accomplish specific tasks. Functions usually take in some sort of object (e.g., vector, list), process it, and return a result. You can write your own, use functions that come directly from installing R (i.e., Base R functions), or use functions from external packages.

    +

    Function - Functions are “self contained” modules of code that accomplish specific tasks. Functions usually take in some sort of object (e.g., vector, list), process it, and return a result. You can write your own, use functions that come directly from installing R (i.e., Base R functions), or use functions from external packages.

    A function might help you add numbers together, create a plot, or organize your data. In fact, we have already used three functions in the Module 1, including c(), matrix(), list(). Here is another one, sum()

    -
    sum(1, 20234)
    +
    sum(1, 20234)
    [1] 20235
    @@ -457,19 +431,19 @@

    Function - Basic term

    Function

    -

    The general usage for a function is the name of the function followed by parentheses. Within the parentheses are arguments.

    +

    The general usage for a function is the name of the function followed by parentheses (i.e., the function signature). Within the parentheses are arguments.

    -
    function_name(argument1, argument2, ...)
    +
    function_name(argument1, argument2, ...)
    -
    +

    Arguments - Basic term

    Arguments are what you pass to the function and can include:

    1. the physical object on which the function carries out a task (e.g., can be data such as a number 1 or 20234)
    -
    sum(1, 20234)
    +
    sum(1, 20234)
    [1] 20235
    @@ -478,15 +452,15 @@

    Arguments - Basic term

  • options that alter the way the function operates (e.g., such as the base argument in the function log())
  • -
    log(10, base = 10)
    +
    log(10, base = 10)
    [1] 1
    -
    log(10, base = 2)
    +
    log(10, base = 2)
    [1] 3.321928
    -
    log(10, base=exp(1))
    +
    log(10, base=exp(1))
    [1] 2.302585
    @@ -504,7 +478,7 @@

    Arguments

    Example

    What is the default in the base argument of the log() function?

    -
    log(10)
    +
    log(10)
    [1] 2.302585
    @@ -522,7 +496,7 @@

    Sure that is easy enough, but how do you know

    Seeking help for using functions

    The best way of finding out this information is to use the ? followed by the name of the function. Doing this will open up the help manual in the bottom RStudio Help panel. It provides a description of the function, usage, arguments, details, and examples. Lets look at the help file for the function round()

    -
    ?log
    +
    ?log
    Registered S3 method overwritten by 'printr':
       method                from     
    @@ -611,7 +585,7 @@ 

    Seeking help for using functions

    cbind(deparse.level=2, # to get nice column names x, log(1+x), log1p(x), exp(x)-1, expm1(x))
    -
    +

    How to specify arguments

    1. Arguments are separated with a comma
    2. @@ -619,19 +593,19 @@

      How to specify arguments

    -
    log(10, 2)
    +
    log(10, 2)
    [1] 3.321928
    -
    log(base=2, x=10)
    +
    log(base=2, x=10)
    [1] 3.321928
    -
    log(x=10, 2)
    +
    log(x=10, 2)
    [1] 3.321928
    -
    log(10, base=2)
    +
    log(10, base=2)
    [1] 3.321928
    @@ -639,7 +613,7 @@

    How to specify arguments

    Package - Basic term

    -

    When you download R, it has a “base” set of functions, that are associated with a “base” set of packages including: ‘base’, ‘datasets’, ‘graphics’, ‘grDevices’, ‘methods’, ‘stats’, ‘methods’ (typically just referred to as Base R).

    +

    When you download R, it has a “base” set of functions, that are associated with a “base” set of packages including: ‘base’, ‘datasets’, ‘graphics’, ‘grDevices’, ‘methods’, ‘stats’ (typically just referred to as Base R).

    • e.g., the log() function comes from the ‘base’ package
    @@ -648,12 +622,12 @@

    Package - Basic term

    Packages

    -

    The Packages window in RStudio can help you identify what have been installed (listed), and which one have been called (check mark).

    +

    The Packages pane in RStudio can help you identify what have been installed (listed), and which one have been attached (check mark).

    Lets go look at the Packages window, find the base package and find the log() function. It automatically loads the help file that we looked at earlier using ?log.

    Additional Packages

    -

    You can install additional packages for your uses from CRAN or GitHub. These additional packages are written by RStudio or R users/developers (like us)

    +

    You can install additional packages for your use from CRAN or GitHub. These additional packages are written by RStudio or R users/developers (like us)

    • Not all packages available on CRAN or GitHub are trustworthy
    • RStudio (the company) makes a lot of great packages
    • @@ -661,28 +635,28 @@

      Additional Packages

    • How to trust an R package
    -
    -

    Installing and calling packages

    -

    To use the bundle or “package” of code (and or possibly data) from a package, you need to install and also call the package.

    +
    +

    Installing and attaching packages

    +

    To use the bundle or “package” of code (and or possibly data) from a package, you need to install and also attach the package.

    To install a package you can

      -
    1. go to Tools —> Install Packages in the RStudio header
    2. +
    3. go to R Studio Menu Bar Tools Menu —> Install Packages in the RStudio header

    OR

    1. use the following code:
    -
    install.packages(package_name)
    +
    install.packages("package_name")
    -
    -

    Installing and calling packages

    -

    To call (i.e., be able to use the package) you can use the following code:

    +
    +

    Installing and attaching packages

    +

    To attach (i.e., be able to use the package) you can use the following code:

    -
    library(package_name)
    +
    require(package_name) #library(package_name) also works
    -

    More on installing and calling packages later…

    +

    More on installing and attaching packages later…

    Mini Exercise

    @@ -690,9 +664,9 @@

    Mini Exercise

    Functions from Module 1

    -

    The combine function c() collects/combines/joins single R objects into a vector of R objects. It is mostly used for creating vectors of numbers, character strings, and other data types.

    +

    The combine function c() concatenate/collects/combines single R objects into a vector of R objects. It is mostly used for creating vectors of numbers, character strings, and other data types.

    -
    ?c
    +
    ?c
    @@ -798,12 +772,133 @@

    Functions from Module 1

    Functions from Module 1

    The matrix() function creates a matrix from the given set of values.

    -
    ?matrix
    +
    ?matrix
    -

    xxamy - doesn’t seem to work - may need to paste in a screen shot figure

    -
    No documentation for 'matix' in specified packages and libraries
    +
    Matrices
    +
    +Description:
    +
    +     'matrix' creates a matrix from the given set of values.
    +
    +     'as.matrix' attempts to turn its argument into a matrix.
    +
    +     'is.matrix' tests if its argument is a (strict) matrix.
    +
    +Usage:
    +
    +     matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE,
    +            dimnames = NULL)
    +     
    +     as.matrix(x, ...)
    +     ## S3 method for class 'data.frame'
    +     as.matrix(x, rownames.force = NA, ...)
    +     
    +     is.matrix(x)
    +     
    +Arguments:
    +
    +    data: an optional data vector (including a list or 'expression'
    +          vector).  Non-atomic classed R objects are coerced by
    +          'as.vector' and all attributes discarded.
    +
    +    nrow: the desired number of rows.
    +
    +    ncol: the desired number of columns.
    +
    +   byrow: logical. If 'FALSE' (the default) the matrix is filled by
    +          columns, otherwise the matrix is filled by rows.
    +
    +dimnames: A 'dimnames' attribute for the matrix: 'NULL' or a 'list' of
    +          length 2 giving the row and column names respectively.  An
    +          empty list is treated as 'NULL', and a list of length one as
    +          row names.  The list can be named, and the list names will be
    +          used as names for the dimensions.
    +
    +       x: an R object.
    +
    +     ...: additional arguments to be passed to or from methods.
    +
    +rownames.force: logical indicating if the resulting matrix should have
    +          character (rather than 'NULL') 'rownames'.  The default,
    +          'NA', uses 'NULL' rownames if the data frame has 'automatic'
    +          row.names or for a zero-row data frame.
    +
    +Details:
    +
    +     If one of 'nrow' or 'ncol' is not given, an attempt is made to
    +     infer it from the length of 'data' and the other parameter.  If
    +     neither is given, a one-column matrix is returned.
    +
    +     If there are too few elements in 'data' to fill the matrix, then
    +     the elements in 'data' are recycled.  If 'data' has length zero,
    +     'NA' of an appropriate type is used for atomic vectors ('0' for
    +     raw vectors) and 'NULL' for lists.
    +
    +     'is.matrix' returns 'TRUE' if 'x' is a vector and has a '"dim"'
    +     attribute of length 2 and 'FALSE' otherwise.  Note that a
    +     'data.frame' is *not* a matrix by this test.  The function is
    +     generic: you can write methods to handle specific classes of
    +     objects, see InternalMethods.
    +
    +     'as.matrix' is a generic function.  The method for data frames
    +     will return a character matrix if there is only atomic columns and
    +     any non-(numeric/logical/complex) column, applying 'as.vector' to
    +     factors and 'format' to other non-character columns.  Otherwise,
    +     the usual coercion hierarchy (logical < integer < double <
    +     complex) will be used, e.g., all-logical data frames will be
    +     coerced to a logical matrix, mixed logical-integer will give a
    +     integer matrix, etc.
    +
    +     The default method for 'as.matrix' calls 'as.vector(x)', and hence
    +     e.g. coerces factors to character vectors.
    +
    +     When coercing a vector, it produces a one-column matrix, and
    +     promotes the names (if any) of the vector to the rownames of the
    +     matrix.
    +
    +     'is.matrix' is a primitive function.
    +
    +     The 'print' method for a matrix gives a rectangular layout with
    +     dimnames or indices.  For a list matrix, the entries of length not
    +     one are printed in the form 'integer,7' indicating the type and
    +     length.
    +
    +Note:
    +
    +     If you just want to convert a vector to a matrix, something like
    +
    +       dim(x) <- c(nx, ny)
    +       dimnames(x) <- list(row_names, col_names)
    +     
    +     will avoid duplicating 'x' _and_ preserve 'class(x)' which may be
    +     useful, e.g., for 'Date' objects.
    +
    +References:
    +
    +     Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S
    +     Language_.  Wadsworth & Brooks/Cole.
    +
    +See Also:
    +
    +     'data.matrix', which attempts to convert to a numeric matrix.
    +
    +     A matrix is the special case of a two-dimensional 'array'.
    +     'inherits(m, "array")' is true for a 'matrix' 'm'.
    +
    +Examples:
    +
    +     is.matrix(as.matrix(1:10))
    +     !is.matrix(warpbreaks)  # data.frame, NOT matrix!
    +     warpbreaks[1:10,]
    +     as.matrix(warpbreaks[1:10,])  # using as.matrix.data.frame(.) method
    +     
    +     ## Example of setting row and column names
    +     mdat <- matrix(c(1,2,3, 11,12,13), nrow = 2, ncol = 3, byrow = TRUE,
    +                    dimnames = list(c("row1", "row2"),
    +                                    c("C.1", "C.2", "C.3")))
    +     mdat
    @@ -813,21 +908,19 @@

    Summary

  • Functions are “self contained” modules of code that accomplish specific tasks.
  • Arguments are what you pass to functions (e.g., objects on which you carry out the task or options for how to carry out the task)
  • Arguments may include defaults that the author of the function specified as being “good enough in standard cases”, but that can be changed.
  • -
  • An R Package is a bundle or “package” of code (and or possibly data) that can be used by installing it once and calling it (using library()) each time R/Rstudio is opened
  • +
  • An R Package is a bundle or “package” of code (and or possibly data) that can be used by installing it once and attaching it (using library()) each time R/Rstudio is opened
  • The Help window in RStudio is useful for to get more information about functions and packages
  • Acknowledgements

    -

    These are the materials I looked through, modified, or extracted to complete this module’s lecture.

    +

    These are the materials we looked through, modified, or extracted to complete this module’s lecture.

    -
    @@ -856,6 +949,7 @@

    Acknowledgements

    Reveal.initialize({ 'controlsAuto': true, 'previewLinksAuto': false, +'smaller': true, 'pdfSeparateFragments': false, 'autoAnimateEasing': "ease", 'autoAnimateDuration': 1, @@ -1110,7 +1204,18 @@

    Acknowledgements

    } return false; } - const onCopySuccess = function(e) { + const clipboard = new window.ClipboardJS('.code-copy-button', { + text: function(trigger) { + const codeEl = trigger.previousElementSibling.cloneNode(true); + for (const childEl of codeEl.children) { + if (isCodeAnnotation(childEl)) { + childEl.remove(); + } + } + return codeEl.innerText; + } + }); + clipboard.on('success', function(e) { // button target const button = e.trigger; // don't keep focus @@ -1142,50 +1247,11 @@

    Acknowledgements

    }, 1000); // clear code selection e.clearSelection(); - } - const getTextToCopy = function(trigger) { - const codeEl = trigger.previousElementSibling.cloneNode(true); - for (const childEl of codeEl.children) { - if (isCodeAnnotation(childEl)) { - childEl.remove(); - } - } - return codeEl.innerText; - } - const clipboard = new window.ClipboardJS('.code-copy-button:not([data-in-quarto-modal])', { - text: getTextToCopy }); - clipboard.on('success', onCopySuccess); - if (window.document.getElementById('quarto-embedded-source-code-modal')) { - // For code content inside modals, clipBoardJS needs to be initialized with a container option - // TODO: Check when it could be a function (https://github.com/zenorocha/clipboard.js/issues/860) - const clipboardModal = new window.ClipboardJS('.code-copy-button[data-in-quarto-modal]', { - text: getTextToCopy, - container: window.document.getElementById('quarto-embedded-source-code-modal') - }); - clipboardModal.on('success', onCopySuccess); - } - var localhostRegex = new RegExp(/^(?:http|https):\/\/localhost\:?[0-9]*\//); - var mailtoRegex = new RegExp(/^mailto:/); - var filterRegex = new RegExp('/' + window.location.host + '/'); - var isInternal = (href) => { - return filterRegex.test(href) || localhostRegex.test(href) || mailtoRegex.test(href); - } - // Inspect non-navigation links and adorn them if external - var links = window.document.querySelectorAll('a[href]:not(.nav-link):not(.navbar-brand):not(.toc-action):not(.sidebar-link):not(.sidebar-item-toggle):not(.pagination-link):not(.no-external):not([aria-hidden]):not(.dropdown-item):not(.quarto-navigation-tool):not(.about-link)'); - for (var i=0; iAcknowledgements interactive: true, interactiveBorder: 10, theme: 'light-border', - placement: 'bottom-start', + placement: 'bottom-start' }; - if (contentFn) { - config.content = contentFn; - } - if (onTriggerFn) { - config.onTrigger = onTriggerFn; - } - if (onUntriggerFn) { - config.onUntrigger = onUntriggerFn; - } config['offset'] = [0,0]; config['maxWidth'] = 700; window.tippy(el, config); @@ -1219,11 +1276,7 @@

    Acknowledgements

    try { href = new URL(href).hash; } catch {} const id = href.replace(/^#\/?/, ""); const note = window.document.getElementById(id); - if (note) { - return note.innerHTML; - } else { - return ""; - } + return note.innerHTML; }); } const findCites = (el) => { diff --git a/docs/modules/Module03-WorkingDirectories.html b/docs/modules/Module03-WorkingDirectories.html index 617e48f..ae3ee6b 100644 --- a/docs/modules/Module03-WorkingDirectories.html +++ b/docs/modules/Module03-WorkingDirectories.html @@ -8,11 +8,11 @@ - + - SISMID Module NUMBER Materials (2025) – Module 3: Working Directories + SISMID Module NUMBER Materials (2025) - Module 3: Working Directories @@ -32,7 +32,7 @@ } /* CSS for syntax highlighting */ pre > code.sourceCode { white-space: pre; position: relative; } - pre > code.sourceCode > span { line-height: 1.25; } + pre > code.sourceCode > span { display: inline-block; line-height: 1.25; } pre > code.sourceCode > span:empty { height: 1.2em; } .sourceCode { overflow: visible; } code.sourceCode > span { color: inherit; text-decoration: inherit; } @@ -43,7 +43,7 @@ } @media print { pre > code.sourceCode { white-space: pre-wrap; } - pre > code.sourceCode > span { display: inline-block; text-indent: -5em; padding-left: 5em; } + pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; } } pre.numberSource code { counter-reset: source-line 0; } @@ -71,7 +71,7 @@ code span.at { color: #657422; } /* Attribute */ code span.bn { color: #ad0000; } /* BaseN */ code span.bu { } /* BuiltIn */ - code span.cf { color: #003b4f; font-weight: bold; } /* ControlFlow */ + code span.cf { color: #003b4f; } /* ControlFlow */ code span.ch { color: #20794d; } /* Char */ code span.cn { color: #8f5902; } /* Constant */ code span.co { color: #5e5e5e; } /* Comment */ @@ -85,7 +85,7 @@ code span.fu { color: #4758ab; } /* Function */ code span.im { color: #00769e; } /* Import */ code span.in { color: #5e5e5e; } /* Information */ - code span.kw { color: #003b4f; font-weight: bold; } /* Keyword */ + code span.kw { color: #003b4f; } /* Keyword */ code span.op { color: #5e5e5e; } /* Operator */ code span.ot { color: #003b4f; } /* Other */ code span.pp { color: #ad0000; } /* Preprocessor */ @@ -222,8 +222,7 @@ } .callout.callout-titled .callout-body > .callout-content > :last-child { - padding-bottom: 0.5rem; - margin-bottom: 0; + margin-bottom: 0.5rem; } .callout.callout-titled .callout-icon::before { @@ -408,29 +407,12 @@

    Module 3: Working Directories

    -
    -

    Learning Objectives

    After module 3, you should be able to…

      -
    • Understand your own systems file structure and the purpose of the working directory
    • +
    • Understand your own systems’ file structure and the purpose of the working directory
    • Determine the working directory
    • Change the working directory
    @@ -443,16 +425,16 @@

    File Structure

    Working Directory – Basic term

    • R “looks” for files on your computer relative to the “working” directory
    • -
    • For example, if you want to load data into R or save a figure, you will need to tell R where/store the file
    • +
    • For example, if you want to load data into R or save a figure, you will need to tell R where to look for or store the file
    • Many people recommend not setting a directory in the scripts, rather assume you’re in the directory the script is in

    Getting and setting the working directory using code

    -
    ## get the working directory
    -getwd()
    -setwd("~/") 
    +
    ## get the working directory
    +getwd()
    +setwd("~/") 
    @@ -488,13 +470,13 @@

    Setting the working directory using your cursor

    Remember above “Many people recommend not setting a directory in the scripts, rather assume you’re in the directory the script is in.” To do so, go to Session –> Set Working Directory –> To Source File Location

    RStudio will show the code in the Console for the action you took with your cursor. This is a good way to learn about your file system how to set a correct working directory!

    -
    setwd("~/Dropbox/Git/SISMID-2024")
    +
    setwd("~/Dropbox/Git/SISMID-2024")

    Setting the Working Directory

    -

    If you have not yet saved a “source” file, it will set working directory to the default location. See RStudio -> Preferences -> General for default location.

    -

    To change the working directory to another location, go to Session –> Set Working Directory –> Choose Directory`

    +

    If you have not yet saved a “source” file, it will set working directory to the default location.Find the Tool Menu in the Menu Bar -> Global Opsions -> General for default location.

    +

    To change the working directory to another location, find Session Menu in the Menu Bar –> Set Working Directory –> Choose Directory`

    Again, RStudio will show the code in the Console for the action you took with your cursor.

    @@ -503,7 +485,7 @@

    Summary

  • R “looks” for files on your computer relative to the “working” directory
  • Absolute path points to the same location in a file system - it is specific to your system and your system alone
  • Relative path points is based on the current working directory
  • -
  • Two functions, setwd() and getwd(), are your new best friends.
  • +
  • Two functions, setwd() and getwd() are useful for identifying and manipulating the working directory.
  • @@ -513,10 +495,8 @@

    Acknowledgements

  • “Introduction to R for Public Health Researchers” Johns Hopkins University
  • -
    @@ -545,6 +525,7 @@

    Acknowledgements

    Reveal.initialize({ 'controlsAuto': true, 'previewLinksAuto': false, +'smaller': true, 'pdfSeparateFragments': false, 'autoAnimateEasing': "ease", 'autoAnimateDuration': 1, @@ -799,7 +780,18 @@

    Acknowledgements

    } return false; } - const onCopySuccess = function(e) { + const clipboard = new window.ClipboardJS('.code-copy-button', { + text: function(trigger) { + const codeEl = trigger.previousElementSibling.cloneNode(true); + for (const childEl of codeEl.children) { + if (isCodeAnnotation(childEl)) { + childEl.remove(); + } + } + return codeEl.innerText; + } + }); + clipboard.on('success', function(e) { // button target const button = e.trigger; // don't keep focus @@ -831,50 +823,11 @@

    Acknowledgements

    }, 1000); // clear code selection e.clearSelection(); - } - const getTextToCopy = function(trigger) { - const codeEl = trigger.previousElementSibling.cloneNode(true); - for (const childEl of codeEl.children) { - if (isCodeAnnotation(childEl)) { - childEl.remove(); - } - } - return codeEl.innerText; - } - const clipboard = new window.ClipboardJS('.code-copy-button:not([data-in-quarto-modal])', { - text: getTextToCopy }); - clipboard.on('success', onCopySuccess); - if (window.document.getElementById('quarto-embedded-source-code-modal')) { - // For code content inside modals, clipBoardJS needs to be initialized with a container option - // TODO: Check when it could be a function (https://github.com/zenorocha/clipboard.js/issues/860) - const clipboardModal = new window.ClipboardJS('.code-copy-button[data-in-quarto-modal]', { - text: getTextToCopy, - container: window.document.getElementById('quarto-embedded-source-code-modal') - }); - clipboardModal.on('success', onCopySuccess); - } - var localhostRegex = new RegExp(/^(?:http|https):\/\/localhost\:?[0-9]*\//); - var mailtoRegex = new RegExp(/^mailto:/); - var filterRegex = new RegExp('/' + window.location.host + '/'); - var isInternal = (href) => { - return filterRegex.test(href) || localhostRegex.test(href) || mailtoRegex.test(href); - } - // Inspect non-navigation links and adorn them if external - var links = window.document.querySelectorAll('a[href]:not(.nav-link):not(.navbar-brand):not(.toc-action):not(.sidebar-link):not(.sidebar-item-toggle):not(.pagination-link):not(.no-external):not([aria-hidden]):not(.dropdown-item):not(.quarto-navigation-tool):not(.about-link)'); - for (var i=0; iAcknowledgements interactive: true, interactiveBorder: 10, theme: 'light-border', - placement: 'bottom-start', + placement: 'bottom-start' }; - if (contentFn) { - config.content = contentFn; - } - if (onTriggerFn) { - config.onTrigger = onTriggerFn; - } - if (onUntriggerFn) { - config.onUntrigger = onUntriggerFn; - } config['offset'] = [0,0]; config['maxWidth'] = 700; window.tippy(el, config); @@ -908,11 +852,7 @@

    Acknowledgements

    try { href = new URL(href).hash; } catch {} const id = href.replace(/^#\/?/, ""); const note = window.document.getElementById(id); - if (note) { - return note.innerHTML; - } else { - return ""; - } + return note.innerHTML; }); } const findCites = (el) => { diff --git a/docs/modules/Module04-RProject.html b/docs/modules/Module04-RProject.html index b580b3d..212660a 100644 --- a/docs/modules/Module04-RProject.html +++ b/docs/modules/Module04-RProject.html @@ -8,11 +8,11 @@ - + - SISMID Module NUMBER Materials (2025) – Module 4: R Project + SISMID Module NUMBER Materials (2025) - Module 4: R Project @@ -157,8 +157,7 @@ } .callout.callout-titled .callout-body > .callout-content > :last-child { - padding-bottom: 0.5rem; - margin-bottom: 0; + margin-bottom: 0.5rem; } .callout.callout-titled .callout-icon::before { @@ -343,21 +342,6 @@

    Module 4: R Project

    -
    -

    Learning Objectives

    @@ -365,11 +349,11 @@

    Learning Objectives

    • Create an R Project
    • Check you are in the desired R Project
    • -
    • Reference the Files window in RStudio
    • +
    • Reference the Files pane in RStudio
    • Describe “good” R Project organization
    -
    +

    RStudio Project

    RStudio “Project” is one highly recommended strategy to build organized and reproducible code in R.

      @@ -381,8 +365,8 @@

      RStudio Project

      RStudio Project Creation

      Let’s create a new RStudio Project.

      -

      Go to File –> New Project –> New Directory –> New Project

      -

      Call your Project “IntroToR_RProject”

      +

      Find the File Menu in the Menu Bar –> New Project –> New Directory –> New Project

      +

      Name your Project “IntroToR_RProject”

      RStudio Project Organization

      @@ -394,36 +378,38 @@

      RStudio Project Organization

    1. output
    2. figures
    3. -

      We will be working from this directory for the remainder of the Workshop. Take a moment to move any R scripts you have already created to the ‘code’ sub-directories.

      +

      We will be working from this directory for the remainder of the Workshop. Take a moment to move any R scripts you have already created to the ‘code’ sub-directory.

      -
      +

      Some things to notice in an R Project

        -
      1. The name of the R Project will be shown at the top of the RStudio application
      2. +
      3. The name of the R Project will be shown at the top of the RStudio Window
      4. If you check the working directory using getwd() you will find the working directory is set to the location where the R Project was saved.
      5. -
      6. The Files window in RStudio is also set to the location where the R Project was saved, making it easy to navigate to sub-directories directly from RStudio.
      7. +
      8. The Files pane in RStudio is also set to the location where the R Project was saved, making it easy to navigate to sub-directories directly from RStudio.

      R Project - Common issues

      If you simply open RStudio, it will not automatically open your R Project. As a result, when you say run a function to import data using the relative path based on your working directory, it won’t be able to find the data.

      -

      To open a previously created R Project, you need to open the R Project (i.e., SISMID_IntroToR_RProject.RProj)

      +

      To open a previously created R Project, you need to open the R Project (i.e., double click on SISMID_IntroToR_RProject.RProj)

      Summary

      • R Projects are really helpful for lots of reasons, including to improve the reproducibility of your work
      • Consistently set up your R Project’s sub-directories so that you can easily navigate the project
      • +
      • If you get an error that a file can’t be found, make sure you correctly opened the R Project by looking for the Project name at the top of the RStudio application window.
      -
      +

      Mini Exercise

      1. Close R Studio
      2. -
      3. Reopen you R Project
      4. +
      5. Reopen your R Project
      6. Check that you are actually in the R Project
      7. Create a new R script and save it in your ‘code’ subdirectory
      8. -
      9. Create a vector of numbers and then get a summary statistics of that vector (e.g., sum, mean, median)
      10. +
      11. Create a vector of numbers
      12. +
      13. Create a vector a character values
      14. Add comment(s) to your R script to explain your code.
      @@ -431,10 +417,8 @@

      Mini Exercise

      Acknowledgements

      These are the materials we looked through, modified, or extracted to complete this module’s lecture.

      -
      @@ -463,6 +447,7 @@

      Acknowledgements

      Reveal.initialize({ 'controlsAuto': true, 'previewLinksAuto': false, +'smaller': true, 'pdfSeparateFragments': false, 'autoAnimateEasing': "ease", 'autoAnimateDuration': 1, @@ -679,7 +664,18 @@

      Acknowledgements

      } return false; } - const onCopySuccess = function(e) { + const clipboard = new window.ClipboardJS('.code-copy-button', { + text: function(trigger) { + const codeEl = trigger.previousElementSibling.cloneNode(true); + for (const childEl of codeEl.children) { + if (isCodeAnnotation(childEl)) { + childEl.remove(); + } + } + return codeEl.innerText; + } + }); + clipboard.on('success', function(e) { // button target const button = e.trigger; // don't keep focus @@ -711,50 +707,11 @@

      Acknowledgements

      }, 1000); // clear code selection e.clearSelection(); - } - const getTextToCopy = function(trigger) { - const codeEl = trigger.previousElementSibling.cloneNode(true); - for (const childEl of codeEl.children) { - if (isCodeAnnotation(childEl)) { - childEl.remove(); - } - } - return codeEl.innerText; - } - const clipboard = new window.ClipboardJS('.code-copy-button:not([data-in-quarto-modal])', { - text: getTextToCopy }); - clipboard.on('success', onCopySuccess); - if (window.document.getElementById('quarto-embedded-source-code-modal')) { - // For code content inside modals, clipBoardJS needs to be initialized with a container option - // TODO: Check when it could be a function (https://github.com/zenorocha/clipboard.js/issues/860) - const clipboardModal = new window.ClipboardJS('.code-copy-button[data-in-quarto-modal]', { - text: getTextToCopy, - container: window.document.getElementById('quarto-embedded-source-code-modal') - }); - clipboardModal.on('success', onCopySuccess); - } - var localhostRegex = new RegExp(/^(?:http|https):\/\/localhost\:?[0-9]*\//); - var mailtoRegex = new RegExp(/^mailto:/); - var filterRegex = new RegExp('/' + window.location.host + '/'); - var isInternal = (href) => { - return filterRegex.test(href) || localhostRegex.test(href) || mailtoRegex.test(href); - } - // Inspect non-navigation links and adorn them if external - var links = window.document.querySelectorAll('a[href]:not(.nav-link):not(.navbar-brand):not(.toc-action):not(.sidebar-link):not(.sidebar-item-toggle):not(.pagination-link):not(.no-external):not([aria-hidden]):not(.dropdown-item):not(.quarto-navigation-tool):not(.about-link)'); - for (var i=0; iAcknowledgements interactive: true, interactiveBorder: 10, theme: 'light-border', - placement: 'bottom-start', + placement: 'bottom-start' }; - if (contentFn) { - config.content = contentFn; - } - if (onTriggerFn) { - config.onTrigger = onTriggerFn; - } - if (onUntriggerFn) { - config.onUntrigger = onUntriggerFn; - } config['offset'] = [0,0]; config['maxWidth'] = 700; window.tippy(el, config); @@ -788,11 +736,7 @@

      Acknowledgements

      try { href = new URL(href).hash; } catch {} const id = href.replace(/^#\/?/, ""); const note = window.document.getElementById(id); - if (note) { - return note.innerHTML; - } else { - return ""; - } + return note.innerHTML; }); } const findCites = (el) => { diff --git a/docs/modules/Module05-DataImportExport.html b/docs/modules/Module05-DataImportExport.html index a8eac77..abf6541 100644 --- a/docs/modules/Module05-DataImportExport.html +++ b/docs/modules/Module05-DataImportExport.html @@ -8,11 +8,11 @@ - + - SISMID Module NUMBER Materials (2025) – Module 5: Data Import and Export + SISMID Module NUMBER Materials (2025) - Module 5: Data Import and Export @@ -32,7 +32,7 @@ } /* CSS for syntax highlighting */ pre > code.sourceCode { white-space: pre; position: relative; } - pre > code.sourceCode > span { line-height: 1.25; } + pre > code.sourceCode > span { display: inline-block; line-height: 1.25; } pre > code.sourceCode > span:empty { height: 1.2em; } .sourceCode { overflow: visible; } code.sourceCode > span { color: inherit; text-decoration: inherit; } @@ -43,7 +43,7 @@ } @media print { pre > code.sourceCode { white-space: pre-wrap; } - pre > code.sourceCode > span { display: inline-block; text-indent: -5em; padding-left: 5em; } + pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; } } pre.numberSource code { counter-reset: source-line 0; } @@ -71,7 +71,7 @@ code span.at { color: #657422; } /* Attribute */ code span.bn { color: #ad0000; } /* BaseN */ code span.bu { } /* BuiltIn */ - code span.cf { color: #003b4f; font-weight: bold; } /* ControlFlow */ + code span.cf { color: #003b4f; } /* ControlFlow */ code span.ch { color: #20794d; } /* Char */ code span.cn { color: #8f5902; } /* Constant */ code span.co { color: #5e5e5e; } /* Comment */ @@ -85,7 +85,7 @@ code span.fu { color: #4758ab; } /* Function */ code span.im { color: #00769e; } /* Import */ code span.in { color: #5e5e5e; } /* Information */ - code span.kw { color: #003b4f; font-weight: bold; } /* Keyword */ + code span.kw { color: #003b4f; } /* Keyword */ code span.op { color: #5e5e5e; } /* Operator */ code span.ot { color: #003b4f; } /* Other */ code span.pp { color: #ad0000; } /* Preprocessor */ @@ -222,8 +222,7 @@ } .callout.callout-titled .callout-body > .callout-content > :last-child { - padding-bottom: 0.5rem; - margin-bottom: 0; + margin-bottom: 0.5rem; } .callout.callout-titled .callout-icon::before { @@ -408,55 +407,22 @@

      Module 5: Data Import and Export

      -
      -

      Learning Objectives

      After module 5, you should be able to…

      • Use Base R functions to load data
      • -
      • Install and call external R Packages to extend R’s functionality
      • -
      • Install any type of data into R
      • -
      • Find loaded data in the Global Environment window of RStudio
      • +
      • Install and attach external R Packages to extend R’s functionality
      • +
      • Load any type of data into R
      • +
      • Find loaded data in the Environment pane of RStudio
      • Reading and writing R .Rds and .Rda/.RData files

      Import (read) Data

        -
      • Importing or ‘Reading in’ data is the first step of any real project/analysis
      • +
      • Importing or ‘Reading in’ data are the first step of any real project / data analysis
      • R can read almost any file format, especially with external, non-Base R, packages
      • We are going to focus on simple delimited files first.
          @@ -466,19 +432,20 @@

          Import (read) Data

        A delimited file is a sequential file with column delimiters. Each delimited file is a stream of records, which consists of fields that are ordered by column. Each record contains fields for one row. Within each row, individual fields are separated by column delimiters (IBM.com definition)

      -
      +

      Mini exercise

      1. Download Module 5 data from the website and save the data to your data subdirectory – specifically SISMID_IntroToR_RProject/data

      2. -
      3. Open the data files in a text editor application and familiarize you self with the data.

      4. -
      5. Determine the delminiter of the two ‘.txt’ files

      6. +
      7. Open the ‘.csv’ and ‘.txt’ data files in a text editor application and familiarize yourself with the data (i.e., Notepad for Windows and TextEdit for Mac)

      8. +
      9. Open the ‘.xlsx’ data file in excel and familiarize yourself with the data - if you use a Mac do not open in Numbers, it can corrupt the file - if you do not have excel, you can upload it to Google Sheets

      10. +
      11. Determine the delimiter of the two ‘.txt’ files

      Import delimited data

      Within the Base R ‘util’ package we can find a handful of useful functions including read.csv() and read.delim() to importing data.

      -
      ?read.csv
      +
      ?read.csv
      @@ -838,55 +805,54 @@

      Import delimited data

      Import .csv files

      -

      Reminder

      +

      Function signature reminder

      read.csv(file, header = TRUE, sep = ",", quote = "\"",
                dec = ".", fill = TRUE, comment.char = "", ...)
      -

      file is the first argument and is the path to your file, in quotes

      -
      -       can be path in your local computer -- absolute file path or relative file path 
      --       can be path to a file on a website
      +
          -       `file` is the first argument and is the path to your file, in quotes 
      +    
      +            -       can be path in your local computer -- absolute file path or relative file path 
      +            -       can be path to a file on a website
      -

      Mini Exercise

      +

      Mini exercise

      If your R Project is not already open, open it so we take advantage of it setting a useful working directory for us in order to import data.

      Import .csv files

      Lets import a new data file

      -
      ## Examples
      -df <- read.csv(file = "data/serodata.csv") #relative path
      -df <- read.csv(file = "~/Dropbox/Git/SISMID-2024/modules/data/serodata.csv") #absolute path starting from my home directory
      +
      ## Examples
      +df <- read.csv(file = "data/serodata.csv") #relative path

      Note #1, I assigned the data frame to an object called df. I could have called the data anything, but in order to use the data (i.e., as an object we can find in the Environment), I need to assign it as an object.

      -

      Note #2, Look to the Environment window, you will see the df object ready to be used.

      +

      Note #2, Look to the Environment pane, you will see the df object ready to be used.

      Import .txt files

      read.csv() is a special case of read.delim() – a general function to read a delimited file into a data frame

      +

      Reminder function signature

      read.delim(file, header = TRUE, sep = "\t", quote = "\"",
                  dec = ".", fill = TRUE, comment.char = "", ...)
      -
        -
      • file is the path to your file, in quotes
      • -
      • delim is what separates the fields within a record. The default for csv is comma
      • -
      +
          - `file` is the path to your file, in quotes 
      +    - `delim` is what separates the fields within a record. The default for csv is comma

      Import .txt files

      -

      Lets first import ‘serodata1.txt’ which uses a tab delminiter and ‘serodata2.txt’ which uses a semicolon delminiter.

      +

      Lets first import ‘serodata1.txt’ which uses a tab delimiter and ‘serodata2.txt’ which uses a semicolon delimiter.

      -
      ## Examples
      -df <- read.delim(file = "data/serodata.txt", sep = "\t")
      -df <- read.delim(file = "data/serodata.txt", sep = ";")
      +
      ## Examples
      +df <- read.delim(file = "data/serodata.txt", sep = "\t")
      +df <- read.delim(file = "data/serodata.txt", sep = ";")
      -

      The data is now successfully read into your R workspace, many times actually. Notice, that each time we imported the data we assigned the data to the df object, meaning we replaced it each time we reassinged the df object.

      +

      The dataset is now successfully read into your R workspace, many times actually. Notice, that each time we imported the data we assigned the data to the df object, meaning we replaced it each time we reassinged the df object.

      -
      +

      What if we have a .xlsx file - what do we do?

      1. Google / Ask ChatGPT
      2. Find and vet function and package you want
      3. Install package
      4. -
      5. Call package
      6. +
      7. Attach package
      8. Use function
      @@ -894,25 +860,13 @@

      What if we have a .xlsx file - what do we do?

      1. Internet Search

      -
      -

      -
      -
      -
      -

      -
      -
      -
      -

      -
      -
      @@ -920,7 +874,7 @@

      1. Internet Search

      2. Find and vet function and package you want

      I am getting consistent message to use the the read_excel() function found in the readxl package. This package was developed by Hadley Wickham, who we know is reputable. Also, you can check that data was read in correctly, b/c this is a straightforward task.

      -
      +

      3. Install Package

      To use the bundle or “package” of code (and or possibly data) from a package, you need to install and also call the package.

      To install a package you can

      @@ -932,29 +886,28 @@

      3. Install Package

    4. use the following code:
    -
    install.packages("package_name")
    +
    install.packages("package_name")

    Therefore,

    -
    install.packages("readxl")
    +
    install.packages("readxl")
    -
    -

    4. Call Package

    -

    Reminder – Installing and calling packages

    -

    To call (i.e., be able to use the package) you can use the following code:

    +
    +

    4. Attach Package

    +

    Reminder - To attach (i.e., be able to use the package) you can use the following code:

    -
    library(package_name)
    +
    require(package_name)

    Therefore,

    -
    library(readxl)
    +
    require(readxl)

    5. Use Function

    -
    ?read_excel
    +
    ?read_excel

    Read xls and xlsx files

    Description:

    @@ -1100,7 +1053,7 @@

    5. Use Function

    5. Use Function

    -

    Reminder

    +

    Reminder of function signature

    read_excel(
       path,
       sheet = NULL,
    @@ -1117,12 +1070,11 @@ 

    5. Use Function

    )

    Let’s practice

    -
    df <- read_excel(path = "data/serodata.xlsx", sheet = "Data")
    +
    df <- read_excel(path = "data/serodata.xlsx", sheet = "Data")
    -
    -

    Mini exercise

    -

    Lets make some mistakes

    +
    +

    Lets make some mistakes

    1. What if we read in the data without assigning it to an object (i.e., read_excel(path = "data/serodata.xlsx", sheet = "Data"))?

    2. What if we forget to specify the sheet argument? (i.e., dd <- read_excel(path = "data/serodata.xlsx"))?

    3. @@ -1130,26 +1082,28 @@

      Mini exercise

    Installing and calling packages - Common confusion

    -

    You only need to install a package once (unless you update R), but you will need to call or load a package each time you want to use it.

    +


    +

    You only need to install a package once (unless you update R or want to update the package), but you will need to call or load a package each time you want to use it.

    +


    The exception to this rule are the “base” set of packages (i.e., Base R) that are installed automatically when you install R and that automatically called whenever you open R or RStudio.

    Common Error

    -

    Be prepared to see the error

    +

    Be prepared to see this error

    -
    Error: could not find function "some_function"
    +
    Error: could not find function "some_function_name"
    -

    This usually mean that either

    +

    This usually means that either

    • you called the function by the wrong name
    • you have not installed a package that contains the function
    • -
    • you have installed a package but you forgot to call it (i.e., library(package_name)) – most likely
    • +
    • you have installed a package but you forgot to attach it (i.e., require(package_name)) – most likely

    Export (write) Data

      -
    • Exporting or ‘Writing out’ data allows you to save modified files to future use or sharing
    • +
    • Exporting or ‘Writing out’ data allows you to save modified files for future use or sharing
    • R can write almost any file format, especially with external, non-Base R, packages
    • We are going to focus again on writing delimited files
    @@ -1367,17 +1321,18 @@

    Export delimited data

    Export delimited data

    +

    Let’s practice exporting the data as three files with three different delimiters (comma, tab, semicolon)

    -
    write.csv(df, file="data/serodata_new.csv", row.names = FALSE) #comma delimited
    -write.table(df, file="data/serodata1_new.txt", sep="\t", row.names = FALSE) #tab delimited
    -write.table(df, file="data/serodata2_new.txt", sep=";", row.names = FALSE) #semicolon delimited
    +
    write.csv(df, file="data/serodata_new.csv", row.names = FALSE) #comma delimited
    +write.table(df, file="data/serodata1_new.txt", sep="\t", row.names = FALSE) #tab delimited
    +write.table(df, file="data/serodata2_new.txt", sep=";", row.names = FALSE) #semicolon delimited

    Note, I wrote the data to new file names. Even though we didn’t change the data at all in this module, it is good practice to keep raw data raw, and not to write over it.

    R .rds and .rda/RData files

    There are two file extensions worth discussing.

    -

    R has two native data formats—Rdata (sometimes shortened to Rda) and Rds. These formats are used when R objects are saved for later use. Rdata is used to save multiple R objects, while Rds is used to save a single R object.

    +

    R has two native data formats—‘Rdata’ (sometimes shortened to ‘Rda’) and ‘Rds’. These formats are used when R objects are saved for later use. ‘Rdata’ is used to save multiple R objects, while ‘Rds’ is used to save a single R object. ‘Rds’ is fast to write/read and is very small.

    .rds binary file

    @@ -1390,20 +1345,21 @@

    .rds binary file

    .rda/RData files

    The Base R functions save() and load() can be used to save and load multiple R objects.

    -

    save() writes an external representation of R objects to the specified file, and can by loaded back into the environment using load(). A nice feature about using save and load is that the R object is directly imported into the environment and you don’t have to assign it to an object. The files can be saved as .RData or .rda files.

    +

    save() writes an external representation of R objects to the specified file, and can by loaded back into the environment using load(). A nice feature about using save and load is that the R object(s) is directly imported into the environment and you don’t have to specify the name. The files can be saved as .RData or .Rda files.

    +

    Function signature

    save(object1, object2, file = "filename.RData")
     load("filename.RData")
    -

    Note, that when you read .RData files you don’t need to assign it to an abjecct. It simply reads in the objects as they were saved. Therefore, load("filename.RData") will read in object1 and object2 directly into the Global Environment.

    +

    Note, that you separate the objects you want to save with commas.

    Summary

      -
    • Importing or ‘Reading in’ data is the first step of any real project/analysis
    • -
    • The Base R ‘util’ package we can find a handful of useful functions including read.csv() and read.delim() to importing/reading data or write.csv() and write.table() for exporti/writing data
    • -
    • When importing data (exception is object from .RData), you must assign it to an object, otherwise it cannot be called/used
    • -
    • Properly read data can be found in the Environment window of RStudio
    • -
    • You only need to install a package once (unless you update R), but you will need to call or load a package each time you want to use it.
    • -
    • To complete a tasek you don’t know how to do (e.g., reading in an excel data file) use the following steps: 1. Google / Ask ChatGPT, 2. Find and vet function and package you want, 3. Install package, 4. Call package, 5. Use function
    • +
    • Importing or ‘Reading in’ data are the first step of any real project / data analysis
    • +
    • The Base R ‘util’ package has useful functions including read.csv() and read.delim() to importing/reading data or write.csv() and write.table() for exporting/writing data
    • +
    • When importing data (exception is object from .RData), you must assign it to an object, otherwise it cannot be used
    • +
    • If data are imported correctly, they can be found in the Environment pane of RStudio
    • +
    • You only need to install a package once (unless you update R or the package), but you will need to attach a package each time you want to use it.
    • +
    • To complete a task you don’t know how to do (e.g., reading in an excel data file) use the following steps: 1. Google / Ask ChatGPT, 2. Find and vet function and package you want, 3. Install package, 4. Attach package, 5. Use function
    @@ -1413,10 +1369,8 @@

    Acknowledgements

  • “Introduction to R for Public Health Researchers” Johns Hopkins University
  • -
    @@ -1445,6 +1399,7 @@

    Acknowledgements

    Reveal.initialize({ 'controlsAuto': true, 'previewLinksAuto': false, +'smaller': true, 'pdfSeparateFragments': false, 'autoAnimateEasing': "ease", 'autoAnimateDuration': 1, @@ -1699,7 +1654,18 @@

    Acknowledgements

    } return false; } - const onCopySuccess = function(e) { + const clipboard = new window.ClipboardJS('.code-copy-button', { + text: function(trigger) { + const codeEl = trigger.previousElementSibling.cloneNode(true); + for (const childEl of codeEl.children) { + if (isCodeAnnotation(childEl)) { + childEl.remove(); + } + } + return codeEl.innerText; + } + }); + clipboard.on('success', function(e) { // button target const button = e.trigger; // don't keep focus @@ -1731,50 +1697,11 @@

    Acknowledgements

    }, 1000); // clear code selection e.clearSelection(); - } - const getTextToCopy = function(trigger) { - const codeEl = trigger.previousElementSibling.cloneNode(true); - for (const childEl of codeEl.children) { - if (isCodeAnnotation(childEl)) { - childEl.remove(); - } - } - return codeEl.innerText; - } - const clipboard = new window.ClipboardJS('.code-copy-button:not([data-in-quarto-modal])', { - text: getTextToCopy }); - clipboard.on('success', onCopySuccess); - if (window.document.getElementById('quarto-embedded-source-code-modal')) { - // For code content inside modals, clipBoardJS needs to be initialized with a container option - // TODO: Check when it could be a function (https://github.com/zenorocha/clipboard.js/issues/860) - const clipboardModal = new window.ClipboardJS('.code-copy-button[data-in-quarto-modal]', { - text: getTextToCopy, - container: window.document.getElementById('quarto-embedded-source-code-modal') - }); - clipboardModal.on('success', onCopySuccess); - } - var localhostRegex = new RegExp(/^(?:http|https):\/\/localhost\:?[0-9]*\//); - var mailtoRegex = new RegExp(/^mailto:/); - var filterRegex = new RegExp('/' + window.location.host + '/'); - var isInternal = (href) => { - return filterRegex.test(href) || localhostRegex.test(href) || mailtoRegex.test(href); - } - // Inspect non-navigation links and adorn them if external - var links = window.document.querySelectorAll('a[href]:not(.nav-link):not(.navbar-brand):not(.toc-action):not(.sidebar-link):not(.sidebar-item-toggle):not(.pagination-link):not(.no-external):not([aria-hidden]):not(.dropdown-item):not(.quarto-navigation-tool):not(.about-link)'); - for (var i=0; iAcknowledgements interactive: true, interactiveBorder: 10, theme: 'light-border', - placement: 'bottom-start', + placement: 'bottom-start' }; - if (contentFn) { - config.content = contentFn; - } - if (onTriggerFn) { - config.onTrigger = onTriggerFn; - } - if (onUntriggerFn) { - config.onUntrigger = onUntriggerFn; - } config['offset'] = [0,0]; config['maxWidth'] = 700; window.tippy(el, config); @@ -1808,11 +1726,7 @@

    Acknowledgements

    try { href = new URL(href).hash; } catch {} const id = href.replace(/^#\/?/, ""); const note = window.document.getElementById(id); - if (note) { - return note.innerHTML; - } else { - return ""; - } + return note.innerHTML; }); } const findCites = (el) => { diff --git a/docs/modules/Module06-DataSubset.html b/docs/modules/Module06-DataSubset.html index eba0b1f..d141d1a 100644 --- a/docs/modules/Module06-DataSubset.html +++ b/docs/modules/Module06-DataSubset.html @@ -483,11 +483,11 @@

    Quick summary of data

    Description of data

    -

    This is data based on a simulated pathogen X IgG antibody serological survey. The rows represent individuals. Variables include IgG concentrations in IU/mL, age in years, gender, and residence based on slum characterization. We will use this dataset for lectures throughout the Workshop.

    +

    This is data based on a simulated pathogen X IgG antibody serological survey. The rows represent individuals. Variables include IgG concentrations in IU/mL, age in years, gender, and residence based on slum characterization. We will use this dataset for modules throughout the Workshop.

    View the data as a whole dataframe

    -

    The View() function, one of the few Base R functions with a capital letter can be used to open a new tab in the Console and view the data as you would in excel.

    +

    The View() function, one of the few Base R functions with a capital letter, and can be used to open a new tab in the Console and view the data as you would in excel.

    View(df)
    @@ -495,12 +495,13 @@

    View the data as a whole dataframe

    View the data as a whole dataframe

    -

    You can also open a new tab of the data by clicking on the data icon beside the object in the Environment window.

    +

    You can also open a new tab of the data by clicking on the data icon beside the object in the Environment pane

    -
    +

    You can also hold down Cmd or CTRL and click on the name of a data frame in your code.

    +

    Indexing

    -

    R contains several constructs which allow access to individual elements or subsets through indexing operations. Indexing can be used both to extract part of an object and to replace parts of an object (or to add parts). There are three basic indexing syntax: [ ], [[ ]] and $.

    +

    R contains several operators which allow access to individual elements or subsets through indexing. Indexing can be used both to extract part of an object and to replace parts of an object (or to add parts). There are three basic indexing operators: [, [[ and $.

    x[i] #if x is a vector
     x[i, j] #if x is a matrix/data frame
    @@ -511,7 +512,7 @@ 

    Indexing

    Vectors and multi-dimensional objects

    -

    To index a vector, vector[i] select the ith element. To index a multi-dimensional objects such as a matrix, matrix[i, j] selects the element in row i and column j, where as in a three dimensional array[k, i, i, j] selects the element in matrix k, row i, and column j.

    +

    To index a vector, vector[i] select the ith element. To index a multi-dimensional objects such as a matrix, matrix[i, j] selects the element in row i and column j, where as in a three dimensional array[k, i, j] selects the element in matrix k, row i, and column j.

    Let’s practice by first creating the same objects as we did in Module 1.

    number.object <- 3
    @@ -533,7 +534,7 @@ 

    Vectors and multi-dimensional objects

    [2,] 4 5
    -

    Finally, let’s use indexing to pull our elements of the objects.

    +

    Finally, let’s use indexing to pull out elements of the objects.

    vector.object1[2] #pulling the second element
    @@ -574,158 +575,37 @@

    List objects

    [2,] 4 5
    -
    -
    -

    $ for indexing

    -

    $ allows only a literal character string or a symbol as the index.

    +

    What happens if we use a single square bracket?

    -
    df$IgG_concentration
    +
    list.object[3]
    -
      [1] 3.176895e-01 3.436823e+00 3.000000e-01 1.432363e+02 4.476534e-01
    -  [6] 2.527076e-02 6.101083e-01 3.000000e-01 2.916968e+00 1.649819e+00
    - [11] 4.574007e+00 1.583904e+02           NA 1.065068e+02 1.113870e+02
    - [16] 4.144893e+01 3.000000e-01 2.527076e-01 8.159247e+01 1.825342e+02
    - [21] 4.244656e+01 1.193493e+02 3.000000e-01 3.000000e-01 9.025271e-01
    - [26] 3.501805e-01 3.000000e-01 1.227437e+00 1.702055e+02 3.000000e-01
    - [31] 4.801444e-01 2.527076e-02 3.000000e-01 5.776173e-02 4.801444e-01
    - [36] 3.826715e-01 3.000000e-01 4.048558e+02 3.000000e-01 5.451264e-01
    - [41] 3.000000e-01 5.590753e+01 2.202166e-01 1.709760e+02 1.227437e+00
    - [46] 4.567527e+02 4.838480e+01 1.227437e-01 1.877256e-01 3.000000e-01
    - [51] 3.501805e-01 3.339350e+00 3.000000e-01 5.451264e-01           NA
    - [56] 2.104693e+00           NA 3.826715e-01 3.926366e+01 1.129964e+00
    - [61] 3.501805e+00 7.542808e+01 4.800475e+01 1.000000e+00 4.068884e+01
    - [66] 3.000000e-01 4.377672e+01 1.193493e+02 6.977740e+01 1.373288e+02
    - [71] 1.642979e+02           NA 1.542808e+02 6.033058e-01 2.809917e-01
    - [76] 1.966942e+00 2.041322e+00 2.115702e+00 4.663043e+02 3.000000e-01
    - [81] 1.500796e+02 1.543790e+02 2.561983e-01 1.596338e+02 1.732484e+02
    - [86] 4.641304e+02 3.736364e+01 1.572452e+02 3.000000e-01 3.000000e-01
    - [91] 8.264463e-02 6.776859e-01 7.272727e-01 2.066116e-01 1.966942e+00
    - [96] 3.000000e-01 3.000000e-01 2.809917e-01 8.016529e-01 1.818182e-01
    -[101] 1.818182e-01 8.264463e-02 3.422727e+01 8.743506e+00 3.000000e-01
    -[106] 1.641720e+02 4.049587e-01 1.001592e+02 4.489130e+02 1.101911e+02
    -[111] 4.440909e+01 1.288217e+02 2.840909e+01 1.003981e+02 8.512397e-01
    -[116] 1.322314e-01 1.297521e+00 1.570248e-01 1.966942e+00 1.536624e+02
    -[121] 3.000000e-01 3.000000e-01 1.074380e+00 1.099174e+00 3.057851e-01
    -[126] 3.000000e-01 5.785124e-02 4.391304e+02 6.130435e+02 1.074380e-01
    -[131] 7.125796e+01 4.222727e+01 1.620223e+02 3.750000e+01 1.534236e+02
    -[136] 6.239130e+02 5.521739e+02 5.785124e-02 6.547945e-01 8.767123e-02
    -[141] 3.000000e-01 2.849315e+00 3.835616e-02 2.849315e-01 4.649315e+00
    -[146] 1.369863e-01 3.589041e-01 1.049315e+00 4.668998e+01 1.473510e+02
    -[151] 4.589744e+01 2.109589e-01 1.741722e+02 2.496503e+01 1.850993e+02
    -[156] 1.863014e-01 1.863014e-01 4.589744e+01 1.942881e+02 5.079646e+02
    -[161] 8.767123e-01 2.750685e+00 1.503311e+02 3.000000e-01 3.095890e-01
    -[166] 3.000000e-01 6.371681e+02 6.054795e-01 1.955298e+02 1.786424e+02
    -[171] 1.120861e+02 1.331954e+02 2.159292e+02 5.628319e+02 1.900662e+02
    -[176] 6.547945e-01 1.665753e+00 1.739238e+02 9.991722e+01 9.321192e+01
    -[181] 8.767123e-02           NA 6.794521e-01 5.808219e-01 1.369863e-01
    -[186] 2.060274e+00 1.610099e+02 4.082192e-01 8.273973e-01 4.601770e+02
    -[191] 1.389073e+02 3.867133e+01 9.260274e-01 5.918874e+01 1.870861e+02
    -[196] 4.328767e-01 6.301370e-02 3.000000e-01 1.548013e+02 5.819536e+01
    -[201] 1.724338e+02 1.932401e+01 2.164420e+00 9.757412e-01 1.509434e-01
    -[206] 1.509434e-01 7.766571e+01 4.319563e+01 1.752022e-01 3.094775e+01
    -[211] 1.266846e-01 2.919806e+01 9.545455e+00 2.735115e+01 1.314841e+02
    -[216] 3.643985e+01 1.498559e+02 9.363636e+00 2.479784e-01 5.390836e-02
    -[221] 8.787062e-01 1.994609e-01 3.000000e-01 3.000000e-01 5.390836e-03
    -[226] 4.177898e-01 3.000000e-01 2.479784e-01 2.964960e-02 2.964960e-01
    -[231] 5.148248e+00 1.994609e-01 3.000000e-01 1.779539e+02 3.290210e+02
    -[236] 3.000000e-01 1.809798e+02 4.905660e-01 1.266846e-01 1.543948e+02
    -[241] 1.379683e+02 6.153846e+02 1.474784e+02 3.000000e-01 1.024259e+00
    -[246] 4.444056e+02 3.000000e-01 2.504043e+00 3.000000e-01 3.000000e-01
    -[251] 7.816712e-02 3.000000e-01 5.390836e-02 1.494236e+02 5.972622e+01
    -[256] 6.361186e-01 1.837896e+02 1.320809e+02 1.571906e-01 1.520231e+02
    -[261] 3.000000e-01 3.000000e-01 1.823699e+02 3.000000e-01 2.173913e+00
    -[266] 2.142202e+01 3.000000e-01 3.408027e+00 4.155963e+01 9.698997e-02
    -[271] 1.238532e+01 9.528926e+00 1.916185e+02 1.060201e+00 3.679104e+02
    -[276] 4.288991e+01 9.971098e+01 3.000000e-01 1.208092e+02 3.000000e-01
    -[281] 6.688963e-03 2.505017e+00 1.481605e+00 3.000000e-01 5.183946e-01
    -[286] 3.000000e-01 1.872910e-01 3.678930e-01 3.000000e-01 4.529851e+02
    -[291] 3.169725e+01 3.000000e-01 4.922018e+01 2.548507e+02 1.661850e+02
    -[296] 9.164179e+02 3.678930e-01 1.236994e+02 6.705202e+01 3.834862e+01
    -[301] 1.963211e+00 3.000000e-01 2.474916e-01 3.000000e-01 2.173913e-01
    -[306] 8.193980e-01 2.444816e+00 3.000000e-01 1.571906e-01 1.849711e+02
    -[311] 6.119403e+02 3.000000e-01 4.280936e-01 9.698997e-02 3.678930e-02
    -[316] 4.832090e+02 1.390173e+02 3.000000e-01 6.555970e+02 1.526012e+02
    -[321] 3.000000e-01 7.222222e-01 7.724426e+01 3.000000e-01 6.111111e-01
    -[326] 1.555556e+00 3.055556e-01 1.500000e+00 1.470772e+02 1.694444e+00
    -[331] 3.138298e+02 1.414405e+02 1.990605e+02 4.212766e+02 3.000000e-01
    -[336] 3.000000e-01 6.478723e+02 3.000000e-01 2.222222e+00 3.000000e-01
    -[341] 2.055556e+00 2.777778e-02 8.333333e-02 1.032359e+02 1.611111e+00
    -[346] 8.333333e-02 2.333333e+00 5.755319e+02 1.686848e+02 1.111111e-01
    -[351] 3.000000e-01 8.372340e+02 3.000000e-01 3.784504e+01 3.819149e+02
    -[356] 5.555556e-02 3.000000e+02 1.855950e+02 1.944444e-01 3.000000e-01
    -[361] 5.555556e-02 1.138889e+00 4.254237e+01 3.000000e-01 3.000000e-01
    -[366] 3.000000e-01 3.000000e-01 3.138298e+02 1.235908e+02 4.159574e+02
    -[371] 3.009685e+01 1.567850e+02 1.367432e+02 3.731235e+01 9.164927e+01
    -[376] 2.936170e+02 8.820459e+01 1.035491e+02 7.379958e+01 3.000000e-01
    -[381] 1.718750e+02 2.128527e+00 1.253918e+00 2.382445e-01 4.639498e-01
    -[386] 1.253918e-01 1.253918e-01 3.000000e-01 1.000000e+00 1.570043e+02
    -[391] 4.344086e+02 2.184953e+00 1.507837e+00 3.228840e-01 4.588024e+01
    -[396] 1.660560e+02 3.000000e-01 3.043011e+02 2.612903e+02 1.621767e+02
    -[401] 3.228840e-01 4.639498e-01 2.495298e+00 3.257053e+00 3.793103e-01
    -[406]           NA 6.896552e-02 3.000000e-01 1.423197e+00 3.000000e-01
    -[411] 3.000000e-01 1.786638e+02 3.279570e+02           NA 1.903017e+02
    -[416] 1.654095e+02 4.639498e-01 1.815733e+02 1.366771e+00 1.536050e-01
    -[421] 1.306587e+01 2.129032e+02 1.925647e+02 3.000000e-01 1.028213e+00
    -[426] 3.793103e-01 8.025078e-01 4.860215e+02 3.000000e-01 2.100313e-01
    -[431] 2.767665e+01 1.592476e+00 9.717868e-02 1.028213e+00 3.793103e-01
    -[436] 1.292026e+02 4.425150e+01 3.193548e+02 1.860991e+02 6.614420e-01
    -[441] 5.203762e-01 1.330819e+02 1.673491e+02 3.000000e-01 1.117457e+02
    -[446] 3.045509e+01 3.000000e-01 8.280255e-02 3.000000e-01 1.200637e+00
    -[451] 1.687898e-01 7.367273e+02 8.280255e-02 5.127389e-01 1.974522e-01
    -[456] 7.993631e-01 3.000000e-01 3.298182e+02 9.736842e+01 3.000000e-01
    -[461] 3.000000e-01 4.214545e+02 3.000000e-01 2.578182e+02 2.261147e-01
    -[466] 3.000000e-01 1.883901e+02 9.458204e+01 3.000000e-01 3.000000e-01
    -[471] 7.707006e-01 5.032727e+02 1.544586e+00 1.431115e+02 3.000000e-01
    -[476] 1.458599e+00 1.247678e+02           NA 4.334545e+02 3.000000e-01
    -[481] 6.156364e+02 9.574303e+01 1.928019e+02 1.888545e+02 1.598297e+02
    -[486] 5.127389e-01 1.171053e+02           NA 2.547771e-02 1.707430e+02
    -[491] 3.000000e-01 1.869969e+02 4.731481e+01 1.988390e+02 3.000000e-01
    -[496] 8.808050e+01 2.003185e+00 3.000000e-01 3.509259e+01 9.365325e+01
    -[501] 3.000000e-01 3.736111e+01 1.674923e+02 8.808050e+01 1.656347e+02
    -[506] 3.722222e+01 6.756364e+02 3.000000e-01 1.698142e+02 1.628483e+02
    -[511] 5.985130e-01 1.903346e+00 3.000000e-01 3.000000e-01 8.996283e-01
    -[516] 3.977695e-01 3.000000e-01 3.000000e-01 3.000000e-01 3.000000e-01
    -[521] 7.446809e+02 6.095745e+02 1.427445e+02 3.000000e-01 2.973978e-02
    -[526] 3.977695e-01 4.095745e+02 4.595745e+02 3.000000e-01 1.976341e+02
    -[531] 3.776596e+02 1.777603e+02 4.312268e-01 6.765957e+02 7.978723e+02
    -[536] 9.665427e-02 1.879338e+02 4.358670e+01 3.000000e-01 3.000000e-01
    -[541] 2.638955e+01 3.180523e+01 1.746845e+02 1.876972e+02 1.044164e+02
    -[546] 1.202681e+02 1.630915e+02 1.276025e+02 8.880126e+01 3.563830e+02
    -[551] 2.212766e+02 1.969121e+01 3.755319e+02 1.214511e+02 1.034700e+02
    -[556] 3.000000e-01 3.643123e-01 6.319703e-02 3.000000e-01 3.000000e-01
    -[561] 3.000000e-01 3.000000e-01 3.000000e-01 3.000000e-01 3.000000e-01
    -[566] 3.000000e-01 1.664038e+02 2.946809e+02 4.391924e+01 1.874606e+02
    -[571] 1.143533e+02 1.600158e+02 1.635688e-01 8.809148e+01 1.337539e+02
    -[576] 1.985804e+02 1.578864e+02 3.000000e-01 3.000000e-01 1.953642e-01
    -[581] 1.119205e+00 2.523636e+02 3.000000e-01 4.844371e+00 3.000000e-01
    -[586] 1.492553e+02 1.993617e+02 2.847682e-01 3.145695e-01 3.000000e-01
    -[591] 3.406429e+01 6.595745e+01 3.000000e-01 2.174545e+02           NA
    -[596] 5.957447e+01 7.236364e+02 3.000000e-01 3.000000e-01 3.000000e-01
    -[601] 2.676364e+02 1.891489e+02 3.036364e+02 3.000000e-01 3.000000e-01
    -[606] 3.000000e-01 3.000000e-01 3.000000e-01 1.447020e+00 2.130909e+02
    -[611] 1.357616e-01 3.000000e-01 3.000000e-01 5.534545e+02 1.891489e+02
    -[616] 7.202128e+01 3.250287e+01 1.655629e-02 3.123636e+02 3.000000e-01
    -[621] 7.138298e+01 3.000000e-01 6.946809e+01 4.012629e+01 1.629787e+02
    -[626] 1.508511e+02 1.655629e-02 3.000000e-01 4.635762e-02 3.000000e-01
    -[631] 3.000000e-01 3.000000e-01 1.942553e+02 3.690909e+02 3.000000e-01
    -[636] 3.000000e-01 2.847682e+00 1.435106e+02 3.000000e-01 4.752009e+01
    -[641] 2.621125e+01 1.055319e+02 3.000000e-01 1.149007e+00 2.927273e+02
    -[646] 3.000000e-01 3.000000e-01 4.839265e+01 3.000000e-01 3.000000e-01
    -[651] 2.251656e-01
    -
    -
    -

    Note, if you have spaces in your variable name, you will need to use back ticks variable name after the $. This is a good reason to not create variables / column names with spaces.

    +
    [[1]]
    +     [,1] [,2]
    +[1,]    2    3
    +[2,]    4    5
    + + +

    The [[ operator is called the “extract” operator and gives us the element from the list. The [ operator is called the “subset” operator and gives us a subset of the list, that is still a list.

    +
    +
    +

    $ for indexing for data frame

    +

    $ allows only a literal character string or a symbol as the index. For a data frame it extracts a variable.

    +
    +
    df$IgG_concentration
    +
    +

    Note, if you have spaces in your variable name, you will need to use back ticks ` after the $. This is a good reason to not create variables / column names with spaces.

    $ for indexing with lists

    +

    $ allows only a literal character string or a symbol as the index. For a list it extracts a named element.

    List elements can be named

    -
    list.object.named <- list(
    -  emory = number.object,
    -  uga = vector.object2,
    -  gsu = matrix.object
    -)
    -list.object.named
    +
    list.object.named <- list(
    +  emory = number.object,
    +  uga = vector.object2,
    +  gsu = matrix.object
    +)
    +list.object.named
    $emory
     [1] 3
    @@ -739,13 +619,13 @@ 

    $ for indexing with lists

    [2,] 4 5
    -

    If list elements are named, than you can reference data from list using $ or using double square brackets, [[ ]]

    +

    If list elements are named, than you can reference data from list using $ or using double square brackets, [[

    -
    list.object.named$uga 
    +
    list.object.named$uga 
    [1] "blue"   "red"    "yellow"
    -
    list.object.named[["uga"]] 
    +
    list.object.named[["uga"]] 
    [1] "blue"   "red"    "yellow"
    @@ -755,1633 +635,716 @@

    $ for indexing with lists

    Using indexing to rename columns

    As mentioned above, indexing can be used both to extract part of an object and to replace parts of an object (or to add parts).

    -
    colnames(df) # just prints
    +
    colnames(df) 
    [1] "observation_id"    "IgG_concentration" "age"              
     [4] "gender"            "slum"             
    -
    colnames(df)[1:2] <- c("IgG_concentration_mIU/mL", "age_year") # reassigns
    -colnames(df)
    +
    colnames(df)[2:3] <- c("IgG_concentration_IU/mL", "age_year") # reassigns
    +colnames(df)
    -
    [1] "IgG_concentration_mIU/mL" "age_year"                
    -[3] "age"                      "gender"                  
    -[5] "slum"                    
    +
    [1] "observation_id"          "IgG_concentration_IU/mL"
    +[3] "age_year"                "gender"                 
    +[5] "slum"                   
    -
    colnames(df)[1:2] <- c("IgG_concentration", "age") #reset
    +
    +

    For the sake of the module, I am going to reassign them back to the original variable names

    +
    +
    colnames(df)[2:3] <- c("IgG_concentration", "age") #reset

    Using indexing to subset by columns

    -

    We can also subset a data frames and matrices (2-dimensional objects) using the bracket [ row , column ]. We can subset by columns and pull the x column using the index of the column or the column name.

    -

    For example, here I am pulling the 3nd column, which has the variable name age

    +

    We can also subset data frames and matrices (2-dimensional objects) using the bracket [ row , column ]. We can subset by columns and pull the x column using the index of the column or the column name. Leaving either row or column dimension blank means to select all of them.

    +

    For example, here I am pulling the 3rd column, which has the variable name age, for all of rows.

    -
    df[ , "age"] #same as df[ , 3]
    -
    -
      [1] 3.176895e-01 3.436823e+00 3.000000e-01 1.432363e+02 4.476534e-01
    -  [6] 2.527076e-02 6.101083e-01 3.000000e-01 2.916968e+00 1.649819e+00
    - [11] 4.574007e+00 1.583904e+02           NA 1.065068e+02 1.113870e+02
    - [16] 4.144893e+01 3.000000e-01 2.527076e-01 8.159247e+01 1.825342e+02
    - [21] 4.244656e+01 1.193493e+02 3.000000e-01 3.000000e-01 9.025271e-01
    - [26] 3.501805e-01 3.000000e-01 1.227437e+00 1.702055e+02 3.000000e-01
    - [31] 4.801444e-01 2.527076e-02 3.000000e-01 5.776173e-02 4.801444e-01
    - [36] 3.826715e-01 3.000000e-01 4.048558e+02 3.000000e-01 5.451264e-01
    - [41] 3.000000e-01 5.590753e+01 2.202166e-01 1.709760e+02 1.227437e+00
    - [46] 4.567527e+02 4.838480e+01 1.227437e-01 1.877256e-01 3.000000e-01
    - [51] 3.501805e-01 3.339350e+00 3.000000e-01 5.451264e-01           NA
    - [56] 2.104693e+00           NA 3.826715e-01 3.926366e+01 1.129964e+00
    - [61] 3.501805e+00 7.542808e+01 4.800475e+01 1.000000e+00 4.068884e+01
    - [66] 3.000000e-01 4.377672e+01 1.193493e+02 6.977740e+01 1.373288e+02
    - [71] 1.642979e+02           NA 1.542808e+02 6.033058e-01 2.809917e-01
    - [76] 1.966942e+00 2.041322e+00 2.115702e+00 4.663043e+02 3.000000e-01
    - [81] 1.500796e+02 1.543790e+02 2.561983e-01 1.596338e+02 1.732484e+02
    - [86] 4.641304e+02 3.736364e+01 1.572452e+02 3.000000e-01 3.000000e-01
    - [91] 8.264463e-02 6.776859e-01 7.272727e-01 2.066116e-01 1.966942e+00
    - [96] 3.000000e-01 3.000000e-01 2.809917e-01 8.016529e-01 1.818182e-01
    -[101] 1.818182e-01 8.264463e-02 3.422727e+01 8.743506e+00 3.000000e-01
    -[106] 1.641720e+02 4.049587e-01 1.001592e+02 4.489130e+02 1.101911e+02
    -[111] 4.440909e+01 1.288217e+02 2.840909e+01 1.003981e+02 8.512397e-01
    -[116] 1.322314e-01 1.297521e+00 1.570248e-01 1.966942e+00 1.536624e+02
    -[121] 3.000000e-01 3.000000e-01 1.074380e+00 1.099174e+00 3.057851e-01
    -[126] 3.000000e-01 5.785124e-02 4.391304e+02 6.130435e+02 1.074380e-01
    -[131] 7.125796e+01 4.222727e+01 1.620223e+02 3.750000e+01 1.534236e+02
    -[136] 6.239130e+02 5.521739e+02 5.785124e-02 6.547945e-01 8.767123e-02
    -[141] 3.000000e-01 2.849315e+00 3.835616e-02 2.849315e-01 4.649315e+00
    -[146] 1.369863e-01 3.589041e-01 1.049315e+00 4.668998e+01 1.473510e+02
    -[151] 4.589744e+01 2.109589e-01 1.741722e+02 2.496503e+01 1.850993e+02
    -[156] 1.863014e-01 1.863014e-01 4.589744e+01 1.942881e+02 5.079646e+02
    -[161] 8.767123e-01 2.750685e+00 1.503311e+02 3.000000e-01 3.095890e-01
    -[166] 3.000000e-01 6.371681e+02 6.054795e-01 1.955298e+02 1.786424e+02
    -[171] 1.120861e+02 1.331954e+02 2.159292e+02 5.628319e+02 1.900662e+02
    -[176] 6.547945e-01 1.665753e+00 1.739238e+02 9.991722e+01 9.321192e+01
    -[181] 8.767123e-02           NA 6.794521e-01 5.808219e-01 1.369863e-01
    -[186] 2.060274e+00 1.610099e+02 4.082192e-01 8.273973e-01 4.601770e+02
    -[191] 1.389073e+02 3.867133e+01 9.260274e-01 5.918874e+01 1.870861e+02
    -[196] 4.328767e-01 6.301370e-02 3.000000e-01 1.548013e+02 5.819536e+01
    -[201] 1.724338e+02 1.932401e+01 2.164420e+00 9.757412e-01 1.509434e-01
    -[206] 1.509434e-01 7.766571e+01 4.319563e+01 1.752022e-01 3.094775e+01
    -[211] 1.266846e-01 2.919806e+01 9.545455e+00 2.735115e+01 1.314841e+02
    -[216] 3.643985e+01 1.498559e+02 9.363636e+00 2.479784e-01 5.390836e-02
    -[221] 8.787062e-01 1.994609e-01 3.000000e-01 3.000000e-01 5.390836e-03
    -[226] 4.177898e-01 3.000000e-01 2.479784e-01 2.964960e-02 2.964960e-01
    -[231] 5.148248e+00 1.994609e-01 3.000000e-01 1.779539e+02 3.290210e+02
    -[236] 3.000000e-01 1.809798e+02 4.905660e-01 1.266846e-01 1.543948e+02
    -[241] 1.379683e+02 6.153846e+02 1.474784e+02 3.000000e-01 1.024259e+00
    -[246] 4.444056e+02 3.000000e-01 2.504043e+00 3.000000e-01 3.000000e-01
    -[251] 7.816712e-02 3.000000e-01 5.390836e-02 1.494236e+02 5.972622e+01
    -[256] 6.361186e-01 1.837896e+02 1.320809e+02 1.571906e-01 1.520231e+02
    -[261] 3.000000e-01 3.000000e-01 1.823699e+02 3.000000e-01 2.173913e+00
    -[266] 2.142202e+01 3.000000e-01 3.408027e+00 4.155963e+01 9.698997e-02
    -[271] 1.238532e+01 9.528926e+00 1.916185e+02 1.060201e+00 3.679104e+02
    -[276] 4.288991e+01 9.971098e+01 3.000000e-01 1.208092e+02 3.000000e-01
    -[281] 6.688963e-03 2.505017e+00 1.481605e+00 3.000000e-01 5.183946e-01
    -[286] 3.000000e-01 1.872910e-01 3.678930e-01 3.000000e-01 4.529851e+02
    -[291] 3.169725e+01 3.000000e-01 4.922018e+01 2.548507e+02 1.661850e+02
    -[296] 9.164179e+02 3.678930e-01 1.236994e+02 6.705202e+01 3.834862e+01
    -[301] 1.963211e+00 3.000000e-01 2.474916e-01 3.000000e-01 2.173913e-01
    -[306] 8.193980e-01 2.444816e+00 3.000000e-01 1.571906e-01 1.849711e+02
    -[311] 6.119403e+02 3.000000e-01 4.280936e-01 9.698997e-02 3.678930e-02
    -[316] 4.832090e+02 1.390173e+02 3.000000e-01 6.555970e+02 1.526012e+02
    -[321] 3.000000e-01 7.222222e-01 7.724426e+01 3.000000e-01 6.111111e-01
    -[326] 1.555556e+00 3.055556e-01 1.500000e+00 1.470772e+02 1.694444e+00
    -[331] 3.138298e+02 1.414405e+02 1.990605e+02 4.212766e+02 3.000000e-01
    -[336] 3.000000e-01 6.478723e+02 3.000000e-01 2.222222e+00 3.000000e-01
    -[341] 2.055556e+00 2.777778e-02 8.333333e-02 1.032359e+02 1.611111e+00
    -[346] 8.333333e-02 2.333333e+00 5.755319e+02 1.686848e+02 1.111111e-01
    -[351] 3.000000e-01 8.372340e+02 3.000000e-01 3.784504e+01 3.819149e+02
    -[356] 5.555556e-02 3.000000e+02 1.855950e+02 1.944444e-01 3.000000e-01
    -[361] 5.555556e-02 1.138889e+00 4.254237e+01 3.000000e-01 3.000000e-01
    -[366] 3.000000e-01 3.000000e-01 3.138298e+02 1.235908e+02 4.159574e+02
    -[371] 3.009685e+01 1.567850e+02 1.367432e+02 3.731235e+01 9.164927e+01
    -[376] 2.936170e+02 8.820459e+01 1.035491e+02 7.379958e+01 3.000000e-01
    -[381] 1.718750e+02 2.128527e+00 1.253918e+00 2.382445e-01 4.639498e-01
    -[386] 1.253918e-01 1.253918e-01 3.000000e-01 1.000000e+00 1.570043e+02
    -[391] 4.344086e+02 2.184953e+00 1.507837e+00 3.228840e-01 4.588024e+01
    -[396] 1.660560e+02 3.000000e-01 3.043011e+02 2.612903e+02 1.621767e+02
    -[401] 3.228840e-01 4.639498e-01 2.495298e+00 3.257053e+00 3.793103e-01
    -[406]           NA 6.896552e-02 3.000000e-01 1.423197e+00 3.000000e-01
    -[411] 3.000000e-01 1.786638e+02 3.279570e+02           NA 1.903017e+02
    -[416] 1.654095e+02 4.639498e-01 1.815733e+02 1.366771e+00 1.536050e-01
    -[421] 1.306587e+01 2.129032e+02 1.925647e+02 3.000000e-01 1.028213e+00
    -[426] 3.793103e-01 8.025078e-01 4.860215e+02 3.000000e-01 2.100313e-01
    -[431] 2.767665e+01 1.592476e+00 9.717868e-02 1.028213e+00 3.793103e-01
    -[436] 1.292026e+02 4.425150e+01 3.193548e+02 1.860991e+02 6.614420e-01
    -[441] 5.203762e-01 1.330819e+02 1.673491e+02 3.000000e-01 1.117457e+02
    -[446] 3.045509e+01 3.000000e-01 8.280255e-02 3.000000e-01 1.200637e+00
    -[451] 1.687898e-01 7.367273e+02 8.280255e-02 5.127389e-01 1.974522e-01
    -[456] 7.993631e-01 3.000000e-01 3.298182e+02 9.736842e+01 3.000000e-01
    -[461] 3.000000e-01 4.214545e+02 3.000000e-01 2.578182e+02 2.261147e-01
    -[466] 3.000000e-01 1.883901e+02 9.458204e+01 3.000000e-01 3.000000e-01
    -[471] 7.707006e-01 5.032727e+02 1.544586e+00 1.431115e+02 3.000000e-01
    -[476] 1.458599e+00 1.247678e+02           NA 4.334545e+02 3.000000e-01
    -[481] 6.156364e+02 9.574303e+01 1.928019e+02 1.888545e+02 1.598297e+02
    -[486] 5.127389e-01 1.171053e+02           NA 2.547771e-02 1.707430e+02
    -[491] 3.000000e-01 1.869969e+02 4.731481e+01 1.988390e+02 3.000000e-01
    -[496] 8.808050e+01 2.003185e+00 3.000000e-01 3.509259e+01 9.365325e+01
    -[501] 3.000000e-01 3.736111e+01 1.674923e+02 8.808050e+01 1.656347e+02
    -[506] 3.722222e+01 6.756364e+02 3.000000e-01 1.698142e+02 1.628483e+02
    -[511] 5.985130e-01 1.903346e+00 3.000000e-01 3.000000e-01 8.996283e-01
    -[516] 3.977695e-01 3.000000e-01 3.000000e-01 3.000000e-01 3.000000e-01
    -[521] 7.446809e+02 6.095745e+02 1.427445e+02 3.000000e-01 2.973978e-02
    -[526] 3.977695e-01 4.095745e+02 4.595745e+02 3.000000e-01 1.976341e+02
    -[531] 3.776596e+02 1.777603e+02 4.312268e-01 6.765957e+02 7.978723e+02
    -[536] 9.665427e-02 1.879338e+02 4.358670e+01 3.000000e-01 3.000000e-01
    -[541] 2.638955e+01 3.180523e+01 1.746845e+02 1.876972e+02 1.044164e+02
    -[546] 1.202681e+02 1.630915e+02 1.276025e+02 8.880126e+01 3.563830e+02
    -[551] 2.212766e+02 1.969121e+01 3.755319e+02 1.214511e+02 1.034700e+02
    -[556] 3.000000e-01 3.643123e-01 6.319703e-02 3.000000e-01 3.000000e-01
    -[561] 3.000000e-01 3.000000e-01 3.000000e-01 3.000000e-01 3.000000e-01
    -[566] 3.000000e-01 1.664038e+02 2.946809e+02 4.391924e+01 1.874606e+02
    -[571] 1.143533e+02 1.600158e+02 1.635688e-01 8.809148e+01 1.337539e+02
    -[576] 1.985804e+02 1.578864e+02 3.000000e-01 3.000000e-01 1.953642e-01
    -[581] 1.119205e+00 2.523636e+02 3.000000e-01 4.844371e+00 3.000000e-01
    -[586] 1.492553e+02 1.993617e+02 2.847682e-01 3.145695e-01 3.000000e-01
    -[591] 3.406429e+01 6.595745e+01 3.000000e-01 2.174545e+02           NA
    -[596] 5.957447e+01 7.236364e+02 3.000000e-01 3.000000e-01 3.000000e-01
    -[601] 2.676364e+02 1.891489e+02 3.036364e+02 3.000000e-01 3.000000e-01
    -[606] 3.000000e-01 3.000000e-01 3.000000e-01 1.447020e+00 2.130909e+02
    -[611] 1.357616e-01 3.000000e-01 3.000000e-01 5.534545e+02 1.891489e+02
    -[616] 7.202128e+01 3.250287e+01 1.655629e-02 3.123636e+02 3.000000e-01
    -[621] 7.138298e+01 3.000000e-01 6.946809e+01 4.012629e+01 1.629787e+02
    -[626] 1.508511e+02 1.655629e-02 3.000000e-01 4.635762e-02 3.000000e-01
    -[631] 3.000000e-01 3.000000e-01 1.942553e+02 3.690909e+02 3.000000e-01
    -[636] 3.000000e-01 2.847682e+00 1.435106e+02 3.000000e-01 4.752009e+01
    -[641] 2.621125e+01 1.055319e+02 3.000000e-01 1.149007e+00 2.927273e+02
    -[646] 3.000000e-01 3.000000e-01 4.839265e+01 3.000000e-01 3.000000e-01
    -[651] 2.251656e-01
    -
    -
    -

    We can select multiple columns using multiple column names:

    +
    df[ , "age"] #same as df[ , 3]
    + +

    We can select multiple columns using multiple column names, again this is selecting these variables for all of the rows.

    df[, c("age", "gender")] #same as df[ , c(3,4)]
    -
                 age gender
    -1   3.176895e-01 Female
    -2   3.436823e+00 Female
    -3   3.000000e-01   Male
    -4   1.432363e+02   Male
    -5   4.476534e-01   Male
    -6   2.527076e-02   Male
    -7   6.101083e-01 Female
    -8   3.000000e-01 Female
    -9   2.916968e+00   Male
    -10  1.649819e+00   Male
    -11  4.574007e+00   Male
    -12  1.583904e+02 Female
    -13            NA   Male
    -14  1.065068e+02   Male
    -15  1.113870e+02   Male
    -16  4.144893e+01   Male
    -17  3.000000e-01   Male
    -18  2.527076e-01 Female
    -19  8.159247e+01 Female
    -20  1.825342e+02   Male
    -21  4.244656e+01   Male
    -22  1.193493e+02 Female
    -23  3.000000e-01   Male
    -24  3.000000e-01 Female
    -25  9.025271e-01 Female
    -26  3.501805e-01   Male
    -27  3.000000e-01   Male
    -28  1.227437e+00 Female
    -29  1.702055e+02 Female
    -30  3.000000e-01 Female
    -31  4.801444e-01   Male
    -32  2.527076e-02   Male
    -33  3.000000e-01 Female
    -34  5.776173e-02   Male
    -35  4.801444e-01 Female
    -36  3.826715e-01 Female
    -37  3.000000e-01   Male
    -38  4.048558e+02   Male
    -39  3.000000e-01   Male
    -40  5.451264e-01   Male
    -41  3.000000e-01 Female
    -42  5.590753e+01   Male
    -43  2.202166e-01 Female
    -44  1.709760e+02   Male
    -45  1.227437e+00   Male
    -46  4.567527e+02   Male
    -47  4.838480e+01   Male
    -48  1.227437e-01 Female
    -49  1.877256e-01 Female
    -50  3.000000e-01 Female
    -51  3.501805e-01   Male
    -52  3.339350e+00   Male
    -53  3.000000e-01 Female
    -54  5.451264e-01 Female
    -55            NA   Male
    -56  2.104693e+00   Male
    -57            NA   Male
    -58  3.826715e-01 Female
    -59  3.926366e+01 Female
    -60  1.129964e+00   Male
    -61  3.501805e+00 Female
    -62  7.542808e+01 Female
    -63  4.800475e+01 Female
    -64  1.000000e+00   Male
    -65  4.068884e+01   Male
    -66  3.000000e-01 Female
    -67  4.377672e+01 Female
    -68  1.193493e+02   Male
    -69  6.977740e+01   Male
    -70  1.373288e+02 Female
    -71  1.642979e+02   Male
    -72            NA Female
    -73  1.542808e+02   Male
    -74  6.033058e-01   Male
    -75  2.809917e-01   Male
    -76  1.966942e+00   Male
    -77  2.041322e+00   Male
    -78  2.115702e+00 Female
    -79  4.663043e+02   Male
    -80  3.000000e-01   Male
    -81  1.500796e+02   Male
    -82  1.543790e+02 Female
    -83  2.561983e-01 Female
    -84  1.596338e+02   Male
    -85  1.732484e+02 Female
    -86  4.641304e+02 Female
    -87  3.736364e+01   Male
    -88  1.572452e+02 Female
    -89  3.000000e-01   Male
    -90  3.000000e-01   Male
    -91  8.264463e-02   Male
    -92  6.776859e-01 Female
    -93  7.272727e-01   Male
    -94  2.066116e-01 Female
    -95  1.966942e+00   Male
    -96  3.000000e-01   Male
    -97  3.000000e-01   Male
    -98  2.809917e-01 Female
    -99  8.016529e-01 Female
    -100 1.818182e-01 Female
    -101 1.818182e-01   Male
    -102 8.264463e-02 Female
    -103 3.422727e+01 Female
    -104 8.743506e+00   Male
    -105 3.000000e-01   Male
    -106 1.641720e+02 Female
    -107 4.049587e-01   Male
    -108 1.001592e+02   Male
    -109 4.489130e+02 Female
    -110 1.101911e+02 Female
    -111 4.440909e+01   Male
    -112 1.288217e+02 Female
    -113 2.840909e+01   Male
    -114 1.003981e+02 Female
    -115 8.512397e-01 Female
    -116 1.322314e-01   Male
    -117 1.297521e+00 Female
    -118 1.570248e-01   Male
    -119 1.966942e+00 Female
    -120 1.536624e+02   Male
    -121 3.000000e-01 Female
    -122 3.000000e-01 Female
    -123 1.074380e+00   Male
    -124 1.099174e+00 Female
    -125 3.057851e-01 Female
    -126 3.000000e-01 Female
    -127 5.785124e-02 Female
    -128 4.391304e+02 Female
    -129 6.130435e+02 Female
    -130 1.074380e-01   Male
    -131 7.125796e+01   Male
    -132 4.222727e+01   Male
    -133 1.620223e+02 Female
    -134 3.750000e+01 Female
    -135 1.534236e+02 Female
    -136 6.239130e+02 Female
    -137 5.521739e+02   Male
    -138 5.785124e-02 Female
    -139 6.547945e-01 Female
    -140 8.767123e-02 Female
    -141 3.000000e-01   Male
    -142 2.849315e+00 Female
    -143 3.835616e-02   Male
    -144 2.849315e-01   Male
    -145 4.649315e+00   Male
    -146 1.369863e-01 Female
    -147 3.589041e-01   Male
    -148 1.049315e+00   Male
    -149 4.668998e+01 Female
    -150 1.473510e+02 Female
    -151 4.589744e+01   Male
    -152 2.109589e-01   Male
    -153 1.741722e+02 Female
    -154 2.496503e+01 Female
    -155 1.850993e+02   Male
    -156 1.863014e-01   Male
    -157 1.863014e-01   Male
    -158 4.589744e+01 Female
    -159 1.942881e+02 Female
    -160 5.079646e+02 Female
    -161 8.767123e-01   Male
    -162 2.750685e+00   Male
    -163 1.503311e+02 Female
    -164 3.000000e-01   Male
    -165 3.095890e-01   Male
    -166 3.000000e-01   Male
    -167 6.371681e+02 Female
    -168 6.054795e-01 Female
    -169 1.955298e+02 Female
    -170 1.786424e+02   Male
    -171 1.120861e+02 Female
    -172 1.331954e+02   Male
    -173 2.159292e+02   Male
    -174 5.628319e+02   Male
    -175 1.900662e+02 Female
    -176 6.547945e-01   Male
    -177 1.665753e+00   Male
    -178 1.739238e+02   Male
    -179 9.991722e+01   Male
    -180 9.321192e+01   Male
    -181 8.767123e-02 Female
    -182           NA   Male
    -183 6.794521e-01 Female
    -184 5.808219e-01   Male
    -185 1.369863e-01 Female
    -186 2.060274e+00 Female
    -187 1.610099e+02   Male
    -188 4.082192e-01 Female
    -189 8.273973e-01   Male
    -190 4.601770e+02 Female
    -191 1.389073e+02 Female
    -192 3.867133e+01 Female
    -193 9.260274e-01 Female
    -194 5.918874e+01 Female
    -195 1.870861e+02 Female
    -196 4.328767e-01   Male
    -197 6.301370e-02   Male
    -198 3.000000e-01 Female
    -199 1.548013e+02   Male
    -200 5.819536e+01 Female
    -201 1.724338e+02 Female
    -202 1.932401e+01 Female
    -203 2.164420e+00 Female
    -204 9.757412e-01 Female
    -205 1.509434e-01   Male
    -206 1.509434e-01 Female
    -207 7.766571e+01   Male
    -208 4.319563e+01 Female
    -209 1.752022e-01   Male
    -210 3.094775e+01 Female
    -211 1.266846e-01   Male
    -212 2.919806e+01   Male
    -213 9.545455e+00 Female
    -214 2.735115e+01 Female
    -215 1.314841e+02 Female
    -216 3.643985e+01   Male
    -217 1.498559e+02 Female
    -218 9.363636e+00 Female
    -219 2.479784e-01   Male
    -220 5.390836e-02 Female
    -221 8.787062e-01 Female
    -222 1.994609e-01   Male
    -223 3.000000e-01 Female
    -224 3.000000e-01   Male
    -225 5.390836e-03 Female
    -226 4.177898e-01 Female
    -227 3.000000e-01 Female
    -228 2.479784e-01   Male
    -229 2.964960e-02   Male
    -230 2.964960e-01   Male
    -231 5.148248e+00 Female
    -232 1.994609e-01   Male
    -233 3.000000e-01   Male
    -234 1.779539e+02   Male
    -235 3.290210e+02 Female
    -236 3.000000e-01   Male
    -237 1.809798e+02 Female
    -238 4.905660e-01   Male
    -239 1.266846e-01   Male
    -240 1.543948e+02 Female
    -241 1.379683e+02 Female
    -242 6.153846e+02   Male
    -243 1.474784e+02   Male
    -244 3.000000e-01 Female
    -245 1.024259e+00   Male
    -246 4.444056e+02 Female
    -247 3.000000e-01   Male
    -248 2.504043e+00 Female
    -249 3.000000e-01 Female
    -250 3.000000e-01 Female
    -251 7.816712e-02 Female
    -252 3.000000e-01 Female
    -253 5.390836e-02   Male
    -254 1.494236e+02 Female
    -255 5.972622e+01   Male
    -256 6.361186e-01 Female
    -257 1.837896e+02 Female
    -258 1.320809e+02 Female
    -259 1.571906e-01   Male
    -260 1.520231e+02   Male
    -261 3.000000e-01 Female
    -262 3.000000e-01 Female
    -263 1.823699e+02   Male
    -264 3.000000e-01   Male
    -265 2.173913e+00   Male
    -266 2.142202e+01   Male
    -267 3.000000e-01 Female
    -268 3.408027e+00   Male
    -269 4.155963e+01   Male
    -270 9.698997e-02   Male
    -271 1.238532e+01 Female
    -272 9.528926e+00   Male
    -273 1.916185e+02 Female
    -274 1.060201e+00   Male
    -275 3.679104e+02 Female
    -276 4.288991e+01   Male
    -277 9.971098e+01   Male
    -278 3.000000e-01   Male
    -279 1.208092e+02   Male
    -280 3.000000e-01   Male
    -281 6.688963e-03 Female
    -282 2.505017e+00 Female
    -283 1.481605e+00   Male
    -284 3.000000e-01 Female
    -285 5.183946e-01 Female
    -286 3.000000e-01 Female
    -287 1.872910e-01   Male
    -288 3.678930e-01 Female
    -289 3.000000e-01   Male
    -290 4.529851e+02 Female
    -291 3.169725e+01 Female
    -292 3.000000e-01   Male
    -293 4.922018e+01   Male
    -294 2.548507e+02   Male
    -295 1.661850e+02   Male
    -296 9.164179e+02   Male
    -297 3.678930e-01 Female
    -298 1.236994e+02   Male
    -299 6.705202e+01   Male
    -300 3.834862e+01   Male
    -301 1.963211e+00 Female
    -302 3.000000e-01   Male
    -303 2.474916e-01   Male
    -304 3.000000e-01 Female
    -305 2.173913e-01   Male
    -306 8.193980e-01   Male
    -307 2.444816e+00 Female
    -308 3.000000e-01   Male
    -309 1.571906e-01 Female
    -310 1.849711e+02   Male
    -311 6.119403e+02 Female
    -312 3.000000e-01 Female
    -313 4.280936e-01 Female
    -314 9.698997e-02   Male
    -315 3.678930e-02 Female
    -316 4.832090e+02   Male
    -317 1.390173e+02 Female
    -318 3.000000e-01   Male
    -319 6.555970e+02 Female
    -320 1.526012e+02 Female
    -321 3.000000e-01 Female
    -322 7.222222e-01   Male
    -323 7.724426e+01   Male
    -324 3.000000e-01   Male
    -325 6.111111e-01 Female
    -326 1.555556e+00 Female
    -327 3.055556e-01   Male
    -328 1.500000e+00   Male
    -329 1.470772e+02   Male
    -330 1.694444e+00 Female
    -331 3.138298e+02 Female
    -332 1.414405e+02 Female
    -333 1.990605e+02 Female
    -334 4.212766e+02   Male
    -335 3.000000e-01   Male
    -336 3.000000e-01   Male
    -337 6.478723e+02   Male
    -338 3.000000e-01   Male
    -339 2.222222e+00 Female
    -340 3.000000e-01   Male
    -341 2.055556e+00   Male
    -342 2.777778e-02 Female
    -343 8.333333e-02   Male
    -344 1.032359e+02 Female
    -345 1.611111e+00 Female
    -346 8.333333e-02 Female
    -347 2.333333e+00 Female
    -348 5.755319e+02   Male
    -349 1.686848e+02 Female
    -350 1.111111e-01   Male
    -351 3.000000e-01   Male
    -352 8.372340e+02 Female
    -353 3.000000e-01   Male
    -354 3.784504e+01   Male
    -355 3.819149e+02   Male
    -356 5.555556e-02 Female
    -357 3.000000e+02 Female
    -358 1.855950e+02   Male
    -359 1.944444e-01 Female
    -360 3.000000e-01   Male
    -361 5.555556e-02 Female
    -362 1.138889e+00   Male
    -363 4.254237e+01 Female
    -364 3.000000e-01   Male
    -365 3.000000e-01   Male
    -366 3.000000e-01 Female
    -367 3.000000e-01 Female
    -368 3.138298e+02 Female
    -369 1.235908e+02   Male
    -370 4.159574e+02   Male
    -371 3.009685e+01 Female
    -372 1.567850e+02 Female
    -373 1.367432e+02 Female
    -374 3.731235e+01 Female
    -375 9.164927e+01   Male
    -376 2.936170e+02 Female
    -377 8.820459e+01 Female
    -378 1.035491e+02   Male
    -379 7.379958e+01 Female
    -380 3.000000e-01   Male
    -381 1.718750e+02   Male
    -382 2.128527e+00   Male
    -383 1.253918e+00 Female
    -384 2.382445e-01   Male
    -385 4.639498e-01 Female
    -386 1.253918e-01   Male
    -387 1.253918e-01   Male
    -388 3.000000e-01 Female
    -389 1.000000e+00   Male
    -390 1.570043e+02   Male
    -391 4.344086e+02 Female
    -392 2.184953e+00   Male
    -393 1.507837e+00 Female
    -394 3.228840e-01 Female
    -395 4.588024e+01   Male
    -396 1.660560e+02   Male
    -397 3.000000e-01   Male
    -398 3.043011e+02   Male
    -399 2.612903e+02 Female
    -400 1.621767e+02   Male
    -401 3.228840e-01   Male
    -402 4.639498e-01 Female
    -403 2.495298e+00 Female
    -404 3.257053e+00 Female
    -405 3.793103e-01 Female
    -406           NA   Male
    -407 6.896552e-02 Female
    -408 3.000000e-01   Male
    -409 1.423197e+00 Female
    -410 3.000000e-01 Female
    -411 3.000000e-01 Female
    -412 1.786638e+02   Male
    -413 3.279570e+02   Male
    -414           NA Female
    -415 1.903017e+02   Male
    -416 1.654095e+02 Female
    -417 4.639498e-01 Female
    -418 1.815733e+02   Male
    -419 1.366771e+00   Male
    -420 1.536050e-01 Female
    -421 1.306587e+01   Male
    -422 2.129032e+02 Female
    -423 1.925647e+02   Male
    -424 3.000000e-01 Female
    -425 1.028213e+00 Female
    -426 3.793103e-01 Female
    -427 8.025078e-01 Female
    -428 4.860215e+02 Female
    -429 3.000000e-01 Female
    -430 2.100313e-01   Male
    -431 2.767665e+01 Female
    -432 1.592476e+00   Male
    -433 9.717868e-02 Female
    -434 1.028213e+00 Female
    -435 3.793103e-01   Male
    -436 1.292026e+02   Male
    -437 4.425150e+01 Female
    -438 3.193548e+02 Female
    -439 1.860991e+02 Female
    -440 6.614420e-01 Female
    -441 5.203762e-01   Male
    -442 1.330819e+02   Male
    -443 1.673491e+02 Female
    -444 3.000000e-01   Male
    -445 1.117457e+02   Male
    -446 3.045509e+01 Female
    -447 3.000000e-01   Male
    -448 8.280255e-02 Female
    -449 3.000000e-01 Female
    -450 1.200637e+00 Female
    -451 1.687898e-01   Male
    -452 7.367273e+02 Female
    -453 8.280255e-02   Male
    -454 5.127389e-01   Male
    -455 1.974522e-01   Male
    -456 7.993631e-01 Female
    -457 3.000000e-01   Male
    -458 3.298182e+02   Male
    -459 9.736842e+01 Female
    -460 3.000000e-01 Female
    -461 3.000000e-01 Female
    -462 4.214545e+02 Female
    -463 3.000000e-01   Male
    -464 2.578182e+02 Female
    -465 2.261147e-01   Male
    -466 3.000000e-01 Female
    -467 1.883901e+02   Male
    -468 9.458204e+01 Female
    -469 3.000000e-01 Female
    -470 3.000000e-01   Male
    -471 7.707006e-01 Female
    -472 5.032727e+02   Male
    -473 1.544586e+00 Female
    -474 1.431115e+02 Female
    -475 3.000000e-01   Male
    -476 1.458599e+00   Male
    -477 1.247678e+02 Female
    -478           NA Female
    -479 4.334545e+02   Male
    -480 3.000000e-01 Female
    -481 6.156364e+02 Female
    -482 9.574303e+01   Male
    -483 1.928019e+02   Male
    -484 1.888545e+02   Male
    -485 1.598297e+02 Female
    -486 5.127389e-01   Male
    -487 1.171053e+02 Female
    -488           NA   Male
    -489 2.547771e-02 Female
    -490 1.707430e+02 Female
    -491 3.000000e-01   Male
    -492 1.869969e+02   Male
    -493 4.731481e+01   Male
    -494 1.988390e+02 Female
    -495 3.000000e-01   Male
    -496 8.808050e+01   Male
    -497 2.003185e+00 Female
    -498 3.000000e-01   Male
    -499 3.509259e+01 Female
    -500 9.365325e+01 Female
    -501 3.000000e-01   Male
    -502 3.736111e+01 Female
    -503 1.674923e+02 Female
    -504 8.808050e+01   Male
    -505 1.656347e+02 Female
    -506 3.722222e+01 Female
    -507 6.756364e+02 Female
    -508 3.000000e-01   Male
    -509 1.698142e+02   Male
    -510 1.628483e+02 Female
    -511 5.985130e-01   Male
    -512 1.903346e+00 Female
    -513 3.000000e-01   Male
    -514 3.000000e-01   Male
    -515 8.996283e-01   Male
    -516 3.977695e-01 Female
    -517 3.000000e-01   Male
    -518 3.000000e-01   Male
    -519 3.000000e-01   Male
    -520 3.000000e-01 Female
    -521 7.446809e+02   Male
    -522 6.095745e+02 Female
    -523 1.427445e+02   Male
    -524 3.000000e-01 Female
    -525 2.973978e-02   Male
    -526 3.977695e-01 Female
    -527 4.095745e+02 Female
    -528 4.595745e+02   Male
    -529 3.000000e-01 Female
    -530 1.976341e+02 Female
    -531 3.776596e+02 Female
    -532 1.777603e+02 Female
    -533 4.312268e-01   Male
    -534 6.765957e+02 Female
    -535 7.978723e+02   Male
    -536 9.665427e-02   Male
    -537 1.879338e+02   Male
    -538 4.358670e+01 Female
    -539 3.000000e-01 Female
    -540 3.000000e-01   Male
    -541 2.638955e+01   Male
    -542 3.180523e+01 Female
    -543 1.746845e+02   Male
    -544 1.876972e+02   Male
    -545 1.044164e+02   Male
    -546 1.202681e+02   Male
    -547 1.630915e+02 Female
    -548 1.276025e+02 Female
    -549 8.880126e+01   Male
    -550 3.563830e+02   Male
    -551 2.212766e+02   Male
    -552 1.969121e+01 Female
    -553 3.755319e+02 Female
    -554 1.214511e+02   Male
    -555 1.034700e+02 Female
    -556 3.000000e-01 Female
    -557 3.643123e-01 Female
    -558 6.319703e-02 Female
    -559 3.000000e-01   Male
    -560 3.000000e-01   Male
    -561 3.000000e-01 Female
    -562 3.000000e-01 Female
    -563 3.000000e-01   Male
    -564 3.000000e-01   Male
    -565 3.000000e-01 Female
    -566 3.000000e-01   Male
    -567 1.664038e+02 Female
    -568 2.946809e+02 Female
    -569 4.391924e+01   Male
    -570 1.874606e+02 Female
    -571 1.143533e+02   Male
    -572 1.600158e+02   Male
    -573 1.635688e-01   Male
    -574 8.809148e+01 Female
    -575 1.337539e+02   Male
    -576 1.985804e+02   Male
    -577 1.578864e+02 Female
    -578 3.000000e-01 Female
    -579 3.000000e-01   Male
    -580 1.953642e-01 Female
    -581 1.119205e+00   Male
    -582 2.523636e+02   Male
    -583 3.000000e-01   Male
    -584 4.844371e+00 Female
    -585 3.000000e-01   Male
    -586 1.492553e+02 Female
    -587 1.993617e+02   Male
    -588 2.847682e-01 Female
    -589 3.145695e-01 Female
    -590 3.000000e-01   Male
    -591 3.406429e+01 Female
    -592 6.595745e+01   Male
    -593 3.000000e-01   Male
    -594 2.174545e+02   Male
    -595           NA Female
    -596 5.957447e+01 Female
    -597 7.236364e+02 Female
    -598 3.000000e-01   Male
    -599 3.000000e-01 Female
    -600 3.000000e-01   Male
    -601 2.676364e+02   Male
    -602 1.891489e+02   Male
    -603 3.036364e+02 Female
    -604 3.000000e-01 Female
    -605 3.000000e-01   Male
    -606 3.000000e-01   Male
    -607 3.000000e-01 Female
    -608 3.000000e-01   Male
    -609 1.447020e+00   Male
    -610 2.130909e+02 Female
    -611 1.357616e-01 Female
    -612 3.000000e-01 Female
    -613 3.000000e-01 Female
    -614 5.534545e+02 Female
    -615 1.891489e+02 Female
    -616 7.202128e+01 Female
    -617 3.250287e+01   Male
    -618 1.655629e-02   Male
    -619 3.123636e+02   Male
    -620 3.000000e-01   Male
    -621 7.138298e+01   Male
    -622 3.000000e-01 Female
    -623 6.946809e+01 Female
    -624 4.012629e+01   Male
    -625 1.629787e+02 Female
    -626 1.508511e+02 Female
    -627 1.655629e-02   Male
    -628 3.000000e-01   Male
    -629 4.635762e-02   Male
    -630 3.000000e-01 Female
    -631 3.000000e-01 Female
    -632 3.000000e-01   Male
    -633 1.942553e+02   Male
    -634 3.690909e+02   Male
    -635 3.000000e-01 Female
    -636 3.000000e-01 Female
    -637 2.847682e+00   Male
    -638 1.435106e+02 Female
    -639 3.000000e-01   Male
    -640 4.752009e+01 Female
    -641 2.621125e+01 Female
    -642 1.055319e+02 Female
    -643 3.000000e-01 Female
    -644 1.149007e+00   Male
    -645 2.927273e+02 Female
    -646 3.000000e-01 Female
    -647 3.000000e-01 Female
    -648 4.839265e+01   Male
    -649 3.000000e-01   Male
    -650 3.000000e-01 Female
    -651 2.251656e-01 Female
    +
        age gender
    +1     2 Female
    +2     4 Female
    +3     4   Male
    +4     4   Male
    +5     1   Male
    +6     4   Male
    +7     4 Female
    +8    NA Female
    +9     4   Male
    +10    2   Male
    +11    3   Male
    +12   15 Female
    +13    8   Male
    +14   12   Male
    +15   15   Male
    +16    9   Male
    +17    8   Male
    +18    7 Female
    +19   11 Female
    +20   10   Male
    +21    8   Male
    +22   11 Female
    +23    2   Male
    +24    2 Female
    +25    3 Female
    +26    5   Male
    +27    1   Male
    +28    3 Female
    +29    5 Female
    +30    5 Female
    +31    3   Male
    +32    1   Male
    +33    4 Female
    +34    3   Male
    +35    2 Female
    +36   11 Female
    +37    7   Male
    +38    8   Male
    +39    6   Male
    +40    6   Male
    +41   11 Female
    +42   10   Male
    +43    6 Female
    +44   12   Male
    +45   11   Male
    +46   10   Male
    +47   11   Male
    +48   13 Female
    +49    3 Female
    +50    4 Female
    +51    3   Male
    +52    1   Male
    +53    2 Female
    +54    2 Female
    +55    4   Male
    +56    2   Male
    +57    2   Male
    +58    3 Female
    +59    3 Female
    +60    4   Male
    +61    1 Female
    +62   13 Female
    +63   13 Female
    +64    6   Male
    +65   13   Male
    +66    5 Female
    +67   13 Female
    +68   14   Male
    +69   13   Male
    +70    8 Female
    +71    7   Male
    +72    6 Female
    +73   13   Male
    +74    3   Male
    +75    4   Male
    +76    2   Male
    +77   NA   Male
    +78    5 Female
    +79    3   Male
    +80    3   Male
    +81   14   Male
    +82   11 Female
    +83    7 Female
    +84    7   Male
    +85   11 Female
    +86    9 Female
    +87   14   Male
    +88   13 Female
    +89    1   Male
    +90    1   Male
    +91    4   Male
    +92    1 Female
    +93    2   Male
    +94    3 Female
    +95    2   Male
    +96    1   Male
    +97    2   Male
    +98    2 Female
    +99    4 Female
    +100   5 Female
    +101   5   Male
    +102   6 Female
    +103  14 Female
    +104  14   Male
    +105  10   Male
    +106   6 Female
    +107   6   Male
    +108   8   Male
    +109   6 Female
    +110  12 Female
    +111  12   Male
    +112  14 Female
    +113  15   Male
    +114  12 Female
    +115   4 Female
    +116   4   Male
    +117   3 Female
    +118  NA   Male
    +119   2 Female
    +120   3   Male
    +121  NA Female
    +122   3 Female
    +123   3   Male
    +124   2 Female
    +125   4 Female
    +126  10 Female
    +127   7 Female
    +128  11 Female
    +129   6 Female
    +130  11   Male
    +131   9   Male
    +132   6   Male
    +133  13 Female
    +134  10 Female
    +135   6 Female
    +136  11 Female
    +137   7   Male
    +138   6 Female
    +139   4 Female
    +140   4 Female
    +141   4   Male
    +142   4 Female
    +143   4   Male
    +144   4   Male
    +145   3   Male
    +146   4 Female
    +147   3   Male
    +148   3   Male
    +149  13 Female
    +150   7 Female
    +151  10   Male
    +152   6   Male
    +153  10 Female
    +154  12 Female
    +155  10   Male
    +156  10   Male
    +157  13   Male
    +158  13 Female
    +159   5 Female
    +160   3 Female
    +161   4   Male
    +162   1   Male
    +163   3 Female
    +164   4   Male
    +165   4   Male
    +166   1   Male
    +167   5 Female
    +168   6 Female
    +169  14 Female
    +170   6   Male
    +171  13 Female
    +172   9   Male
    +173  11   Male
    +174  10   Male
    +175   5 Female
    +176  14   Male
    +177   7   Male
    +178  10   Male
    +179   6   Male
    +180   5   Male
    +181   3 Female
    +182   4   Male
    +183   2 Female
    +184   3   Male
    +185   3 Female
    +186   2 Female
    +187   3   Male
    +188   5 Female
    +189   2   Male
    +190   3 Female
    +191  14 Female
    +192   9 Female
    +193  14 Female
    +194   9 Female
    +195   8 Female
    +196   7   Male
    +197  13   Male
    +198   8 Female
    +199   6   Male
    +200  12 Female
    +201  14 Female
    +202  15 Female
    +203   2 Female
    +204   4 Female
    +205   3   Male
    +206   3 Female
    +207   3   Male
    +208   4 Female
    +209   3   Male
    +210  14 Female
    +211   8   Male
    +212   7   Male
    +213  14 Female
    +214  13 Female
    +215  13 Female
    +216   7   Male
    +217   8 Female
    +218  10 Female
    +219   9   Male
    +220   9 Female
    +221   3 Female
    +222   4   Male
    +223   4 Female
    +224   4   Male
    +225   2 Female
    +226   1 Female
    +227   3 Female
    +228   2   Male
    +229   3   Male
    +230   5   Male
    +231   2 Female
    +232   2   Male
    +233   9   Male
    +234  13   Male
    +235  10 Female
    +236   6   Male
    +237  13 Female
    +238  11   Male
    +239  10   Male
    +240   8 Female
    +241   9 Female
    +242  10   Male
    +243  14   Male
    +244   1 Female
    +245   2   Male
    +246   3 Female
    +247   2   Male
    +248   3 Female
    +249   2 Female
    +250   3 Female
    +251   5 Female
    +252  10 Female
    +253   7   Male
    +254  13 Female
    +255  15   Male
    +256  11 Female
    +257  10 Female
    +258   3 Female
    +259   2   Male
    +260   3   Male
    +261   3 Female
    +262   3 Female
    +263   4   Male
    +264   3   Male
    +265   2   Male
    +266   4   Male
    +267   2 Female
    +268   8   Male
    +269  11   Male
    +270   6   Male
    +271  14 Female
    +272  14   Male
    +273   5 Female
    +274   5   Male
    +275  10 Female
    +276  13   Male
    +277   6   Male
    +278   5   Male
    +279  12   Male
    +280   2   Male
    +281   3 Female
    +282   1 Female
    +283   1   Male
    +284   1 Female
    +285   2 Female
    +286   5 Female
    +287   5   Male
    +288   4 Female
    +289   2   Male
    +290  NA Female
    +291   6 Female
    +292   8   Male
    +293  15   Male
    +294  11   Male
    +295  14   Male
    +296   6   Male
    +297  10 Female
    +298  12   Male
    +299  14   Male
    +300  10   Male
    +301   1 Female
    +302   3   Male
    +303   2   Male
    +304   3 Female
    +305   4   Male
    +306   3   Male
    +307   4 Female
    +308   4   Male
    +309   1 Female
    +310   7   Male
    +311  11 Female
    +312   7 Female
    +313   5 Female
    +314  10   Male
    +315   9 Female
    +316  13   Male
    +317  11 Female
    +318  13   Male
    +319   9 Female
    +320  15 Female
    +321   7 Female
    +322   4   Male
    +323   1   Male
    +324   1   Male
    +325   2 Female
    +326   2 Female
    +327   3   Male
    +328   2   Male
    +329   3   Male
    +330   4 Female
    +331   7 Female
    +332  11 Female
    +333  10 Female
    +334   5   Male
    +335   8   Male
    +336  15   Male
    +337  14   Male
    +338   2   Male
    +339   2 Female
    +340   2   Male
    +341   5   Male
    +342   4 Female
    +343   3   Male
    +344   5 Female
    +345   4 Female
    +346   2 Female
    +347   1 Female
    +348   7   Male
    +349   8 Female
    +350  NA   Male
    +351   9   Male
    +352   8 Female
    +353   5   Male
    +354  14   Male
    +355  14   Male
    +356   7 Female
    +357  13 Female
    +358   2   Male
    +359   1 Female
    +360   1   Male
    +361   4 Female
    +362   3   Male
    +363   4 Female
    +364   3   Male
    +365   1   Male
    +366   5 Female
    +367   4 Female
    +368   4 Female
    +369   4   Male
    +370  11   Male
    +371  15 Female
    +372  12 Female
    +373  11 Female
    +374   8 Female
    +375  13   Male
    +376  10 Female
    +377  10 Female
    +378  15   Male
    +379   8 Female
    +380  14   Male
    +381   4   Male
    +382   1   Male
    +383   5 Female
    +384   2   Male
    +385   2 Female
    +386   4   Male
    +387   4   Male
    +388   2 Female
    +389   3   Male
    +390  11   Male
    +391  10 Female
    +392   6   Male
    +393  12 Female
    +394  10 Female
    +395   8   Male
    +396   8   Male
    +397  13   Male
    +398  10   Male
    +399  13 Female
    +400  10   Male
    +401   2   Male
    +402   4 Female
    +403   3 Female
    +404   2 Female
    +405   1 Female
    +406   3   Male
    +407   3 Female
    +408   4   Male
    +409   5 Female
    +410   5 Female
    +411   1 Female
    +412  11   Male
    +413   6   Male
    +414  14 Female
    +415   8   Male
    +416   8 Female
    +417   9 Female
    +418   7   Male
    +419   6   Male
    +420  12 Female
    +421   8   Male
    +422  11 Female
    +423  14   Male
    +424   3 Female
    +425   1 Female
    +426   5 Female
    +427   2 Female
    +428   3 Female
    +429   4 Female
    +430   2   Male
    +431   3 Female
    +432   4   Male
    +433   1 Female
    +434   7 Female
    +435  10   Male
    +436  11   Male
    +437   7 Female
    +438  10 Female
    +439  14 Female
    +440   7 Female
    +441  11   Male
    +442  12   Male
    +443  10 Female
    +444   6   Male
    +445  13   Male
    +446   8 Female
    +447   2   Male
    +448   3 Female
    +449   1 Female
    +450   2 Female
    +451  NA   Male
    +452  NA Female
    +453   4   Male
    +454   4   Male
    +455   1   Male
    +456   2 Female
    +457   2   Male
    +458  12   Male
    +459  12 Female
    +460   8 Female
    +461  14 Female
    +462  13 Female
    +463   6   Male
    +464  11 Female
    +465  11   Male
    +466  10 Female
    +467  12   Male
    +468  14 Female
    +469  11 Female
    +470   1   Male
    +471   2 Female
    +472   3   Male
    +473   3 Female
    +474   5 Female
    +475   3   Male
    +476   1   Male
    +477   4 Female
    +478   4 Female
    +479   4   Male
    +480   2 Female
    +481   5 Female
    +482   7   Male
    +483   8   Male
    +484  10   Male
    +485   6 Female
    +486   7   Male
    +487  10 Female
    +488   6   Male
    +489   6 Female
    +490  15 Female
    +491   5   Male
    +492   3   Male
    +493   5   Male
    +494   3 Female
    +495   5   Male
    +496   5   Male
    +497   1 Female
    +498   1   Male
    +499   7 Female
    +500  14 Female
    +501   9   Male
    +502  10 Female
    +503  10 Female
    +504  11   Male
    +505  11 Female
    +506  12 Female
    +507  11 Female
    +508  12   Male
    +509  12   Male
    +510  10 Female
    +511   1   Male
    +512   2 Female
    +513   4   Male
    +514   2   Male
    +515   3   Male
    +516   3 Female
    +517   2   Male
    +518   4   Male
    +519   3   Male
    +520   1 Female
    +521   4   Male
    +522  12 Female
    +523   6   Male
    +524   7 Female
    +525   7   Male
    +526  13 Female
    +527   8 Female
    +528   7   Male
    +529   8 Female
    +530   8 Female
    +531  11 Female
    +532  14 Female
    +533   3   Male
    +534   2 Female
    +535   2   Male
    +536   3   Male
    +537   2   Male
    +538   2 Female
    +539   3 Female
    +540   2   Male
    +541   5   Male
    +542  10 Female
    +543  14   Male
    +544   9   Male
    +545   6   Male
    +546   7   Male
    +547  14 Female
    +548   7 Female
    +549   7   Male
    +550   9   Male
    +551  14   Male
    +552  10 Female
    +553  13 Female
    +554   5   Male
    +555   4 Female
    +556   4 Female
    +557   5 Female
    +558   4 Female
    +559   4   Male
    +560   4   Male
    +561   3 Female
    +562   1 Female
    +563   4   Male
    +564   1   Male
    +565   1 Female
    +566   7   Male
    +567  13 Female
    +568  10 Female
    +569  14   Male
    +570  12 Female
    +571  14   Male
    +572   8   Male
    +573   7   Male
    +574  11 Female
    +575   8   Male
    +576  12   Male
    +577   9 Female
    +578   5 Female
    +579   4   Male
    +580   3 Female
    +581   2   Male
    +582   2   Male
    +583   3   Male
    +584   4 Female
    +585   4   Male
    +586   4 Female
    +587   5   Male
    +588   3 Female
    +589   6 Female
    +590   3   Male
    +591  11 Female
    +592  11   Male
    +593   7   Male
    +594   8   Male
    +595   6 Female
    +596  10 Female
    +597   8 Female
    +598   8   Male
    +599   9 Female
    +600   8   Male
    +601  13   Male
    +602  11   Male
    +603   8 Female
    +604   2 Female
    +605   4   Male
    +606   2   Male
    +607   2 Female
    +608   4   Male
    +609   2   Male
    +610   4 Female
    +611   2 Female
    +612   4 Female
    +613   1 Female
    +614   4 Female
    +615  12 Female
    +616   7 Female
    +617  11   Male
    +618   6   Male
    +619   8   Male
    +620  14   Male
    +621  11   Male
    +622   7 Female
    +623  14 Female
    +624   6   Male
    +625  13 Female
    +626  13 Female
    +627   3   Male
    +628   1   Male
    +629   3   Male
    +630   1 Female
    +631   1 Female
    +632   2   Male
    +633   4   Male
    +634   4   Male
    +635   2 Female
    +636   4 Female
    +637   5   Male
    +638   3 Female
    +639   3   Male
    +640   6 Female
    +641  11 Female
    +642   9 Female
    +643   7 Female
    +644   8   Male
    +645  NA Female
    +646   8 Female
    +647  14 Female
    +648  10   Male
    +649  10   Male
    +650  11 Female
    +651  13 Female

    We can remove select columns using indexing as well, OR by simply changing the column to NULL

    df[, -5] #remove column 5, "slum" variable
    -
    -
        IgG_concentration          age age.1 gender
    -1                5772 3.176895e-01     2 Female
    -2                8095 3.436823e+00     4 Female
    -3                9784 3.000000e-01     4   Male
    -4                9338 1.432363e+02     4   Male
    -5                6369 4.476534e-01     1   Male
    -6                6885 2.527076e-02     4   Male
    -7                6252 6.101083e-01     4 Female
    -8                8913 3.000000e-01    NA Female
    -9                7332 2.916968e+00     4   Male
    -10               6941 1.649819e+00     2   Male
    -11               5104 4.574007e+00     3   Male
    -12               9078 1.583904e+02    15 Female
    -13               9960           NA     8   Male
    -14               9651 1.065068e+02    12   Male
    -15               9229 1.113870e+02    15   Male
    -16               5210 4.144893e+01     9   Male
    -17               5105 3.000000e-01     8   Male
    -18               7607 2.527076e-01     7 Female
    -19               7582 8.159247e+01    11 Female
    -20               8179 1.825342e+02    10   Male
    -21               5660 4.244656e+01     8   Male
    -22               6696 1.193493e+02    11 Female
    -23               7842 3.000000e-01     2   Male
    -24               6578 3.000000e-01     2 Female
    -25               9619 9.025271e-01     3 Female
    -26               9838 3.501805e-01     5   Male
    -27               6935 3.000000e-01     1   Male
    -28               5885 1.227437e+00     3 Female
    -29               9657 1.702055e+02     5 Female
    -30               9146 3.000000e-01     5 Female
    -31               7056 4.801444e-01     3   Male
    -32               9144 2.527076e-02     1   Male
    -33               8696 3.000000e-01     4 Female
    -34               7042 5.776173e-02     3   Male
    -35               5278 4.801444e-01     2 Female
    -36               6541 3.826715e-01    11 Female
    -37               6070 3.000000e-01     7   Male
    -38               5490 4.048558e+02     8   Male
    -39               6527 3.000000e-01     6   Male
    -40               5389 5.451264e-01     6   Male
    -41               9003 3.000000e-01    11 Female
    -42               6682 5.590753e+01    10   Male
    -43               7844 2.202166e-01     6 Female
    -44               8257 1.709760e+02    12   Male
    -45               7767 1.227437e+00    11   Male
    -46               8391 4.567527e+02    10   Male
    -47               8317 4.838480e+01    11   Male
    -48               7397 1.227437e-01    13 Female
    -49               8495 1.877256e-01     3 Female
    -50               8093 3.000000e-01     4 Female
    -51               7375 3.501805e-01     3   Male
    -52               5255 3.339350e+00     1   Male
    -53               8445 3.000000e-01     2 Female
    -54               8959 5.451264e-01     2 Female
    -55               8400           NA     4   Male
    -56               7420 2.104693e+00     2   Male
    -57               5206           NA     2   Male
    -58               7431 3.826715e-01     3 Female
    -59               7230 3.926366e+01     3 Female
    -60               8208 1.129964e+00     4   Male
    -61               8538 3.501805e+00     1 Female
    -62               6125 7.542808e+01    13 Female
    -63               5767 4.800475e+01    13 Female
    -64               5487 1.000000e+00     6   Male
    -65               5539 4.068884e+01    13   Male
    -66               5759 3.000000e-01     5 Female
    -67               6845 4.377672e+01    13 Female
    -68               7170 1.193493e+02    14   Male
    -69               6588 6.977740e+01    13   Male
    -70               7939 1.373288e+02     8 Female
    -71               5006 1.642979e+02     7   Male
    -72               9180           NA     6 Female
    -73               9638 1.542808e+02    13   Male
    -74               7781 6.033058e-01     3   Male
    -75               6932 2.809917e-01     4   Male
    -76               8120 1.966942e+00     2   Male
    -77               9292 2.041322e+00    NA   Male
    -78               9228 2.115702e+00     5 Female
    -79               8185 4.663043e+02     3   Male
    -80               6797 3.000000e-01     3   Male
    -81               5970 1.500796e+02    14   Male
    -82               7219 1.543790e+02    11 Female
    -83               6870 2.561983e-01     7 Female
    -84               7653 1.596338e+02     7   Male
    -85               8824 1.732484e+02    11 Female
    -86               8311 4.641304e+02     9 Female
    -87               9458 3.736364e+01    14   Male
    -88               8275 1.572452e+02    13 Female
    -89               6786 3.000000e-01     1   Male
    -90               6595 3.000000e-01     1   Male
    -91               5264 8.264463e-02     4   Male
    -92               9188 6.776859e-01     1 Female
    -93               6611 7.272727e-01     2   Male
    -94               6840 2.066116e-01     3 Female
    -95               5663 1.966942e+00     2   Male
    -96               9611 3.000000e-01     1   Male
    -97               7717 3.000000e-01     2   Male
    -98               8374 2.809917e-01     2 Female
    -99               5134 8.016529e-01     4 Female
    -100              8122 1.818182e-01     5 Female
    -101              6192 1.818182e-01     5   Male
    -102              9668 8.264463e-02     6 Female
    -103              9577 3.422727e+01    14 Female
    -104              6403 8.743506e+00    14   Male
    -105              9464 3.000000e-01    10   Male
    -106              8157 1.641720e+02     6 Female
    -107              9451 4.049587e-01     6   Male
    -108              6615 1.001592e+02     8   Male
    -109              9074 4.489130e+02     6 Female
    -110              7479 1.101911e+02    12 Female
    -111              8946 4.440909e+01    12   Male
    -112              5296 1.288217e+02    14 Female
    -113              6238 2.840909e+01    15   Male
    -114              6303 1.003981e+02    12 Female
    -115              6662 8.512397e-01     4 Female
    -116              6251 1.322314e-01     4   Male
    -117              9110 1.297521e+00     3 Female
    -118              8480 1.570248e-01    NA   Male
    -119              5229 1.966942e+00     2 Female
    -120              9173 1.536624e+02     3   Male
    -121              9896 3.000000e-01    NA Female
    -122              5057 3.000000e-01     3 Female
    -123              7732 1.074380e+00     3   Male
    -124              6882 1.099174e+00     2 Female
    -125              9587 3.057851e-01     4 Female
    -126              9930 3.000000e-01    10 Female
    -127              6960 5.785124e-02     7 Female
    -128              6335 4.391304e+02    11 Female
    -129              6286 6.130435e+02     6 Female
    -130              9035 1.074380e-01    11   Male
    -131              5720 7.125796e+01     9   Male
    -132              7368 4.222727e+01     6   Male
    -133              5170 1.620223e+02    13 Female
    -134              6691 3.750000e+01    10 Female
    -135              6173 1.534236e+02     6 Female
    -136              8170 6.239130e+02    11 Female
    -137              9637 5.521739e+02     7   Male
    -138              9482 5.785124e-02     6 Female
    -139              7880 6.547945e-01     4 Female
    -140              6307 8.767123e-02     4 Female
    -141              8822 3.000000e-01     4   Male
    -142              8190 2.849315e+00     4 Female
    -143              7554 3.835616e-02     4   Male
    -144              6519 2.849315e-01     4   Male
    -145              9764 4.649315e+00     3   Male
    -146              8792 1.369863e-01     4 Female
    -147              6721 3.589041e-01     3   Male
    -148              9042 1.049315e+00     3   Male
    -149              7407 4.668998e+01    13 Female
    -150              7229 1.473510e+02     7 Female
    -151              7532 4.589744e+01    10   Male
    -152              6516 2.109589e-01     6   Male
    -153              7941 1.741722e+02    10 Female
    -154              8124 2.496503e+01    12 Female
    -155              7869 1.850993e+02    10   Male
    -156              5647 1.863014e-01    10   Male
    -157              9120 1.863014e-01    13   Male
    -158              6608 4.589744e+01    13 Female
    -159              8635 1.942881e+02     5 Female
    -160              9341 5.079646e+02     3 Female
    -161              9982 8.767123e-01     4   Male
    -162              6976 2.750685e+00     1   Male
    -163              6008 1.503311e+02     3 Female
    -164              5432 3.000000e-01     4   Male
    -165              5749 3.095890e-01     4   Male
    -166              6428 3.000000e-01     1   Male
    -167              5947 6.371681e+02     5 Female
    -168              6027 6.054795e-01     6 Female
    -169              5064 1.955298e+02    14 Female
    -170              5861 1.786424e+02     6   Male
    -171              6702 1.120861e+02    13 Female
    -172              7851 1.331954e+02     9   Male
    -173              8310 2.159292e+02    11   Male
    -174              5897 5.628319e+02    10   Male
    -175              9249 1.900662e+02     5 Female
    -176              9163 6.547945e-01    14   Male
    -177              6550 1.665753e+00     7   Male
    -178              5859 1.739238e+02    10   Male
    -179              5607 9.991722e+01     6   Male
    -180              8746 9.321192e+01     5   Male
    -181              5274 8.767123e-02     3 Female
    -182              9412           NA     4   Male
    -183              5691 6.794521e-01     2 Female
    -184              9016 5.808219e-01     3   Male
    -185              9128 1.369863e-01     3 Female
    -186              8539 2.060274e+00     2 Female
    -187              5703 1.610099e+02     3   Male
    -188              9573 4.082192e-01     5 Female
    -189              5852 8.273973e-01     2   Male
    -190              5971 4.601770e+02     3 Female
    -191              7015 1.389073e+02    14 Female
    -192              8221 3.867133e+01     9 Female
    -193              6752 9.260274e-01    14 Female
    -194              7436 5.918874e+01     9 Female
    -195              6869 1.870861e+02     8 Female
    -196              8947 4.328767e-01     7   Male
    -197              7360 6.301370e-02    13   Male
    -198              7494 3.000000e-01     8 Female
    -199              8243 1.548013e+02     6   Male
    -200              6176 5.819536e+01    12 Female
    -201              6818 1.724338e+02    14 Female
    -202              8083 1.932401e+01    15 Female
    -203              6711 2.164420e+00     2 Female
    -204              8890 9.757412e-01     4 Female
    -205              5576 1.509434e-01     3   Male
    -206              8396 1.509434e-01     3 Female
    -207              5986 7.766571e+01     3   Male
    -208              9758 4.319563e+01     4 Female
    -209              5444 1.752022e-01     3   Male
    -210              6394 3.094775e+01    14 Female
    -211              5694 1.266846e-01     8   Male
    -212              9604 2.919806e+01     7   Male
    -213              7895 9.545455e+00    14 Female
    -214              5141 2.735115e+01    13 Female
    -215              8034 1.314841e+02    13 Female
    -216              6566 3.643985e+01     7   Male
    -217              6827 1.498559e+02     8 Female
    -218              7400 9.363636e+00    10 Female
    -219              9094 2.479784e-01     9   Male
    -220              9474 5.390836e-02     9 Female
    -221              7984 8.787062e-01     3 Female
    -222              9524 1.994609e-01     4   Male
    -223              9598 3.000000e-01     4 Female
    -224              9664 3.000000e-01     4   Male
    -225              9910 5.390836e-03     2 Female
    -226              9216 4.177898e-01     1 Female
    -227              9706 3.000000e-01     3 Female
    -228              5320 2.479784e-01     2   Male
    -229              5256 2.964960e-02     3   Male
    -230              9006 2.964960e-01     5   Male
    -231              6413 5.148248e+00     2 Female
    -232              8717 1.994609e-01     2   Male
    -233              9873 3.000000e-01     9   Male
    -234              6699 1.779539e+02    13   Male
    -235              8228 3.290210e+02    10 Female
    -236              6494 3.000000e-01     6   Male
    -237              9294 1.809798e+02    13 Female
    -238              7680 4.905660e-01    11   Male
    -239              7534 1.266846e-01    10   Male
    -240              9920 1.543948e+02     8 Female
    -241              9814 1.379683e+02     9 Female
    -242              5363 6.153846e+02    10   Male
    -243              5842 1.474784e+02    14   Male
    -244              7992 3.000000e-01     1 Female
    -245              5565 1.024259e+00     2   Male
    -246              5258 4.444056e+02     3 Female
    -247              8200 3.000000e-01     2   Male
    -248              8795 2.504043e+00     3 Female
    -249              7676 3.000000e-01     2 Female
    -250              7029 3.000000e-01     3 Female
    -251              7535 7.816712e-02     5 Female
    -252              5026 3.000000e-01    10 Female
    -253              8630 5.390836e-02     7   Male
    -254              6989 1.494236e+02    13 Female
    -255              8454 5.972622e+01    15   Male
    -256              9741 6.361186e-01    11 Female
    -257              6418 1.837896e+02    10 Female
    -258              9922 1.320809e+02     3 Female
    -259              8504 1.571906e-01     2   Male
    -260              6491 1.520231e+02     3   Male
    -261              6002 3.000000e-01     3 Female
    -262              7127 3.000000e-01     3 Female
    -263              8540 1.823699e+02     4   Male
    -264              7115 3.000000e-01     3   Male
    -265              7268 2.173913e+00     2   Male
    -266              8279 2.142202e+01     4   Male
    -267              8880 3.000000e-01     2 Female
    -268              8076 3.408027e+00     8   Male
    -269              6250 4.155963e+01    11   Male
    -270              8542 9.698997e-02     6   Male
    -271              5393 1.238532e+01    14 Female
    -272              9197 9.528926e+00    14   Male
    -273              6651 1.916185e+02     5 Female
    -274              7473 1.060201e+00     5   Male
    -275              6589 3.679104e+02    10 Female
    -276              6867 4.288991e+01    13   Male
    -277              5413 9.971098e+01     6   Male
    -278              6765 3.000000e-01     5   Male
    -279              8933 1.208092e+02    12   Male
    -280              6294 3.000000e-01     2   Male
    -281              8688 6.688963e-03     3 Female
    -282              8108 2.505017e+00     1 Female
    -283              6926 1.481605e+00     1   Male
    -284              5880 3.000000e-01     1 Female
    -285              5529 5.183946e-01     2 Female
    -286              8963 3.000000e-01     5 Female
    -287              9594 1.872910e-01     5   Male
    -288              8075 3.678930e-01     4 Female
    -289              5680 3.000000e-01     2   Male
    -290              5617 4.529851e+02    NA Female
    -291              5080 3.169725e+01     6 Female
    -292              7719 3.000000e-01     8   Male
    -293              6780 4.922018e+01    15   Male
    -294              8768 2.548507e+02    11   Male
    -295              7031 1.661850e+02    14   Male
    -296              7740 9.164179e+02     6   Male
    -297              8855 3.678930e-01    10 Female
    -298              7241 1.236994e+02    12   Male
    -299              8156 6.705202e+01    14   Male
    -300              7333 3.834862e+01    10   Male
    -301              6906 1.963211e+00     1 Female
    -302              9511 3.000000e-01     3   Male
    -303              9336 2.474916e-01     2   Male
    -304              6644 3.000000e-01     3 Female
    -305              5554 2.173913e-01     4   Male
    -306              8094 8.193980e-01     3   Male
    -307              8836 2.444816e+00     4 Female
    -308              7147 3.000000e-01     4   Male
    -309              7745 1.571906e-01     1 Female
    -310              9345 1.849711e+02     7   Male
    -311              5606 6.119403e+02    11 Female
    -312              9766 3.000000e-01     7 Female
    -313              6666 4.280936e-01     5 Female
    -314              9965 9.698997e-02    10   Male
    -315              7927 3.678930e-02     9 Female
    -316              6266 4.832090e+02    13   Male
    -317              9487 1.390173e+02    11 Female
    -318              7089 3.000000e-01    13   Male
    -319              5731 6.555970e+02     9 Female
    -320              7962 1.526012e+02    15 Female
    -321              9532 3.000000e-01     7 Female
    -322              6687 7.222222e-01     4   Male
    -323              6570 7.724426e+01     1   Male
    -324              5781 3.000000e-01     1   Male
    -325              8935 6.111111e-01     2 Female
    -326              5780 1.555556e+00     2 Female
    -327              9029 3.055556e-01     3   Male
    -328              5668 1.500000e+00     2   Male
    -329              8203 1.470772e+02     3   Male
    -330              7381 1.694444e+00     4 Female
    -331              7734 3.138298e+02     7 Female
    -332              7257 1.414405e+02    11 Female
    -333              8418 1.990605e+02    10 Female
    -334              8259 4.212766e+02     5   Male
    -335              5587 3.000000e-01     8   Male
    -336              8499 3.000000e-01    15   Male
    -337              7897 6.478723e+02    14   Male
    -338              8300 3.000000e-01     2   Male
    -339              9691 2.222222e+00     2 Female
    -340              5873 3.000000e-01     2   Male
    -341              6690 2.055556e+00     5   Male
    -342              9970 2.777778e-02     4 Female
    -343              8978 8.333333e-02     3   Male
    -344              6181 1.032359e+02     5 Female
    -345              8218 1.611111e+00     4 Female
    -346              5387 8.333333e-02     2 Female
    -347              7850 2.333333e+00     1 Female
    -348              7326 5.755319e+02     7   Male
    -349              8448 1.686848e+02     8 Female
    -350              7264 1.111111e-01    NA   Male
    -351              8361 3.000000e-01     9   Male
    -352              7497 8.372340e+02     8 Female
    -353              5559 3.000000e-01     5   Male
    -354              7321 3.784504e+01    14   Male
    -355              8372 3.819149e+02    14   Male
    -356              5030 5.555556e-02     7 Female
    -357              6936 3.000000e+02    13 Female
    -358              9628 1.855950e+02     2   Male
    -359              8558 1.944444e-01     1 Female
    -360              7840 3.000000e-01     1   Male
    -361              5100 5.555556e-02     4 Female
    -362              8244 1.138889e+00     3   Male
    -363              9115 4.254237e+01     4 Female
    -364              5489 3.000000e-01     3   Male
    -365              5766 3.000000e-01     1   Male
    -366              5024 3.000000e-01     5 Female
    -367              8599 3.000000e-01     4 Female
    -368              8895 3.138298e+02     4 Female
    -369              7708 1.235908e+02     4   Male
    -370              7646 4.159574e+02    11   Male
    -371              6640 3.009685e+01    15 Female
    -372              8958 1.567850e+02    12 Female
    -373              6477 1.367432e+02    11 Female
    -374              7910 3.731235e+01     8 Female
    -375              7829 9.164927e+01    13   Male
    -376              7503 2.936170e+02    10 Female
    -377              5209 8.820459e+01    10 Female
    -378              6763 1.035491e+02    15   Male
    -379              8976 7.379958e+01     8 Female
    -380              9223 3.000000e-01    14   Male
    -381              7692 1.718750e+02     4   Male
    -382              7453 2.128527e+00     1   Male
    -383              9775 1.253918e+00     5 Female
    -384              9662 2.382445e-01     2   Male
    -385              8733 4.639498e-01     2 Female
    -386              5695 1.253918e-01     4   Male
    -387              7714 1.253918e-01     4   Male
    -388              9224 3.000000e-01     2 Female
    -389              7635 1.000000e+00     3   Male
    -390              7176 1.570043e+02    11   Male
    -391              6102 4.344086e+02    10 Female
    -392              7817 2.184953e+00     6   Male
    -393              9719 1.507837e+00    12 Female
    -394              9740 3.228840e-01    10 Female
    -395              9528 4.588024e+01     8   Male
    -396              7142 1.660560e+02     8   Male
    -397              5689 3.000000e-01    13   Male
    -398              5439 3.043011e+02    10   Male
    -399              6718 2.612903e+02    13 Female
    -400              6569 1.621767e+02    10   Male
    -401              9444 3.228840e-01     2   Male
    -402              6964 4.639498e-01     4 Female
    -403              6420 2.495298e+00     3 Female
    -404              9189 3.257053e+00     2 Female
    -405              9368 3.793103e-01     1 Female
    -406              6360           NA     3   Male
    -407              8196 6.896552e-02     3 Female
    -408              8297 3.000000e-01     4   Male
    -409              6674 1.423197e+00     5 Female
    -410              5269 3.000000e-01     5 Female
    -411              6599 3.000000e-01     1 Female
    -412              7713 1.786638e+02    11   Male
    -413              8644 3.279570e+02     6   Male
    -414              9680           NA    14 Female
    -415              6305 1.903017e+02     8   Male
    -416              8493 1.654095e+02     8 Female
    -417              5297 4.639498e-01     9 Female
    -418              7723 1.815733e+02     7   Male
    -419              7510 1.366771e+00     6   Male
    -420              5102 1.536050e-01    12 Female
    -421              7816 1.306587e+01     8   Male
    -422              5143 2.129032e+02    11 Female
    -423              7414 1.925647e+02    14   Male
    -424              5127 3.000000e-01     3 Female
    -425              5830 1.028213e+00     1 Female
    -426              8929 3.793103e-01     5 Female
    -427              7993 8.025078e-01     2 Female
    -428              8092 4.860215e+02     3 Female
    -429              9750 3.000000e-01     4 Female
    -430              6660 2.100313e-01     2   Male
    -431              8054 2.767665e+01     3 Female
    -432              6086 1.592476e+00     4   Male
    -433              6878 9.717868e-02     1 Female
    -434              8125 1.028213e+00     7 Female
    -435              9500 3.793103e-01    10   Male
    -436              8105 1.292026e+02    11   Male
    -437              9593 4.425150e+01     7 Female
    -438              5202 3.193548e+02    10 Female
    -439              7207 1.860991e+02    14 Female
    -440              5518 6.614420e-01     7 Female
    -441              9820 5.203762e-01    11   Male
    -442              6958 1.330819e+02    12   Male
    -443              9445 1.673491e+02    10 Female
    -444              8774 3.000000e-01     6   Male
    -445              9614 1.117457e+02    13   Male
    -446              9810 3.045509e+01     8 Female
    -447              7271 3.000000e-01     2   Male
    -448              8031 8.280255e-02     3 Female
    -449              7232 3.000000e-01     1 Female
    -450              7452 1.200637e+00     2 Female
    -451              5921 1.687898e-01    NA   Male
    -452              8136 7.367273e+02    NA Female
    -453              6605 8.280255e-02     4   Male
    -454              5125 5.127389e-01     4   Male
    -455              5911 1.974522e-01     1   Male
    -456              9644 7.993631e-01     2 Female
    -457              5760 3.000000e-01     2   Male
    -458              7055 3.298182e+02    12   Male
    -459              9064 9.736842e+01    12 Female
    -460              6925 3.000000e-01     8 Female
    -461              7757 3.000000e-01    14 Female
    -462              8527 4.214545e+02    13 Female
    -463              8521 3.000000e-01     6   Male
    -464              6260 2.578182e+02    11 Female
    -465              9578 2.261147e-01    11   Male
    -466              9570 3.000000e-01    10 Female
    -467              6246 1.883901e+02    12   Male
    -468              9622 9.458204e+01    14 Female
    -469              7661 3.000000e-01    11 Female
    -470              9374 3.000000e-01     1   Male
    -471              8446 7.707006e-01     2 Female
    -472              8332 5.032727e+02     3   Male
    -473              8008 1.544586e+00     3 Female
    -474              9365 1.431115e+02     5 Female
    -475              9819 3.000000e-01     3   Male
    -476              5173 1.458599e+00     1   Male
    -477              6722 1.247678e+02     4 Female
    -478              7668           NA     4 Female
    -479              8980 4.334545e+02     4   Male
    -480              5204 3.000000e-01     2 Female
    -481              6412 6.156364e+02     5 Female
    -482              6404 9.574303e+01     7   Male
    -483              5693 1.928019e+02     8   Male
    -484              8100 1.888545e+02    10   Male
    -485              9760 1.598297e+02     6 Female
    -486              6377 5.127389e-01     7   Male
    -487              6012 1.171053e+02    10 Female
    -488              6224           NA     6   Male
    -489              6561 2.547771e-02     6 Female
    -490              8475 1.707430e+02    15 Female
    -491              6629 3.000000e-01     5   Male
    -492              7200 1.869969e+02     3   Male
    -493              9453 4.731481e+01     5   Male
    -494              6449 1.988390e+02     3 Female
    -495              9452 3.000000e-01     5   Male
    -496              7162 8.808050e+01     5   Male
    -497              8962 2.003185e+00     1 Female
    -498              7328 3.000000e-01     1   Male
    -499              9097 3.509259e+01     7 Female
    -500              9131 9.365325e+01    14 Female
    -501              7280 3.000000e-01     9   Male
    -502              5783 3.736111e+01    10 Female
    -503              9895 1.674923e+02    10 Female
    -504              7986 8.808050e+01    11   Male
    -505              7146 1.656347e+02    11 Female
    -506              8671 3.722222e+01    12 Female
    -507              5273 6.756364e+02    11 Female
    -508              5063 3.000000e-01    12   Male
    -509              6729 1.698142e+02    12   Male
    -510              9085 1.628483e+02    10 Female
    -511              9929 5.985130e-01     1   Male
    -512              8479 1.903346e+00     2 Female
    -513              7395 3.000000e-01     4   Male
    -514              6374 3.000000e-01     2   Male
    -515              7878 8.996283e-01     3   Male
    -516              9603 3.977695e-01     3 Female
    -517              7994 3.000000e-01     2   Male
    -518              5277 3.000000e-01     4   Male
    -519              5054 3.000000e-01     3   Male
    -520              5440 3.000000e-01     1 Female
    -521              6551 7.446809e+02     4   Male
    -522              5281 6.095745e+02    12 Female
    -523              7145 1.427445e+02     6   Male
    -524              5275 3.000000e-01     7 Female
    -525              9542 2.973978e-02     7   Male
    -526              9371 3.977695e-01    13 Female
    -527              5598 4.095745e+02     8 Female
    -528              7148 4.595745e+02     7   Male
    -529              5624 3.000000e-01     8 Female
    -530              6998 1.976341e+02     8 Female
    -531              9286 3.776596e+02    11 Female
    -532              7589 1.777603e+02    14 Female
    -533              7095 4.312268e-01     3   Male
    -534              5455 6.765957e+02     2 Female
    -535              6257 7.978723e+02     2   Male
    -536              8627 9.665427e-02     3   Male
    -537              9786 1.879338e+02     2   Male
    -538              8176 4.358670e+01     2 Female
    -539              9198 3.000000e-01     3 Female
    -540              6586 3.000000e-01     2   Male
    -541              8850 2.638955e+01     5   Male
    -542              9560 3.180523e+01    10 Female
    -543              7144 1.746845e+02    14   Male
    -544              8230 1.876972e+02     9   Male
    -545              7559 1.044164e+02     6   Male
    -546              5312 1.202681e+02     7   Male
    -547              6560 1.630915e+02    14 Female
    -548              6091 1.276025e+02     7 Female
    -549              5578 8.880126e+01     7   Male
    -550              5837 3.563830e+02     9   Male
    -551              8347 2.212766e+02    14   Male
    -552              6453 1.969121e+01    10 Female
    -553              5758 3.755319e+02    13 Female
    -554              5569 1.214511e+02     5   Male
    -555              8766 1.034700e+02     4 Female
    -556              8002 3.000000e-01     4 Female
    -557              7839 3.643123e-01     5 Female
    -558              5434 6.319703e-02     4 Female
    -559              7636 3.000000e-01     4   Male
    -560              6164 3.000000e-01     4   Male
    -561              9243 3.000000e-01     3 Female
    -562              5872 3.000000e-01     1 Female
    -563              8079 3.000000e-01     4   Male
    -564              9762 3.000000e-01     1   Male
    -565              9476 3.000000e-01     1 Female
    -566              8345 3.000000e-01     7   Male
    -567              8128 1.664038e+02    13 Female
    -568              7956 2.946809e+02    10 Female
    -569              8677 4.391924e+01    14   Male
    -570              5881 1.874606e+02    12 Female
    -571              7498 1.143533e+02    14   Male
    -572              8134 1.600158e+02     8   Male
    -573              7748 1.635688e-01     7   Male
    -574              7990 8.809148e+01    11 Female
    -575              6184 1.337539e+02     8   Male
    -576              6339 1.985804e+02    12   Male
    -577              5113 1.578864e+02     9 Female
    -578              9449 3.000000e-01     5 Female
    -579              8110 3.000000e-01     4   Male
    -580              9307 1.953642e-01     3 Female
    -581              5555 1.119205e+00     2   Male
    -582              9152 2.523636e+02     2   Male
    -583              7969 3.000000e-01     3   Male
    -584              6116 4.844371e+00     4 Female
    -585              8294 3.000000e-01     4   Male
    -586              8938 1.492553e+02     4 Female
    -587              9539 1.993617e+02     5   Male
    -588              9470 2.847682e-01     3 Female
    -589              6677 3.145695e-01     6 Female
    -590              8752 3.000000e-01     3   Male
    -591              5574 3.406429e+01    11 Female
    -592              5989 6.595745e+01    11   Male
    -593              9813 3.000000e-01     7   Male
    -594              6150 2.174545e+02     8   Male
    -595              5730           NA     6 Female
    -596              8038 5.957447e+01    10 Female
    -597              5964 7.236364e+02     8 Female
    -598              9043 3.000000e-01     8   Male
    -599              5095 3.000000e-01     9 Female
    -600              8922 3.000000e-01     8   Male
    -601              5469 2.676364e+02    13   Male
    -602              6726 1.891489e+02    11   Male
    -603              7495 3.036364e+02     8 Female
    -604              8159 3.000000e-01     2 Female
    -605              6709 3.000000e-01     4   Male
    -606              5855 3.000000e-01     2   Male
    -607              6058 3.000000e-01     2 Female
    -608              7292 3.000000e-01     4   Male
    -609              6437 1.447020e+00     2   Male
    -610              9326 2.130909e+02     4 Female
    -611              8222 1.357616e-01     2 Female
    -612              6789 3.000000e-01     4 Female
    -613              6348 3.000000e-01     1 Female
    -614              5958 5.534545e+02     4 Female
    -615              9211 1.891489e+02    12 Female
    -616              9450 7.202128e+01     7 Female
    -617              6540 3.250287e+01    11   Male
    -618              8796 1.655629e-02     6   Male
    -619              7971 3.123636e+02     8   Male
    -620              7549 3.000000e-01    14   Male
    -621              9799 7.138298e+01    11   Male
    -622              7013 3.000000e-01     7 Female
    -623              5599 6.946809e+01    14 Female
    -624              8601 4.012629e+01     6   Male
    -625              7383 1.629787e+02    13 Female
    -626              6656 1.508511e+02    13 Female
    -627              5641 1.655629e-02     3   Male
    -628              6222 3.000000e-01     1   Male
    -629              7674 4.635762e-02     3   Male
    -630              5293 3.000000e-01     1 Female
    -631              6715 3.000000e-01     1 Female
    -632              7057 3.000000e-01     2   Male
    -633              7072 1.942553e+02     4   Male
    -634              6380 3.690909e+02     4   Male
    -635              6762 3.000000e-01     2 Female
    -636              5799 3.000000e-01     4 Female
    -637              6681 2.847682e+00     5   Male
    -638              8755 1.435106e+02     3 Female
    -639              6896 3.000000e-01     3   Male
    -640              5945 4.752009e+01     6 Female
    -641              5035 2.621125e+01    11 Female
    -642              6776 1.055319e+02     9 Female
    -643              7863 3.000000e-01     7 Female
    -644              9836 1.149007e+00     8   Male
    -645              7860 2.927273e+02    NA Female
    -646              5248 3.000000e-01     8 Female
    -647              5677 3.000000e-01    14 Female
    -648              9576 4.839265e+01    10   Male
    -649              5824 3.000000e-01    10   Male
    -650              9184 3.000000e-01    11 Female
    -651              5397 2.251656e-01    13 Female
    -
    -
    df$slum <- NULL # this is the same as above
    +
    df$slum <- NULL # this is the same as above
    -

    We can also grab the age column using the $ operator.

    +

    We can also grab the age column using the $ operator, again this is selecting the variable for all of the rows.

    -
    df$age
    -
    -
      [1] 3.176895e-01 3.436823e+00 3.000000e-01 1.432363e+02 4.476534e-01
    -  [6] 2.527076e-02 6.101083e-01 3.000000e-01 2.916968e+00 1.649819e+00
    - [11] 4.574007e+00 1.583904e+02           NA 1.065068e+02 1.113870e+02
    - [16] 4.144893e+01 3.000000e-01 2.527076e-01 8.159247e+01 1.825342e+02
    - [21] 4.244656e+01 1.193493e+02 3.000000e-01 3.000000e-01 9.025271e-01
    - [26] 3.501805e-01 3.000000e-01 1.227437e+00 1.702055e+02 3.000000e-01
    - [31] 4.801444e-01 2.527076e-02 3.000000e-01 5.776173e-02 4.801444e-01
    - [36] 3.826715e-01 3.000000e-01 4.048558e+02 3.000000e-01 5.451264e-01
    - [41] 3.000000e-01 5.590753e+01 2.202166e-01 1.709760e+02 1.227437e+00
    - [46] 4.567527e+02 4.838480e+01 1.227437e-01 1.877256e-01 3.000000e-01
    - [51] 3.501805e-01 3.339350e+00 3.000000e-01 5.451264e-01           NA
    - [56] 2.104693e+00           NA 3.826715e-01 3.926366e+01 1.129964e+00
    - [61] 3.501805e+00 7.542808e+01 4.800475e+01 1.000000e+00 4.068884e+01
    - [66] 3.000000e-01 4.377672e+01 1.193493e+02 6.977740e+01 1.373288e+02
    - [71] 1.642979e+02           NA 1.542808e+02 6.033058e-01 2.809917e-01
    - [76] 1.966942e+00 2.041322e+00 2.115702e+00 4.663043e+02 3.000000e-01
    - [81] 1.500796e+02 1.543790e+02 2.561983e-01 1.596338e+02 1.732484e+02
    - [86] 4.641304e+02 3.736364e+01 1.572452e+02 3.000000e-01 3.000000e-01
    - [91] 8.264463e-02 6.776859e-01 7.272727e-01 2.066116e-01 1.966942e+00
    - [96] 3.000000e-01 3.000000e-01 2.809917e-01 8.016529e-01 1.818182e-01
    -[101] 1.818182e-01 8.264463e-02 3.422727e+01 8.743506e+00 3.000000e-01
    -[106] 1.641720e+02 4.049587e-01 1.001592e+02 4.489130e+02 1.101911e+02
    -[111] 4.440909e+01 1.288217e+02 2.840909e+01 1.003981e+02 8.512397e-01
    -[116] 1.322314e-01 1.297521e+00 1.570248e-01 1.966942e+00 1.536624e+02
    -[121] 3.000000e-01 3.000000e-01 1.074380e+00 1.099174e+00 3.057851e-01
    -[126] 3.000000e-01 5.785124e-02 4.391304e+02 6.130435e+02 1.074380e-01
    -[131] 7.125796e+01 4.222727e+01 1.620223e+02 3.750000e+01 1.534236e+02
    -[136] 6.239130e+02 5.521739e+02 5.785124e-02 6.547945e-01 8.767123e-02
    -[141] 3.000000e-01 2.849315e+00 3.835616e-02 2.849315e-01 4.649315e+00
    -[146] 1.369863e-01 3.589041e-01 1.049315e+00 4.668998e+01 1.473510e+02
    -[151] 4.589744e+01 2.109589e-01 1.741722e+02 2.496503e+01 1.850993e+02
    -[156] 1.863014e-01 1.863014e-01 4.589744e+01 1.942881e+02 5.079646e+02
    -[161] 8.767123e-01 2.750685e+00 1.503311e+02 3.000000e-01 3.095890e-01
    -[166] 3.000000e-01 6.371681e+02 6.054795e-01 1.955298e+02 1.786424e+02
    -[171] 1.120861e+02 1.331954e+02 2.159292e+02 5.628319e+02 1.900662e+02
    -[176] 6.547945e-01 1.665753e+00 1.739238e+02 9.991722e+01 9.321192e+01
    -[181] 8.767123e-02           NA 6.794521e-01 5.808219e-01 1.369863e-01
    -[186] 2.060274e+00 1.610099e+02 4.082192e-01 8.273973e-01 4.601770e+02
    -[191] 1.389073e+02 3.867133e+01 9.260274e-01 5.918874e+01 1.870861e+02
    -[196] 4.328767e-01 6.301370e-02 3.000000e-01 1.548013e+02 5.819536e+01
    -[201] 1.724338e+02 1.932401e+01 2.164420e+00 9.757412e-01 1.509434e-01
    -[206] 1.509434e-01 7.766571e+01 4.319563e+01 1.752022e-01 3.094775e+01
    -[211] 1.266846e-01 2.919806e+01 9.545455e+00 2.735115e+01 1.314841e+02
    -[216] 3.643985e+01 1.498559e+02 9.363636e+00 2.479784e-01 5.390836e-02
    -[221] 8.787062e-01 1.994609e-01 3.000000e-01 3.000000e-01 5.390836e-03
    -[226] 4.177898e-01 3.000000e-01 2.479784e-01 2.964960e-02 2.964960e-01
    -[231] 5.148248e+00 1.994609e-01 3.000000e-01 1.779539e+02 3.290210e+02
    -[236] 3.000000e-01 1.809798e+02 4.905660e-01 1.266846e-01 1.543948e+02
    -[241] 1.379683e+02 6.153846e+02 1.474784e+02 3.000000e-01 1.024259e+00
    -[246] 4.444056e+02 3.000000e-01 2.504043e+00 3.000000e-01 3.000000e-01
    -[251] 7.816712e-02 3.000000e-01 5.390836e-02 1.494236e+02 5.972622e+01
    -[256] 6.361186e-01 1.837896e+02 1.320809e+02 1.571906e-01 1.520231e+02
    -[261] 3.000000e-01 3.000000e-01 1.823699e+02 3.000000e-01 2.173913e+00
    -[266] 2.142202e+01 3.000000e-01 3.408027e+00 4.155963e+01 9.698997e-02
    -[271] 1.238532e+01 9.528926e+00 1.916185e+02 1.060201e+00 3.679104e+02
    -[276] 4.288991e+01 9.971098e+01 3.000000e-01 1.208092e+02 3.000000e-01
    -[281] 6.688963e-03 2.505017e+00 1.481605e+00 3.000000e-01 5.183946e-01
    -[286] 3.000000e-01 1.872910e-01 3.678930e-01 3.000000e-01 4.529851e+02
    -[291] 3.169725e+01 3.000000e-01 4.922018e+01 2.548507e+02 1.661850e+02
    -[296] 9.164179e+02 3.678930e-01 1.236994e+02 6.705202e+01 3.834862e+01
    -[301] 1.963211e+00 3.000000e-01 2.474916e-01 3.000000e-01 2.173913e-01
    -[306] 8.193980e-01 2.444816e+00 3.000000e-01 1.571906e-01 1.849711e+02
    -[311] 6.119403e+02 3.000000e-01 4.280936e-01 9.698997e-02 3.678930e-02
    -[316] 4.832090e+02 1.390173e+02 3.000000e-01 6.555970e+02 1.526012e+02
    -[321] 3.000000e-01 7.222222e-01 7.724426e+01 3.000000e-01 6.111111e-01
    -[326] 1.555556e+00 3.055556e-01 1.500000e+00 1.470772e+02 1.694444e+00
    -[331] 3.138298e+02 1.414405e+02 1.990605e+02 4.212766e+02 3.000000e-01
    -[336] 3.000000e-01 6.478723e+02 3.000000e-01 2.222222e+00 3.000000e-01
    -[341] 2.055556e+00 2.777778e-02 8.333333e-02 1.032359e+02 1.611111e+00
    -[346] 8.333333e-02 2.333333e+00 5.755319e+02 1.686848e+02 1.111111e-01
    -[351] 3.000000e-01 8.372340e+02 3.000000e-01 3.784504e+01 3.819149e+02
    -[356] 5.555556e-02 3.000000e+02 1.855950e+02 1.944444e-01 3.000000e-01
    -[361] 5.555556e-02 1.138889e+00 4.254237e+01 3.000000e-01 3.000000e-01
    -[366] 3.000000e-01 3.000000e-01 3.138298e+02 1.235908e+02 4.159574e+02
    -[371] 3.009685e+01 1.567850e+02 1.367432e+02 3.731235e+01 9.164927e+01
    -[376] 2.936170e+02 8.820459e+01 1.035491e+02 7.379958e+01 3.000000e-01
    -[381] 1.718750e+02 2.128527e+00 1.253918e+00 2.382445e-01 4.639498e-01
    -[386] 1.253918e-01 1.253918e-01 3.000000e-01 1.000000e+00 1.570043e+02
    -[391] 4.344086e+02 2.184953e+00 1.507837e+00 3.228840e-01 4.588024e+01
    -[396] 1.660560e+02 3.000000e-01 3.043011e+02 2.612903e+02 1.621767e+02
    -[401] 3.228840e-01 4.639498e-01 2.495298e+00 3.257053e+00 3.793103e-01
    -[406]           NA 6.896552e-02 3.000000e-01 1.423197e+00 3.000000e-01
    -[411] 3.000000e-01 1.786638e+02 3.279570e+02           NA 1.903017e+02
    -[416] 1.654095e+02 4.639498e-01 1.815733e+02 1.366771e+00 1.536050e-01
    -[421] 1.306587e+01 2.129032e+02 1.925647e+02 3.000000e-01 1.028213e+00
    -[426] 3.793103e-01 8.025078e-01 4.860215e+02 3.000000e-01 2.100313e-01
    -[431] 2.767665e+01 1.592476e+00 9.717868e-02 1.028213e+00 3.793103e-01
    -[436] 1.292026e+02 4.425150e+01 3.193548e+02 1.860991e+02 6.614420e-01
    -[441] 5.203762e-01 1.330819e+02 1.673491e+02 3.000000e-01 1.117457e+02
    -[446] 3.045509e+01 3.000000e-01 8.280255e-02 3.000000e-01 1.200637e+00
    -[451] 1.687898e-01 7.367273e+02 8.280255e-02 5.127389e-01 1.974522e-01
    -[456] 7.993631e-01 3.000000e-01 3.298182e+02 9.736842e+01 3.000000e-01
    -[461] 3.000000e-01 4.214545e+02 3.000000e-01 2.578182e+02 2.261147e-01
    -[466] 3.000000e-01 1.883901e+02 9.458204e+01 3.000000e-01 3.000000e-01
    -[471] 7.707006e-01 5.032727e+02 1.544586e+00 1.431115e+02 3.000000e-01
    -[476] 1.458599e+00 1.247678e+02           NA 4.334545e+02 3.000000e-01
    -[481] 6.156364e+02 9.574303e+01 1.928019e+02 1.888545e+02 1.598297e+02
    -[486] 5.127389e-01 1.171053e+02           NA 2.547771e-02 1.707430e+02
    -[491] 3.000000e-01 1.869969e+02 4.731481e+01 1.988390e+02 3.000000e-01
    -[496] 8.808050e+01 2.003185e+00 3.000000e-01 3.509259e+01 9.365325e+01
    -[501] 3.000000e-01 3.736111e+01 1.674923e+02 8.808050e+01 1.656347e+02
    -[506] 3.722222e+01 6.756364e+02 3.000000e-01 1.698142e+02 1.628483e+02
    -[511] 5.985130e-01 1.903346e+00 3.000000e-01 3.000000e-01 8.996283e-01
    -[516] 3.977695e-01 3.000000e-01 3.000000e-01 3.000000e-01 3.000000e-01
    -[521] 7.446809e+02 6.095745e+02 1.427445e+02 3.000000e-01 2.973978e-02
    -[526] 3.977695e-01 4.095745e+02 4.595745e+02 3.000000e-01 1.976341e+02
    -[531] 3.776596e+02 1.777603e+02 4.312268e-01 6.765957e+02 7.978723e+02
    -[536] 9.665427e-02 1.879338e+02 4.358670e+01 3.000000e-01 3.000000e-01
    -[541] 2.638955e+01 3.180523e+01 1.746845e+02 1.876972e+02 1.044164e+02
    -[546] 1.202681e+02 1.630915e+02 1.276025e+02 8.880126e+01 3.563830e+02
    -[551] 2.212766e+02 1.969121e+01 3.755319e+02 1.214511e+02 1.034700e+02
    -[556] 3.000000e-01 3.643123e-01 6.319703e-02 3.000000e-01 3.000000e-01
    -[561] 3.000000e-01 3.000000e-01 3.000000e-01 3.000000e-01 3.000000e-01
    -[566] 3.000000e-01 1.664038e+02 2.946809e+02 4.391924e+01 1.874606e+02
    -[571] 1.143533e+02 1.600158e+02 1.635688e-01 8.809148e+01 1.337539e+02
    -[576] 1.985804e+02 1.578864e+02 3.000000e-01 3.000000e-01 1.953642e-01
    -[581] 1.119205e+00 2.523636e+02 3.000000e-01 4.844371e+00 3.000000e-01
    -[586] 1.492553e+02 1.993617e+02 2.847682e-01 3.145695e-01 3.000000e-01
    -[591] 3.406429e+01 6.595745e+01 3.000000e-01 2.174545e+02           NA
    -[596] 5.957447e+01 7.236364e+02 3.000000e-01 3.000000e-01 3.000000e-01
    -[601] 2.676364e+02 1.891489e+02 3.036364e+02 3.000000e-01 3.000000e-01
    -[606] 3.000000e-01 3.000000e-01 3.000000e-01 1.447020e+00 2.130909e+02
    -[611] 1.357616e-01 3.000000e-01 3.000000e-01 5.534545e+02 1.891489e+02
    -[616] 7.202128e+01 3.250287e+01 1.655629e-02 3.123636e+02 3.000000e-01
    -[621] 7.138298e+01 3.000000e-01 6.946809e+01 4.012629e+01 1.629787e+02
    -[626] 1.508511e+02 1.655629e-02 3.000000e-01 4.635762e-02 3.000000e-01
    -[631] 3.000000e-01 3.000000e-01 1.942553e+02 3.690909e+02 3.000000e-01
    -[636] 3.000000e-01 2.847682e+00 1.435106e+02 3.000000e-01 4.752009e+01
    -[641] 2.621125e+01 1.055319e+02 3.000000e-01 1.149007e+00 2.927273e+02
    -[646] 3.000000e-01 3.000000e-01 4.839265e+01 3.000000e-01 3.000000e-01
    -[651] 2.251656e-01
    -
    +
    df$age

    Using indexing to subset by rows

    We can use indexing to also subset by rows. For example, here we pull the 100th observation/row.

    -
    df[100,] 
    +
    df[100,] 
    -
        IgG_concentration       age age gender     slum
    -100              8122 0.1818182   5 Female Non slum
    +
        observation_id IgG_concentration age gender     slum
    +100           8122         0.1818182   5 Female Non slum

    And, here we pull the age of the 100th observation/row.

    -
    df[100,"age"] 
    +
    df[100,"age"] 
    -
    [1] 0.1818182
    +
    [1] 5
    @@ -2424,8 +1387,8 @@

    Logical operators

    != -not equal to - + +not equal to x&y @@ -2454,26 +1417,26 @@

    Logical operators

    Logical operators examples

    Let’s practice. First, here is a reminder of what the number.object contains.

    -
    number.object
    +
    number.object
    [1] 3

    Now, we will use logical operators to evaluate the object.

    -
    number.object<4
    +
    number.object<4
    [1] TRUE
    -
    number.object>=3
    +
    number.object>=3
    [1] TRUE
    -
    number.object!=5
    +
    number.object!=5
    [1] TRUE
    -
    number.object %in% c(6,7,2)
    +
    number.object %in% c(6,7,2)
    [1] FALSE
    @@ -2485,41 +1448,54 @@

    Using indexing and logical operators to rename columns

  • We can assign the column names from data frame df to an object cn, then we can modify cn directly using indexing and logical operators, finally we reassign the column names, cn, back to the data frame df:
  • -
    cn <- colnames(df)
    -cn
    +
    cn <- colnames(df)
    +cn
    -
    [1] "IgG_concentration" "age"               "age"              
    +
    [1] "observation_id"    "IgG_concentration" "age"              
     [4] "gender"            "slum"             
    +
    cn=="IgG_concentration"
    +
    +
    [1] FALSE  TRUE FALSE FALSE FALSE
    +
    cn[cn=="IgG_concentration"] <-"IgG_concentration_mIU" #rename cn to "IgG_concentration_mIU" when cn is "IgG_concentration"
    -colnames(df) <- cn
    +colnames(df) <- cn +colnames(df)
    +
    +
    [1] "observation_id"        "IgG_concentration_mIU" "age"                  
    +[4] "gender"                "slum"                 
    +

    Note, I am resetting the column name back to the original name for the sake of the rest of the module.

    -
    colnames(df)[colnames(df)=="IgG_concentration_mIU"] <- "IgG_concentration" #reset
    +
    colnames(df)[colnames(df)=="IgG_concentration_mIU"] <- "IgG_concentration" #reset

    Using indexing and logical operators to subset data

    In this example, we subset by rows and pull only observations with an age of less than or equal to 10 and then saved the subset data to df_lt10. Note that the logical operators df$age<=10 is before the comma because I want to subset by rows (the first dimension).

    -
    df_lte10 <- df[df$age<=10, ]
    -
    -

    In this example, we subset by rows and pull only observations with an age of less than or equal to 5 OR greater than 10.

    -
    -
    df_lte5_gt10 <- df[df$age<=5 | df$age>10, ]
    +
    df_lte10 <- df[df$age<=10, ]

    Lets check that my subsets worked using the summary() function.

    summary(df_lte10$age)
    -
        Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
    -0.005391 0.300000 0.300000 0.724742 0.640788 9.545455       10 
    +
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    +    1.0     3.0     4.0     4.8     7.0    10.0       9 
    -
    summary(df_lte5_gt10$age)
    +
    +


    +

    In the next example, we subset by rows and pull only observations with an age of less than or equal to 5 OR greater than 10.

    +
    +
    df_lte5_gt10 <- df[df$age<=5 | df$age>10, ]
    +
    +

    Lets check that my subsets worked using the summary() function.

    +
    +
    summary(df_lte5_gt10$age)
    -
        Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
    -  0.0054   0.3000   1.6018  87.9886 142.8362 916.4179       10 
    +
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    +   1.00    2.50    4.00    6.08   11.00   15.00       9 
    @@ -2528,10 +1504,12 @@

    Missing values

    Missing data need to be carefully described and dealt with in data analysis. Understanding the different types of missing data and how you can identify them, is the first step to data cleaning.

    Types of “missing” values:

      -
    • NA - general missing data
    • +
    • NA - Not Applicable general missing data
    • NaN - stands for “Not a Number”, happens when you do 0/0.
    • Inf and -Inf - Infinity, happens when you divide a positive number (or negative number) by 0.
    • blank space - sometimes when data is read it, there is a blank space left
    • +
    • an empty string (e.g., "")
    • +
    • NULL- undefined value that represents something that does not exist
    @@ -2540,45 +1518,50 @@

    Logical operators to help identify and missing data

    operator -operator option -description +description + is.na - -is NAN or NA +is NAN or NA + is.nan - -is NAN +is NAN + !is.na - -is not NAN or NA +is not NAN or NA + !is.nan - -is not NAN +is not NAN + is.infinite - -is infinite +is infinite + any - -are any TRUE +are any TRUE + +all +all are TRUE + + + which - -which are TRUE +which are TRUE + @@ -2586,20 +1569,20 @@

    Logical operators to help identify and missing data

    More logical operators examples

    -
    test <- c(0,NA, -1)/0
    -test
    +
    test <- c(0,NA, -1)/0
    +test
    [1]  NaN   NA -Inf
    -
    is.na(test)
    +
    is.na(test)
    [1]  TRUE  TRUE FALSE
    -
    is.nan(test)
    +
    is.nan(test)
    [1]  TRUE FALSE FALSE
    -
    is.infinite(test)
    +
    is.infinite(test)
    [1] FALSE FALSE  TRUE
    @@ -2609,22 +1592,22 @@

    More logical operators examples

    More logical operators examples

    any(is.na(x)) means do we have any NA’s in the object x?

    -
    any(is.na(df$IgG_concentration)) # are there any NAs - YES/TRUE
    +
    any(is.na(df$IgG_concentration)) # are there any NAs - YES/TRUE
    -
    [1] FALSE
    +
    [1] TRUE
    -
    any(is.na(df$slum)) # are there any NAs- NO/FALSE
    +
    any(is.na(df$slum)) # are there any NAs- NO/FALSE
    [1] FALSE

    which(is.na(x)) means which of the elements in object x are NA’s?

    -
    which(is.na(df$IgG_concentration)) 
    +
    which(is.na(df$IgG_concentration)) 
    -
    integer(0)
    +
     [1]  13  55  57  72 182 406 414 478 488 595
    -
    which(is.na(df$slum)) 
    +
    which(is.na(df$slum)) 
    integer(0)
    @@ -2634,7 +1617,7 @@

    More logical operators examples

    subset() function

    The Base R subset() function is a slightly easier way to select variables and observations.

    -
    ?subset
    +
    ?subset
    Registered S3 method overwritten by 'printr':
       method                from     
    @@ -2720,15 +1703,15 @@ 

    subset() function

    Subsetting use the subset() function

    Here are a few examples using the subset() function

    -
    df_lte10_v2 <- subset(df, df$age<=10, select=c(IgG_concentration, age))
    -df_lt5_f <- subset(df, df$age<=5 & gender=="Female", select=c(IgG_concentration, slum))
    +
    df_lte10_v2 <- subset(df, df$age<=10, select=c(IgG_concentration, age))
    +df_lt5_f <- subset(df, df$age<=5 & gender=="Female", select=c(IgG_concentration, slum))

    subset() function vs logical operators

    subset() automatically removes NAs, which is a different behavior from doing logical operations on NAs.

    -
    summary(df_lte10$age)
    +
    summary(df_lte10$age) #created with indexing
    @@ -2744,18 +1727,18 @@

    subset() function vs logical operators

    - - - - - - + + + + + +
    0.00539080.30.30.72474210.64078769.5454541344.87 109
    -
    summary(df_lte10_v2$age)
    +
    summary(df_lte10_v2$age) #created with the subset function
    @@ -2770,12 +1753,12 @@

    subset() function vs logical operators

    - - - - - - + + + + + +
    0.00539080.30.30.72474210.64078769.5454541344.8710
    @@ -2783,24 +1766,24 @@

    subset() function vs logical operators

    We can also see this by looking at the number or rows in each dataset.

    -
    nrow(df_lte10)
    +
    nrow(df_lte10)
    -
    [1] 370
    +
    [1] 504
    -
    nrow(df_lte10_v2)
    +
    nrow(df_lte10_v2)
    -
    [1] 360
    +
    [1] 495

    Summary

      -
    • colnames(), str() and summary()functions from Base R are great functions to assess the data type and some summary statistics
    • -
    • There are three basic indexing syntax: [ ], [[ ]] and $
    • +
    • colnames(), str() and summary()functions from Base R are functions to assess the data type and some summary statistics
    • +
    • There are three basic indexing syntax: [, [[ and $
    • Indexing can be used to extract part of an object (e.g., subset data) and to replace parts of an object (e.g., rename variables / columns)
    • Logical operators can be evaluated on object(s) in order to return a binary response of TRUE/FALSE, and are useful for decision rules for indexing
    • -
    • There are 5 “types” of missing values, the most common being “NA”
    • +
    • There are 7 “types” of missing values, the most common being “NA”
    • Logical operators meant to determine missing values are very helpful for data cleaning
    • The Base R subset() function is a slightly easier way to select variables and observations.
    diff --git a/docs/modules/Module07-VarCreationClassesSummaries.html b/docs/modules/Module07-VarCreationClassesSummaries.html index 5d1113b..76173e9 100644 --- a/docs/modules/Module07-VarCreationClassesSummaries.html +++ b/docs/modules/Module07-VarCreationClassesSummaries.html @@ -8,7 +8,7 @@ - + @@ -407,43 +407,6 @@

    Module 7: Variable Creation, Classes, and Summaries

    -
    -

    Learning Objectives

    @@ -628,7 +591,7 @@

    Import data for this module

    Adding new columns

    -

    You can add a new column, called newcol to df, using the $ operator:

    +

    You can add a new column, called log_IgG to df, using the $ operator:

    df$log_IgG <- log(df$IgG_concentration)
     head(df,3)
    @@ -673,6 +636,161 @@

    Adding new columns

    +

    Note, my use of the underscore in the variable name rather than a space. This is good coding practice and make calling variables much less prone to error.

    +
    +
    +

    Adding new columns

    +

    We can also add a new column using the transform() function:

    +
    +
    +
    Transform an Object, for Example a Data Frame
    +
    +Description:
    +
    +     'transform' is a generic function, which-at least currently-only
    +     does anything useful with data frames.  'transform.default'
    +     converts its first argument to a data frame if possible and calls
    +     'transform.data.frame'.
    +
    +Usage:
    +
    +     transform(`_data`, ...)
    +     
    +Arguments:
    +
    +   _data: The object to be transformed
    +
    +     ...: Further arguments of the form 'tag=value'
    +
    +Details:
    +
    +     The '...' arguments to 'transform.data.frame' are tagged vector
    +     expressions, which are evaluated in the data frame '_data'.  The
    +     tags are matched against 'names(_data)', and for those that match,
    +     the value replace the corresponding variable in '_data', and the
    +     others are appended to '_data'.
    +
    +Value:
    +
    +     The modified value of '_data'.
    +
    +Warning:
    +
    +     This is a convenience function intended for use interactively.
    +     For programming it is better to use the standard subsetting
    +     arithmetic functions, and in particular the non-standard
    +     evaluation of argument 'transform' can have unanticipated
    +     consequences.
    +
    +Note:
    +
    +     If some of the values are not vectors of the appropriate length,
    +     you deserve whatever you get!
    +
    +Author(s):
    +
    +     Peter Dalgaard
    +
    +See Also:
    +
    +     'within' for a more flexible approach, 'subset', 'list',
    +     'data.frame'
    +
    +Examples:
    +
    +     transform(airquality, Ozone = -Ozone)
    +     transform(airquality, new = -Ozone, Temp = (Temp-32)/1.8)
    +     
    +     attach(airquality)
    +     transform(Ozone, logOzone = log(Ozone)) # marginally interesting ...
    +     detach(airquality)
    +
    +
    +

    For example, adding a binary column for seropositivity called seropos:

    +
    +
    df <- transform(df, seropos = IgG_concentration >= 10)
    +head(df)
    +
    + +++++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    observation_idIgG_concentrationagegenderslumlog_IgGseropos
    57720.31768952FemaleNon slum-1.1466807FALSE
    80953.43682314FemaleNon slum1.2345475FALSE
    97840.30000004MaleNon slum-1.2039728FALSE
    9338143.23630144MaleNon slum4.9644957TRUE
    63690.44765341MaleNon slum-0.8037359FALSE
    68850.02527084MaleNon slum-3.6781074FALSE
    +
    +

    Creating conditional variables

    @@ -768,18 +886,19 @@

    Creating conditional variables

    ifelse example

    Reminder of the first three arguments in the ifelse() function are ifelse(test, yes, no).

    -
    df$age_group <- ifelse(df$age <= 5, "young", "old")
    -head(df)
    +
    df$age_group <- ifelse(df$age <= 5, "young", "old")
    +head(df)
    ---++++++-- @@ -789,6 +908,7 @@

    ifelse example

    + @@ -800,6 +920,7 @@

    ifelse example

    + @@ -809,6 +930,7 @@

    ifelse example

    + @@ -818,6 +940,7 @@

    ifelse example

    + @@ -827,6 +950,7 @@

    ifelse example

    + @@ -836,6 +960,7 @@

    ifelse example

    + @@ -845,101 +970,321 @@

    ifelse example

    +
    gender slum log_IgGseropos age_group
    Female Non slum -1.1466807FALSE young
    Female Non slum 1.2345475FALSE young
    Male Non slum -1.2039728FALSE young
    Male Non slum 4.9644957TRUE young
    Male Non slum -0.8037359FALSE young
    Male Non slum -3.6781074FALSE young
    +

    Let’s delve into what is actually happening, with a focus on the NA values in age variable.

    +
    +
    df$age <= 5
    +
    +
      [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE    NA  TRUE  TRUE  TRUE FALSE
    + [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
    + [25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
    + [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    + [49]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
    + [61]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
    + [73] FALSE  TRUE  TRUE  TRUE    NA  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
    + [85] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
    + [97]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    +[109] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE    NA  TRUE  TRUE
    +[121]    NA  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    +[133] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
    +[145]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    +[157] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
    +[169] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
    +[181]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
    +[193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
    +[205]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    +[217] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
    +[229]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    +[241] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
    +[253] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
    +[265]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
    +[277] FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
    +[289]  TRUE    NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    +[301]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
    +[313]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
    +[325]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
    +[337] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
    +[349] FALSE    NA FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
    +[361]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
    +[373] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
    +[385]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    +[397] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
    +[409]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    +[421] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
    +[433]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    +[445] FALSE FALSE  TRUE  TRUE  TRUE  TRUE    NA    NA  TRUE  TRUE  TRUE  TRUE
    +[457]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    +[469] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
    +[481]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
    +[493]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
    +[505] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
    +[517]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    +[529] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
    +[541]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    +[553] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
    +[565]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    +[577] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
    +[589] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    +[601] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
    +[613]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    +[625] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
    +[637]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE    NA FALSE FALSE FALSE
    +[649] FALSE FALSE FALSE
    +
    +
    table(df$age, df$age_group, useNA="always", dnn=list("age", ""))
    +
    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    age/oldyoungNA
    10440
    20720
    30790
    40800
    50410
    63800
    73800
    83900
    92000
    104400
    114100
    122300
    133500
    143700
    151100
    NA009
    +
    +

    Nesting ifelse statements example

    -
    df$age_group <- ifelse(df$age <= 5, "young", 
    -                       ifelse(df$age<=10 & df$age>5, "middle", 
    -                              ifelse(df$age>10, "old", NA)))
    -head(df)
    +
    df$age_group <- ifelse(df$age <= 5, "young", 
    +                       ifelse(df$age<=10 & df$age>5, "middle", "old"))
    +table(df$age, df$age_group, useNA="always", dnn=list("age", ""))
    --------- - - - - - - - + + + + + - - - - - - - + + + + + - - - - - - - + + + + + - - - - - - - + + + + + - - - - - - - + + + + + - - - - - - - + + + + + - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    observation_idIgG_concentrationagegenderslumlog_IgGage_groupage/middleoldyoungNA
    57720.31768952FemaleNon slum-1.1466807young100440
    80953.43682314FemaleNon slum1.2345475young200720
    97840.30000004MaleNon slum-1.2039728young300790
    9338143.23630144MaleNon slum4.9644957young400800
    63690.44765341MaleNon slum-0.8037359young500410
    68850.02527084MaleNon slum-3.6781074young638000
    738000
    839000
    920000
    1044000
    1104100
    1202300
    1303500
    1403700
    1501100
    NA0009
    +

    Note, it puts the variable levels in alphabetical order, we will show how to change this later.

    @@ -958,173 +1303,19 @@

    Overview - Data Classes

    class() function

    The class() function allows you to evaluate the class of an object.

    -
    class(df$IgG_concentration)
    +
    class(df$IgG_concentration)
    [1] "numeric"
    -
    class(df$age)
    +
    class(df$age)
    [1] "integer"
    -
    class(df$gender)
    +
    class(df$gender)
    [1] "character"
    -

    Return the First or Last Parts of an Object

    -

    Description:

    -
     Returns the first or last parts of a vector, matrix, table, data
    - frame or function.  Since 'head()' and 'tail()' are generic
    - functions, they may also have been extended to other classes.
    -

    Usage:

    -
     head(x, ...)
    - ## Default S3 method:
    - head(x, n = 6L, ...)
    - 
    - ## S3 method for class 'matrix'
    - head(x, n = 6L, ...) # is exported as head.matrix()
    - ## NB: The methods for 'data.frame' and 'array'  are identical to the 'matrix' one
    - 
    - ## S3 method for class 'ftable'
    - head(x, n = 6L, ...)
    - ## S3 method for class 'function'
    - head(x, n = 6L, ...)
    - 
    - 
    - tail(x, ...)
    - ## Default S3 method:
    - tail(x, n = 6L, keepnums = FALSE, addrownums, ...)
    - ## S3 method for class 'matrix'
    - tail(x, n = 6L, keepnums = TRUE, addrownums, ...) # exported as tail.matrix()
    - ## NB: The methods for 'data.frame', 'array', and 'table'
    - ##     are identical to the  'matrix'  one
    - 
    - ## S3 method for class 'ftable'
    - tail(x, n = 6L, keepnums = FALSE, addrownums, ...)
    - ## S3 method for class 'function'
    - tail(x, n = 6L, ...)
    - 
    -

    Arguments:

    -
       x: an object
    -
    -   n: an integer vector of length up to 'dim(x)' (or 1, for
    -      non-dimensioned objects).  A 'logical' is silently coerced to
    -      integer.  Values specify the indices to be selected in the
    -      corresponding dimension (or along the length) of the object.
    -      A positive value of 'n[i]' includes the first/last 'n[i]'
    -      indices in that dimension, while a negative value excludes
    -      the last/first 'abs(n[i])', including all remaining indices.
    -      'NA' or non-specified values (when 'length(n) <
    -      length(dim(x))') select all indices in that dimension. Must
    -      contain at least one non-missing value.
    -

    keepnums: in each dimension, if no names in that dimension are present, create them using the indices included in that dimension. Ignored if ‘dim(x)’ is ‘NULL’ or its length 1.

    -

    addrownums: deprecated - ‘keepnums’ should be used instead. Taken as the value of ‘keepnums’ if it is explicitly set when ‘keepnums’ is not.

    -
     ...: arguments to be passed to or from other methods.
    -

    Details:

    -
     For vector/array based objects, 'head()' ('tail()') returns a
    - subset of the same dimensionality as 'x', usually of the same
    - class. For historical reasons, by default they select the first
    - (last) 6 indices in the first dimension ("rows") or along the
    - length of a non-dimensioned vector, and the full extent (all
    - indices) in any remaining dimensions. 'head.matrix()' and
    - 'tail.matrix()' are exported.
    -
    - The default and array(/matrix) methods for 'head()' and 'tail()'
    - are quite general. They will work as is for any class which has a
    - 'dim()' method, a 'length()' method (only required if 'dim()'
    - returns 'NULL'), and a '[' method (that accepts the 'drop'
    - argument and can subset in all dimensions in the dimensioned
    - case).
    -
    - For functions, the lines of the deparsed function are returned as
    - character strings.
    -
    - When 'x' is an array(/matrix) of dimensionality two and more,
    - 'tail()' will add dimnames similar to how they would appear in a
    - full printing of 'x' for all dimensions 'k' where 'n[k]' is
    - specified and non-missing and 'dimnames(x)[[k]]' (or 'dimnames(x)'
    - itself) is 'NULL'.  Specifically, the form of the added dimnames
    - will vary for different dimensions as follows:
    -
    - 'k=1' (rows): '"[n,]"' (right justified with whitespace padding)
    -
    - 'k=2' (columns): '"[,n]"' (with _no_ whitespace padding)
    -
    - 'k>2' (higher dims): '"n"', i.e., the indices as _character_
    -      values
    -
    - Setting 'keepnums = FALSE' suppresses this behaviour.
    -
    - As 'data.frame' subsetting ('indexing') keeps 'attributes', so do
    - the 'head()' and 'tail()' methods for data frames.
    -

    Value:

    -
     An object (usually) like 'x' but generally smaller.  Hence, for
    - 'array's, the result corresponds to 'x[.., drop=FALSE]'.  For
    - 'ftable' objects 'x', a transformed 'format(x)'.
    -

    Note:

    -
     For array inputs the output of 'tail' when 'keepnums' is 'TRUE',
    - any dimnames vectors added for dimensions '>2' are the original
    - numeric indices in that dimension _as character vectors_.  This
    - means that, e.g., for 3-dimensional array 'arr', 'tail(arr,
    - c(2,2,-1))[ , , 2]' and 'tail(arr, c(2,2,-1))[ , , "2"]' may both
    - be valid but have completely different meanings.
    -

    Author(s):

    -
     Patrick Burns, improved and corrected by R-Core. Negative argument
    - added by Vincent Goulet.  Multi-dimension support added by Gabriel
    - Becker.
    -

    Examples:

    -
     head(letters)
    - head(letters, n = -6L)
    - 
    - head(freeny.x, n = 10L)
    - head(freeny.y)
    - 
    - head(iris3)
    - head(iris3, c(6L, 2L))
    - head(iris3, c(6L, -1L, 2L))
    - 
    - tail(letters)
    - tail(letters, n = -6L)
    - 
    - tail(freeny.x)
    - ## the bottom-right "corner" :
    - tail(freeny.x, n = c(4, 2))
    - tail(freeny.y)
    - 
    - tail(iris3)
    - tail(iris3, c(6L, 2L))
    - tail(iris3, c(6L, -1L, 2L))
    - 
    - ## iris with dimnames stripped
    - a3d <- iris3 ; dimnames(a3d) <- NULL
    - tail(a3d, c(6, -1, 2)) # keepnums = TRUE is default here!
    - tail(a3d, c(6, -1, 2), keepnums = FALSE)
    - 
    - ## data frame w/ a (non-standard) attribute:
    - treeS <- structure(trees, foo = "bar")
    - (n <- nrow(treeS))
    - stopifnot(exprs = { # attribute is kept
    -     identical(htS <- head(treeS), treeS[1:6, ])
    -     identical(attr(htS, "foo") , "bar")
    -     identical(tlS <- tail(treeS), treeS[(n-5):n, ])
    -     ## BUT if I use "useAttrib(.)", this is *not* ok, when n is of length 2:
    -     ## --- because [i,j]-indexing of data frames *also* drops "other" attributes ..
    -     identical(tail(treeS, 3:2), treeS[(n-2):n, 2:3] )
    - })
    - 
    - tail(library) # last lines of function
    - 
    - head(stats::ftable(Titanic))
    - 
    - ## 1d-array (with named dim) :
    - a1 <- array(1:7, 7); names(dim(a1)) <- "O2"
    - stopifnot(exprs = {
    -   identical( tail(a1, 10), a1)
    -   identical( head(a1, 10), a1)
    -   identical( head(a1, 1), a1 [1 , drop=FALSE] ) # was a1[1] in R <= 3.6.x
    -   identical( tail(a1, 2), a1[6:7])
    -   identical( tail(a1, 1), a1 [7 , drop=FALSE] ) # was a1[7] in R <= 3.6.x
    - })

    One dimensional data types

    @@ -1144,19 +1335,19 @@

    Character and numeric

    This can also be a bit tricky.

    If only one character in the whole vector, the class is assumed to be character

    -
    class(c(1, 2, "tree")) 
    +
    class(c(1, 2, "tree")) 
    [1] "character"

    Here because integers are in quotations, it is read as a character class by R.

    -
    class(c("1", "4", "7")) 
    +
    class(c("1", "4", "7")) 
    [1] "character"
    -

    Note, this is the first time we have shown you nested functions. Here, instead of creating a new vector object (e.g., x <- c("1", "4", "7")) and then feeding the vector object x into the first argument of the class() function (e.g., class(x)), we combined the two steps and directly fed a vector object into the class function.

    +

    Note, instead of creating a new vector object (e.g., x <- c("1", "4", "7")) and then feeding the vector object x into the first argument of the class() function (e.g., class(x)), we combined the two steps and directly fed a vector object into the class function.

    Numeric Subclasses

    @@ -1167,19 +1358,19 @@

    Numeric Subclasses

    typeof() identifies the vector type (double, integer, logical, or character), whereas class() identifies the root class. The difference between the two will be more clear when we look at two dimensional classes below.

    -
    class(df$IgG_concentration)
    +
    class(df$IgG_concentration)
    [1] "numeric"
    -
    class(df$age)
    +
    class(df$age)
    [1] "integer"
    -
    typeof(df$IgG_concentration)
    +
    typeof(df$IgG_concentration)
    [1] "double"
    -
    typeof(df$age)
    +
    typeof(df$age)
    [1] "integer"
    @@ -1187,35 +1378,35 @@

    Numeric Subclasses

    Logical

    -

    Reminder logical is a type that only has two possible elements: TRUE and FALSE.

    +

    Reminder logical is a type that only has three possible elements: TRUE and FALSE and NA

    -
    class(c(TRUE, FALSE, TRUE, TRUE, FALSE))
    +
    class(c(TRUE, FALSE, TRUE, TRUE, FALSE))
    [1] "logical"
    -

    Note that logical elements are NOT in quotes. Putting R special classes (e.g., NA or FALSE) in quotations turns them into character value.

    +

    Note that when creating logical object the TRUE and FALSE are NOT in quotes. Putting R special classes (e.g., NA or FALSE) in quotations turns them into character value.

    Other useful functions for evaluating/setting classes

    There are two useful functions associated with practically all R classes:

    • is.CLASS_NAME(x) to logically check whether or not x is of certain class. For example, is.integer or is.character or is.numeric
    • -
    • as.CLASS_NAME(x) to coerce between classes x from current x class into a certain class. For example, as.integer or as.character or as.numeric. This is particularly useful is maybe integer variable was read in as a character variable, or when you need to change a character variable to a factor variable (more on this later).
    • +
    • as.CLASS_NAME(x) to coerce between classes x from current x class into a another class. For example, as.integer or as.character or as.numeric. This is particularly useful is maybe integer variable was read in as a character variable, or when you need to change a character variable to a factor variable (more on this later).

    Examples is.CLASS_NAME(x)

    -
    is.numeric(df$IgG_concentration)
    +
    is.numeric(df$IgG_concentration)
    [1] TRUE
    -
    is.character(df$age)
    +
    is.character(df$age)
    [1] FALSE
    -
    is.character(df$gender)
    +
    is.character(df$gender)
    [1] TRUE
    @@ -1225,29 +1416,29 @@

    Examples is.CLASS_NAME(x)

    Examples as.CLASS_NAME(x)

    In some cases, coercing is seamless

    -
    as.character(c(1, 4, 7))
    +
    as.character(c(1, 4, 7))
    [1] "1" "4" "7"
    -
    as.numeric(c("1", "4", "7"))
    +
    as.numeric(c("1", "4", "7"))
    [1] 1 4 7
    -
    as.logical(c("TRUE", "FALSE", "FALSE"))
    +
    as.logical(c("TRUE", "FALSE", "FALSE"))
    [1]  TRUE FALSE FALSE
    -

    In some cases the coercing is not possible; if executed, will return NA (an R constant representing “Not Available” i.e. missing value)

    +

    In some cases the coercing is not possible; if executed, will return NA

    -
    as.numeric(c("1", "4", "7a"))
    +
    as.numeric(c("1", "4", "7a"))
    Warning: NAs introduced by coercion
    [1]  1  4 NA
    -
    as.logical(c("TRUE", "FALSE", "UNKNOWN"))
    +
    as.logical(c("TRUE", "FALSE", "UNKNOWN"))
    [1]  TRUE FALSE    NA
    @@ -1257,21 +1448,22 @@

    Examples as.CLASS_NAME(x)

    Factors

    A factor is a special character vector where the elements have pre-defined groups or ‘levels’. You can think of these as qualitative or categorical variables. Use the factor() function to create factors from character values.

    -
    class(df$age_group)
    +
    class(df$age_group)
    [1] "character"
    -
    df$age_group_factor <- factor(df$age_group)
    -class(df$age_group_factor)
    +
    df$age_group_factor <- factor(df$age_group)
    +class(df$age_group_factor)
    [1] "factor"
    -
    levels(df$age_group_factor)
    +
    levels(df$age_group_factor)
    [1] "middle" "old"    "young" 
    -

    Note that levels are, by default, set to alphanumerical order! And, the first is always the “reference” group. However, we often prefer a different reference group.

    +

    Note 1, that levels are, by default, set to alphanumerical order! And, the first is always the “reference” group. However, we often prefer a different reference group.

    +

    Note 2, we can also make ordered factors using factor(... ordered=TRUE), but we won’t talk more about that.

    Reference Groups

    @@ -1281,7 +1473,12 @@

    Reference Groups

    Changing factor reference

    -

    Changing the reference group of a factor variable. - If the object is already a factor then use relevel() function and the ref argument to specify the reference. - If the object is a character then use factor() function and levels argument to specify the order of the values, the first being the reference.

    +

    Changing the reference group of a factor variable.

    +
      +
    • If the object is already a factor then use relevel() function and the ref argument to specify the reference.
    • +
    • If the object is a character then use factor() function and levels argument to specify the order of the values, the first being the reference.
    • +
    +

    Let’s look at the relevel() help file

    Reorder Levels of Factor

    Description:

     The levels of a factor are re-ordered so that the level specified
    @@ -1307,6 +1504,8 @@ 

    Changing factor reference

    Examples:

     warpbreaks$tension <- relevel(warpbreaks$tension, ref = "M")
      summary(lm(breaks ~ wool + tension, data = warpbreaks))
    +


    +

    Let’s look at the factor() help file

    Factors

    Description:

     The function 'factor' is used to encode a vector as a factor (the
    @@ -1529,16 +1728,16 @@ 

    Changing factor reference

    Changing factor reference examples

    -
    df$age_group_factor <- relevel(df$age_group_factor, ref="young")
    -levels(df$age_group_factor)
    +
    df$age_group_factor <- relevel(df$age_group_factor, ref="young")
    +levels(df$age_group_factor)
    [1] "young"  "middle" "old"   

    OR

    -
    df$age_group_factor <- factor(df$age_group, levels=c("young", "middle", "old"))
    -levels(df$age_group_factor)
    +
    df$age_group_factor <- factor(df$age_group, levels=c("young", "middle", "old"))
    +levels(df$age_group_factor)
    [1] "young"  "middle" "old"   
    @@ -1559,7 +1758,7 @@

    Matrices

    as.matrix() creates a matrix from a data frame (where all values are the same class).

    You can also create a matrix from scratch using matrix() Use ?matrix to see the arguments.

    -
    matrix(1:6, ncol = 2) 
    +
    matrix(data=1:6, ncol = 2) 
    @@ -1578,7 +1777,7 @@

    Matrices

    -
    matrix(1:6, ncol=2, byrow=TRUE) 
    +
    matrix(data=1:6, ncol=2, byrow=TRUE) 
    @@ -1598,13 +1797,13 @@

    Matrices

    -

    Notice, the first matrix filled in numbers 1-6 by columns first and then rows because default byrow argument is FALSE. In the second matrix, we changed the argument byrow to TRUE, and now numbers 1-6 are filled by rows first and then columns.

    +

    Note, the first matrix filled in numbers 1-6 by columns first and then rows because default byrow argument is FALSE. In the second matrix, we changed the argument byrow to TRUE, and now numbers 1-6 are filled by rows first and then columns.

    -

    Data Frame

    -

    You can transform an existing matrix into data frames and tibble using as.data.frame().

    +

    Data frame

    +

    You can transform an existing matrix into data frames using as.data.frame()

    -
    as.data.frame(matrix(1:6, ncol = 2) ) 
    +
    as.data.frame(matrix(1:6, ncol = 2) ) 
    @@ -1633,8 +1832,18 @@

    Data Frame

    Numeric variable data summary

    -

    Data summarization on numeric vectors/variables: - mean(): takes the mean of x - sd(): takes the standard deviation of x - median(): takes the median of x - quantile(): displays sample quantiles of x. Default is min, IQR, max - range(): displays the range. Same as c(min(), max()) - sum(): sum of x - max(): maximum value in x - min(): minimum value in x

    -

    Note, all have the na.rm = argument for missing data

    +

    Data summarization on numeric vectors/variables:

    +
      +
    • mean(): takes the mean of x
    • +
    • sd(): takes the standard deviation of x
    • +
    • median(): takes the median of x
    • +
    • quantile(): displays sample quantiles of x. Default is min, IQR, max
    • +
    • range(): displays the range. Same as c(min(), max())
    • +
    • sum(): sum of x
    • +
    • max(): maximum value in x
    • +
    • min(): minimum value in x
    • +
    +

    Note, all have the na.rm argument for missing data

    Arithmetic Mean

    Description:

     Generic function for the (trimmed) arithmetic mean.
    @@ -1677,19 +1886,20 @@

    Numeric variable data summary

    Numeric variable data summary examples

    -
    summary(df)
    +
    summary(df)
    +++---+++-- @@ -1700,6 +1910,7 @@

    Numeric variable data summary examples

    + @@ -1713,6 +1924,7 @@

    Numeric variable data summary examples

    + @@ -1724,6 +1936,7 @@

    Numeric variable data summary examples

    + @@ -1735,6 +1948,7 @@

    Numeric variable data summary examples

    + @@ -1746,6 +1960,7 @@

    Numeric variable data summary examples

    + @@ -1759,6 +1974,7 @@

    Numeric variable data summary examples

    + @@ -1770,6 +1986,7 @@

    Numeric variable data summary examples

    + @@ -1781,27 +1998,28 @@

    Numeric variable data summary examples

    +
    gender slum log_IgGseropos age_group age_group_factor
    Length:651 Length:651 Min. :-5.2231Mode :logical Length:651 young :316
    Class :character Class :character 1st Qu.:-1.2040FALSE:360 Class :character middle:179
    Mode :character Mode :character Median : 0.5103TRUE :281 Mode :character old :147
    NA NA Mean : 1.6074NA’s :10 NA NA’s : 9
    3rd Qu.: 4.9519 NA NANA
    Max. : 6.8205 NA NANA
    NA’s :10 NA NANA
    -
    range(df$age)
    +
    range(df$age)
    [1] NA NA
    -
    range(df$age, na.rm=TRUE)
    +
    range(df$age, na.rm=TRUE)
    [1]  1 15
    -
    median(df$IgG_concentration, na.rm=TRUE)
    +
    median(df$IgG_concentration, na.rm=TRUE)
    [1] 1.665753
    -

    Character Variable Data Summaries

    -

    Data summarization on character or factor vectors/variables * table()

    +

    Character variable data summaries

    +

    Data summarization on character or factor vectors/variables using table()

    Cross Tabulation and Table Creation

    Description:

     'table' uses cross-classifying factors to build a contingency
    @@ -1965,7 +2183,7 @@ 

    Character Variable Data Summaries

    Character variable data summary examples

    Number of observations in each category

    -
    table(df$gender)
    +
    table(df$gender)
    @@ -1982,7 +2200,7 @@

    Character variable data summary examples

    -
    table(df$gender, useNA="always")
    +
    table(df$gender, useNA="always")
    @@ -2001,7 +2219,7 @@

    Character variable data summary examples

    -
    table(df$age_group, useNA="always")
    +
    table(df$age_group, useNA="always")
    @@ -2023,9 +2241,8 @@

    Character variable data summary examples

    -

    Percent of observations in each category (xxzane - better way in base r?)

    -
    table(df$gender)/nrow(df) #if no NA values
    +
    table(df$gender)/nrow(df) #if no NA values
    @@ -2042,7 +2259,7 @@

    Character variable data summary examples

    -
    table(df$age_group)/nrow(df[!is.na(df$age_group),]) #if there are NA values
    +
    table(df$age_group)/nrow(df[!is.na(df$age_group),]) #if there are NA values
    @@ -2061,7 +2278,7 @@

    Character variable data summary examples

    -
    table(df$age_group)/nrow(subset(df, !is.na(df$age_group),)) #if there are NA values
    +
    table(df$age_group)/nrow(subset(df, !is.na(df$age_group),)) #if there are NA values
    @@ -2091,12 +2308,12 @@

    Summary

  • is.CLASS_NAME(x) can be used to test the class of an object x
  • as.CLASS_NAME(x) can be used to change the class of an object x
  • Factors are a special character class that has levels
  • -
  • +
  • …xxamy complete
  • Acknowledgements

    -

    These are the materials I looked through, modified, or extracted to complete this module’s lecture.

    +

    These are the materials we looked through, modified, or extracted to complete this module’s lecture.

    diff --git a/docs/modules/Module08-DataMergeReshape.html b/docs/modules/Module08-DataMergeReshape.html new file mode 100644 index 0000000..9b0bcad --- /dev/null +++ b/docs/modules/Module08-DataMergeReshape.html @@ -0,0 +1,1587 @@ + + + + + + + + + + + + + + + SISMID Module NUMBER Materials (2025) - Module 8: Data Merging and Reshaping + + + + + + + + + + + + + + + +
    +
    + +
    +

    Module 8: Data Merging and Reshaping

    + +
    + + +
    + +
    +
    +

    Learning Objectives

    +

    After module 8, you should be able to…

    +
      +
    • Merge/join data together
    • +
    • Reshape data from wide to long
    • +
    • Reshape data from long to wide
    • +
    +
    +
    +

    Joining types

    +

    Pay close attention to the number of rows in your data set before and after a join. This will help flag when an issue has arisen. This will depend on the type of merge:

    +
      +
    • 1:1 merge (one-to-one merge) – Simplest merge (sometimes things go wrong)
    • +
    • 1:m merge (one-to-many merge) – More complex (things often go wrong) +
        +
      • The “one” suggests that one dataset has the merging variable (e.g., id) each represented once and the “many” implies that one dataset has the merging variable represented multiple times
      • +
    • +
    • m:m merge (many-to-many merge) – Danger zone (can be unpredictable)
    • +
    +
    +
    +

    one-to-one merge

    +
      +
    • This means that each row of data represents a unique unit of analysis that exists in another dataset (e.g,. id variable)
    • +
    • Will likely have variables that don’t exist in the current dataset (that’s why you are trying to merge it in)
    • +
    • The merging variable (e.g., id) each represented a single time
    • +
    • You should try to structure your data so that a 1:1 merge or 1:m merge is possible so that fewer things can go wrong.
    • +
    +
    +
    +

    merge() function

    +

    We will use the merge() function to conduct one-to-one merge

    +
    Registered S3 method overwritten by 'printr':
    +  method                from     
    +  knit_print.data.frame rmarkdown
    +

    Merge Two Data Frames

    +

    Description:

    +
     Merge two data frames by common columns or row names, or do other
    + versions of database _join_ operations.
    +

    Usage:

    +
     merge(x, y, ...)
    + 
    + ## Default S3 method:
    + merge(x, y, ...)
    + 
    + ## S3 method for class 'data.frame'
    + merge(x, y, by = intersect(names(x), names(y)),
    +       by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,
    +       sort = TRUE, suffixes = c(".x",".y"), no.dups = TRUE,
    +       incomparables = NULL, ...)
    + 
    +

    Arguments:

    +
    x, y: data frames, or objects to be coerced to one.
    +

    by, by.x, by.y: specifications of the columns used for merging. See ‘Details’.

    +
     all: logical; 'all = L' is shorthand for 'all.x = L' and 'all.y =
    +      L', where 'L' is either 'TRUE' or 'FALSE'.
    +

    all.x: logical; if ‘TRUE’, then extra rows will be added to the output, one for each row in ‘x’ that has no matching row in ‘y’. These rows will have ‘NA’s in those columns that are usually filled with values from ’y’. The default is ‘FALSE’, so that only rows with data from both ‘x’ and ‘y’ are included in the output.

    +

    all.y: logical; analogous to ‘all.x’.

    +
    sort: logical.  Should the result be sorted on the 'by' columns?
    +

    suffixes: a character vector of length 2 specifying the suffixes to be used for making unique the names of columns in the result which are not used for merging (appearing in ‘by’ etc).

    +

    no.dups: logical indicating that ‘suffixes’ are appended in more cases to avoid duplicated column names in the result. This was implicitly false before R version 3.5.0.

    +

    incomparables: values which cannot be matched. See ‘match’. This is intended to be used for merging on one column, so these are incomparable values of that column.

    +
     ...: arguments to be passed to or from methods.
    +

    Details:

    +
     'merge' is a generic function whose principal method is for data
    + frames: the default method coerces its arguments to data frames
    + and calls the '"data.frame"' method.
    +
    + By default the data frames are merged on the columns with names
    + they both have, but separate specifications of the columns can be
    + given by 'by.x' and 'by.y'.  The rows in the two data frames that
    + match on the specified columns are extracted, and joined together.
    + If there is more than one match, all possible matches contribute
    + one row each.  For the precise meaning of 'match', see 'match'.
    +
    + Columns to merge on can be specified by name, number or by a
    + logical vector: the name '"row.names"' or the number '0' specifies
    + the row names.  If specified by name it must correspond uniquely
    + to a named column in the input.
    +
    + If 'by' or both 'by.x' and 'by.y' are of length 0 (a length zero
    + vector or 'NULL'), the result, 'r', is the _Cartesian product_ of
    + 'x' and 'y', i.e., 'dim(r) = c(nrow(x)*nrow(y), ncol(x) +
    + ncol(y))'.
    +
    + If 'all.x' is true, all the non matching cases of 'x' are appended
    + to the result as well, with 'NA' filled in the corresponding
    + columns of 'y'; analogously for 'all.y'.
    +
    + If the columns in the data frames not used in merging have any
    + common names, these have 'suffixes' ('".x"' and '".y"' by default)
    + appended to try to make the names of the result unique.  If this
    + is not possible, an error is thrown.
    +
    + If a 'by.x' column name matches one of 'y', and if 'no.dups' is
    + true (as by default), the y version gets suffixed as well,
    + avoiding duplicate column names in the result.
    +
    + The complexity of the algorithm used is proportional to the length
    + of the answer.
    +
    + In SQL database terminology, the default value of 'all = FALSE'
    + gives a _natural join_, a special case of an _inner join_.
    + Specifying 'all.x = TRUE' gives a _left (outer) join_, 'all.y =
    + TRUE' a _right (outer) join_, and both ('all = TRUE') a _(full)
    + outer join_.  DBMSes do not match 'NULL' records, equivalent to
    + 'incomparables = NA' in R.
    +

    Value:

    +
     A data frame.  The rows are by default lexicographically sorted on
    + the common columns, but for 'sort = FALSE' are in an unspecified
    + order.  The columns are the common columns followed by the
    + remaining columns in 'x' and then those in 'y'.  If the matching
    + involved row names, an extra character column called 'Row.names'
    + is added at the left, and in all cases the result has 'automatic'
    + row names.
    +

    Note:

    +
     This is intended to work with data frames with vector-like
    + columns: some aspects work with data frames containing matrices,
    + but not all.
    +
    + Currently long vectors are not accepted for inputs, which are thus
    + restricted to less than 2^31 rows. That restriction also applies
    + to the result for 32-bit platforms.
    +

    See Also:

    +
     'data.frame', 'by', 'cbind'.
    +
    + 'dendrogram' for a class which has a 'merge' method.
    +

    Examples:

    +
     authors <- data.frame(
    +     ## I(*) : use character columns of names to get sensible sort order
    +     surname = I(c("Tukey", "Venables", "Tierney", "Ripley", "McNeil")),
    +     nationality = c("US", "Australia", "US", "UK", "Australia"),
    +     deceased = c("yes", rep("no", 4)))
    + authorN <- within(authors, { name <- surname; rm(surname) })
    + books <- data.frame(
    +     name = I(c("Tukey", "Venables", "Tierney",
    +              "Ripley", "Ripley", "McNeil", "R Core")),
    +     title = c("Exploratory Data Analysis",
    +               "Modern Applied Statistics ...",
    +               "LISP-STAT",
    +               "Spatial Statistics", "Stochastic Simulation",
    +               "Interactive Data Analysis",
    +               "An Introduction to R"),
    +     other.author = c(NA, "Ripley", NA, NA, NA, NA,
    +                      "Venables & Smith"))
    + 
    + (m0 <- merge(authorN, books))
    + (m1 <- merge(authors, books, by.x = "surname", by.y = "name"))
    +  m2 <- merge(books, authors, by.x = "name", by.y = "surname")
    + stopifnot(exprs = {
    +    identical(m0, m2[, names(m0)])
    +    as.character(m1[, 1]) == as.character(m2[, 1])
    +    all.equal(m1[, -1], m2[, -1][ names(m1)[-1] ])
    +    identical(dim(merge(m1, m2, by = NULL)),
    +              c(nrow(m1)*nrow(m2), ncol(m1)+ncol(m2)))
    + })
    + 
    + ## "R core" is missing from authors and appears only here :
    + merge(authors, books, by.x = "surname", by.y = "name", all = TRUE)
    + 
    + 
    + ## example of using 'incomparables'
    + x <- data.frame(k1 = c(NA,NA,3,4,5), k2 = c(1,NA,NA,4,5), data = 1:5)
    + y <- data.frame(k1 = c(NA,2,NA,4,5), k2 = c(NA,NA,3,4,5), data = 1:5)
    + merge(x, y, by = c("k1","k2")) # NA's match
    + merge(x, y, by = "k1") # NA's match, so 6 rows
    + merge(x, y, by = "k2", incomparables = NA) # 2 rows
    +
    +
    +

    Lets import the new data we want to merge and take a look

    +

    The new data serodata_new.csv represents a follow-up serological survey four years later. At this follow-up individuals were retested for IgG antibody concentrations and their ages were collected.

    +
    +
    df_new <- read.csv("data/serodata_new.csv")
    +str(df_new)
    +
    +
    'data.frame':   636 obs. of  3 variables:
    + $ observation_id   : int  5772 8095 9784 9338 6369 6885 6252 8913 7332 6941 ...
    + $ IgG_concentration: num  0.261 2.981 0.282 136.638 0.381 ...
    + $ age              : int  6 8 8 8 5 8 8 NA 8 6 ...
    +
    +
    summary(df_new)
    +
    +
    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    observation_idIgG_concentrationage
    Min. :5006Min. : 0.0051Min. : 5.00
    1st Qu.:63281st Qu.: 0.27511st Qu.: 7.00
    Median :7494Median : 1.5477Median :10.00
    Mean :7490Mean : 82.7684Mean :10.63
    3rd Qu.:87363rd Qu.:129.63893rd Qu.:14.00
    Max. :9982Max. :950.6590Max. :19.00
    NANANA’s :9
    +
    +
    +
    +
    +

    Merge the new data with the original data

    +

    Lets load the old data as well and look for a variable, or variables, to merge by.

    +
    +
    df <- read.csv("data/serodata.csv")
    +colnames(df)
    +
    +
    [1] "observation_id"    "IgG_concentration" "age"              
    +[4] "gender"            "slum"             
    +
    +
    +

    We notice that observation_id seems to be the obvious variable by which to merge. However, we also realize that IgG_concentration and age are the exact same names. If we merge now we see that

    +
    +
    head(merge(df, df_new, all.x=T, all.y=T, by=c('observation_id')))
    +
    + +++++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    observation_idIgG_concentration.xage.xgenderslumIgG_concentration.yage.y
    5006164.29794527MaleNon slum155.581132511
    50240.30000005FemaleNon slum0.29186059
    50260.300000010FemaleNon slum0.254294514
    50300.05555567FemaleNon slum0.053326211
    503526.211251411FemaleNon slum22.015930015
    50540.30000003MaleNon slum0.27096717
    +
    +
    +
    +
    +

    Merge the new data with the original data

    +

    The first option is to rename the IgG_concentration and age variables before the merge, so that it is clear which is time point 1 and time point 2.

    +
    +
    df$IgG_concentration_time1 <- df$IgG_concentration
    +df$age_time1 <- df$age
    +df$IgG_concentration <- df$age <- NULL #remove the original variables
    +
    +df_new$IgG_concentration_time2 <- df_new$IgG_concentration
    +df_new$age_time2 <- df_new$age
    +df_new$IgG_concentration <- df_new$age <- NULL #remove the original variables
    +
    +

    Now, lets merge.

    +
    +
    df_all_wide <- merge(df, df_new, all.x=T, all.y=T, by=c('observation_id'))
    +str(df_all_wide)
    +
    +
    'data.frame':   651 obs. of  7 variables:
    + $ observation_id         : int  5006 5024 5026 5030 5035 5054 5057 5063 5064 5080 ...
    + $ gender                 : chr  "Male" "Female" "Female" "Female" ...
    + $ slum                   : chr  "Non slum" "Non slum" "Non slum" "Non slum" ...
    + $ IgG_concentration_time1: num  164.2979 0.3 0.3 0.0556 26.2113 ...
    + $ age_time1              : int  7 5 10 7 11 3 3 12 14 6 ...
    + $ IgG_concentration_time2: num  155.5811 0.2919 0.2543 0.0533 22.0159 ...
    + $ age_time2              : int  11 9 14 11 15 7 7 16 18 10 ...
    +
    +
    +
    +
    +

    Merge the new data with the original data

    +

    The second option is to add a time variable to the two data sets and then merge by observation_id,time,age,IgG_concentration. Note, I need to read in the data again b/c I removed the IgG_concentration and age variables.

    +
    +
    df <- read.csv("data/serodata.csv")
    +df_new <- read.csv("data/serodata_new.csv")
    +
    +
    +
    df$time <- 1 #you can put in one number and it will repeat it
    +df_new$time <- 2
    +head(df)
    +
    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    observation_idIgG_concentrationagegenderslumtime
    57720.31768952FemaleNon slum1
    80953.43682314FemaleNon slum1
    97840.30000004MaleNon slum1
    9338143.23630144MaleNon slum1
    63690.44765341MaleNon slum1
    68850.02527084MaleNon slum1
    +
    +
    head(df_new)
    +
    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    observation_idIgG_concentrationagetime
    57720.261238862
    80952.980904982
    97840.281948982
    9338136.638226082
    63690.381011952
    68850.024595182
    +
    +
    +

    Now, lets merge. Note, “By default the data frames are merged on the columns with names they both have” therefore if I don’t specify the by argument it will merge on all matching variables.

    +
    +
    df_all_long <- merge(df, df_new, all.x=T, all.y=T) 
    +str(df_all_long)
    +
    +
    'data.frame':   1287 obs. of  6 variables:
    + $ observation_id   : int  5006 5006 5024 5024 5026 5026 5030 5030 5035 5035 ...
    + $ IgG_concentration: num  155.581 164.298 0.292 0.3 0.254 ...
    + $ age              : int  11 7 9 5 14 10 11 7 15 11 ...
    + $ time             : num  2 1 2 1 2 1 2 1 2 1 ...
    + $ gender           : chr  NA "Male" NA "Female" ...
    + $ slum             : chr  NA "Non slum" NA "Non slum" ...
    +
    +
    +

    Note, there are 1287 rows, which is the sum of the number of rows of df (651 rows) and df_new (636 rows)

    +
    +
    +

    What is wide/long data?

    +

    Above, we actually created a wide and long version of the data.

    +

    Wide: has many columns

    +
      +
    • multiple columns per individual, values spread across multiple columns
    • +
    • easier for humans to read
    • +
    +

    Long: has many rows

    +
      +
    • column names become data
    • +
    • multiple rows per observation, a single column contains the values
    • +
    • easier for R to make plots & do analysis
    • +
    +
    +
    +

    reshape() function

    +

    The reshape() function allows you to toggle between wide and long data

    +

    Reshape Grouped Data

    +

    Description:

    +
     This function reshapes a data frame between 'wide' format (with
    + repeated measurements in separate columns of the same row) and
    + 'long' format (with the repeated measurements in separate rows).
    +

    Usage:

    +
     reshape(data, varying = NULL, v.names = NULL, timevar = "time",
    +         idvar = "id", ids = 1:NROW(data),
    +         times = seq_along(varying[[1]]),
    +         drop = NULL, direction, new.row.names = NULL,
    +         sep = ".",
    +         split = if (sep == "") {
    +             list(regexp = "[A-Za-z][0-9]", include = TRUE)
    +         } else {
    +             list(regexp = sep, include = FALSE, fixed = TRUE)}
    +         )
    + 
    + ### Typical usage for converting from long to wide format:
    + 
    + # reshape(data, direction = "wide",
    + #         idvar = "___", timevar = "___", # mandatory
    + #         v.names = c(___),    # time-varying variables
    + #         varying = list(___)) # auto-generated if missing
    + 
    + ### Typical usage for converting from wide to long format:
    + 
    + ### If names of wide-format variables are in a 'nice' format
    + 
    + # reshape(data, direction = "long",
    + #         varying = c(___), # vector 
    + #         sep)              # to help guess 'v.names' and 'times'
    + 
    + ### To specify long-format variable names explicitly
    + 
    + # reshape(data, direction = "long",
    + #         varying = ___,  # list / matrix / vector (use with care)
    + #         v.names = ___,  # vector of variable names in long format
    + #         timevar, times, # name / values of constructed time variable
    + #         idvar, ids)     # name / values of constructed id variable
    + 
    +

    Arguments:

    +
    data: a data frame
    +

    varying: names of sets of variables in the wide format that correspond to single variables in long format (‘time-varying’). This is canonically a list of vectors of variable names, but it can optionally be a matrix of names, or a single vector of names. In each case, when ‘direction = “long”’, the names can be replaced by indices which are interpreted as referring to ‘names(data)’. See ‘Details’ for more details and options.

    +

    v.names: names of variables in the long format that correspond to multiple variables in the wide format. See ‘Details’.

    +

    timevar: the variable in long format that differentiates multiple records from the same group or individual. If more than one record matches, the first will be taken (with a warning).

    +

    idvar: Names of one or more variables in long format that identify multiple records from the same group/individual. These variables may also be present in wide format.

    +
     ids: the values to use for a newly created 'idvar' variable in
    +      long format.
    +

    times: the values to use for a newly created ‘timevar’ variable in long format. See ‘Details’.

    +
    drop: a vector of names of variables to drop before reshaping.
    +

    direction: character string, partially matched to either ‘“wide”’ to reshape to wide format, or ‘“long”’ to reshape to long format.

    +

    new.row.names: character or ‘NULL’: a non-null value will be used for the row names of the result.

    +
     sep: A character vector of length 1, indicating a separating
    +      character in the variable names in the wide format.  This is
    +      used for guessing 'v.names' and 'times' arguments based on
    +      the names in 'varying'.  If 'sep == ""', the split is just
    +      before the first numeral that follows an alphabetic
    +      character.  This is also used to create variable names when
    +      reshaping to wide format.
    +

    split: A list with three components, ‘regexp’, ‘include’, and (optionally) ‘fixed’. This allows an extended interface to variable name splitting. See ‘Details’.

    +

    Details:

    +
     Although 'reshape()' can be used in a variety of contexts, the
    + motivating application is data from longitudinal studies, and the
    + arguments of this function are named and described in those terms.
    + A longitudinal study is characterized by repeated measurements of
    + the same variable(s), e.g., height and weight, on each unit being
    + studied (e.g., individual persons) at different time points (which
    + are assumed to be the same for all units). These variables are
    + called time-varying variables. The study may include other
    + variables that are measured only once for each unit and do not
    + vary with time (e.g., gender and race); these are called
    + time-constant variables.
    +
    + A 'wide' format representation of a longitudinal dataset will have
    + one record (row) for each unit, typically with some time-constant
    + variables that occupy single columns, and some time-varying
    + variables that occupy multiple columns (one column for each time
    + point).  A 'long' format representation of the same dataset will
    + have multiple records (rows) for each individual, with the
    + time-constant variables being constant across these records and
    + the time-varying variables varying across the records.  The 'long'
    + format dataset will have two additional variables: a 'time'
    + variable identifying which time point each record comes from, and
    + an 'id' variable showing which records refer to the same unit.
    +
    + The type of conversion (long to wide or wide to long) is
    + determined by the 'direction' argument, which is mandatory unless
    + the 'data' argument is the result of a previous call to 'reshape'.
    + In that case, the operation can be reversed simply using
    + 'reshape(data)' (the other arguments are stored as attributes on
    + the data frame).
    +
    + Conversion from long to wide format with 'direction = "wide"' is
    + the simpler operation, and is mainly useful in the context of
    + multivariate analysis where data is often expected as a
    + wide-format matrix. In this case, the time variable 'timevar' and
    + id variable 'idvar' must be specified. All other variables are
    + assumed to be time-varying, unless the time-varying variables are
    + explicitly specified via the 'v.names' argument.  A warning is
    + issued if time-constant variables are not actually constant.
    +
    + Each time-varying variable is expanded into multiple variables in
    + the wide format.  The names of these expanded variables are
    + generated automatically, unless they are specified as the
    + 'varying' argument in the form of a list (or matrix) with one
    + component (or row) for each time-varying variable. If 'varying' is
    + a vector of names, it is implicitly converted into a matrix, with
    + one row for each time-varying variable. Use this option with care
    + if there are multiple time-varying variables, as the ordering (by
    + column, the default in the 'matrix' constructor) may be
    + unintuitive, whereas the explicit list or matrix form is
    + unambiguous.
    +
    + Conversion from wide to long with 'direction = "long"' is the more
    + common operation as most (univariate) statistical modeling
    + functions expect data in the long format. In the simpler case
    + where there is only one time-varying variable, the corresponding
    + columns in the wide format input can be specified as the 'varying'
    + argument, which can be either a vector of column names or the
    + corresponding column indices. The name of the corresponding
    + variable in the long format output combining these columns can be
    + optionally specified as the 'v.names' argument, and the name of
    + the time variables as the 'timevar' argument. The values to use as
    + the time values corresponding to the different columns in the wide
    + format can be specified as the 'times' argument.  If 'v.names' is
    + unspecified, the function will attempt to guess 'v.names' and
    + 'times' from 'varying' (an explicitly specified 'times' argument
    + is unused in that case).  The default expects variable names like
    + 'x.1', 'x.2', where 'sep = "."' specifies to split at the dot and
    + drop it from the name.  To have alphabetic followed by numeric
    + times use 'sep = ""'.
    +
    + Multiple time-varying variables can be specified in two ways,
    + either with 'varying' as an atomic vector as above, or as a list
    + (or a matrix). The first form is useful (and mandatory) if the
    + automatic variable name splitting as described above is used; this
    + requires the names of all time-varying variables to be suitably
    + formatted in the same manner, and 'v.names' to be unspecified. If
    + 'varying' is a list (with one component for each time-varying
    + variable) or a matrix (one row for each time-varying variable),
    + variable name splitting is not attempted, and 'v.names' and
    + 'times' will generally need to be specified, although they will
    + default to, respectively, the first variable name in each set, and
    + sequential times.
    +
    + Also, guessing is not attempted if 'v.names' is given explicitly,
    + even if 'varying' is an atomic vector. In that case, the number of
    + time-varying variables is taken to be the length of 'v.names', and
    + 'varying' is implicitly converted into a matrix, with one row for
    + each time-varying variable. As in the case of long to wide
    + conversion, the matrix is filled up by column, so careful
    + attention needs to be paid to the order of variable names (or
    + indices) in 'varying', which is taken to be like 'x.1', 'y.1',
    + 'x.2', 'y.2' (i.e., variables corresponding to the same time point
    + need to be grouped together).
    +
    + The 'split' argument should not usually be necessary.  The
    + 'split$regexp' component is passed to either 'strsplit' or
    + 'regexpr', where the latter is used if 'split$include' is 'TRUE',
    + in which case the splitting occurs after the first character of
    + the matched string.  In the 'strsplit' case, the separator is not
    + included in the result, and it is possible to specify fixed-string
    + matching using 'split$fixed'.
    +

    Value:

    +
     The reshaped data frame with added attributes to simplify
    + reshaping back to the original form.
    +

    See Also:

    +
     'stack', 'aperm'; 'relist' for reshaping the result of 'unlist'.
    + 'xtabs' and 'as.data.frame.table' for creating contingency tables
    + and converting them back to data frames.
    +

    Examples:

    +
     summary(Indometh) # data in long format
    + 
    + ## long to wide (direction = "wide") requires idvar and timevar at a minimum
    + reshape(Indometh, direction = "wide", idvar = "Subject", timevar = "time")
    + 
    + ## can also explicitly specify name of combined variable
    + wide <- reshape(Indometh, direction = "wide", idvar = "Subject",
    +                 timevar = "time", v.names = "conc", sep= "_")
    + wide
    + 
    + ## reverse transformation
    + reshape(wide, direction = "long")
    + reshape(wide, idvar = "Subject", varying = list(2:12),
    +         v.names = "conc", direction = "long")
    + 
    + ## times need not be numeric
    + df <- data.frame(id = rep(1:4, rep(2,4)),
    +                  visit = I(rep(c("Before","After"), 4)),
    +                  x = rnorm(4), y = runif(4))
    + df
    + reshape(df, timevar = "visit", idvar = "id", direction = "wide")
    + ## warns that y is really varying
    + reshape(df, timevar = "visit", idvar = "id", direction = "wide", v.names = "x")
    + 
    + 
    + ##  unbalanced 'long' data leads to NA fill in 'wide' form
    + df2 <- df[1:7, ]
    + df2
    + reshape(df2, timevar = "visit", idvar = "id", direction = "wide")
    + 
    + ## Alternative regular expressions for guessing names
    + df3 <- data.frame(id = 1:4, age = c(40,50,60,50), dose1 = c(1,2,1,2),
    +                   dose2 = c(2,1,2,1), dose4 = c(3,3,3,3))
    + reshape(df3, direction = "long", varying = 3:5, sep = "")
    + 
    + 
    + ## an example that isn't longitudinal data
    + state.x77 <- as.data.frame(state.x77)
    + long <- reshape(state.x77, idvar = "state", ids = row.names(state.x77),
    +                 times = names(state.x77), timevar = "Characteristic",
    +                 varying = list(names(state.x77)), direction = "long")
    + 
    + reshape(long, direction = "wide")
    + 
    + reshape(long, direction = "wide", new.row.names = unique(long$state))
    + 
    + ## multiple id variables
    + df3 <- data.frame(school = rep(1:3, each = 4), class = rep(9:10, 6),
    +                   time = rep(c(1,1,2,2), 3), score = rnorm(12))
    + wide <- reshape(df3, idvar = c("school", "class"), direction = "wide")
    + wide
    + ## transform back
    + reshape(wide)
    +
    +
    +

    long to wide data

    +

    xxzane - help

    +
    +
    +

    wide to long data

    +

    xxzane - help

    +
    +
    +

    Let’s get real

    +

    Use the pivot_wider() and pivot_longer() from the tidyr package!

    +
    +
    +

    Summary

    +
      +
    • +
    +
    +
    +

    Acknowledgements

    +

    These are the materials we looked through, modified, or extracted to complete this module’s lecture.

    + + + +
    + + + + + + + + + + + + + + + + + + + + + + + + + + + \ No newline at end of file diff --git a/docs/modules/Module09-DataAnalysis.html b/docs/modules/Module09-DataAnalysis.html index aca507b..e7487aa 100644 --- a/docs/modules/Module09-DataAnalysis.html +++ b/docs/modules/Module09-DataAnalysis.html @@ -8,11 +8,11 @@ - + - SISMID Module NUMBER Materials (2025) – Module 9: Data Analysis + SISMID Module NUMBER Materials (2025) - Module 9: Data Analysis @@ -32,7 +32,7 @@ } /* CSS for syntax highlighting */ pre > code.sourceCode { white-space: pre; position: relative; } - pre > code.sourceCode > span { line-height: 1.25; } + pre > code.sourceCode > span { display: inline-block; line-height: 1.25; } pre > code.sourceCode > span:empty { height: 1.2em; } .sourceCode { overflow: visible; } code.sourceCode > span { color: inherit; text-decoration: inherit; } @@ -43,7 +43,7 @@ } @media print { pre > code.sourceCode { white-space: pre-wrap; } - pre > code.sourceCode > span { display: inline-block; text-indent: -5em; padding-left: 5em; } + pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; } } pre.numberSource code { counter-reset: source-line 0; } @@ -71,7 +71,7 @@ code span.at { color: #657422; } /* Attribute */ code span.bn { color: #ad0000; } /* BaseN */ code span.bu { } /* BuiltIn */ - code span.cf { color: #003b4f; font-weight: bold; } /* ControlFlow */ + code span.cf { color: #003b4f; } /* ControlFlow */ code span.ch { color: #20794d; } /* Char */ code span.cn { color: #8f5902; } /* Constant */ code span.co { color: #5e5e5e; } /* Comment */ @@ -85,7 +85,7 @@ code span.fu { color: #4758ab; } /* Function */ code span.im { color: #00769e; } /* Import */ code span.in { color: #5e5e5e; } /* Information */ - code span.kw { color: #003b4f; font-weight: bold; } /* Keyword */ + code span.kw { color: #003b4f; } /* Keyword */ code span.op { color: #5e5e5e; } /* Operator */ code span.ot { color: #003b4f; } /* Other */ code span.pp { color: #ad0000; } /* Preprocessor */ @@ -222,8 +222,7 @@ } .callout.callout-titled .callout-body > .callout-content > :last-child { - padding-bottom: 0.5rem; - margin-bottom: 0; + margin-bottom: 0.5rem; } .callout.callout-titled .callout-icon::before { @@ -408,45 +407,22 @@

    Module 9: Data Analysis

    -
    -

    Learning Objectives

    After module 9, you should be able to…

      -
    •   Descriptively assess association between two variables
    • -
    •   Compute basic statistics 
    • -
    •   Fit a generalized linear model
    • +
    • Descriptively assess association between two variables
    • +
    • Compute basic statistics
    • +
    • Fit a generalized linear model

    Import data for this module

    Let’s read in our data (again) and take a quick look.

    -
    df <- read.csv(file = "data/serodata.csv") #relative path
    -head(x=df, n=3)
    +
    df <- read.csv(file = "data/serodata.csv") #relative path
    +head(x=df, n=3)
      observation_id IgG_concentration age gender     slum
     1           5772         0.3176895   2 Female Non slum
    @@ -459,47 +435,186 @@ 

    Import data for this module

    Prep data

    Create age_group three level factor variable

    -
    df$age_group <- ifelse(df$age <= 5, "young", 
    -                       ifelse(df$age<=10 & df$age>5, "middle", 
    -                              ifelse(df$age>10, "old", NA)))
    -df$age_group <- factor(df$age_group, levels=c("young", "middle", "old"))
    +
    df$age_group <- ifelse(df$age <= 5, "young", 
    +                       ifelse(df$age<=10 & df$age>5, "middle", "old"))
    +df$age_group <- factor(df$age_group, levels=c("young", "middle", "old"))
    -

    Create seropos binary variable representing seropositivity if antibody concentrations are >10 mIUmL.

    +

    Create seropos binary variable representing seropositivity if antibody concentrations are >10 IU/mL.

    -
    df$seropos <- ifelse(df$IgG_concentration<10, 0, 
    -                                        ifelse(df$IgG_concentration>=10, 1, NA))
    +
    df$seropos <- ifelse(df$IgG_concentration<10, 0, 1)

    2 variable contingency tables

    We use table() prior to look at one variable, now we can generate frequency tables for 2 plus variables. To get cell percentages, the prop.table() is useful.

    -
    freq <- table(df$age_group, df$seropo)
    -prop <- prop.table(freq)
    -freq
    -
    -
            
    -           0   1
    -  young  254  57
    -  middle  70 105
    -  old     30 116
    +
    ?prop.table
    -
    prop
    +
    +
    library(printr)
    +
    +
    Registered S3 method overwritten by 'printr':
    +  method                from     
    +  knit_print.data.frame rmarkdown
    +
    +
    ?prop.table
    -
            
    -                  0          1
    -  young  0.40189873 0.09018987
    -  middle 0.11075949 0.16613924
    -  old    0.04746835 0.18354430
    +
    Express Table Entries as Fraction of Marginal Table
    +
    +Description:
    +
    +     Returns conditional proportions given 'margins', i.e. entries of
    +     'x', divided by the appropriate marginal sums.
    +
    +Usage:
    +
    +     proportions(x, margin = NULL)
    +     prop.table(x, margin = NULL)
    +     
    +Arguments:
    +
    +       x: table
    +
    +  margin: a vector giving the margins to split by.  E.g., for a matrix
    +          '1' indicates rows, '2' indicates columns, 'c(1, 2)'
    +          indicates rows and columns.  When 'x' has named dimnames, it
    +          can be a character vector selecting dimension names.
    +
    +Value:
    +
    +     Table like 'x' expressed relative to 'margin'
    +
    +Note:
    +
    +     'prop.table' is an earlier name, retained for back-compatibility.
    +
    +Author(s):
    +
    +     Peter Dalgaard
    +
    +See Also:
    +
    +     'marginSums'. 'apply', 'sweep' are a more general mechanism for
    +     sweeping out marginal statistics.
    +
    +Examples:
    +
    +     m <- matrix(1:4, 2)
    +     m
    +     proportions(m, 1)
    +     
    +     DF <- as.data.frame(UCBAdmissions)
    +     tbl <- xtabs(Freq ~ Gender + Admit, DF)
    +     
    +     proportions(tbl, "Gender")
    +
    +
    +
    +
    +

    2 variable contingency tables

    +

    Let’s practice

    +
    +
    freq <- table(df$age_group, df$seropos)
    +freq
    +
    + + + + + + + + + + + + + + + + + + + + + + + + + +
    /01
    young25457
    middle70105
    old30116
    +
    +
    +

    Now, lets move to percentages

    +
    +
    prop.cell.percentages <- prop.table(freq)
    +prop.cell.percentages
    +
    + + + + + + + + + + + + + + + + + + + + + + + + + +
    /01
    young0.40189870.0901899
    middle0.11075950.1661392
    old0.04746840.1835443
    +
    +
    prop.column.percentages <- prop.table(freq, margin=2)
    +prop.column.percentages
    +
    + + + + + + + + + + + + + + + + + + + + + + + + + +
    /01
    young0.71751410.2050360
    middle0.19774010.3776978
    old0.08474580.4172662

    Chi-Square test

    The chisq.test() function test of independence of factor variables from stats package.

    -
    Registered S3 method overwritten by 'printr':
    -  method                from     
    -  knit_print.data.frame rmarkdown
    +
    +
    ?chisq.test
    +

    Pearson’s Chi-squared Test for Count Data

    Description:

     'chisq.test' performs chi-squared contingency table tests and
    @@ -633,7 +748,7 @@ 

    Chi-Square test

    Chi-Square test

    -
    chisq.test(freq)
    +
    chisq.test(freq)
    
         Pearson's Chi-squared test
    @@ -642,35 +757,58 @@ 

    Chi-Square test

    X-squared = 175.85, df = 2, p-value < 2.2e-16
    -

    We reject the null hypothesis that the proportion of seropositive individuals who are young (<5yo) is the same for individuals who are middle (5-10yo) or old (>10yo).

    +

    We reject the null hypothesis that the proportion of seropositive individuals in the young, middle, and old age groups are the same.

    Correlation

    First, we compute correlation by providing two vectors.

    Like other functions, if there are NAs, you get NA as the result. But if you specify use only the complete observations, then it will give you correlation using the non-missing data.

    -
    cor(df$age, df$IgG_concentration, method="pearson")
    +
    cor(df$age, df$IgG_concentration, method="pearson")
    [1] NA
    -
    cor(df$age, df$IgG_concentration, method="pearson", use = "complete.obs") #IF have missing data
    +
    cor(df$age, df$IgG_concentration, method="pearson", use = "complete.obs") #IF have missing data
    [1] 0.2604783

    Small positive correlation between IgG concentration and age.

    +
    +

    Correlation confidence interval

    +

    The function cor.test() also gives you the confidence interval of the correlation statistic. Note, it uses complete observations by default.

    +
    +
    cor.test(df$age, df$IgG_concentration, method="pearson")
    +
    +
    
    +    Pearson's product-moment correlation
    +
    +data:  df$age and df$IgG_concentration
    +t = 6.7717, df = 630, p-value = 2.921e-11
    +alternative hypothesis: true correlation is not equal to 0
    +95 percent confidence interval:
    + 0.1862722 0.3317295
    +sample estimates:
    +      cor 
    +0.2604783 
    +
    +
    +

    T-test

    The commonly used are:

    • one-sample t-test – used to test mean of a variable in one group (to the null hypothesis mean)
    • -
    • two-sample t-test – used to test difference in means of a variable between two groups (null hypothesis - the group means are the same); if “two groups” are data of the same individuals collected at 2 time points, we say it is two-sample paired t-test
    • +
    • two-sample t-test – used to test difference in means of a variable between two groups (null hypothesis - the group means are the same)

    T-test

    We can use the t.test() function from the stats package.

    +
    +
    ?t.test
    +

    Student’s t-Test

    Description:

     Performs one and two sample t-tests on vectors of data.
    @@ -769,10 +907,10 @@

    Running two-sample t-test

    Running two-sample t-test

    -
    IgG_young <- df$IgG_concentration[df$age_group=="young"]
    -IgG_old <- df$IgG_concentration[df$age_group=="old"]
    -
    -t.test(IgG_young, IgG_old)
    +
    IgG_young <- df$IgG_concentration[df$age_group=="young"]
    +IgG_old <- df$IgG_concentration[df$age_group=="old"]
    +
    +t.test(IgG_young, IgG_old)
    
         Welch Two Sample t-test
    @@ -787,11 +925,14 @@ 

    Running two-sample t-test

    45.05056 129.35454
    -

    The mean IgG concenration of young and old is 45.05 and 129.35 mIU/mL, respectively. We reject null hypothesis that the difference in the mean IgG concentration of young and old is 0 mIU/mL.

    +

    The mean IgG concenration of young and old is 45.05 and 129.35 IU/mL, respectively. We reject null hypothesis that the difference in the mean IgG concentration of young and old is 0 IU/mL.

    Linear regression fit in R

    To fit regression models in R, we use the function glm() (Generalized Linear Model).

    +
    +
    ?glm
    +

    Fitting Generalized Linear Models

    Description:

     'glm' is used to fit generalized linear models, specified by
    @@ -1078,11 +1219,11 @@ 

    Linear regression fit in R

    • formula – model formula written using names of columns in our data
    • data – our data frame
    • -
    •   `family` -- error distribution and link function
    • +
    • family – error distribution and link function
    -
    fit1 <- glm(IgG_concentration~age+gender+slum, data=df, family=gaussian())
    -fit2 <- glm(seropos~age_group+gender+slum, data=df, family = binomial(link = "logit"))
    +
    fit1 <- glm(IgG_concentration~age+gender+slum, data=df, family=gaussian())
    +fit2 <- glm(seropos~age_group+gender+slum, data=df, family = binomial(link = "logit"))
    @@ -1178,7 +1319,7 @@

    summary.glm()

    Linear regression fit in R

    Lets look at the output…

    -
    summary(fit1)
    +
    summary(fit1)
    
     Call:
    @@ -1204,7 +1345,7 @@ 

    Linear regression fit in R

    Number of Fisher Scoring iterations: 2
    -
    summary(fit2)
    +
    summary(fit2)
    
     Call:
    @@ -1236,23 +1377,20 @@ 

    Linear regression fit in R

    Summary

      -
    •   Use `cor()` to calculate correlation between two numeric vectors.
    • -
    • corrplot() and ggpairs() is nice for a quick visualization of correlations
    • -
    • t.test() or t_test() tests the mean compared to null or difference in means between two groups
    • +
    • Use cor() or cor.test() to calculate correlation between two numeric vectors.
    • +
    • t.test() tests the mean compared to null or difference in means between two groups
    •   ... xxamy more

    Acknowledgements

    -

    These are the materials I looked through, modified, or extracted to complete this module’s lecture.

    +

    These are the materials we looked through, modified, or extracted to complete this module’s lecture.

    -
    @@ -1281,6 +1419,7 @@

    Acknowledgements

    Reveal.initialize({ 'controlsAuto': true, 'previewLinksAuto': false, +'smaller': true, 'pdfSeparateFragments': false, 'autoAnimateEasing': "ease", 'autoAnimateDuration': 1, @@ -1535,7 +1674,18 @@

    Acknowledgements

    } return false; } - const onCopySuccess = function(e) { + const clipboard = new window.ClipboardJS('.code-copy-button', { + text: function(trigger) { + const codeEl = trigger.previousElementSibling.cloneNode(true); + for (const childEl of codeEl.children) { + if (isCodeAnnotation(childEl)) { + childEl.remove(); + } + } + return codeEl.innerText; + } + }); + clipboard.on('success', function(e) { // button target const button = e.trigger; // don't keep focus @@ -1567,50 +1717,11 @@

    Acknowledgements

    }, 1000); // clear code selection e.clearSelection(); - } - const getTextToCopy = function(trigger) { - const codeEl = trigger.previousElementSibling.cloneNode(true); - for (const childEl of codeEl.children) { - if (isCodeAnnotation(childEl)) { - childEl.remove(); - } - } - return codeEl.innerText; - } - const clipboard = new window.ClipboardJS('.code-copy-button:not([data-in-quarto-modal])', { - text: getTextToCopy }); - clipboard.on('success', onCopySuccess); - if (window.document.getElementById('quarto-embedded-source-code-modal')) { - // For code content inside modals, clipBoardJS needs to be initialized with a container option - // TODO: Check when it could be a function (https://github.com/zenorocha/clipboard.js/issues/860) - const clipboardModal = new window.ClipboardJS('.code-copy-button[data-in-quarto-modal]', { - text: getTextToCopy, - container: window.document.getElementById('quarto-embedded-source-code-modal') - }); - clipboardModal.on('success', onCopySuccess); - } - var localhostRegex = new RegExp(/^(?:http|https):\/\/localhost\:?[0-9]*\//); - var mailtoRegex = new RegExp(/^mailto:/); - var filterRegex = new RegExp('/' + window.location.host + '/'); - var isInternal = (href) => { - return filterRegex.test(href) || localhostRegex.test(href) || mailtoRegex.test(href); - } - // Inspect non-navigation links and adorn them if external - var links = window.document.querySelectorAll('a[href]:not(.nav-link):not(.navbar-brand):not(.toc-action):not(.sidebar-link):not(.sidebar-item-toggle):not(.pagination-link):not(.no-external):not([aria-hidden]):not(.dropdown-item):not(.quarto-navigation-tool):not(.about-link)'); - for (var i=0; iAcknowledgements interactive: true, interactiveBorder: 10, theme: 'light-border', - placement: 'bottom-start', + placement: 'bottom-start' }; - if (contentFn) { - config.content = contentFn; - } - if (onTriggerFn) { - config.onTrigger = onTriggerFn; - } - if (onUntriggerFn) { - config.onUntrigger = onUntriggerFn; - } config['offset'] = [0,0]; config['maxWidth'] = 700; window.tippy(el, config); @@ -1644,11 +1746,7 @@

    Acknowledgements

    try { href = new URL(href).hash; } catch {} const id = href.replace(/^#\/?/, ""); const note = window.document.getElementById(id); - if (note) { - return note.innerHTML; - } else { - return ""; - } + return note.innerHTML; }); } const findCites = (el) => { diff --git a/docs/modules/Module10-DataVisualization.html b/docs/modules/Module10-DataVisualization.html index 8e82f1d..157bea4 100644 --- a/docs/modules/Module10-DataVisualization.html +++ b/docs/modules/Module10-DataVisualization.html @@ -8,11 +8,11 @@ - + - SISMID Module NUMBER Materials (2025) – Module 10: Data Visualization + SISMID Module NUMBER Materials (2025) - Module 10: Data Visualization @@ -32,7 +32,7 @@ } /* CSS for syntax highlighting */ pre > code.sourceCode { white-space: pre; position: relative; } - pre > code.sourceCode > span { line-height: 1.25; } + pre > code.sourceCode > span { display: inline-block; line-height: 1.25; } pre > code.sourceCode > span:empty { height: 1.2em; } .sourceCode { overflow: visible; } code.sourceCode > span { color: inherit; text-decoration: inherit; } @@ -43,7 +43,7 @@ } @media print { pre > code.sourceCode { white-space: pre-wrap; } - pre > code.sourceCode > span { display: inline-block; text-indent: -5em; padding-left: 5em; } + pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; } } pre.numberSource code { counter-reset: source-line 0; } @@ -71,7 +71,7 @@ code span.at { color: #657422; } /* Attribute */ code span.bn { color: #ad0000; } /* BaseN */ code span.bu { } /* BuiltIn */ - code span.cf { color: #003b4f; font-weight: bold; } /* ControlFlow */ + code span.cf { color: #003b4f; } /* ControlFlow */ code span.ch { color: #20794d; } /* Char */ code span.cn { color: #8f5902; } /* Constant */ code span.co { color: #5e5e5e; } /* Comment */ @@ -85,7 +85,7 @@ code span.fu { color: #4758ab; } /* Function */ code span.im { color: #00769e; } /* Import */ code span.in { color: #5e5e5e; } /* Information */ - code span.kw { color: #003b4f; font-weight: bold; } /* Keyword */ + code span.kw { color: #003b4f; } /* Keyword */ code span.op { color: #5e5e5e; } /* Operator */ code span.ot { color: #003b4f; } /* Other */ code span.pp { color: #ad0000; } /* Preprocessor */ @@ -222,8 +222,7 @@ } .callout.callout-titled .callout-body > .callout-content > :last-child { - padding-bottom: 0.5rem; - margin-bottom: 0; + margin-bottom: 0.5rem; } .callout.callout-titled .callout-icon::before { @@ -408,36 +407,6 @@

    Module 10: Data Visualization

    -
    -

    Learning Objectives

    @@ -450,8 +419,8 @@

    Learning Objectives

    Import data for this module

    Let’s read in our data (again) and take a quick look.

    -
    df <- read.csv(file = "data/serodata.csv") #relative path
    -head(x=df, n=3)
    +
    df <- read.csv(file = "data/serodata.csv") #relative path
    +head(x=df, n=3)
      observation_id IgG_concentration age gender     slum
     1           5772         0.3176895   2 Female Non slum
    @@ -464,22 +433,20 @@ 

    Import data for this module

    Prep data

    Create age_group three level factor variable

    -
    df$age_group <- ifelse(df$age <= 5, "young", 
    -                       ifelse(df$age<=10 & df$age>5, "middle", 
    -                              ifelse(df$age>10, "old", NA)))
    -df$age_group <- factor(df$age_group, levels=c("young", "middle", "old"))
    +
    df$age_group <- ifelse(df$age <= 5, "young", 
    +                       ifelse(df$age<=10 & df$age>5, "middle", "old")) 
    +df$age_group <- factor(df$age_group, levels=c("young", "middle", "old"))
    -

    Create seropos binary variable representing seropositivity if antibody concentrations are >10 mIUmL.

    +

    Create seropos binary variable representing seropositivity if antibody concentrations are >10 IU/mL.

    -
    df$seropos <- ifelse(df$IgG_concentration<10, 0, 
    -                                        ifelse(df$IgG_concentration>=10, 1, NA))
    +
    df$seropos <- ifelse(df$IgG_concentration<10, 0, 1)

    Base R data visualizattion functions

    The Base R ‘graphics’ package has a ton of graphics options.

    -
    library(help = "graphics")
    +
    help(package = "graphics")
    @@ -586,7 +553,7 @@

    Base R data visualizattion functions

    -
    +

    Base R Plotting

    To make a plot you often need to specify the following features:

      @@ -598,15 +565,22 @@

      Base R Plotting

      1. Parameters

      The parameter section fixes the settings for all your plots, basically the plot options. Adding attributes via par() before you call the plot creates ‘global’ settings for your plot.

      -

      In the example below, we have set two commonly used optional attributes in the global plot settings. - The mfrow specifies that we have one row and two columns of plots — that is, two plots side by side. - The mar attribute is a vector of our margin widths, with the first value indicating the margin below the plot (5), the second indicating the margin to the left of the plot (5), the third, the top of the plot(4), and the fourth to the left (1).

      +

      In the example below, we have set two commonly used optional attributes in the global plot settings.

      +
        +
      • The mfrow specifies that we have one row and two columns of plots — that is, two plots side by side.
      • +
      • The mar attribute is a vector of our margin widths, with the first value indicating the margin below the plot (5), the second indicating the margin to the left of the plot (5), the third, the top of the plot(4), and the fourth to the left (1).
      • +
      par(mfrow = c(1,2), mar = c(5,5,4,1))
      +
      +
      +

      1. Parameters

      -
      +

    Lots of parameters options

    However, there are many more parameter options that can be specified in the ‘global’ settings or specific to a certain plot option.

    -
    ?par
    +
    ?par

    Set or Query Graphical Parameters

    Description:

    @@ -1266,7 +1240,7 @@

    Lots of parameters options

    Common parameter options

    -

    Six useful parameter arguments help improve the readability of the plot:

    +

    Eight useful parameter arguments help improve the readability of the plot:

    • xlab: specifies the x-axis label of the plot
    • ylab: specifies the y-axis label
    • @@ -1297,7 +1271,7 @@

      2. Plot Attributes

      histogram() Help File

      -
      ?hist
      +
      ?hist

      Histograms

      Description:

      @@ -1474,7 +1448,7 @@

      histogram() Help File

      histogram() example

      -

      Reminder

      +

      Reminder function signature

      hist(x, breaks = "Sturges",
            freq = NULL, probability = !freq,
            include.lowest = TRUE, right = TRUE, fuzz = 1e-7,
      @@ -1486,33 +1460,25 @@ 

      histogram() example

      nclass = NULL, warn.unused = TRUE, ...)

      Let’s practice

      -
      hist(df$age)
      +
      hist(df$age)
      -
      -

      -
      -
      -
      hist(
      -    df$age, 
      -    freq=FALSE, 
      -    main="Histogram", 
      -    xlab="Age (years)"
      -    )
      +
      hist(
      +    df$age, 
      +    freq=FALSE, 
      +    main="Histogram", 
      +    xlab="Age (years)"
      +    )
      -
      -

      -
      -

      plot() Help File

      -
      ?plot
      +
      ?plot

      Generic X-Y Plotting

      Description:

      @@ -1612,37 +1578,29 @@

      plot() Help File

      plot() example

      -
      plot(df$age, df$IgG_concentration)
      +
      plot(df$age, df$IgG_concentration)
      -
      -

      -
      -
      -
      plot(
      -    df$age, 
      -    df$IgG_concentration, 
      -    type="p", 
      -    main="Age by IgG Concentrations", 
      -    xlab="Age (years)", 
      -    ylab="IgG Concentration (mIU/mL)", 
      -    pch=16, 
      -    cex=0.9,
      -    col="lightblue")
      +
      plot(
      +    df$age, 
      +    df$IgG_concentration, 
      +    type="p", 
      +    main="Age by IgG Concentrations", 
      +    xlab="Age (years)", 
      +    ylab="IgG Concentration (IU/mL)", 
      +    pch=16, 
      +    cex=0.9,
      +    col="lightblue")
      -
      -

      -
      -

      boxplot() Help File

      -
      ?boxplot
      +
      ?boxplot

      Box Plots

      Description:

      @@ -1813,7 +1771,7 @@

      boxplot() Help File

      boxplot() example

      -

      Reminder

      +

      Reminder function signature

      boxplot(formula, data = NULL, ..., subset, na.action = NULL,
               xlab = mklab(y_var = horizontal),
               ylab = mklab(y_var =!horizontal),
      @@ -1821,207 +1779,209 @@ 

      boxplot() example

      drop = FALSE, sep = ".", lex.order = FALSE)

      Let’s practice

      -
      boxplot(IgG_concentration~age_group, data=df)
      +
      boxplot(IgG_concentration~age_group, data=df)
      -
      -

      -
      -
      -
      boxplot(
      -    log(df$IgG_concentration)~df$age_group, 
      -    main="Age by IgG Concentrations", 
      -    xlab="Age Group (years)", 
      -    ylab="log IgG Concentration (mIU/mL)", 
      -    names=c("1-5","6-10", "11-15"), 
      -    varwidth=T
      -    )
      +
      boxplot(
      +    log(df$IgG_concentration)~df$age_group, 
      +    main="Age by IgG Concentrations", 
      +    xlab="Age Group (years)", 
      +    ylab="log IgG Concentration (mIU/mL)", 
      +    names=c("1-5","6-10", "11-15"), 
      +    varwidth=T
      +    )
      -
      -

      -
      -

      barplot() Help File

      -
      ?barplot
      +
      ?barplot
      -

      Box Plots

      +

      Bar Plots

      Description:

      -
       Produce box-and-whisker plot(s) of the given (grouped) values.
      +
       Creates a bar plot with vertical or horizontal bars.

      Usage:

      -
       boxplot(x, ...)
      - 
      - ## S3 method for class 'formula'
      - boxplot(formula, data = NULL, ..., subset, na.action = NULL,
      -         xlab = mklab(y_var = horizontal),
      -         ylab = mklab(y_var =!horizontal),
      -         add = FALSE, ann = !add, horizontal = FALSE,
      -         drop = FALSE, sep = ".", lex.order = FALSE)
      +
       barplot(height, ...)
        
        ## Default S3 method:
      - boxplot(x, ..., range = 1.5, width = NULL, varwidth = FALSE,
      -         notch = FALSE, outline = TRUE, names, plot = TRUE,
      -         border = par("fg"), col = "lightgray", log = "",
      -         pars = list(boxwex = 0.8, staplewex = 0.5, outwex = 0.5),
      -          ann = !add, horizontal = FALSE, add = FALSE, at = NULL)
      + barplot(height, width = 1, space = NULL,
      +         names.arg = NULL, legend.text = NULL, beside = FALSE,
      +         horiz = FALSE, density = NULL, angle = 45,
      +         col = NULL, border = par("fg"),
      +         main = NULL, sub = NULL, xlab = NULL, ylab = NULL,
      +         xlim = NULL, ylim = NULL, xpd = TRUE, log = "",
      +         axes = TRUE, axisnames = TRUE,
      +         cex.axis = par("cex.axis"), cex.names = par("cex.axis"),
      +         inside = TRUE, plot = TRUE, axis.lty = 0, offset = 0,
      +         add = FALSE, ann = !add && par("ann"), args.legend = NULL, ...)
      + 
      + ## S3 method for class 'formula'
      + barplot(formula, data, subset, na.action,
      +         horiz = FALSE, xlab = NULL, ylab = NULL, ...)
        

      Arguments:

      -

      formula: a formula, such as ‘y ~ grp’, where ‘y’ is a numeric vector of data values to be split into groups according to the grouping variable ‘grp’ (usually a factor). Note that ‘~ g1 + g2’ is equivalent to ‘g1:g2’.

      -
      data: a data.frame (or list) from which the variables in 'formula'
      +

      height: either a vector or matrix of values describing the bars which make up the plot. If ‘height’ is a vector, the plot consists of a sequence of rectangular bars with heights given by the values in the vector. If ‘height’ is a matrix and ‘beside’ is ‘FALSE’ then each bar of the plot corresponds to a column of ‘height’, with the values in the column giving the heights of stacked sub-bars making up the bar. If ‘height’ is a matrix and ‘beside’ is ‘TRUE’, then the values in each column are juxtaposed rather than stacked.

      +

      width: optional vector of bar widths. Re-cycled to length the number of bars drawn. Specifying a single value will have no visible effect unless ‘xlim’ is specified.

      +

      space: the amount of space (as a fraction of the average bar width) left before each bar. May be given as a single number or one number per bar. If ‘height’ is a matrix and ‘beside’ is ‘TRUE’, ‘space’ may be specified by two numbers, where the first is the space between bars in the same group, and the second the space between the groups. If not given explicitly, it defaults to ‘c(0,1)’ if ‘height’ is a matrix and ‘beside’ is ‘TRUE’, and to 0.2 otherwise.

      +

      names.arg: a vector of names to be plotted below each bar or group of bars. If this argument is omitted, then the names are taken from the ‘names’ attribute of ‘height’ if this is a vector, or the column names if it is a matrix.

      +

      legend.text: a vector of text used to construct a legend for the plot, or a logical indicating whether a legend should be included. This is only useful when ‘height’ is a matrix. In that case given legend labels should correspond to the rows of ‘height’; if ‘legend.text’ is true, the row names of ‘height’ will be used as labels if they are non-null.

      +

      beside: a logical value. If ‘FALSE’, the columns of ‘height’ are portrayed as stacked bars, and if ‘TRUE’ the columns are portrayed as juxtaposed bars.

      +

      horiz: a logical value. If ‘FALSE’, the bars are drawn vertically with the first bar to the left. If ‘TRUE’, the bars are drawn horizontally with the first at the bottom.

      +

      density: a vector giving the density of shading lines, in lines per inch, for the bars or bar components. The default value of ‘NULL’ means that no shading lines are drawn. Non-positive values of ‘density’ also inhibit the drawing of shading lines.

      +

      angle: the slope of shading lines, given as an angle in degrees (counter-clockwise), for the bars or bar components.

      +
       col: a vector of colors for the bars or bar components.  By
      +      default, '"grey"' is used if 'height' is a vector, and a
      +      gamma-corrected grey palette if 'height' is a matrix; see
      +      'grey.colors'.
      +

      border: the color to be used for the border of the bars. Use ‘border = NA’ to omit borders. If there are shading lines, ‘border = TRUE’ means use the same colour for the border as for the shading lines.

      +

      main,sub: main title and subtitle for the plot.

      +
      xlab: a label for the x axis.
      +
      +ylab: a label for the y axis.
      +
      +xlim: limits for the x axis.
      +
      +ylim: limits for the y axis.
      +
      + xpd: logical. Should bars be allowed to go outside region?
      +
      + log: string specifying if axis scales should be logarithmic; see
      +      'plot.default'.
      +
      +axes: logical.  If 'TRUE', a vertical (or horizontal, if 'horiz' is
      +      true) axis is drawn.
      +

      axisnames: logical. If ‘TRUE’, and if there are ‘names.arg’ (see above), the other axis is drawn (with ‘lty = 0’) and labeled.

      +

      cex.axis: expansion factor for numeric axis labels (see ‘par(’cex’)’).

      +

      cex.names: expansion factor for axis names (bar labels).

      +

      inside: logical. If ‘TRUE’, the lines which divide adjacent (non-stacked!) bars will be drawn. Only applies when ‘space = 0’ (which it partly is when ‘beside = TRUE’).

      +
      plot: logical.  If 'FALSE', nothing is plotted.
      +

      axis.lty: the graphics parameter ‘lty’ (see ‘par(’lty’)’) applied to the axis and tick marks of the categorical (default horizontal) axis. Note that by default the axis is suppressed.

      +

      offset: a vector indicating how much the bars should be shifted relative to the x axis.

      +
       add: logical specifying if bars should be added to an already
      +      existing plot; defaults to 'FALSE'.
      +
      + ann: logical specifying if the default annotation ('main', 'sub',
      +      'xlab', 'ylab') should appear on the plot, see 'title'.
      +

      args.legend: list of additional arguments to pass to ‘legend()’; names of the list are used as argument names. Only used if ‘legend.text’ is supplied.

      +

      formula: a formula where the ‘y’ variables are numeric data to plot against the categorical ‘x’ variables. The formula can have one of three forms:

      +
                  y ~ x
      +            y ~ x1 + x2
      +            cbind(y1, y2) ~ x
      +      
      +      (see the examples).
      +
      +data: a data frame (or list) from which the variables in formula
             should be taken.
      -

      subset: an optional vector specifying a subset of observations to be used for plotting.

      -

      na.action: a function which indicates what should happen when the data contain ’NA’s. The default is to ignore missing values in either the response or the group.

      -

      xlab, ylab: x- and y-axis annotation, since R 3.6.0 with a non-empty default. Can be suppressed by ‘ann=FALSE’.

      -
       ann: 'logical' indicating if axes should be annotated (by 'xlab'
      -      and 'ylab').
      -

      drop, sep, lex.order: passed to ‘split.default’, see there.

      -
         x: for specifying data from which the boxplots are to be
      -      produced. Either a numeric vector, or a single list
      -      containing such vectors. Additional unnamed arguments specify
      -      further data as separate vectors (each corresponding to a
      -      component boxplot).  'NA's are allowed in the data.
      -
      - ...: For the 'formula' method, named arguments to be passed to the
      -      default method.
      -
      -      For the default method, unnamed arguments are additional data
      -      vectors (unless 'x' is a list when they are ignored), and
      -      named arguments are arguments and graphical parameters to be
      -      passed to 'bxp' in addition to the ones given by argument
      -      'pars' (and override those in 'pars'). Note that 'bxp' may or
      -      may not make use of graphical parameters it is passed: see
      -      its documentation.
      -

      range: this determines how far the plot whiskers extend out from the box. If ‘range’ is positive, the whiskers extend to the most extreme data point which is no more than ‘range’ times the interquartile range from the box. A value of zero causes the whiskers to extend to the data extremes.

      -

      width: a vector giving the relative widths of the boxes making up the plot.

      -

      varwidth: if ‘varwidth’ is ‘TRUE’, the boxes are drawn with widths proportional to the square-roots of the number of observations in the groups.

      -

      notch: if ‘notch’ is ‘TRUE’, a notch is drawn in each side of the boxes. If the notches of two plots do not overlap this is ‘strong evidence’ that the two medians differ (Chambers et al, 1983, p. 62). See ‘boxplot.stats’ for the calculations used.

      -

      outline: if ‘outline’ is not true, the outliers are not drawn (as points whereas S+ uses lines).

      -

      names: group labels which will be printed under each boxplot. Can be a character vector or an expression (see plotmath).

      -

      boxwex: a scale factor to be applied to all boxes. When there are only a few groups, the appearance of the plot can be improved by making the boxes narrower.

      -

      staplewex: staple line width expansion, proportional to box width.

      -

      outwex: outlier line width expansion, proportional to box width.

      -
      plot: if 'TRUE' (the default) then a boxplot is produced.  If not,
      -      the summaries which the boxplots are based on are returned.
      -

      border: an optional vector of colors for the outlines of the boxplots. The values in ‘border’ are recycled if the length of ‘border’ is less than the number of plots.

      -
       col: if 'col' is non-null it is assumed to contain colors to be
      -      used to colour the bodies of the box plots. By default they
      -      are in the background colour.
      -
      - log: character indicating if x or y or both coordinates should be
      -      plotted in log scale.
      -
      -pars: a list of (potentially many) more graphical parameters, e.g.,
      -      'boxwex' or 'outpch'; these are passed to 'bxp' (if 'plot' is
      -      true); for details, see there.
      -

      horizontal: logical indicating if the boxplots should be horizontal; default ‘FALSE’ means vertical boxes.

      -
       add: logical, if true _add_ boxplot to current plot.
      -
      -  at: numeric vector giving the locations where the boxplots should
      -      be drawn, particularly when 'add = TRUE'; defaults to '1:n'
      -      where 'n' is the number of boxes.
      -

      Details:

      -
       The generic function 'boxplot' currently has a default method
      - ('boxplot.default') and a formula interface ('boxplot.formula').
      -
      - If multiple groups are supplied either as multiple arguments or
      - via a formula, parallel boxplots will be plotted, in the order of
      - the arguments or the order of the levels of the factor (see
      - 'factor').
      -
      - Missing values are ignored when forming boxplots.
      +

      subset: an optional vector specifying a subset of observations to be used.

      +

      na.action: a function which indicates what should happen when the data contain ‘NA’ values. The default is to ignore missing values in the given variables.

      +
       ...: arguments to be passed to/from other methods.  For the
      +      default method these can include further arguments (such as
      +      'axes', 'asp' and 'main') and graphical parameters (see
      +      'par') which are passed to 'plot.window()', 'title()' and
      +      'axis'.

      Value:

      -
       List with the following components:
      -

      stats: a matrix, each column contains the extreme of the lower whisker, the lower hinge, the median, the upper hinge and the extreme of the upper whisker for one group/plot. If all the inputs have the same class attribute, so will this component.

      -
         n: a vector with the number of (non-'NA') observations in each
      -      group.
      -
      -conf: a matrix where each column contains the lower and upper
      -      extremes of the notch.
      -
      - out: the values of any data points which lie beyond the extremes
      -      of the whiskers.
      -

      group: a vector of the same length as ‘out’ whose elements indicate to which group the outlier belongs.

      -

      names: a vector of names for the groups.

      +
       A numeric vector (or matrix, when 'beside = TRUE'), say 'mp',
      + giving the coordinates of _all_ the bar midpoints drawn, useful
      + for adding to the graph.
      +
      + If 'beside' is true, use 'colMeans(mp)' for the midpoints of each
      + _group_ of bars, see example.
      +

      Author(s):

      +
       R Core, with a contribution by Arni Magnusson.

      References:

      -
       Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988).  _The New
      - S Language_.  Wadsworth & Brooks/Cole.
      -
      - Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P. A.
      - (1983).  _Graphical Methods for Data Analysis_.  Wadsworth &
      - Brooks/Cole.
      -
      - Murrell, P. (2005).  _R Graphics_.  Chapman & Hall/CRC Press.
      +
       Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S
      + Language_.  Wadsworth & Brooks/Cole.
       
      - See also 'boxplot.stats'.
      + Murrell, P. (2005) _R Graphics_. Chapman & Hall/CRC Press.

      See Also:

      -
       'boxplot.stats' which does the computation, 'bxp' for the plotting
      - and more examples; and 'stripchart' for an alternative (with small
      - data sets).
      +
       'plot(..., type = "h")', 'dotchart'; 'hist' for bars of a
      + _continuous_ variable.  'mosaicplot()', more sophisticated to
      + visualize _several_ categorical variables.

      Examples:

      -
       ## boxplot on a formula:
      - boxplot(count ~ spray, data = InsectSprays, col = "lightgray")
      - # *add* notches (somewhat funny here <--> warning "notches .. outside hinges"):
      - boxplot(count ~ spray, data = InsectSprays,
      -         notch = TRUE, add = TRUE, col = "blue")
      +
       # Formula method
      + barplot(GNP ~ Year, data = longley)
      + barplot(cbind(Employed, Unemployed) ~ Year, data = longley)
        
      - boxplot(decrease ~ treatment, data = OrchardSprays, col = "bisque",
      -         log = "y")
      - ## horizontal=TRUE, switching  y <--> x :
      - boxplot(decrease ~ treatment, data = OrchardSprays, col = "bisque",
      -         log = "x", horizontal=TRUE)
      + ## 3rd form of formula - 2 categories :
      + op <- par(mfrow = 2:1, mgp = c(3,1,0)/2, mar = .1+c(3,3:1))
      + summary(d.Titanic <- as.data.frame(Titanic))
      + barplot(Freq ~ Class + Survived, data = d.Titanic,
      +         subset = Age == "Adult" & Sex == "Male",
      +         main = "barplot(Freq ~ Class + Survived, *)", ylab = "# {passengers}", legend.text = TRUE)
      + # Corresponding table :
      + (xt <- xtabs(Freq ~ Survived + Class + Sex, d.Titanic, subset = Age=="Adult"))
      + # Alternatively, a mosaic plot :
      + mosaicplot(xt[,,"Male"], main = "mosaicplot(Freq ~ Class + Survived, *)", color=TRUE)
      + par(op)
        
      - rb <- boxplot(decrease ~ treatment, data = OrchardSprays, col = "bisque")
      - title("Comparing boxplot()s and non-robust mean +/- SD")
      - mn.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, mean)
      - sd.t <- tapply(OrchardSprays$decrease, OrchardSprays$treatment, sd)
      - xi <- 0.3 + seq(rb$n)
      - points(xi, mn.t, col = "orange", pch = 18)
      - arrows(xi, mn.t - sd.t, xi, mn.t + sd.t,
      -        code = 3, col = "pink", angle = 75, length = .1)
        
      - ## boxplot on a matrix:
      - mat <- cbind(Uni05 = (1:100)/21, Norm = rnorm(100),
      -              `5T` = rt(100, df = 5), Gam2 = rgamma(100, shape = 2))
      - boxplot(mat) # directly, calling boxplot.matrix()
      + # Default method
      + require(grDevices) # for colours
      + tN <- table(Ni <- stats::rpois(100, lambda = 5))
      + r <- barplot(tN, col = rainbow(20))
      + #- type = "h" plotting *is* 'bar'plot
      + lines(r, tN, type = "h", col = "red", lwd = 2)
        
      - ## boxplot on a data frame:
      - df. <- as.data.frame(mat)
      - par(las = 1) # all axis labels horizontal
      - boxplot(df., main = "boxplot(*, horizontal = TRUE)", horizontal = TRUE)
      + barplot(tN, space = 1.5, axisnames = FALSE,
      +         sub = "barplot(..., space= 1.5, axisnames = FALSE)")
        
      - ## Using 'at = ' and adding boxplots -- example idea by Roger Bivand :
      - boxplot(len ~ dose, data = ToothGrowth,
      -         boxwex = 0.25, at = 1:3 - 0.2,
      -         subset = supp == "VC", col = "yellow",
      -         main = "Guinea Pigs' Tooth Growth",
      -         xlab = "Vitamin C dose mg",
      -         ylab = "tooth length",
      -         xlim = c(0.5, 3.5), ylim = c(0, 35), yaxs = "i")
      - boxplot(len ~ dose, data = ToothGrowth, add = TRUE,
      -         boxwex = 0.25, at = 1:3 + 0.2,
      -         subset = supp == "OJ", col = "orange")
      - legend(2, 9, c("Ascorbic acid", "Orange juice"),
      -        fill = c("yellow", "orange"))
      + barplot(VADeaths, plot = FALSE)
      + barplot(VADeaths, plot = FALSE, beside = TRUE)
        
      - ## With less effort (slightly different) using factor *interaction*:
      - boxplot(len ~ dose:supp, data = ToothGrowth,
      -         boxwex = 0.5, col = c("orange", "yellow"),
      -         main = "Guinea Pigs' Tooth Growth",
      -         xlab = "Vitamin C dose mg", ylab = "tooth length",
      -         sep = ":", lex.order = TRUE, ylim = c(0, 35), yaxs = "i")
      + mp <- barplot(VADeaths) # default
      + tot <- colMeans(VADeaths)
      + text(mp, tot + 3, format(tot), xpd = TRUE, col = "blue")
      + barplot(VADeaths, beside = TRUE,
      +         col = c("lightblue", "mistyrose", "lightcyan",
      +                 "lavender", "cornsilk"),
      +         legend.text = rownames(VADeaths), ylim = c(0, 100))
      + title(main = "Death Rates in Virginia", font.main = 4)
        
      - ## more examples in  help(bxp)
      + hh <- t(VADeaths)[, 5:1] + mybarcol <- "gray20" + mp <- barplot(hh, beside = TRUE, + col = c("lightblue", "mistyrose", + "lightcyan", "lavender"), + legend.text = colnames(VADeaths), ylim = c(0,100), + main = "Death Rates in Virginia", font.main = 4, + sub = "Faked upper 2*sigma error bars", col.sub = mybarcol, + cex.names = 1.5) + segments(mp, hh, mp, hh + 2*sqrt(1000*hh/100), col = mybarcol, lwd = 1.5) + stopifnot(dim(mp) == dim(hh)) # corresponding matrices + mtext(side = 1, at = colMeans(mp), line = -2, + text = paste("Mean", formatC(colMeans(hh))), col = "red") + + # Bar shading example + barplot(VADeaths, angle = 15+10*1:5, density = 20, col = "black", + legend.text = rownames(VADeaths)) + title(main = list("Death Rates in Virginia", font = 4)) + + # Border color + barplot(VADeaths, border = "dark blue") + + + # Log scales (not much sense here) + barplot(tN, col = heat.colors(12), log = "y") + barplot(tN, col = gray.colors(20), log = "xy") + + # Legend location + barplot(height = cbind(x = c(465, 91) / 465 * 100, + y = c(840, 200) / 840 * 100, + z = c(37, 17) / 37 * 100), + beside = FALSE, + width = c(465, 840, 37), + col = c(1, 2), + legend.text = c("A", "B"), + args.legend = list(x = "topleft"))

      barplot() example

      The function takes the a lot of arguments to control the way the way our data is plotted.

      -

      Reminder

      +

      Reminder function signature

      barplot(height, width = 1, space = NULL,
               names.arg = NULL, legend.text = NULL, beside = FALSE,
               horiz = FALSE, density = NULL, angle = 45,
      @@ -2033,23 +1993,15 @@ 

      barplot() example

      inside = TRUE, plot = TRUE, axis.lty = 0, offset = 0, add = FALSE, ann = !add && par("ann"), args.legend = NULL, ...)
      -
      freq <- table(df$seropos, df$age_group)
      -barplot(freq)
      +
      freq <- table(df$seropos, df$age_group)
      +barplot(freq)
      -
      -

      -
      -
      -
      prop <- prop.table(freq)
      -barplot(prop)
      +
      prop.cell.percentages <- prop.table(freq)
      +barplot(prop.cell.percentages)
      -
      -

      -
      -
      @@ -2057,7 +2009,7 @@

      barplot() example

      3. Legend!

      In Base R plotting the legend is not automatically generated. This is nice because it gives you a huge amount of control over how your legend looks, but it is also easy to mislabel your colors, symbols, line types, etc. So, basically be careful.

      -
      ?legend
      +
      ?legend
      @@ -2422,7 +2374,7 @@

      3. Legend!

      Add legend to the plot

      -

      Reminder

      +

      Reminder function signature

      legend(x, y = NULL, legend, fill = NULL, col = par("col"),
              border = "black", lty, lwd, pch,
              angle = 45, density = NULL, bty = "o", bg = par("bg"),
      @@ -2437,45 +2389,55 @@ 

      Add legend to the plot

      seg.len = 2)

      Let’s practice

      -
      barplot(prop, col=c("darkblue","red"), ylim=c(0,0.7), main="Seropositivity by Age Group")
      -legend(x=2.5, y=0.7,
      -             fill=c("darkblue","red"), 
      -             legend = c("seronegative", "seropositive"))
      - +
      barplot(prop.cell.percentages, col=c("darkblue","red"), ylim=c(0,0.5), main="Seropositivity by Age Group")
      +legend(x=2.5, y=0.5,
      +             fill=c("darkblue","red"), 
      +             legend = c("seronegative", "seropositive"))
      -
      +
    +
    +

    Add legend to the plot

    + +

    barplot() example

    Getting closer, but what I really want is column proportions (i.e., the proportions should sum to one for each age group). Also, the age groups need more meaningful names.

    -
    freq <- table(df$seropos, df$age_group)
    -tot.per.age.group <- colSums(freq)
    -age.seropos.matrix <- t(t(freq)/tot.per.age.group)
    -colnames(age.seropos.matrix) <- c("1-5 yo", "6-10 yo", "11-15 yo")
    -
    -barplot(age.seropos.matrix, col=c("darkblue","red"), ylim=c(0,1.35), main="Seropositivity by Age Group")
    -axis(2, at = c(0.2, 0.4, 0.6, 0.8,1))
    -legend(x=2.8, y=1.35,
    -             fill=c("darkblue","red"), 
    -             legend = c("seronegative", "seropositive"))
    - +
    freq <- table(df$seropos, df$age_group)
    +prop.column.percentages <- prop.table(freq, margin=2)
    +colnames(prop.column.percentages) <- c("1-5 yo", "6-10 yo", "11-15 yo")
    +
    +barplot(prop.column.percentages, col=c("darkblue","red"), ylim=c(0,1.35), main="Seropositivity by Age Group")
    +axis(2, at = c(0.2, 0.4, 0.6, 0.8,1))
    +legend(x=2.8, y=1.35,
    +             fill=c("darkblue","red"), 
    +             legend = c("seronegative", "seropositive"))
    -
    +

    barplot() example

    + +
    +
    +

    barplot() example

    Now, let look at seropositivity by two individual level characteristics in the same plot.

    -
    par(mfrow = c(1,2))
    -barplot(age.seropos.matrix, col=c("darkblue","red"), ylim=c(0,1.35), main="Seropositivity by Age Group")
    -axis(2, at = c(0.2, 0.4, 0.6, 0.8,1))
    -legend(x=1, y=1.35, fill=c("darkblue","red"), legend = c("seronegative", "seropositive"))
    -
    -barplot(slum.seropos.matrix, col=c("darkblue","red"), ylim=c(0,1.35), main="Seropositivity by Residence")
    -axis(2, at = c(0.2, 0.4, 0.6, 0.8,1))
    -legend(x=1, y=1.35, fill=c("darkblue","red"),  legend = c("seronegative", "seropositive"))
    - +
    par(mfrow = c(1,2))
    +barplot(prop.column.percentages, col=c("darkblue","red"), ylim=c(0,1.35), main="Seropositivity by Age Group")
    +axis(2, at = c(0.2, 0.4, 0.6, 0.8,1))
    +legend("topright",
    +             fill=c("darkblue","red"), 
    +             legend = c("seronegative", "seropositive"))
    +
    +barplot(prop.column.percentages2, col=c("darkblue","red"), ylim=c(0,1.35), main="Seropositivity by Residence")
    +axis(2, at = c(0.2, 0.4, 0.6, 0.8,1))
    +legend("topright", fill=c("darkblue","red"),  legend = c("seronegative", "seropositive"))
    -
    +
    +
    +

    barplot() example

    + +

    Summary

      @@ -2484,16 +2446,14 @@

      Summary

    Acknowledgements

    -

    These are the materials I looked through, modified, or extracted to complete this module’s lecture.

    +

    These are the materials we looked through, modified, or extracted to complete this module’s lecture.

    -
    @@ -2522,6 +2482,7 @@

    Acknowledgements

    Reveal.initialize({ 'controlsAuto': true, 'previewLinksAuto': false, +'smaller': true, 'pdfSeparateFragments': false, 'autoAnimateEasing': "ease", 'autoAnimateDuration': 1, @@ -2776,7 +2737,18 @@

    Acknowledgements

    } return false; } - const onCopySuccess = function(e) { + const clipboard = new window.ClipboardJS('.code-copy-button', { + text: function(trigger) { + const codeEl = trigger.previousElementSibling.cloneNode(true); + for (const childEl of codeEl.children) { + if (isCodeAnnotation(childEl)) { + childEl.remove(); + } + } + return codeEl.innerText; + } + }); + clipboard.on('success', function(e) { // button target const button = e.trigger; // don't keep focus @@ -2808,50 +2780,11 @@

    Acknowledgements

    }, 1000); // clear code selection e.clearSelection(); - } - const getTextToCopy = function(trigger) { - const codeEl = trigger.previousElementSibling.cloneNode(true); - for (const childEl of codeEl.children) { - if (isCodeAnnotation(childEl)) { - childEl.remove(); - } - } - return codeEl.innerText; - } - const clipboard = new window.ClipboardJS('.code-copy-button:not([data-in-quarto-modal])', { - text: getTextToCopy }); - clipboard.on('success', onCopySuccess); - if (window.document.getElementById('quarto-embedded-source-code-modal')) { - // For code content inside modals, clipBoardJS needs to be initialized with a container option - // TODO: Check when it could be a function (https://github.com/zenorocha/clipboard.js/issues/860) - const clipboardModal = new window.ClipboardJS('.code-copy-button[data-in-quarto-modal]', { - text: getTextToCopy, - container: window.document.getElementById('quarto-embedded-source-code-modal') - }); - clipboardModal.on('success', onCopySuccess); - } - var localhostRegex = new RegExp(/^(?:http|https):\/\/localhost\:?[0-9]*\//); - var mailtoRegex = new RegExp(/^mailto:/); - var filterRegex = new RegExp('/' + window.location.host + '/'); - var isInternal = (href) => { - return filterRegex.test(href) || localhostRegex.test(href) || mailtoRegex.test(href); - } - // Inspect non-navigation links and adorn them if external - var links = window.document.querySelectorAll('a[href]:not(.nav-link):not(.navbar-brand):not(.toc-action):not(.sidebar-link):not(.sidebar-item-toggle):not(.pagination-link):not(.no-external):not([aria-hidden]):not(.dropdown-item):not(.quarto-navigation-tool):not(.about-link)'); - for (var i=0; iAcknowledgements interactive: true, interactiveBorder: 10, theme: 'light-border', - placement: 'bottom-start', + placement: 'bottom-start' }; - if (contentFn) { - config.content = contentFn; - } - if (onTriggerFn) { - config.onTrigger = onTriggerFn; - } - if (onUntriggerFn) { - config.onUntrigger = onUntriggerFn; - } config['offset'] = [0,0]; config['maxWidth'] = 700; window.tippy(el, config); @@ -2885,11 +2809,7 @@

    Acknowledgements

    try { href = new URL(href).hash; } catch {} const id = href.replace(/^#\/?/, ""); const note = window.document.getElementById(id); - if (note) { - return note.innerHTML; - } else { - return ""; - } + return note.innerHTML; }); } const findCites = (el) => { diff --git a/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-15-2.png b/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-15-2.png index 5656265..4e5c9c8 100644 Binary files a/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-15-2.png and b/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-15-2.png differ diff --git a/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-24-1.png b/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-24-1.png deleted file mode 100644 index 5313a2a..0000000 Binary files a/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-24-1.png and /dev/null differ diff --git a/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-25-1.png b/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-25-1.png index 232d44e..edfae88 100644 Binary files a/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-25-1.png and b/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-25-1.png differ diff --git a/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-27-1.png b/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-27-1.png index 1abfaa6..232d44e 100644 Binary files a/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-27-1.png and b/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-27-1.png differ diff --git a/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-30-1.png b/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-30-1.png new file mode 100644 index 0000000..c6eb02c Binary files /dev/null and b/docs/modules/Module10-DataVisualization_files/figure-revealjs/unnamed-chunk-30-1.png differ diff --git a/docs/modules/ModuleXX-Iteration.html b/docs/modules/ModuleXX-Iteration.html index adee8c8..8c66c94 100644 --- a/docs/modules/ModuleXX-Iteration.html +++ b/docs/modules/ModuleXX-Iteration.html @@ -8,11 +8,11 @@ - + - SISMID Module NUMBER Materials (2025) – Iteration in R + SISMID Module NUMBER Materials (2025) - Iteration in R @@ -32,7 +32,7 @@ } /* CSS for syntax highlighting */ pre > code.sourceCode { white-space: pre; position: relative; } - pre > code.sourceCode > span { line-height: 1.25; } + pre > code.sourceCode > span { display: inline-block; line-height: 1.25; } pre > code.sourceCode > span:empty { height: 1.2em; } .sourceCode { overflow: visible; } code.sourceCode > span { color: inherit; text-decoration: inherit; } @@ -43,7 +43,7 @@ } @media print { pre > code.sourceCode { white-space: pre-wrap; } - pre > code.sourceCode > span { display: inline-block; text-indent: -5em; padding-left: 5em; } + pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; } } pre.numberSource code { counter-reset: source-line 0; } @@ -71,7 +71,7 @@ code span.at { color: #657422; } /* Attribute */ code span.bn { color: #ad0000; } /* BaseN */ code span.bu { } /* BuiltIn */ - code span.cf { color: #003b4f; font-weight: bold; } /* ControlFlow */ + code span.cf { color: #003b4f; } /* ControlFlow */ code span.ch { color: #20794d; } /* Char */ code span.cn { color: #8f5902; } /* Constant */ code span.co { color: #5e5e5e; } /* Comment */ @@ -85,7 +85,7 @@ code span.fu { color: #4758ab; } /* Function */ code span.im { color: #00769e; } /* Import */ code span.in { color: #5e5e5e; } /* Information */ - code span.kw { color: #003b4f; font-weight: bold; } /* Keyword */ + code span.kw { color: #003b4f; } /* Keyword */ code span.op { color: #5e5e5e; } /* Operator */ code span.ot { color: #003b4f; } /* Other */ code span.pp { color: #ad0000; } /* Preprocessor */ @@ -222,8 +222,7 @@ } .callout.callout-titled .callout-body > .callout-content > :last-child { - padding-bottom: 0.5rem; - margin-bottom: 0; + margin-bottom: 0.5rem; } .callout.callout-titled .callout-icon::before { @@ -407,44 +406,11 @@

    Iteration in R

    -
    -

    Learning goals

    1. Replace repetitive code with a for loop
    2. -
    3. Compare and contrast for loops and *apply() functions
    4. Use vectorization to replace unnecessary loops
    @@ -455,16 +421,16 @@

    What is iteration?

  • In R, this means running the same code multiple times in a row.
  • -
    data("penguins", package = "palmerpenguins")
    -for (this_island in levels(penguins$island)) {
    -    island_mean <-
    -        penguins$bill_depth_mm[penguins$island == this_island] |>
    -        mean(na.rm = TRUE) |>
    -        round(digits = 2)
    -    
    -    cat(paste("The mean bill depth on", this_island, "Island was", island_mean,
    -                            "mm.\n"))
    -}
    +
    data("penguins", package = "palmerpenguins")
    +for (this_island in levels(penguins$island)) {
    +    island_mean <-
    +        penguins$bill_depth_mm[penguins$island == this_island] |>
    +        mean(na.rm = TRUE) |>
    +        round(digits = 2)
    +    
    +    cat(paste("The mean bill depth on", this_island, "Island was", island_mean,
    +                            "mm.\n"))
    +}
    The mean bill depth on Biscoe Island was 15.87 mm.
     The mean bill depth on Dream Island was 18.34 mm.
    @@ -475,37 +441,37 @@ 

    What is iteration?

    Parts of a loop

    -
    for (this_island in levels(penguins$island)) {
    -    island_mean <-
    -        penguins$bill_depth_mm[penguins$island == this_island] |>
    -        mean(na.rm = TRUE) |>
    -        round(digits = 2)
    -    
    -    cat(paste("The mean bill depth on", this_island, "Island was", island_mean,
    -                            "mm.\n"))
    -}
    +
    for (this_island in levels(penguins$island)) {
    +    island_mean <-
    +        penguins$bill_depth_mm[penguins$island == this_island] |>
    +        mean(na.rm = TRUE) |>
    +        round(digits = 2)
    +    
    +    cat(paste("The mean bill depth on", this_island, "Island was", island_mean,
    +                            "mm.\n"))
    +}

    The header declares how many times we will repeat the same code. The header contains a control variable that changes in each repetition and a sequence of values for the control variable to take.

    Parts of a loop

    -
    for (this_island in levels(penguins$island)) {
    -    island_mean <-
    -        penguins$bill_depth_mm[penguins$island == this_island] |>
    -        mean(na.rm = TRUE) |>
    -        round(digits = 2)
    -    
    -    cat(paste("The mean bill depth on", this_island, "Island was", island_mean,
    -                            "mm.\n"))
    -}
    +
    for (this_island in levels(penguins$island)) {
    +    island_mean <-
    +        penguins$bill_depth_mm[penguins$island == this_island] |>
    +        mean(na.rm = TRUE) |>
    +        round(digits = 2)
    +    
    +    cat(paste("The mean bill depth on", this_island, "Island was", island_mean,
    +                            "mm.\n"))
    +}

    The body of the loop contains code that will be repeated a number of times based on the header instructions. In R, the body has to be surrounded by curly braces.

    Header parts

    -
    for (this_island in levels(penguins$island)) {...}
    +
    for (this_island in levels(penguins$island)) {...}
    • for: keyword that declares we are doing a for loop.
    • @@ -519,12 +485,12 @@

      Header parts

      Header parts

      -
      for (this_island in levels(penguins$island)) {...}
      +
      for (this_island in levels(penguins$island)) {...}
      • Since levels(penguins$island) evaluates to c("Biscoe", "Dream", "Torgersen"), our loop will repeat 3 times.
      - +
      @@ -553,13 +519,13 @@

      Header parts

      Loop iteration 1

      -
      island_mean <-
      -    penguins$bill_depth_mm[penguins$island == "Biscoe"] |>
      -    mean(na.rm = TRUE) |>
      -    round(digits = 2)
      -
      -cat(paste("The mean bill depth on", "Biscoe", "Island was", island_mean,
      -                    "mm.\n"))
      +
      island_mean <-
      +    penguins$bill_depth_mm[penguins$island == "Biscoe"] |>
      +    mean(na.rm = TRUE) |>
      +    round(digits = 2)
      +
      +cat(paste("The mean bill depth on", "Biscoe", "Island was", island_mean,
      +                    "mm.\n"))
      The mean bill depth on Biscoe Island was 15.87 mm.
      @@ -568,13 +534,13 @@

      Loop iteration 1

      Loop iteration 2

      -
      island_mean <-
      -    penguins$bill_depth_mm[penguins$island == "Dream"] |>
      -    mean(na.rm = TRUE) |>
      -    round(digits = 2)
      -
      -cat(paste("The mean bill depth on", "Dream", "Island was", island_mean,
      -                    "mm.\n"))
      +
      island_mean <-
      +    penguins$bill_depth_mm[penguins$island == "Dream"] |>
      +    mean(na.rm = TRUE) |>
      +    round(digits = 2)
      +
      +cat(paste("The mean bill depth on", "Dream", "Island was", island_mean,
      +                    "mm.\n"))
      The mean bill depth on Dream Island was 18.34 mm.
      @@ -583,13 +549,13 @@

      Loop iteration 2

      Loop iteration 3

      -
      island_mean <-
      -    penguins$bill_depth_mm[penguins$island == "Torgersen"] |>
      -    mean(na.rm = TRUE) |>
      -    round(digits = 2)
      -
      -cat(paste("The mean bill depth on", "Torgersen", "Island was", island_mean,
      -                    "mm.\n"))
      +
      island_mean <-
      +    penguins$bill_depth_mm[penguins$island == "Torgersen"] |>
      +    mean(na.rm = TRUE) |>
      +    round(digits = 2)
      +
      +cat(paste("The mean bill depth on", "Torgersen", "Island was", island_mean,
      +                    "mm.\n"))
      The mean bill depth on Torgersen Island was 18.43 mm.
      @@ -598,15 +564,15 @@

      Loop iteration 3

      The loop structure automates this process for us so we don’t have to copy and paste our code!

      -
      for (this_island in levels(penguins$island)) {
      -    island_mean <-
      -        penguins$bill_depth_mm[penguins$island == this_island] |>
      -        mean(na.rm = TRUE) |>
      -        round(digits = 2)
      -    
      -    cat(paste("The mean bill depth on", this_island, "Island was", island_mean,
      -                            "mm.\n"))
      -}
      +
      for (this_island in levels(penguins$island)) {
      +    island_mean <-
      +        penguins$bill_depth_mm[penguins$island == this_island] |>
      +        mean(na.rm = TRUE) |>
      +        round(digits = 2)
      +    
      +    cat(paste("The mean bill depth on", this_island, "Island was", island_mean,
      +                            "mm.\n"))
      +}
      The mean bill depth on Biscoe Island was 15.87 mm.
       The mean bill depth on Dream Island was 18.34 mm.
      @@ -636,9 +602,9 @@ 

      You try it!

      Write a loop that goes from 1 to 10, squares each of the numbers, and prints the squared number.

      -
      for (i in 1:10) {
      -    cat(i ^ 2, "\n")
      -}
      +
      for (i in 1:10) {
      +    cat(i ^ 2, "\n")
      +}
      1 
       4 
      @@ -670,8 +636,8 @@ 

      Wait, did we need to do that?

    • Almost all basic operations in R are vectorized: they work on a vector of arguments all at the same time.
    • -
      # No loop needed!
      -(1:10)^2
      +
      # No loop needed!
      +(1:10)^2
       [1]   1   4   9  16  25  36  49  64  81 100
      @@ -685,15 +651,15 @@

      Wait, did we need to do that?

    • Almost all basic operations in R are vectorized: they work on a vector of arguments all at the same time.
    • -
      # No loop needed!
      -(1:10)^2
      +
      # No loop needed!
      +(1:10)^2
       [1]   1   4   9  16  25  36  49  64  81 100
      -
      # Get the first 10 odd numbers, a common CS 101 loop problem on exams
      -(1:20)[which((1:20 %% 2) == 1)]
      +
      # Get the first 10 odd numbers, a common CS 101 loop problem on exams
      +(1:20)[which((1:20 %% 2) == 1)]
       [1]  1  3  5  7  9 11 13 15 17 19
      @@ -710,9 +676,9 @@

      Loop walkthrough

      -
      meas <- readRDS(here::here("data", "measles_final.Rds")) |>
      -    subset(vaccine_antigen == "MCV1")
      -str(meas)
      +
      meas <- readRDS(here::here("data", "measles_final.Rds")) |>
      +    subset(vaccine_antigen == "MCV1")
      +str(meas)
      'data.frame':   7972 obs. of  7 variables:
        $ iso3c           : chr  "AFG" "AFG" "AFG" "AFG" ...
      @@ -733,7 +699,7 @@ 

      Loop walkthrough

      -
      res <- vector(mode = "list", length = length(unique(meas$country)))
      +
      res <- vector(mode = "list", length = length(unique(meas$country)))
      • This is called preallocation and it can make your loops much faster.
      • @@ -748,8 +714,8 @@

        Loop walkthrough

      -
      countries <- unique(meas$country)
      -for (i in 1:length(countries)) {...}
      +
      countries <- unique(meas$country)
      +for (i in 1:length(countries)) {...}
      @@ -765,10 +731,10 @@

      Loop walkthrough

      -
      for (i in 1:length(countries)) {
      -    # Get the data for the current country only
      -    country_data <- subset(meas, country == countries[i])
      -}
      +
      for (i in 1:length(countries)) {
      +    # Get the data for the current country only
      +    country_data <- subset(meas, country == countries[i])
      +}
      @@ -778,16 +744,16 @@

      Loop walkthrough

      -
      for (i in 1:length(countries)) {
      -    # Get the data for the current country only
      -    country_data <- subset(meas, country == countries[i])
      -    
      -    # Get the summary statistics for this country
      -    country_cases <- country_data$Cases
      -    country_med <- median(country_cases, na.rm = TRUE)
      -    country_iqr <- IQR(country_cases, na.rm = TRUE)
      -    country_range <- range(country_cases, na.rm = TRUE)
      -}
      +
      for (i in 1:length(countries)) {
      +    # Get the data for the current country only
      +    country_data <- subset(meas, country == countries[i])
      +    
      +    # Get the summary statistics for this country
      +    country_cases <- country_data$Cases
      +    country_med <- median(country_cases, na.rm = TRUE)
      +    country_iqr <- IQR(country_cases, na.rm = TRUE)
      +    country_range <- range(country_cases, na.rm = TRUE)
      +}
      @@ -795,27 +761,27 @@

      Loop walkthrough

    • Next we save the summary statistics into a data frame.
    • -
      for (i in 1:length(countries)) {
      -    # Get the data for the current country only
      -    country_data <- subset(meas, country == countries[i])
      -    
      -    # Get the summary statistics for this country
      -    country_cases <- country_data$Cases
      -    country_quart <- quantile(
      -        country_cases, na.rm = TRUE, probs = c(0.25, 0.5, 0.75)
      -    )
      -    country_range <- range(country_cases, na.rm = TRUE)
      -    
      -    # Save the summary statistics into a data frame
      -    country_summary <- data.frame(
      -        country = countries[[i]],
      -        min = country_range[[1]],
      -        Q1 = country_quart[[1]],
      -        median = country_quart[[2]],
      -        Q3 = country_quart[[3]],
      -        max = country_range[[2]]
      -    )
      -}
      +
      for (i in 1:length(countries)) {
      +    # Get the data for the current country only
      +    country_data <- subset(meas, country == countries[i])
      +    
      +    # Get the summary statistics for this country
      +    country_cases <- country_data$Cases
      +    country_quart <- quantile(
      +        country_cases, na.rm = TRUE, probs = c(0.25, 0.5, 0.75)
      +    )
      +    country_range <- range(country_cases, na.rm = TRUE)
      +    
      +    # Save the summary statistics into a data frame
      +    country_summary <- data.frame(
      +        country = countries[[i]],
      +        min = country_range[[1]],
      +        Q1 = country_quart[[1]],
      +        median = country_quart[[2]],
      +        Q3 = country_quart[[3]],
      +        max = country_range[[2]]
      +    )
      +}
      @@ -823,30 +789,30 @@

      Loop walkthrough

    • And finally, we save the data frame as the next element in our storage list.
    • -
      for (i in 1:length(countries)) {
      -    # Get the data for the current country only
      -    country_data <- subset(meas, country == countries[i])
      -    
      -    # Get the summary statistics for this country
      -    country_cases <- country_data$Cases
      -    country_quart <- quantile(
      -        country_cases, na.rm = TRUE, probs = c(0.25, 0.5, 0.75)
      -    )
      -    country_range <- range(country_cases, na.rm = TRUE)
      -    
      -    # Save the summary statistics into a data frame
      -    country_summary <- data.frame(
      -        country = countries[[i]],
      -        min = country_range[[1]],
      -        Q1 = country_quart[[1]],
      -        median = country_quart[[2]],
      -        Q3 = country_quart[[3]],
      -        max = country_range[[2]]
      -    )
      -    
      -    # Save the results to our container
      -    res[[i]] <- country_summary
      -}
      +
      for (i in 1:length(countries)) {
      +    # Get the data for the current country only
      +    country_data <- subset(meas, country == countries[i])
      +    
      +    # Get the summary statistics for this country
      +    country_cases <- country_data$Cases
      +    country_quart <- quantile(
      +        country_cases, na.rm = TRUE, probs = c(0.25, 0.5, 0.75)
      +    )
      +    country_range <- range(country_cases, na.rm = TRUE)
      +    
      +    # Save the summary statistics into a data frame
      +    country_summary <- data.frame(
      +        country = countries[[i]],
      +        min = country_range[[1]],
      +        Q1 = country_quart[[1]],
      +        median = country_quart[[2]],
      +        Q3 = country_quart[[3]],
      +        max = country_range[[2]]
      +    )
      +    
      +    # Save the results to our container
      +    res[[i]] <- country_summary
      +}
      Warning in min(x): no non-missing arguments to min; returning Inf
      @@ -872,7 +838,7 @@

      Loop walkthrough

    • Let’s take a look at the results.
    • -
      head(res)
      +
      head(res)
      [[1]]
             country min   Q1 median   Q3   max
      @@ -908,10 +874,10 @@ 

      Loop walkthrough

    • We can use a vectorization trick: the function do.call() seems like ancient computer science magic. And it is. But it will actually help us a lot.
    • -
      res_df <- do.call(rbind, res)
      -head(res_df)
      +
      res_df <- do.call(rbind, res)
      +head(res_df)
      -
      Iteration
      +
      @@ -981,7 +947,7 @@

      Loop walkthrough

      -
      ?rbind
      +
      ?rbind
      Combine R Objects by Rows or Columns
       
      @@ -1115,8 +1081,8 @@ 

      Loop walkthrough

      Factors have their levels expanded as necessary (in the order of the levels of the level sets of the factors encountered) and the result is an ordered factor if and only if all the components were - ordered factors. Old-style categories (integer vectors with - levels) are promoted to factors. + ordered factors. (The last point differs from S-PLUS.) Old-style + categories (integer vectors with levels) are promoted to factors. Note that for result column 'j', 'factor(., exclude = X(j))' is applied, where @@ -1200,7 +1166,7 @@

      Loop walkthrough

      -
      ?do.call
      +
      ?do.call
      Execute a Function Call
       
      @@ -1297,13 +1263,13 @@ 

      Loop walkthrough

    • OK, so basically what happened is that
    • -
      do.call(rbind, list)
      +
      do.call(rbind, list)
      • Gets transformed into
      -
      rbind(list[[1]], list[[2]], list[[3]], ..., list[[length(list)]])
      +
      rbind(list[[1]], list[[2]], list[[3]], ..., list[[length(list)]])
      • That’s vectorization magic!
      • @@ -1323,25 +1289,25 @@

        You try it! (if we have time)

        Main problem solution

        -
        meas$cases_per_thousand <- meas$Cases / as.numeric(meas$total_pop) * 1000
        -countries <- unique(meas$country)
        -
        -plot(
        -    NULL, NULL,
        -    xlim = c(1980, 2022),
        -    ylim = c(0, 50),
        -    xlab = "Year",
        -    ylab = "Incidence per 1000 people"
        -)
        -
        -for (i in 1:length(countries)) {
        -    country_data <- subset(meas, country == countries[[i]])
        -    lines(
        -        x = country_data$time,
        -        y = country_data$cases_per_thousand,
        -        col = adjustcolor("black", alpha.f = 0.25)
        -    )
        -}
        +
        meas$cases_per_thousand <- meas$Cases / as.numeric(meas$total_pop) * 1000
        +countries <- unique(meas$country)
        +
        +plot(
        +    NULL, NULL,
        +    xlim = c(1980, 2022),
        +    ylim = c(0, 50),
        +    xlab = "Year",
        +    ylab = "Incidence per 1000 people"
        +)
        +
        +for (i in 1:length(countries)) {
        +    country_data <- subset(meas, country == countries[[i]])
        +    lines(
        +        x = country_data$time,
        +        y = country_data$cases_per_thousand,
        +        col = adjustcolor("black", alpha.f = 0.25)
        +    )
        +}
        @@ -1351,38 +1317,38 @@

        Main problem solution

        Bonus problem solution

        -
        # First calculate the cumulative cases, treating NA as zeroes
        -cumulative_cases <- ave(
        -    x = ifelse(is.na(meas$Cases), 0, meas$Cases),
        -    meas$country,
        -    FUN = cumsum
        -)
        -
        -# Now put the NAs back where they should be
        -meas$cumulative_cases <- cumulative_cases + (meas$Cases * 0)
        -
        -plot(
        -    NULL, NULL,
        -    xlim = c(1980, 2022),
        -    ylim = c(1, 6.2e6),
        -    xlab = "Year",
        -    ylab = "Cumulative cases per 1000 people"
        -)
        -
        -for (i in 1:length(countries)) {
        -    country_data <- subset(meas, country == countries[[i]])
        -    lines(
        -        x = country_data$time,
        -        y = country_data$cumulative_cases,
        -        col = adjustcolor("black", alpha.f = 0.25)
        -    )
        -}
        -
        -text(
        -    x = 2020,
        -    y = 6e6,
        -    labels = "China →"
        -)
        +
        # First calculate the cumulative cases, treating NA as zeroes
        +cumulative_cases <- ave(
        +    x = ifelse(is.na(meas$Cases), 0, meas$Cases),
        +    meas$country,
        +    FUN = cumsum
        +)
        +
        +# Now put the NAs back where they should be
        +meas$cumulative_cases <- cumulative_cases + (meas$Cases * 0)
        +
        +plot(
        +    NULL, NULL,
        +    xlim = c(1980, 2022),
        +    ylim = c(1, 6.2e6),
        +    xlab = "Year",
        +    ylab = "Cumulative cases per 1000 people"
        +)
        +
        +for (i in 1:length(countries)) {
        +    country_data <- subset(meas, country == countries[[i]])
        +    lines(
        +        x = country_data$time,
        +        y = country_data$cumulative_cases,
        +        col = adjustcolor("black", alpha.f = 0.25)
        +    )
        +}
        +
        +text(
        +    x = 2020,
        +    y = 6e6,
        +    labels = "China →"
        +)
        @@ -1396,10 +1362,8 @@

        More practice on your own

      • Assess the impact of age_months as a confounder in the Diphtheria serology data. First, write code to transform age_months into age ranges for each year. Then, using a loop, calculate the crude odds ratio for the effect of vaccination on infection for each of the age ranges. How does the odds ratio change as age increases? Can you formalize this analysis by fitting a logistic regression model with age_months and vaccination as predictors?
      -
      @@ -1428,6 +1392,7 @@

      More practice on your own

      Reveal.initialize({ 'controlsAuto': true, 'previewLinksAuto': false, +'smaller': false, 'pdfSeparateFragments': false, 'autoAnimateEasing': "ease", 'autoAnimateDuration': 1, @@ -1613,81 +1578,43 @@

      More practice on your own

      }); - - -
      country