-
Notifications
You must be signed in to change notification settings - Fork 85
/
Copy pathmanipulate.Rmd
439 lines (280 loc) · 19.2 KB
/
manipulate.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
---
pagetitle: Manipulate
output:
html_document:
pandoc_args: [
"--number-offset=1"]
---
# Manipulate
```{r, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
if (!require(librarian)){
install.packages("librarian")
library(librarian)
}
shelf(
htmltools, mitchelloharawild/icons)
# icons::download_fontawesome()
# devtools::load_all("~/github/bbest/icons")
# icons::download_octicons()
```
## Learning Objectives {.unnumbered .objectives}
1. **Read** online table.
1. **Download** a tabular file, in commonly used comma-separated value (**\*.csv**) format, from the URL of a dataset on an **ERDDAP** server.
2. **Read** a CSV file into a `data.frame()` using base R `utils::read.csv()`.
1. **Show** the table.
1. **Show** the `data.frame` in the R console.
2. **Render** the table into HTML with `DT::datatable()`.
1. **Wrangle** with `tibble`, `dplyr`, `tidyr` and `purrr`.
1. **Convert** the `data.frame` to a tidyverse `tibble()` and show it again, now with a more informative summary.
1. **Pipe** commands with `%>%` from `dplyr` for feeding the output of the function on the left as the first input argument to the function on the right.
1. **Select** columns of information with `select()`, optionally using select operators like `starts_with()`.
2. **Filter** rows of information based on values in columns with `filter()`.
3. **Mutate** a new column with `mutate()`, possibly derived from other column(s).
4. **Summarize** all rows with `summarize()`. Precede this with `group_by()` to summarize based on rows with common values of a given column.
5. **Nest** tables inside of tables with a `tibble` and tidyr's `nest()`. Build a linear model object (`lm()`) into another **list column** per grouping using purrr's `map()`. Extract values from this model object as double with `map_dbl()`.
## Tips for using RStudio {.unnumbered}
1. Always launch RStudio from an `*.Rproj` file so it sets the working directory.
2. Use the [Visual Markdown Editor](https://rstudio.github.io/visual-markdown-editing/#/) (version 1.4+; see menu RStudio \> About RStudio to check your version) to easily create Markdown in your Rmarkdown document.
## Read online table
In this first set of exercises we'll read marine ecological indicators from [California Current - Home \| Integrated Ecosystem Assessment](https://www.integratedecosystemassessment.noaa.gov/regions/california-current) (IEA). These datasets are also being served by an **ERDDAP** server (which is "a data server that gives you a simple, consistent way to download subsets of scientific datasets in common file formats and make graphs and maps.").
### Create `manipulate.Rmd` in r3-exercises
Open your Rstudio project for the r3-exercises by double-clicking on the `r3-exercises.Rproj` file (from Windows Explorer in Windows; Finder in Mac). This will set the working directory in the R Console to the same folder.
In RStudio menu, **File** -\> **New File** -\> **R Markdown...** I suggest giving it a title of "Manipulate" and saving it as `manipulate.Rmd`. Let's delete everything ***after*** the first R chunk, which is surrounded by *fenced backticks*:
```` ```{r setup, include=FALSE} ````
`knitr::opts_chunk$set(echo = TRUE)`
```` ``` ````
And create a new heading and subheading:
```md
## Read online table
### Download table (`*.csv`)
```
### Get URL to CSV
Please visit the ERDDAP server [https://oceanview.pfeg.noaa.gov/erddap](oceanview.pfeg.noaa.gov/erddap) and please do a *Full Text Search for Datasets* using "cciea" in the text box before clicking **Search**. These are all the California Current IEA datasets.
From the listing of datasets, please click on [**data**](https://oceanview.pfeg.noaa.gov/erddap/tabledap/cciea_AC.html) for the "CCIEA Anthropogenic Drivers" dataset.
Note the filtering options for `time` and other variables like `consumption_fish (Millions of metric tons)` and `cps_landings_coastwide (1000s metric tons)`. Set the time filter from being only the most recent time to the entire range of available time from `1945-01-01` to `2020-01-01`.
Scroll to the bottom and **Submit** with the default `.htmlTable` view. You get an web table of the data. Notice the many missing values in earlier years.
Go back in your browser to change the the **File type** to `.csv`. Now instead of clicking Submit, click on **Just generate the URL**. Although the generated URL lists all variables to include, the default is to do that, so we can strip off everything after the `.csv`, starting with the query parameters `?` .
### Download CSV
Let's use this URL to download a new file in a new R Chunk (RStudio menu **Code** -\> **Insert Chunk**). Paste the following inside your R Chunk:
```{r}
# set variables
csv_url <- "https://oceanview.pfeg.noaa.gov/erddap/tabledap/cciea_AC.csv"
# if ERDDAP server down (Error in download.file) with URL above, use this:
# csv_url <- "https://raw.githubusercontent.com/noaa-iea/r3-train/master/data/cciea_AC.csv"
dir_data <- "data"
# derived variables
csv <- file.path(dir_data, basename(csv_url))
# create directory
dir.create(dir_data, showWarnings = F)
# download file
if (!file.exists(csv))
download.file(csv_url, csv)
```
Then click the green Play button to execute the commands in the R Console. After the last command you should see something like:
trying URL 'https://oceanview.pfeg.noaa.gov/erddap/tabledap/cciea_AC.csv'
downloaded 53 KB
### Read table `read.csv()`
Now open the file by going into the **Files** RStudio pane, **More** -\> **Show Folder in New Window**. Then double click on `data/cciea_AC.csv` to open in your Spreadsheet program (like Microsoft Excel or Apple Pages or LibreOffice Calc). Mine looks like this:
`r img(src='figs/manipulate/cciea_AC_xls.png', width=600)`
Let's try to read in the csv by adding a new subsection in your Rmarkdown file:
```md
### Read table `read.csv()`
```
Then create a new R Chunk and add the following lines to the R chunk:
```{r, results='hide'}
# attempt to read csv
d <- read.csv(csv)
# show the data frame
d
```
Note how the presence of the 2nd line with units makes the values character `<chr>` data type:
`r img(src='figs/manipulate/cciea_AC_read.csv.png', width=600)`
But we want numeric values. So we could manually delete that second line of units or look at the **help** documentation for this function (`?read.csv` in Console pane; or `F1` key with cursor on the function in the code editor pane). Notice the `skip` argument, which we can implement like so:
```{r, results='hide'}
# read csv by skipping first two lines, so no header
d <- read.csv(csv, skip = 2, header = FALSE)
d
# update data frame to original column names
names(d) <- names(read.csv(csv))
d
# update for future reuse (NEW!)
write.csv(d, csv, row.names = F)
```
You can run above one line at a time with **Code** -\> **Run Selected Lines** (note the keyboard shortcut, which depends on your operating system). This will allow you to see how the data.frame `d` evolves. When running the w
### Show table `DT::datatable()`
Next, let's render your Rmarkdown to html by clicking on **Knit** button (with the blue ball of yarn). Notice how all those times of outputting `d` show the full dataset with each line prefixed by a comment. There are many prettier ways to show tables in Rmarkdown. We'll use the `DT` R package to start.
You'll need to install the R package if you don't already have it. Check by searching for it in the **Packages** pane and **Install** if missing.
Let's start by commenting the previous instances of d (lines 36, 40, 44 for me) by prefixing with a `#`. Then try adding this subsection:
```md
### Show table `DT::datatable()`
```
And in a new R chunk:
```{r}
# show table
DT::datatable(d)
```
Then **Knit** again to see the updated table display, which defaults to paging through results, showing alternating gray backgrounds and providing search and sort abilities. Check out the function help (`?DT::datatable` in R Console; or `F1` with cursor over function in editor) and R package website [DT](https://rstudio.github.io/DT) for more.
Also note how we can call a function explicitly with the R package like `DT::datatable(d)` or load the library and use the function without the explicit R package referencing:
```{r, results='hide'}
library(DT)
datatable(d)
```
## Wrangle data
Now that we've learned how to read a basic data table, let's "wrangle" with `dplyr`, `tidyr` and `purrr`:
![Data "wrangling" loosely encompasses the **Import**, **Tidy** and **Transform** steps. Source: [9 Introduction \| R for Data Science](https://r4ds.had.co.nz/wrangle-intro.html)](https://d33wubrfki0l68.cloudfront.net/e3f9e555d0035731c04642ceb58a03fb84b98a7d/4f070/diagrams/data-science-wrangle.png)
Let's create a new section and subsection in your Rmd:
```md
## Wrangle data
### Manipulate with `dplyr`
```
### Install R packages
Be sure you have the R packages `dplyr`, `tidyr` and `purrr` installed via the **Packages** pane. **Search** (magnifying glass) for them and if missing, use the **Install** button.
### Manipulate with `dplyr`
There's a few things we'll want to do next to make this data.frame more usable:
1. **Convert** the `data.frame` to a `tibble`, which provides a more useful summary in the default output of the R Console to include only the first 10 rows of data and the data types for each column.
2. **Transform** the `time` column to an actual `as.Date()` column using dplyr's `mutate()`.
3. **Column** select with dplyr's `select()` to those starting with "total_fisheries_revenue" using `starts_with()`.
4. **Row** select with dplyr's `filter()` to those starting within the last 40 years, i.e. `time >= as.Date("1981-01-01")`.
Create a new R Chunk (RStudio menu **Code** -\> **Insert Chunk**) and insert within the fenced code block the following:
```{r}
library(DT)
library(dplyr)
d <- d %>%
# tibble
tibble() %>%
# mutate time
mutate(
time = as.Date(substr(time, 1, 10))) %>%
# select columns
select(
time,
starts_with("total_fisheries_revenue")) %>%
# filter rows
filter(
time >= as.Date("1981-01-01"))
datatable(d)
```
Next, **Run** the the chunk with the green Play button. **Knit** your document to see the update output table.
### Tidy with `tidyr`
Now, we're ready to "tidy" up our data per the definition in the [Data Import Cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf):
`r img(src='figs/manipulate/tidy_rule.png', width=300)`
We'll see how this transformation of the data structure from **wide** to **long** is especially useful in the next section when we want to summarize fisheries revenue by region.
Add a new subsection in your Rmd:
```md
### Tidy with `tidyr`
```
And a new R Chunk:
```{r}
library(tidyr)
d <- d %>%
pivot_longer(-time)
datatable(d)
```
### Summarize with `dplyr`
Now that we have the data in **long** format, we can:
1. **Mutate** a new `region` from the original full column `name` by using stringr's `str_replace()` function.
2. **Select** for just the columns of interest (we no longer need `name`).
3. **Summarize** by `region`, which we want to first `group_by()`, then `summarize()` the `mean()` of `value` to a new column `avg_revenue`).
4. **Format** the `avg_revenue` as a currency with DT's `formatCurrency()`).
Add a new subsection in your Rmd:
```md
### Summarize with `dplyr`
```
And a new R Chunk:
```{r}
library(stringr)
d <- d %>%
mutate(
region = str_replace(name, "total_fisheries_revenue_", "")) %>%
select(time, region, value)
datatable(d)
d_sum <- d %>%
group_by(region) %>%
summarize(
avg_revenue = mean(value))
datatable(d_sum) %>%
formatCurrency("avg_revenue")
```
### Apply functions with `purrr` on a `nest`'ed `tibble`
One of the major innovations of a `tibble` is the ability to store nearly any object in the cell of a table as a **list column**. This could be an entire table, a fitted model, plot, etc. Let's try out these features driven by the question: **What's the trend over time for fishing revenue by region?**
1. The creation of tables within cells is most commonly done with tidyr's `nest()` function based on a grouping of values in a column, i.e. dplyr's `group_by()`.
2. The `purrr` R package provides functions to operate on list objects, in this case the nested data. and application of functions on these data with purrr's `map` function. We can feed the `data` object into an anonymous function where we fit the linear model `lm()` and return a list object. To then extract the coefficient from the model `coef(summary())`, we want to return a type of **double** (not another list object), so we use the `map_dbl()` function.
Add a new subsection in your Rmd:
```md
### Apply functions with `purrr` on a `nest`'ed `tibble`
```
And a new R Chunk:
```{r}
library(purrr)
n <- d %>%
group_by(region) %>%
nest(
data = c(time, value))
n
n <- n %>%
mutate(
lm = map(data, function(d){
lm(value ~ time, d) } ),
trend = map_dbl(lm, function(m){
coef(summary(m))["time","Estimate"] }))
n
n %>%
select(region, trend) %>%
datatable()
```
## Rmarkdown output options
If you'd like to render your Rmarkdown into HTML with a table of contents (`toc`), showing up to 3rd level headings (i.e. `### Heading`; use `toc_depth: 3`) and have a floating menu (`toc_float`), along with the ability to Show (or Hide) your Code Chunks (`code_folding`), then replace this default front matter yaml:
```yaml
output: html_document
```
with:
```yaml
output:
html_document:
toc: true
toc_depth: 3
toc_float: true
code_folding: "show"
```
For details, see [3.1 HTML document | R Markdown: The Definitive Guide](https://bookdown.org/yihui/rmarkdown/html-document.html).
## Further Reading {.unnumbered}
### Cheatsheets {.unnumbered}
[RStudio Cheatsheets](https://www.rstudio.com/resources/cheatsheets) (most also available via RStudio menu **Help** \> **Cheat Sheets**):
- [RStudio IDE](https://github.com/rstudio/cheatsheets/raw/master/rstudio-ide.pdf): RStudio interface
- [Data Import](https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf): `readr`, `tidyr`
- [Data Transformation](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf): `dplyr`
- [Apply functions with `purrr`](https://github.com/rstudio/cheatsheets/raw/master/purrr.pdf)
- [Dates and Times Cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/lubridate.pdf): `lubridate`
- [Work with Strings Cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/strings.pdf): `stringr`
- [Regular Expressions Cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/regex.pdf): `stringr` + regex
- [Tidy Evaluation with rlang](https://github.com/rstudio/cheatsheets/raw/master/tidyeval.pdf): advanced topics on how to use non-standard evaluation (i.e. unquoted strings referencing columns or other variables), especially for creating functions to handle tidy data
## Further Learning Objectives {.unnumbered}
For another day...
1. **Pipe** commands with `%>%` from `dplyr` for feeding the output of the function on the left as the first input argument to the function on the right.\
2. **Read** online table.\
1. **Download** a tabular file, in commonly used comma-separated value (**\*.csv**) format, from the URL of a dataset on an **ERDDAP** server.
2. **Read** a CSV file into a `data.frame()` using base R `utils::read.csv()` and compare with the tidyverse R function `readr::read_csv()`. Understand the difference in how characters are handled as **factors** versus **character**.
3. **APIs**, or application programming languages, provide a consistent way to query an online resource (with query parameters in a GET URL or longer forms with a POST from a form) and return information (typically as JSON), on reading csv directly from URL and using `rerddap` R package to access ERDDAP, which is an **API** that you could directly read with `httr` R package. For building your own API, see [`plumber`](https://www.rplumber.io) (and [REST APIs with Plumber Cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/plumber.pdf)).\
3. **Show** the table.
1. **Show** the `data.frame` in the R console. Display a `summary()`. Convert the `data.frame` to a tidyverse `tibble()` and show it again, now with a more informative summary. `View()` the table in RStudio.
2. **Render** the table into HTML with `DT::datatable()`. Render the Rmarkdown document to `pdf` and `docx` to see the shortcomings of this htmlwidget in these formats. Use `knitr::kable()` for simple tables and the `huxtable` R package for these non-HTML formats.\
4. **Manipulate** with `dplyr`.
See [Data Transformation Cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf).
1. **Select** columns of information with `select()`, optionally using select operators like `starts_with()`, `ends_with()` and `everything()`. Use `rename()` to only rename existing columns.
2. **Filter** rows of information based on values in columns with `filter()`. **Slice** data based on row numbers.
3. **Arrange** rows by values in one or more columns with `arrange()`. Arrange in ascending (default) or descending (`desc()`) order. Use **factors** to customize the sort order.
4. **Pull** values with `pull()` into a single `vector` (versus the default `data.frame` output).
5. **Mutate** a new column with `mutate()`, possibly derived from other column(s). Use `glue::glue()` to combine character columns, `stringr` R package to manipulate character strings, and `lubridate` R package for dates.\
\
See [Dates and Times Cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/lubridate.pdf), [Work with Strings Cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/strings.pdf), [Regular Expressions Cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/regex.pdf).
6. **Summarize** all rows with `summarize()`. Precede this with `group_by()` to summarize based on rows with common values of a given column.
7. **Join** two tables on a common column with `left_join()` (all from left; matching from right) or `inner_join()` (matching from left and right). Get unmatched values with `anti_join()`.
5. **Nest** tables **Unnest**. Get climatology by group as data, save linear model in cell, plot in cell. List column concept. Purrr `purrr::map()` function.
See [Data Import Cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf), [Apply Functions Cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/purrr.pdf) .
6. Create a function to use an argument.\
See [Tidy Evaluation with rlang Cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/tidyeval.pdf)
7. **Pivot** tables between wide and long formats. How would you summarize by group without a long format? A: it would be a painful for loop.
8. **Relational database** concepts with sqlite. And using `tbl(con, "table")`, `sql_show()` and `collect()`. Migrate from CSV to sqlite to PostgreSQL to PostGIS to BigQuery. Dplyr as the middle layer (and dbplyr). See db.rstudio.com for more, especially on previewing SQL results with `conn`. Compare dplyr commands to SQL.
See [Databases using R](https://db.rstudio.com).
9. **Compare** all of the above tidyverse operations with **base R** functions and note the readability (i.e. cognitive load) for understanding the analyses.