manipulate.Rmd

---
pagetitle: Manipulate
output:
  html_document:
    pandoc_args: [
      "--number-offset=1"]
---

# Manipulate

```{r, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)

if (!require(librarian)){
  install.packages("librarian")
  library(librarian)
}
shelf(
  htmltools, mitchelloharawild/icons)

# icons::download_fontawesome()
# devtools::load_all("~/github/bbest/icons")
# icons::download_octicons()
```

## Learning Objectives {.unnumbered .objectives}

1.  **Read** online table.

    1.  **Download** a tabular file, in commonly used comma-separated value (**\*.csv**) format, from the URL of a dataset on an **ERDDAP** server.

    2.  **Read** a CSV file into a `data.frame()` using base R `utils::read.csv()`.

1.  **Show** the table.

    1.  **Show** the `data.frame` in the R console.

    2.  **Render** the table into HTML with `DT::datatable()`.

1.  **Wrangle** with `tibble`, `dplyr`, `tidyr` and `purrr`.

    1. **Convert** the `data.frame` to a tidyverse `tibble()` and show it again, now with a more informative summary.
    
    1.  **Pipe** commands with `%>%` from `dplyr` for feeding the output of the function on the left as the first input argument to the function on the right.

    1.  **Select** columns of information with `select()`, optionally using select operators like `starts_with()`.

    2.  **Filter** rows of information based on values in columns with `filter()`.

    3.  **Mutate** a new column with `mutate()`, possibly derived from other column(s).

    4.  **Summarize** all rows with `summarize()`. Precede this with `group_by()` to summarize based on rows with common values of a given column.

    5.  **Nest** tables inside of tables with a `tibble` and tidyr's `nest()`. Build a linear model object (`lm()`) into another **list column** per grouping using purrr's `map()`. Extract values from this model object as double with `map_dbl()`.

## Tips for using RStudio {.unnumbered}

1.  Always launch RStudio from an `*.Rproj` file so it sets the working directory.

2.  Use the [Visual Markdown Editor](https://rstudio.github.io/visual-markdown-editing/#/) (version 1.4+; see menu RStudio \> About RStudio to check your version) to easily create Markdown in your Rmarkdown document.

## Read online table

In this first set of exercises we'll read marine ecological indicators from [California Current - Home \| Integrated Ecosystem Assessment](https://www.integratedecosystemassessment.noaa.gov/regions/california-current) (IEA). These datasets are also being served by an **ERDDAP** server (which is "a data server that gives you a simple, consistent way to download subsets of scientific datasets in common file formats and make graphs and maps.").

### Create `manipulate.Rmd` in r3-exercises

Open your Rstudio project for the r3-exercises by double-clicking on the `r3-exercises.Rproj` file (from Windows Explorer in Windows; Finder in Mac). This will set the working directory in the R Console to the same folder.

In RStudio menu, **File** -\> **New File** -\> **R Markdown...** I suggest giving it a title of "Manipulate" and saving it as `manipulate.Rmd`. Let's delete everything ***after*** the first R chunk, which is surrounded by *fenced backticks*:

```` ```{r setup, include=FALSE} ````

`knitr::opts_chunk$set(echo = TRUE)`

```` ``` ````

And create a new heading and subheading:

```md
## Read online table

### Download table (`*.csv`)
```

### Get URL to CSV

Please visit the ERDDAP server [https://oceanview.pfeg.noaa.gov/erddap](oceanview.pfeg.noaa.gov/erddap) and please do a *Full Text Search for Datasets* using "cciea" in the text box before clicking **Search**. These are all the California Current IEA datasets.

From the listing of datasets, please click on [**data**](https://oceanview.pfeg.noaa.gov/erddap/tabledap/cciea_AC.html) for the "CCIEA Anthropogenic Drivers" dataset.

Note the filtering options for `time` and other variables like `consumption_fish (Millions of metric tons)` and `cps_landings_coastwide (1000s metric tons)`. Set the time filter from being only the most recent time to the entire range of available time from `1945-01-01` to `2020-01-01`.

Scroll to the bottom and **Submit** with the default `.htmlTable` view. You get an web table of the data. Notice the many missing values in earlier years.

Go back in your browser to change the the **File type** to `.csv`. Now instead of clicking Submit, click on **Just generate the URL**. Although the generated URL lists all variables to include, the default is to do that, so we can strip off everything after the `.csv`, starting with the query parameters `?` .

### Download CSV

Let's use this URL to download a new file in a new R Chunk (RStudio menu **Code** -\> **Insert Chunk**). Paste the following inside your R Chunk:

```{r}
# set variables
csv_url  <- "https://oceanview.pfeg.noaa.gov/erddap/tabledap/cciea_AC.csv"
# if ERDDAP server down (Error in download.file) with URL above, use this:
#    csv_url <- "https://raw.githubusercontent.com/noaa-iea/r3-train/master/data/cciea_AC.csv"
dir_data <- "data"

# derived variables
csv <- file.path(dir_data, basename(csv_url))

# create directory
dir.create(dir_data, showWarnings = F)

# download file
if (!file.exists(csv))
  download.file(csv_url, csv)
```

Then click the green Play button to execute the commands in the R Console. After the last command you should see something like:

    trying URL 'https://oceanview.pfeg.noaa.gov/erddap/tabledap/cciea_AC.csv'
    downloaded 53 KB

### Read table `read.csv()`

Now open the file by going into the **Files** RStudio pane, **More** -\> **Show Folder in New Window**. Then double click on `data/cciea_AC.csv` to open in your Spreadsheet program (like Microsoft Excel or Apple Pages or LibreOffice Calc). Mine looks like this:

`r img(src='figs/manipulate/cciea_AC_xls.png', width=600)`

Let's try to read in the csv by adding a new subsection in your Rmarkdown file:

```md
### Read table `read.csv()`
```

Then create a new R Chunk and add the following lines to the R chunk:

```{r, results='hide'}
# attempt to read csv
d <- read.csv(csv)

# show the data frame
d
```

Note how the presence of the 2nd line with units makes the values character `<chr>` data type:

`r img(src='figs/manipulate/cciea_AC_read.csv.png', width=600)`

But we want numeric values. So we could manually delete that second line of units or look at the **help** documentation for this function (`?read.csv` in Console pane; or `F1` key with cursor on the function in the code editor pane). Notice the `skip` argument, which we can implement like so:

```{r, results='hide'}
# read csv by skipping first two lines, so no header
d <- read.csv(csv, skip = 2, header = FALSE)
d

# update data frame to original column names
names(d) <- names(read.csv(csv))
d

# update for future reuse (NEW!)
write.csv(d, csv, row.names = F)
```

You can run above one line at a time with **Code** -\> **Run Selected Lines** (note the keyboard shortcut, which depends on your operating system). This will allow you to see how the data.frame `d` evolves. When running the w

### Show table `DT::datatable()`

Next, let's render your Rmarkdown to html by clicking on **Knit** button (with the blue ball of yarn). Notice how all those times of outputting `d` show the full dataset with each line prefixed by a comment. There are many prettier ways to show tables in Rmarkdown. We'll use the `DT` R package to start.

You'll need to install the R package if you don't already have it. Check by searching for it in the **Packages** pane and **Install** if missing.

Let's start by commenting the previous instances of d (lines 36, 40, 44 for me) by prefixing with a `#`. Then try adding this subsection:

```md
### Show table `DT::datatable()`
```

And in a new R chunk:

```{r}
# show table
DT::datatable(d)
```

Then **Knit** again to see the updated table display, which defaults to paging through results, showing alternating gray backgrounds and providing search and sort abilities. Check out the function help (`?DT::datatable` in R Console; or `F1` with cursor over function in editor) and R package website [DT](https://rstudio.github.io/DT) for more.

Also note how we can call a function explicitly with the R package like `DT::datatable(d)` or load the library and use the function without the explicit R package referencing:

```{r, results='hide'}
library(DT)
datatable(d)
```

## Wrangle data

Now that we've learned how to read a basic data table, let's "wrangle" with `dplyr`, `tidyr` and `purrr`:

![Data "wrangling" loosely encompasses the **Import**, **Tidy** and **Transform** steps. Source: [9 Introduction \| R for Data Science](https://r4ds.had.co.nz/wrangle-intro.html)](https://d33wubrfki0l68.cloudfront.net/e3f9e555d0035731c04642ceb58a03fb84b98a7d/4f070/diagrams/data-science-wrangle.png)

Let's create a new section and subsection in your Rmd:

```md
## Wrangle data

### Manipulate with `dplyr`
```

### Install R packages

Be sure you have the R packages `dplyr`, `tidyr` and `purrr` installed via the **Packages** pane. **Search** (magnifying glass) for them and if missing, use the **Install** button.

### Manipulate with `dplyr`

There's a few things we'll want to do next to make this data.frame more usable:

1.  **Convert** the `data.frame` to a `tibble`, which provides a more useful summary in the default output of the R Console to include only the first 10 rows of data and the data types for each column.

2.  **Transform** the `time` column to an actual `as.Date()` column using dplyr's `mutate()`.

3.  **Column** select with dplyr's `select()` to those starting with "total_fisheries_revenue" using `starts_with()`.

4.  **Row** select with dplyr's `filter()` to those starting within the last 40 years, i.e. `time >= as.Date("1981-01-01")`.

Create a new R Chunk (RStudio menu **Code** -\> **Insert Chunk**) and insert within the fenced code block the following:

```{r}
library(DT)
library(dplyr)

d <- d %>% 
  # tibble
  tibble() %>% 
  # mutate time
  mutate(
    time = as.Date(substr(time, 1, 10))) %>% 
  # select columns
  select(
    time, 
    starts_with("total_fisheries_revenue")) %>% 
  # filter rows
  filter(
    time >= as.Date("1981-01-01"))

datatable(d)
```

Next, **Run** the the chunk with the green Play button. **Knit** your document to see the update output table.

### Tidy with `tidyr`

Now, we're ready to "tidy" up our data per the definition in the [Data Import Cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf):

`r img(src='figs/manipulate/tidy_rule.png', width=300)`

We'll see how this transformation of the data structure from **wide** to **long** is especially useful in the next section when we want to summarize fisheries revenue by region.

Add a new subsection in your Rmd:

```md
### Tidy with `tidyr`
```

And a new R Chunk:

```{r}
library(tidyr)

d <- d %>% 
  pivot_longer(-time)

datatable(d)
```

### Summarize with `dplyr`

Now that we have the data in **long** format, we can:

1.  **Mutate** a new `region` from the original full column `name` by using stringr's `str_replace()` function.

2.  **Select** for just the columns of interest (we no longer need `name`).

3.  **Summarize** by `region`, which we want to first `group_by()`, then `summarize()` the `mean()` of `value` to a new column `avg_revenue`).

4.  **Format** the `avg_revenue` as a currency with DT's `formatCurrency()`).

Add a new subsection in your Rmd:

```md
### Summarize with `dplyr`
```

And a new R Chunk:

```{r}
library(stringr)

d <- d %>% 
  mutate(
    region = str_replace(name, "total_fisheries_revenue_", "")) %>% 
  select(time, region, value)
datatable(d)

d_sum <- d %>% 
  group_by(region) %>% 
  summarize(
    avg_revenue = mean(value))
datatable(d_sum) %>% 
  formatCurrency("avg_revenue")
```

### Apply functions with `purrr` on a `nest`'ed `tibble`

One of the major innovations of a `tibble` is the ability to store nearly any object in the cell of a table as a **list column**. This could be an entire table, a fitted model, plot, etc. Let's try out these features driven by the question: **What's the trend over time for fishing revenue by region?**

1.  The creation of tables within cells is most commonly done with tidyr's `nest()` function based on a grouping of values in a column, i.e. dplyr's `group_by()`.

2.  The `purrr` R package provides functions to operate on list objects, in this case the nested data. and application of functions on these data with purrr's `map` function. We can feed the `data` object into an anonymous function where we fit the linear model `lm()` and return a list object. To then extract the coefficient from the model `coef(summary())`, we want to return a type of **double** (not another list object), so we use the `map_dbl()` function.

Add a new subsection in your Rmd:

```md
### Apply functions with `purrr` on a `nest`'ed `tibble`
```

And a new R Chunk:

```{r}
library(purrr)

n <- d %>% 
  group_by(region) %>% 
  nest(
    data = c(time, value))
n

n <- n %>% 
  mutate(
    lm    = map(data, function(d){
      lm(value ~ time, d) } ),
    trend = map_dbl(lm, function(m){
      coef(summary(m))["time","Estimate"] }))
n
n %>% 
  select(region, trend) %>% 
  datatable()
```

## Rmarkdown output options

If you'd like to render your Rmarkdown into HTML with a table of contents (`toc`), showing up to 3rd level headings (i.e. `### Heading`; use `toc_depth: 3`) and have a floating menu (`toc_float`), along with the ability to Show (or Hide) your Code Chunks (`code_folding`), then replace this default front matter yaml:

```yaml
output: html_document
```

with:

```yaml
output:
  html_document:
    toc: true
    toc_depth: 3
    toc_float: true
    code_folding: "show"
```

For details, see [3.1 HTML document | R Markdown: The Definitive Guide](https://bookdown.org/yihui/rmarkdown/html-document.html).

## Further Reading {.unnumbered}

### Cheatsheets {.unnumbered}

[RStudio Cheatsheets](https://www.rstudio.com/resources/cheatsheets) (most also available via RStudio menu **Help** \> **Cheat Sheets**):

-   [RStudio IDE](https://github.com/rstudio/cheatsheets/raw/master/rstudio-ide.pdf): RStudio interface
-   [Data Import](https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf): `readr`, `tidyr`
-   [Data Transformation](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf): `dplyr`
-   [Apply functions with `purrr`](https://github.com/rstudio/cheatsheets/raw/master/purrr.pdf)
-   [Dates and Times Cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/lubridate.pdf): `lubridate`
-   [Work with Strings Cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/strings.pdf): `stringr`
-   [Regular Expressions Cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/regex.pdf): `stringr` + regex
-   [Tidy Evaluation with rlang](https://github.com/rstudio/cheatsheets/raw/master/tidyeval.pdf): advanced topics on how to use non-standard evaluation (i.e. unquoted strings referencing columns or other variables), especially for creating functions to handle tidy data

## Further Learning Objectives {.unnumbered}

For another day...

1.  **Pipe** commands with `%>%` from `dplyr` for feeding the output of the function on the left as the first input argument to the function on the right.\

2.  **Read** online table.\

    1.  **Download** a tabular file, in commonly used comma-separated value (**\*.csv**) format, from the URL of a dataset on an **ERDDAP** server.

    2.  **Read** a CSV file into a `data.frame()` using base R `utils::read.csv()` and compare with the tidyverse R function `readr::read_csv()`. Understand the difference in how characters are handled as **factors** versus **character**.

    3.  **APIs**, or application programming languages, provide a consistent way to query an online resource (with query parameters in a GET URL or longer forms with a POST from a form) and return information (typically as JSON), on reading csv directly from URL and using `rerddap` R package to access ERDDAP, which is an **API** that you could directly read with `httr` R package. For building your own API, see [`plumber`](https://www.rplumber.io) (and [REST APIs with Plumber Cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/plumber.pdf)).\

3.  **Show** the table.

    1.  **Show** the `data.frame` in the R console. Display a `summary()`. Convert the `data.frame` to a tidyverse `tibble()` and show it again, now with a more informative summary. `View()` the table in RStudio.

    2.  **Render** the table into HTML with `DT::datatable()`. Render the Rmarkdown document to `pdf` and `docx` to see the shortcomings of this htmlwidget in these formats. Use `knitr::kable()` for simple tables and the `huxtable` R package for these non-HTML formats.\

4.  **Manipulate** with `dplyr`.

    See [Data Transformation Cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf).

    1.  **Select** columns of information with `select()`, optionally using select operators like `starts_with()`, `ends_with()` and `everything()`. Use `rename()` to only rename existing columns.

    2.  **Filter** rows of information based on values in columns with `filter()`. **Slice** data based on row numbers.

    3.  **Arrange** rows by values in one or more columns with `arrange()`. Arrange in ascending (default) or descending (`desc()`) order. Use **factors** to customize the sort order.

    4.  **Pull** values with `pull()` into a single `vector` (versus the default `data.frame` output).

    5.  **Mutate** a new column with `mutate()`, possibly derived from other column(s). Use `glue::glue()` to combine character columns, `stringr` R package to manipulate character strings, and `lubridate` R package for dates.\
        \
        See [Dates and Times Cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/lubridate.pdf), [Work with Strings Cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/strings.pdf), [Regular Expressions Cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/regex.pdf).

    6.  **Summarize** all rows with `summarize()`. Precede this with `group_by()` to summarize based on rows with common values of a given column.

    7.  **Join** two tables on a common column with `left_join()` (all from left; matching from right) or `inner_join()` (matching from left and right). Get unmatched values with `anti_join()`.

5.  **Nest** tables **Unnest**. Get climatology by group as data, save linear model in cell, plot in cell. List column concept. Purrr `purrr::map()` function.

    See [Data Import Cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf), [Apply Functions Cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/purrr.pdf) .

6.  Create a function to use an argument.\
    See [Tidy Evaluation with rlang Cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/tidyeval.pdf)

7.  **Pivot** tables between wide and long formats. How would you summarize by group without a long format? A: it would be a painful for loop.

8.  **Relational database** concepts with sqlite. And using `tbl(con, "table")`, `sql_show()` and `collect()`. Migrate from CSV to sqlite to PostgreSQL to PostGIS to BigQuery. Dplyr as the middle layer (and dbplyr). See db.rstudio.com for more, especially on previewing SQL results with `conn`. Compare dplyr commands to SQL.

    See [Databases using R](https://db.rstudio.com).

9.  **Compare** all of the above tidyverse operations with **base R** functions and note the readability (i.e. cognitive load) for understanding the analyses.