Skip to content

Commit

Permalink
Merge pull request #503 from jhudsl/manipulating-update
Browse files Browse the repository at this point in the history
Addressing Manipulation updates ahead of Tues (tomorrow)
  • Loading branch information
avahoffman authored Jan 16, 2024
2 parents 533d64a + 218d7a0 commit 2c61449
Show file tree
Hide file tree
Showing 3 changed files with 113 additions and 108 deletions.
156 changes: 87 additions & 69 deletions modules/Manipulating_Data_in_R/Manipulating_Data_in_R.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,10 @@ library(tidyverse)

## Recap of Data Cleaning

- `recode()` can help with simple recoding (not based on condition but simple swap)
- `case_when()` can recode **entire values** based on conditions
- remember `case_when()` needs `TRUE ~ varaible` to keep values that aren't specified by conditions, otherwise will be `NA`
- `stringr` package has great functions for looking for specific **parts of values** especially `filter()` and `str_detect()` combined
- also has other useful string manipulation functions like `str_replace()` and more!
- also has other useful string manipulation functions like `str_replace()` and more!
- `separate()` can split columns into additional columns
- `unite()` can combine columns

Expand Down Expand Up @@ -152,52 +151,41 @@ You might see old functions `gather` and `spread` when googling. These are older

## Reshaping data from **wide to long**

```{r, echo = FALSE}
wide_data <- tibble(
June_vacc_rate = 0.516,
May_vacc_rate = 0.514,
April_vacc_rate = 0.511
)
```{r, message = FALSE}
wide_vacc <- read_csv(
file = "https://jhudatascience.org/intro_to_r/data/wide_vacc.csv")
```

```{r}
wide_data
long_data <- wide_data %>% pivot_longer(cols = everything())
long_data
wide_vacc
long_vacc <- wide_vacc %>% pivot_longer(cols = everything())
long_vacc
```

## Reshaping data from **wide to long** {.codesmall}
## Reshaping wide to long: Better column names {.codesmall}

`pivot_longer()` - puts column data into rows (`tidyr` package)

- First describe which columns we want to "pivot_longer"
- `names_to =` gives a new name to the pivoted columns
- `values_to =` gives a new name to the values that used to be in those columns
- `names_to =` new name for old columns
- `values_to =` new name for old cell values

<div class = "codeexample">
```{r, eval=FALSE}
{long_data} <- {wide_data} %>% pivot_longer(cols = {columns to pivot},
names_to = {New column name: contains old column names},
values_to = {New column name: contains cell values})
names_to = {name for old columns},
values_to = {name for cell values})
```
</div>

## Reshaping data from **wide to long**

```{r, echo = FALSE}
wide_data <- tibble(
June_vacc_rate = 0.516,
May_vacc_rate = 0.514,
April_vacc_rate = 0.511
)
```

```{r}
wide_data
long_data <- wide_data %>% pivot_longer(cols = everything(),
wide_vacc
long_vacc <- wide_vacc %>% pivot_longer(cols = everything(),
names_to = "Month",
values_to = "Rate")
long_data
long_vacc
```

Newly created column names are enclosed in quotation marks.
Expand All @@ -212,13 +200,13 @@ circ <- read_circulator()
head(circ, 5)
```

## Mission: Taking the averages by line
## Mission: Taking the average boardings by line

Let's imagine we want to create a table of average ridership by route/line. Results should look something like:
Let's imagine we want to create a table of average boardings by route/line. Results should look something like:

```{r, message = FALSE}
```{r, message = FALSE, echo = FALSE}
example <- tibble(line = c("orange","purple","green","banner"),
avg = c("600(?)", "700(?)", "500(?)", "400(?)")
avg_boardings = c("600(?)", "700(?)", "500(?)", "400(?)")
)
example
```
Expand All @@ -233,11 +221,11 @@ long

## Reshaping data from **wide to long**

Un-pivoted columns appear the same as before (`day`, `date`, `daily`)
Un-pivoted columns (`day`, `date`, `daily`) are similar

```{r}
head(circ, n = 2)
head(long, n = 2)
circ %>% select(day, date, daily) %>% head()
long %>% select(day, date, daily) %>% head()
```

## Cleaning up long data
Expand All @@ -246,9 +234,8 @@ We will use `str_replace` from the `stringr` package to put `_` in the names

```{r}
long <- long %>% mutate(
name = str_replace(name, "Board", "_Board"),
name = str_replace(name, "Alight", "_Alight"),
name = str_replace(name, "Average", "_Average")
name = str_replace(string = name, pattern = "B", replacement = "_B"),
name = str_replace(string = name, pattern = "A", replacement = "_A")
)
long
```
Expand All @@ -265,14 +252,24 @@ long <- long %>%
long
```

## Mission: Taking the averages by line
## Mission: Taking the average boardings by line

Filter by Boardings only..

```{r}
long <- long %>%
filter(type == "Boardings")
long
```

## Mission: Taking the average boardings by line

Now our data is more tidy, and we can take the averages easily!

```{r}
long %>%
group_by(line) %>%
summarize("avg" = mean(value, na.rm = TRUE))
summarize("avg_boardings" = mean(value, na.rm = TRUE))
```

## Reshaping data from **wide to long**
Expand Down Expand Up @@ -304,10 +301,10 @@ circ %>%
## Reshaping data from **long to wide**

```{r}
long_data
wide_data <- long_data %>% pivot_wider(names_from = "Month",
long_vacc
wide_vacc <- long_vacc %>% pivot_wider(names_from = "Month",
values_from = "Rate")
wide_data
wide_vacc
```

## Reshaping Charm City Circulator
Expand Down Expand Up @@ -343,7 +340,7 @@ wide

"Combining datasets"

```{r, fig.alt="Inner, outer, left, and right joins represented with venn diagrams", out.width = "100%", echo = FALSE, align = "center"}
```{r, fig.alt="Inner, outer, left, and right joins represented with venn diagrams", out.width = "70%", echo = FALSE, align = "center"}
knitr::include_graphics("images/joins.png")
```

Expand All @@ -359,16 +356,11 @@ knitr::include_graphics("images/joins.png")

## Merging: Simple Data

```{r echo=FALSE}
data_As <- tibble(
State = c("Alabama", "Alaska"),
June_vacc_rate = c(0.516, 0.627),
May_vacc_rate = c(0.514, 0.626)
)
data_cold <- tibble(
State = c("Maine", "Alaska"),
April_vacc_rate = c(0.795, 0.623)
)
```{r message=FALSE}
data_As <- read_csv(
file = "https://jhudatascience.org/intro_to_r/data/data_As_1.csv")
data_cold <- read_csv(
file = "https://jhudatascience.org/intro_to_r/data/data_cold_1.csv")
```

```{r}
Expand Down Expand Up @@ -401,6 +393,8 @@ knitr::include_graphics("images/left_join.gif")

## Left Join

"Everything to the left of the comma"

```{r left_join}
lj <- left_join(data_As, data_cold)
lj
Expand Down Expand Up @@ -429,6 +423,8 @@ knitr::include_graphics("images/right_join.gif")

## Right Join

"Everything to the right of the comma"

```{r right_join}
rj <- right_join(data_As, data_cold)
rj
Expand Down Expand Up @@ -458,12 +454,11 @@ fj

## Watch out for "`includes duplicates`"

```{r echo=FALSE}
data_As <- tibble(State = c("Alabama", "Alaska"),
state_bird = c("wild turkey", "willow ptarmigan"))
data_cold <- tibble(State = c("Maine", "Alaska", "Alaska"),
vacc_rate = c(0.795, 0.623, 0.626),
month = c("April", "April", "May"))
```{r message=FALSE}
data_As <- read_csv(
file = "https://jhudatascience.org/intro_to_r/data/data_As_2.csv")
data_cold <- read_csv(
file = "https://jhudatascience.org/intro_to_r/data/data_cold_2.csv")
```

```{r}
Expand Down Expand Up @@ -499,6 +494,8 @@ knitr::include_graphics("images/left_join_extra.gif")

## Stop `tidylog`

`unloadNamespace()` does the opposite of `library()`.

```{r}
unloadNamespace("tidylog")
```
Expand All @@ -521,29 +518,44 @@ If the datasets have two different names for the same data, use:
full_join(x, y, by = c("a" = "b"))
```

## Using "`setdiff`"
## Getting the set difference with `setdiff`

We might want to determine what indexes ARE in the first dataset that AREN'T in the second.

We might want to determine what indexes ARE in the first dataset that AREN'T in the second:
For this to work, the datasets need the same columns.

We'll just select the index using `select()`.

```{r}
data_As
data_cold
A_states <- data_As %>% select(State)
cold_states <- data_cold %>% select(State)
```

## Using "`setdiff`"
## Getting the set difference with `setdiff`

Use `setdiff` to determine what indexes ARE in the first dataset that AREN'T in the second:
States in `A_states` but not in `cold_states`

```{r}
A_states <- data_As %>% pull(State)
cold_states <- data_cold %>% pull(State)
dplyr::setdiff(A_states, cold_states)
```

States in `cold_states` but not in `A_states`

```{r}
setdiff(A_states, cold_states)
setdiff(cold_states, A_states)
dplyr::setdiff(cold_states, A_states)
```

## Getting the set difference with `setdiff`

Why did we use `dplyr::setdiff`?

There is a base R function, also called `setdiff` that requires vectors.

In other words, we use `dplyr::` to be specific about the package we want to use.

More set operations can be found here:
https://dplyr.tidyverse.org/reference/setops.html

## Summary

* Merging/joining data sets together - assumes all column names that overlap
Expand Down Expand Up @@ -591,3 +603,9 @@ wide2 <- long2 %>% pivot_wider(names_from = "type",
values_from = "value")
wide2
```

## Fast manipulation using `collapse` package

https://sebkrantz.github.io/collapse/

Might be helpful if your data is very large. However, `dplyr` and `tidyr` functions are great for most applications.
29 changes: 12 additions & 17 deletions modules/Manipulating_Data_in_R/lab/Manipulating_Data_in_R_Lab.Rmd
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: "Manipulating Data in R Lab"
title: "Manipulating Data in R Lab - Key"
output: html_document
editor_options:
chunk_output_type: console
Expand Down Expand Up @@ -32,7 +32,6 @@ library(tidyr)
2\. Look at the column names using `colnames` - do you notice any patterns?

```{r}
colnames(vacc)
```

Expand Down Expand Up @@ -73,7 +72,7 @@ new_data <- old_data %>% pivot_longer(cols = colname(s))
```


6\. Filter the "Entity" column so it only includes values in the following list: "Maryland","Virginia","Florida","Massachusetts", "United States". **Hint**: use `filter` and `%in%`.
6\. Using `vacc_long`, filter the "Entity" column so it only includes values in the following list: "Maryland","Virginia","Florida","Massachusetts", "United States". **Hint**: use `filter` and `%in%`.

```
# General format
Expand All @@ -99,22 +98,14 @@ new_data <- old_data %>% pivot_wider(names_from = column1, values_from = column2

**Bonus / Extra practice**:

A\. Why is using `read.csv()` in Question 1 problematic? Use `colnames()` to examine the two different methods and datasets (vacc1 and vacc2) below.

```{r}
vacc1 <- read_csv("http://jhudatascience.org/intro_to_r/data/USA_covid19_vaccinations.csv")
vacc2 <- read.csv("http://jhudatascience.org/intro_to_r/data/USA_covid19_vaccinations.csv")
```


B\. Take the code from Questions 1 and 3-7. Chain all of this code together using the pipe ` %>% `. Call your data `vacc_compare`.
A\. Take the code from Questions 1 and 3-7. Chain all of this code together using the pipe ` %>% `. Call your data `vacc_compare`.

```{r}
```


C\. Modify the code from Question B:
B\. Modify the code from Question A:

- Look for columns that start with "Total" (instead of "Percent") and
- Select different states/Entities to compare
Expand Down Expand Up @@ -176,25 +167,29 @@ new_data <- full_join(x, y, by = c("colname1" = "colname2"))

**Bonus / Extra practice**:

D\. Do a left join of "vacc" and "gdp". Call the output "left". How many observations are there?
C\. Do a left join of "vacc" and "gdp". Call the output "left". How many observations are there?

```{r}
```


E\. Copy your code from Question D and change it to a `right_join` with the same order of the arguments. Call the output "right". How many observations are there?
D\. Copy your code from Question D and change it to a `right_join` with the same order of the arguments. Call the output "right". How many observations are there?

```{r}
```


F\. Perform a `setdiff` on "vacc" and "gdp" to determine what Entities are missing from the GDP data and which are missing from the vaccine data. Remember you need to `pull()` the columns you want to compare.
E\. Perform a `setdiff` on "vacc" and "gdp" to determine what Entities are missing from the GDP data and which are missing from the vaccine data.

- First, `select()` only the columns you want to compare. Create new objects for each dataset.
- Rename the selected column for *one* of the datasets, so that both datasets have the same column name. You can use `rename()`
- Then use `setdiff()`

```
# General format
setdiff(PULLED_COL_1, PULLED_COL_2)
dplyr::setdiff(DATA_1, DATA_2)
```

```{r}
Expand Down
Loading

0 comments on commit 2c61449

Please sign in to comment.