-
Notifications
You must be signed in to change notification settings - Fork 14
/
Copy pathpurrr-extras.Rmd
170 lines (114 loc) · 6.11 KB
/
purrr-extras.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
# Other purrr functions
```{r message = FALSE, warning = FALSE}
library(tidyverse)
```
In this reading, you'll learn about two more map variants, `map_dfr()` and `map_dfc()`. Then, you'll learn about `walk()`, as well as some useful purrr functions that work with functions that return either `TRUE` or `FALSE`.
The purrr package contains more functions than we can cover. The [purrr cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/purrr.pdf) is a great way to find helpful functions when you encounter a new type of iteration problem.
## Map functions that output tibbles
Instead of creating an atomic vector or list, the map variants `map_dfr()` and `map_dfc()` create a tibble.
With these map functions, the assembly line worker creates a tibble for each input element, and the output conveyor belt ends up with a collection of tibbles.
```{r echo=FALSE}
knitr::include_graphics("images/map_df.png", dpi = image_dpi)
```
The worker then combines all the small tibbles into a single, larger tibble. There are multiple ways to combine smaller tibbles into a larger tibble. `map_dfr()` (*r* for *rows*) stacks the smaller tibbles on top of each other.
```{r echo=FALSE}
knitr::include_graphics("images/map_dfr.png", dpi = image_dpi)
```
`map_dfc()` (*c* for *columns*) stacks them side-by-side.
```{r echo=FALSE}
knitr::include_graphics("images/map_dfc.png", dpi = image_dpi)
```
There are `_dfr` and `_dfc` variants of `pmap()` and `map2()` as well. In the following sections, we'll cover `map_dfr()` and `map_dfc()` in more detail.
### `_dfr`
`map_dfr()` is useful when reading in data from multiple files. The following code reads in several very simple csv files, each of which contains the name of a different dinosaur genus.
```{r, message=FALSE}
read_csv("data/purrr-extras/file_001.csv")
read_csv("data/purrr-extras/file_002.csv")
read_csv("data/purrr-extras/file_003.csv")
```
`read_csv()` produces a tibble, and so we can use `map_dfr()` to map over all three file names and bind the resulting individual tibbles into a single tibble.
```{r, message=FALSE}
files <- str_glue("data/purrr-extras/file_00{1:3}.csv")
files
files %>%
map_dfr(read_csv)
```
The result is a tibble with three rows and two columns, because `map_dfr()` aligns the columns of the individual tibbles by name.
The individual tibbles can have different numbers of rows or columns. `map_dfr()` just creates a column for each unique column name. If some of the individual tibbles lack a column that others have, `map_dfr()` fills in with `NA` values.
```{r, message=FALSE}
read_csv("data/purrr-extras/file_004.csv")
c(files, "data/purrr-extras/file_004.csv") %>%
map_dfr(read_csv)
```
### `_dfc`
`map_dfc()` is typically less useful than `map_dfr()` because it relies on row position to stack the tibbles side-by-side. Row position is prone to error, and it will often be difficult to check if the data in each row is aligned correctly. However, if you have data with variables in different places and are positive the rows are aligned, `map_dfc()` may be appropriate.
Unfortunately, even if the individual tibbles contain a unique identifier for each row, `map_dfc()` doesn't use the identifiers to verify that the rows are aligned correctly, nor does it combine identically named columns.
```{r, message=FALSE}
read_csv("data/purrr-extras/file_005.csv")
c("data/purrr-extras/file_001.csv", "data/purrr-extras/file_005.csv") %>%
map_dfc(read_csv)
```
Instead, you end up with a duplicated column (`id...1` and `id...3`).
If you have a unique identifier for each row, it is much better to join on that identifier.
```{r, message=FALSE}
left_join(
read_csv("data/purrr-extras/file_001.csv"),
read_csv("data/purrr-extras/file_005.csv"),
by = "id"
)
```
Also, because `map_dfc()` combines tibbles by row position, the tibbles can have different numbers of columns, but they should have the same number of rows.
## Walk
The walk functions work similarly to the map functions, but you use them when you're interested in applying a function that performs an action instead of producing data (e.g., `print()`).
The walk functions are useful for performing actions like writing files and printing plots. For example, say we used purrr to generate a list of plots.
```{r message=FALSE}
set.seed(745)
plot_rnorm <- function(sd) {
tibble(x = rnorm(n = 5000, mean = 0, sd = sd)) %>%
ggplot(aes(x)) +
geom_histogram(bins = 40) +
geom_vline(xintercept = 0, color = "blue")
}
plots <-
c(5, 1, 9) %>%
map(plot_rnorm)
```
We can now use `walk()` to print them out.
```{r}
plots %>%
walk(print)
```
The walk functions look like they don't return anything, but they actually return their input *invisibly*. When functions return something invisibly, it just means they don't print their return value out when you call them. This functionality makes the walk functions useful in pipes. You can call a walk function to perform an action, get your input back, and continue operating on that input.
## Predicate functions
In Chapter 7, we introduced _predicate functions_, which are functions that return a single `TRUE` or `FALSE`. purrr includes several useful functions that work with predicate functions.
`keep()` and `discard()` iterate over a vector and keep or discard only those elements for which the predicate function returns `TRUE`.
```{r}
x <-
list(
a = c(1, 2),
b = c(4, 5, 6),
c = c("a", "z")
)
x %>%
discard(is.character)
```
```{r}
x %>%
keep(~ length(.) == 2)
```
With tibbles, you can use `keep()` and `discard()` to select columns that meet a certain condition.
```{r}
mpg %>%
keep(is.numeric)
```
`some()` looks at the entire input vector and returns `TRUE` if the predicate is true for any element of the vector and `FAlSE` otherwise.
```{r}
mpg %>%
some(is.numeric)
```
For `every()` to return `TRUE`, every element of the vector must meet the predicate.
```{r}
mpg %>%
every(is.numeric)
```
Other useful purrr functions that use predicate functions include `head_while()`, `compact()`, `has_element()`, and `detect()`. Take a look at the [purrr cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/purrr.pdf) for details.