-
Notifications
You must be signed in to change notification settings - Fork 1
/
06_tidyr.Rmd
230 lines (183 loc) · 5.69 KB
/
06_tidyr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
---
title: "06_tidyr"
output: html_document
---
class: middle, center, inverse
layout: false
# 4.4 `tidyr`:<br><br>Tidy Messy Data
---
background-image: url(https://raw.githubusercontent.com/tidyverse/tidyr/master/man/figures/logo.png)
background-position: 97.5% 2.5%
background-size: 7.5%
layout: true
---
## 4.4 `tidyr`: Tidy Messy Data
`tidyr` provides several functions that help you bring your data into the *tidy data* format (e.g., reshaping data, splitting columns, handling missing values or nesting data).
```{r}
penguins
```
???
- Let's again start with our `penguins` data set which already is in *tidy data* format
- in the following i highlight the dimensionality of the data to show you what happens
DIM: 344 x 8
---
## 4.4 `tidyr`: Tidy Messy Data
**Pivotting:** Converts between long and wide format using `pivot_longer()` and `pivot_wider()`.
.panelset[
.panel[.panel-name[pivot_longer()]
```{r}
long_penguins <- penguins %>%
pivot_longer(
cols = c(species, island),
names_to = "variable", values_to = "value"
)
long_penguins %>% glimpse
```
]
.panel[.panel-name[pivot_wider()]
```{r}
long_penguins %>%
pivot_wider(
names_from = "variable", values_from = "value"
) %>%
glimpse
```
]
]
???
`pivot_longer()`:
- now for each observation we have two rows, one row per variable that are pivotted -> no tidy format any longer
- DIM: 688 x 8
`pivot_wider()`
- invert `pivot_longer()`
- DIM: 344 x 8
---
## 4.4 `tidyr`: Tidy Messy Data
.right[
```{r, echo=F, out.height='80%', out.width='80%'}
knitr::include_graphics("https://raw.githubusercontent.com/apreshill/teachthat/master/pivot/pivot_longer_smaller.gif")
```
]
.footnote[.pull-left[
*Source: [Allison Hill](https://github.com/apreshill/teachthat/blob/master/pivot/pivot_longer_smaller.gif)*
<i>Note: Find more information about `pivot_*()` in the [pivoting vignette](https://tidyr.tidyverse.org/articles/pivot.html).</i>
]]
---
name: tidyr_nest
## 4.4 `tidyr`: Tidy Messy Data
**Nesting:** Groups similar data such that each group becomes a single row in a data frame.
```{r}
nested_penguins <-
penguins %>%
nest(nested_data = c(island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, sex))
nested_penguins
```
???
- note that `nest()` produces a nested data frame with one row per species and year
- note that the `nested_data` column contains `tibbles` with six columns each and a varying amount of observations
- the work with nested data can be particularly helpful if you would like to apply functions to each subset of the data (e.g., fit a model for each year or for each species)
---
name: nested-data
## 4.4 `tidyr`: Tidy Messy Data
**Rectangling:** Disentangles nested data structures (e.g., JSON, HTML) and brings it into *tidy data* format.
.panelset[
.panel[.panel-name[pluck()]
Extract individual objects from a nested data structure via `purrr::pluck()`.
```{r}
nested_penguins %>% purrr::pluck("nested_data", 1)
```
]
.panel[.panel-name[unnest()]
Flatten nested data structures via `tidyr::unnest()`.
```{r}
nested_penguins %>% unnest(cols = c(nested_data))
```
]
.panel[.panel-name[hoist()]
Selectively extract individual components from an object in a nested data structure via `tidyr::hoist()`.
```{r}
nested_penguins %>% hoist(nested_data, hoisted_col = "bill_length_mm")
```
]
]
???
Alternatively use `unnest_wider()` or `unnest_longer()` for more control over the rectangling operation.
---
## 4.4 `tidyr`: Tidy Messy Data
**Splitting** and **Combining:** Transforms a single character column into multiple columns and vice versa.
.panelset[
.panel[.panel-name[unite()]
Collapse multiple columns into a single column.
```{r}
penguins %>% unite(col = "species_gender", c(species, sex), sep = "_", remove = T)
```
]
.panel[.panel-name[separate()]
Separate a single column, containing multiple values, into multiple columns.
```{r}
penguins %>% separate(bill_length_mm, sep = 2, into = c("cm", "mm"))
```
]
.panel[.panel-name[separate_rows()]
Separate a single column, containing multiple values, into multiple rows.
```{r}
penguins %>% separate_rows(island, sep = "s", convert = T)
```
]
]
???
can also `separate` based on character match
---
## 4.4 `tidyr`: Tidy Messy Data
**Handling missing values:** Drop or replace explicit or implicit missing values (`NA`).
```{r, echo=F}
incompl_penguins <- tibble(
species = c(rep("Adelie", 2), rep("Gentoo", 1), rep("Chinstrap", 1)),
year = c(2007, 2008, 2008, 2007),
measurement = c(rnorm(3, mean = 50, sd = 15), NA)
)
```
.panelset[
.panel[.panel-name[Base Case]
```{r}
incompl_penguins
```
]
.panel[.panel-name[complete()]
Make implicit missing values explicit.
```{r}
incompl_penguins %>%
complete(species, year, fill = list(measurement = NA))
```
.pull-right[.pull-right[.footnote[
]]]
]
.panel[.panel-name[drop_na()]
Make explicit missing values implicit.
```{r}
incompl_penguins %>%
drop_na(measurement)
```
]
.panel[.panel-name[fill()]
Replace missing values with the next/previous value.
```{r}
incompl_penguins %>%
fill(measurement, .direction = "down")
```
]
.panel[.panel-name[replace_na()]
Replace missing values with a pre-defined value.
```{r}
incompl_penguins %>%
replace_na(replace = list(measurement = mean(.$measurement, na.rm = T)))
```
]
]
.footnote[
*Note: Find more information and functions on the `tidyr` [cheat sheet](https://raw.githubusercontent.com/rstudio/cheatsheets/master/data-import.pdf).*
]
???
Note: function arguments preceded by a dot in the tidyverse may have one of two reasons:
- the function is still pre-mature, i.e. developers still think about the best way of implementing and naming the function
- the function is regularly applied within another function so that you don't confuse function arguments between the inner and outer function