forked from UBC-STAT/stat-545-guidebook
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcm007.Rmd
361 lines (247 loc) · 11.8 KB
/
cm007.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
# Intro to data wrangling, Part II
```{r, warning = FALSE}
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(gapminder))
suppressPackageStartupMessages(library(lubridate))
suppressPackageStartupMessages(library(tsibble))
suppressPackageStartupMessages(library(here))
suppressPackageStartupMessages(library(DT))
```
## Orientation (5 min)
### Worksheet
You can find a worksheet template for today [here](https://raw.githubusercontent.com/STAT545-UBC/Classroom/master/tutorials/cm007-exercise.Rmd).
### Announcements
- Due tonight:
- Peer review 1 due tonight.
- Homework 2 due tonight.
- Reminder about setting up your own hw repo.
- stat545.com
### Follow-up
- By the way, STAT 545 jumps into the tidyverse way of doing things in R, instead of the base R way of doing things. Lecture 2 was about "just enough" base R to get you started. If you feel that you want more practice here, take a look at [the R intro stat videos by MarinStatsLectures](https://www.youtube.com/playlist?list=PLqzoL9-eJTNARFXxgwbqGo56NtbJnB37A) (link is now in cm002 notes too).
- `if` statement on its own won't work within `mutate()` because it's not vectorized. Would need to vectorize it with `sapply()` or, better, the `purrr::map` family:
```{r}
tibble(a = 1:4) %>%
mutate(b = (if (a < 3) "small" else "big"),
c = sapply(a, function(x) if (x < 3) "small" else "big"),
d = purrr::map_chr(a, ~ (if(.x < 3) "small" else "big")))
```
### Today's Lessons
Where we are with `dplyr`:
- [x] `select()`
- [x] `filter()`
- [x] `arrange()`
- [x] `mutate()`
- [ ] `summarize()`
- [ ] `group_by()`
- [ ] grouped `mutate()`
- [ ] grouped `summarize()`
Today: the unchecked boxes.
We'll then look to a special type of tibble called a __tsibble__, useful for handling data with a time component. In doing so, we will touch on its older cousin, __lubridate__, which makes handling dates and times easier.
### Resources
Concepts from today's class are closely mirrored by the following resources.
- Jenny's tutorial on [Single table dplyr functions](https://stat545.com/dplyr-single.html)
Other resources:
- Like learning from a textbook? Check out all of [r4ds: transform](http://r4ds.had.co.nz/transform.html).
- The [intro to `dplyr` vignette](https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html) is also a great resource.
Resources for specific concepts:
- To learn more about window functions and how dplyr handles them, see the [window-functions](https://cran.r-project.org/web/packages/dplyr/vignettes/window-functions.html) vignette for the `dplyr` package.
[tsibble demo](https://tsibble.tidyverts.org/)
## `summarize()` (3 min)
Like `mutate()`, the `summarize()` function also creates new columns, but the calculations that make the new columns must reduce down to a single number.
For example, let's compute the mean and standard deviation of life expectancy in the gapminder data set:
```{r}
gapminder %>%
summarize(mu = mean(lifeExp),
sigma = sd(lifeExp))
```
Notice that all other columns were dropped. This is necessary, because there's no obvious way to compress the other columns down to a single row. This is unlike `mutate()`, which keeps all columns, and more like `transmute()`, which drops all other columns.
As it is, this is hardly useful. But that's outside of the context of _grouping_, coming up next.
## `group_by()` (20 min)
The true power of `dplyr` lies in its ability to group a tibble, with the `group_by()` function. As usual, this function takes in a tibble and returns a (grouped) tibble.
Let's group the gapminder dataset by continent and year:
```{r}
gapminder %>%
group_by(continent, year)
```
The only thing different from a regular tibble is the indication of grouping variables above the tibble. This means that the tibble is recognized as having "chunks" defined by unique combinations of continent and year:
- Asia in 1952 is one chunk.
- Asia in 1957 is another chunk.
- Europe in 1952 is another chunk.
- etc...
Notice that the data frame isn't rearranged by chunk!
Now that the tibble is grouped, operations that you do on a grouped tibble _will be done independently within each chunk_, as if no other chunks exist.
You can also create new variables and group by that variable simultaneously. Try splitting life expectancy by "small" and "large" using 60 as a threshold:
```{r}
gapminder %>%
group_by(smallLifeExp = lifeExp < 60)
```
### Grouped `summarize()` (10 min)
Want to compute the mean and standard deviation for each year for every continent? No problem:
```{r}
gapminder %>%
group_by(continent, year) %>%
summarize(mu = mean(lifeExp),
sigma = sd(lifeExp))
```
Notice:
- The grouping variables are kept in the tibble, because their values are unique within in chunk (by definition of the chunk!)
- With each call to `summarize()`, the grouping variables are "peeled back" from last grouping variable to first.
This means the above tibble is now only grouped by continent. What happens when we reverse the grouping?
```{r}
gapminder %>%
group_by(year, continent) %>% # Different order
summarize(mu = mean(lifeExp),
sigma = sd(lifeExp))
```
The grouping columns are switched, and now the tibble is grouped by year instead of continent.
`dplyr` has a bunch of convenience functions that help us write code more eloquently. We could use `group_by()` and `summarize()` with `length()` to find the number of entries each country has:
```{r}
gapminder %>%
group_by(country) %>%
transmute(n = length(country))
```
Or, we can use `dplyr::n()` to count the number of rows in each group:
```{r}
gapminder %>%
group_by(country) %>%
summarize(n = n())
```
Or better yet, just use `dplyr::count()`:
```{r}
gapminder %>%
count(country)
```
### Grouped `mutate()` (3 min)
Want to get the increase in GDP per capita for each country? No problem:
```{r}
gap_inc <- gapminder %>%
arrange(year) %>%
group_by(country) %>%
mutate(gdpPercap_inc = gdpPercap - lag(gdpPercap))
DT::datatable(gap_inc)
```
You can't see it here (because of the `datatable()` output), but the tibble is still grouped by country.
Drop the `NA`s with another convenience function, this time supplied by the `tidyr` package (another tidyverse package that we'll see soon):
```{r}
gap_inc %>%
tidyr::drop_na()
```
## Function types (5 min)
We've seen cases of transforming variables using `mutate()` and `summarize()`, both with and without `group_by()`. How can you know what combination to use? Here's a summary based on one of three types of functions.
| Function type | Explanation | Examples | In `dplyr` |
|------|-----|----|----|
| Vectorized functions | These take a vector, and operate on each component independently to return a vector of the same length. In other words, they work element-wise. | `cos()`, `sin()`, `log()`, `exp()`, `round()` | `mutate()` |
| Aggregate functions | These take a vector, and return a vector of length 1 | `mean()`, `sd()`, `length()` | `summarize()`, esp with `group_by()`. |
| Window Functions | these take a vector, and return a vector of the same length that depends on the vector as a whole. | `lag()`, `rank()`, `cumsum()` | `mutate()`, esp with `group_by()` |
## `dplyr` Exercises (20 min)
## Dates and Times (5 min)
The `lubridate` package is great for identifying dates and times. You can also do arithmetic with dates and times with the package, but we won't be discussing that.
Make an object of class `"Date"` using a function that's some permutation of `y`, `m`, and `d` (for year, month, and date, respectively). These functions are more flexible than your yoga instructor:
```{r}
lubridate::mdy("September 24, 2019")
lubridate::mdy("Sep 24 2019")
lubridate::mdy("9-24-19")
lubridate::dym(c("24-2019, September", "25 2019 Sep"))
```
Notice that they display the dates all in `ymd` format, which is best for computing because the dates sort nicely this way.
This is not just a character!
```{r}
lubridate::ymd("2019 September 24") %>%
class()
```
You can tag on `hms`, too:
```{r}
lubridate::ymd_hms("2019 September 24, 23:59:59")
```
We can also extract information from these objects. Day of the week:
```{r}
today <- lubridate::ymd("2019 September 24")
lubridate::wday(today, label = TRUE)
```
Day:
```{r}
lubridate::day(today)
```
Number of days into the year:
```{r}
lubridate::yday(today)
```
Is it a leap year this year?
```{r}
lubridate::leap_year(today)
```
The newer `tsibble` package gives these `lubridate` functions some friends. What's the year and month? Year and week?
```{r}
tsibble::yearmonth(today)
tsibble::yearweek(today)
```
## Tsibbles (15 min)
A `tsibble` (from the package of the same name) is a special type of `tibble`, useful for handling data where a column indicates a time variable.
As an example, here are daily records of a household's electricity usage:
```{r}
energy <- here::here("data", "daily_consumption.csv") %>%
read_csv()
energy
```
Let's make this a `tsibble` in the same way we'd convert a data frame to a `tibble`: with the `as_tsibble()` function. The conversion requires you to specify which column contains the time index, using the `index` argument.
```{r}
(energy <- as_tsibble(energy, index = date))
```
We already see an improvement vis-a-vis the sorted dates!
This is an example of _time series_ data, because the time interval has a regular spacing. A `tsibble` cleverly determines and stores this interval. With the energy consumption data, the interval is one day ("1D" means "1 day", not "1 dimension"!):
```{r}
interval(energy)
```
Notice that there is no record for December 21, 2006, in what would be Row 5. Such records are called _implicit NA's_, because they're actually missing, but aren't explicitly shown as missing in your data set. If you don't make these explicit, you could mess up your analysis if it's anticipating your data to be equally spaced in time. Just `full_gaps()` to bring them out of hiding:
```{r}
(energy <- fill_gaps(energy))
```
Already, it's better to plot the data now that these gaps are filled in. Let's check out 2010. See how the plot without NA's can be a little misleading? Moral: always be as honest as possible with your data.
```{r}
small_energy <- filter(energy, year(date) == 2010)
cowplot::plot_grid(
ggplot(small_energy, aes(date, intensity)) +
geom_line() +
theme_bw() +
xlab("Date (in 2010)") +
ggtitle("NA's made explicit"),
ggplot(drop_na(small_energy), aes(date, intensity)) +
geom_line() +
theme_bw() +
xlab("Date (in 2010)") +
ggtitle("NA's in hiding (implicit)"),
nrow = 2
)
```
How would we convert `gapminder` to a `tsibble`, since it has a time series per country? Use the `key` argument to specify the grouping:
```{r}
gapminder %>%
as_tsibble(index = year, key = country)
```
### `index_by()` instead of `group_by()` (5 min)
It looks like there's seasonality in intensity across the year:
```{r}
ggplot(energy, aes(yday(date), intensity)) +
geom_point() +
theme_bw() +
labs(x = "Day of the Year")
```
Let's get a mean estimate of intensity on each day of the year. We'd like to `group_by(yday(date))`, but because we're grouping on the index variable, we use `index_by()` instead.
```{r}
energy %>%
tsibble::index_by(day_of_year = yday(date)) %>%
dplyr::summarize(mean_intensity = mean(intensity, na.rm = TRUE))
```
What if we wanted to make the time series less granular? Instead of total daily consumption, how about total weekly consumption? Note the convenience function `summarize_all()` given to us by `dplyr`!
```{r}
energy %>%
tsibble::index_by(yearweek = yearweek(date)) %>%
dplyr::summarize_all(sum)
```
By the way, there's no need to worry about "truncated weeks" at the beginning and end of the year. For example, December 31, 2019 is a Tuesday, and is Week 53, but its "yearmonth" is Week 1 in 2020:
```{r}
dec31 <- "2019-12-31"
wday(dec31, label = TRUE)
week(dec31)
yearweek(dec31)
```