-
Notifications
You must be signed in to change notification settings - Fork 0
/
ch25.Rmd
167 lines (127 loc) · 4.89 KB
/
ch25.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
title: "Chapter 25 - Exercises - R for Data Science"
author: "Francisco Yira Albornoz"
date: "February 8th, 2019"
output:
github_document:
toc: true
toc_depth: 4
df_print: tibble
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(modelr)
library(tidyverse)
library(gapminder)
```
## 25.2 gapminder
### 25.2.5 Exercises
1. A linear trend seems to be slightly too simple for the overall trend. Can you do better with a quadratic polynomial? How can you interpret the coefficients of the quadratic? (Hint you might want to transform `year` so that it has mean zero.)
```{r}
by_country <- gapminder %>%
mutate(year_centered = year - mean(year)) %>%
group_by(country, continent) %>%
nest()
simple_model <- function(df) {
lm(lifeExp ~ year, data = df)
}
quadratic_model <- function(df) {
lm(lifeExp ~ poly(year, 2), data = df)
}
by_country <- by_country %>%
mutate(
simple_model = map(data, simple_model),
quadratic_model = map(data, quadratic_model)
)
by_country <- by_country %>%
mutate(
resids = pmap(list(data, simple_model = simple_model, quadratic_model = quadratic_model),
gather_residuals)
)
resids <- unnest(by_country, resids)
```
```{r plot residuals}
resids %>%
group_by(year, model, continent) %>%
summarise(resid = mean(resid)) %>%
ggplot(aes(year, resid, color = model)) +
geom_line(aes(group = model)) +
facet_wrap(~continent)
```
Most of the time the quadratic polynomial model has lower residuals than the simple linear model.
A quadratic polynomial model has three coefficients: the Intercept, the linear coefficient, and the quadratic term coefficient. The latter can be interpreted as the effect associated with `year` squared, a "modifier" on the linear trend that signals a changing effect with the passage of time.
2. Explore other methods for visualising the distribution of R2 per continent. You might want to try the ggbeeswarm package, which provides similar methods for avoiding overlaps as jitter, but uses deterministic methods.
```{r}
glance_models <-
by_country %>%
mutate(glance = map(simple_model, broom::glance)) %>%
unnest(glance, .drop = TRUE)
library(ggbeeswarm)
glance_models %>%
ggplot(aes(continent, r.squared)) +
geom_beeswarm()
```
```{r}
glance_models %>%
ggplot(aes(continent, r.squared)) +
geom_quasirandom()
```
3. To create the last plot (showing the data for the countries with the worst model fits), we needed two steps: we created a data frame with one row per country and then semi-joined it to the original dataset. It’s possible to avoid this join if we use `unnest()` instead of `unnest(.drop = TRUE)`. How?
```{r}
glance_models <-
by_country %>%
mutate(glance = map(simple_model, broom::glance)) %>%
unnest(glance)
glance_models %>%
filter(r.squared < 0.25) %>%
unnest(data) %>%
ggplot(aes(year, lifeExp, color = country)) +
geom_line()
```
## 25.4 Creating list-columns
### 25.4.5 Exercises
1. List all the functions that you can think of that take a atomic vector and return a list.
```
map()
list()
read_csv()
DBI::dbGetQuery()
```
2. Brainstorm useful summary functions that, like `quantile()`, return multiple values.
```
summary()
unique()
seq_range()
confint() # for models
range()
```
3. What’s missing in the following data frame? How does `quantile()` return that missing piece? Why isn’t that helpful here?
```{r}
mtcars %>%
group_by(cyl) %>%
summarise(q = list(quantile(mpg))) %>%
unnest()
```
The names of the quantiles are missing. This is because `quantile()` returns them as names of the values (inside a named vector), and `unnest()` can't retrieve that information. We can solve this issue by converting the `quantile()` output into a dataframe which contains a column with quantiles names.
```{r}
mtcars %>%
group_by(cyl) %>%
summarise(q = list(enframe(quantile(mpg)))) %>%
unnest()
```
4. What does this code do? Why might might it be useful?
```{r}
mtcars %>%
group_by(cyl) %>%
summarise_each(funs(list))
```
It creates a dataframe where each row represents a group in `mtcars` (defined by `cyl` value) and where variables per group are stored in list-columns of vectors. This approach would allows us to use functions like `map()` to easily compute summaries of variables for each group.
## 25.5 Simplifying list-columns
### 25.5.5 Exercises
1. Why might the `lengths()` function be useful for creating atomic vector columns from list-columns?
A possible use case would be using the new lengths column to filter and keep elements of two list-columns with the same length, allowing to do unnesting on both of them later on.
2. List the most common types of vector found in a data frame. What makes lists different?
i. Numeric vectors
ii. Character vectors
iii. Logical vectors
Lists are different because they are not restricted to contain elements of length 1, and also theirs elements can be of different type.