-
Notifications
You must be signed in to change notification settings - Fork 15
/
regression.qmd
190 lines (137 loc) · 5.1 KB
/
regression.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
---
title: "Regression"
date-modified: 'today'
date-format: long
format:
html:
footer: "CC BY 4.0 John R Little"
license: CC BY
bibliography: references.bib
---
## Load library packages
```{r}
#| message: false
#| warning: false
library(dplyr)
library(ggplot2)
library(moderndive)
library(broom)
library(skimr)
```
## Data
Data are from the {`moderndive`} package[^1], [*Modern dive*](https://moderndive.com/) [@kim2021], and the {dplyr} package [@Wickham2023].
[^1]: <https://moderndive.github.io/moderndive/>
```{r}
evals_ch5 <- evals %>%
select(ID, score, bty_avg, age)
evals
evals_ch5
```
```{r}
evals_ch5 %>%
summary()
```
```{r}
skimr::skim(evals_ch5)
```
## Correlation
### strong correlation
Using the `cor` function to show correlation. For example see the **strong correlation** between `starwars$mass` to `starwars$height`
```{r}
my_cor_df <- starwars %>%
filter(mass < 500) %>%
summarise(my_cor = cor(height, mass))
my_cor_df
```
The `cor()` function shows a positive correlation of `r my_cor_df |> pull(my_cor)`. This indicates a positive correlation between height and mass.
### weak correlation
By contrast, see here a regression line that visualizes the weak correlation between `evals_ch5$score` and `evals_ch5$age`
```{r}
evals_ch5 %>%
ggplot(aes(score, age)) +
geom_jitter() +
geom_smooth(method = lm, formula = y ~ x, se = FALSE)
```
## Linear model
::: callout-tip
## Interpretation
For every increase of 1 unit increase in bty_avg, there is an associated increase of, on average, 0.067 units of score. from [*ModenDive*](https://moderndive.com/5-regression.html) [@kim2021]
:::
```{r}
# Fit regression model:
score_model <- lm(score ~ bty_avg, data = evals_ch5)
score_model
```
```{r}
summary(score_model)
```
## Broom
The {`broom`} package is useful for containing the outcomes of some models as data frames. A more holistic approach to tidy modeling is to use the {[`tidymodels`](https://tidymodels.org)} package/approach
Tidy the model fit into a data frame with `broom::tidy()`, then we can use dplyr functions for data wrangling.
```{r}
broom::tidy(score_model)
```
get evaluative measure into a data frame
```{r}
broom::glance(score_model)
```
### More model data
predicted scores can be found in the `.fitted` variable below
```{r}
broom::augment(score_model)
```
## Example of iterative modeling with nested categories of data
In this next example we nest data by the `gender` category, then iterate over those categories using the {purrr} package to `map` [*anonymous functions*](purrr.html#anonymous-functions-1) over our data frames that is [nested](purrr.html#nest) by our category. Look closely and you'll see correlations, linear model regression, and visualizations --- iterated over the `gender` category. [purr::map](purrr.html#map) iteration methods are beyond what we've learned so far, but you can notice how tidy-data and tidyverse principles are leveraged in data wrangling and analysis. In future lessons we'll learn how to employ these techniques along with [writing custom functions](functions.html).
```{r}
#| message: false
#| warning: false
library(tidyverse)
my_iterations <- evals |>
janitor::clean_names() |>
nest(data = -gender) |>
mutate(cor_age = map_dbl(data, \(data) cor(data$score, data$age))) |>
mutate(cor_bty = map_dbl(data, \(data) cor(data$score, data$bty_avg))) |>
mutate(my_fit_bty = map(data, \(data) lm(score ~ bty_avg, data = data) |>
broom::tidy())) |>
mutate(my_plot = map(data, \(data) ggplot(data, aes(bty_avg, score)) +
geom_point(aes(color = age)) +
geom_smooth(method = lm,
se = FALSE,
formula = y ~ x))) |>
mutate(my_plot = map2(my_plot, gender, \(my_plot, gender) my_plot +
labs(title = str_to_title(gender))))
```
This generates a data frame consisting of lists columns such as `my_fit_bty` and `my_plot`
```{{r}}
my_iterations
```
```{r}
#| echo: false
my_iterations
```
`my_terations$my_fit_bty` is a list column consisting of tibble-style data frames. We can unnest those data frames.
```{{r}}
my_iterations |>
unnest(my_fit_bty)
```
```{r}
#| echo: false
my_iterations |>
unnest(my_fit_bty)
```
`my_iterations$my_plot` is a list column consisting of ggplot2 objects. We can pull those out of the data frame.
```{{r}}
my_iterations |>
pull(my_plot)
```
```{r}
#| echo: false
#| layout-ncol: 2
#| fig-cap:
#| - "Female"
#| - "Male"
my_iterations$my_plot[[1]] # or `my_iterations$my_plot`
my_iterations$my_plot[[2]]
```
## Next steps
The example above introduces how multiple models can be fitted through the nesting of data. Of course, modeling can be much more complex. A good next step is to gain basic introductions about [tidymodels](tidymodels.html). You'll gain tips on integrating tidyverse principles with modeling, machine learning, feature selection, and tuning. Alternatively, endeavor to increase your skills in iteration using the [purrr](purrr.html) package so you can leverage iteration with [custom functions](functions.html).