-
Notifications
You must be signed in to change notification settings - Fork 15
/
Copy pathtidy_tuesday_itra.qmd
368 lines (285 loc) · 11.7 KB
/
tidy_tuesday_itra.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
---
title: "Example EDA"
author: "John Little"
date-modified: 'today'
date-format: long
license: CC BY-NC
bibliography: references.bib
---
```{r}
#| message: false
#| warning: false
library(tidyverse)
library(skimr)
```
Basic steps taken in this rough exploration of data....
- import data
- wrangle data
- join data
- visualize data
A brief discussion of packages which proport to perform EDA for you can be found in the [Get Started section of this site](eda.html "selective summary of eda packages").
Note: This page is not intended to teach formal EDA. What happens on this page is merely a brutal re-enactment of some informal explorations that a person *might* take as they familiarize themselves with new data. If you're like most people, you might want to [skip to the visualizations](#visualize-wrangle-and-summarize). Meanwhile, the sections and code-chunks preceding visualization are worth a glance.
::: column-margin
[![Age Distribution of runners.](images/age_ultra_runners.svg)](images/age_ultra_runners.svg)
:::
## Import data
The data come from a [TidyTuesday](https://github.com/rfordatascience/tidytuesday), a weekly social learning project dedicated to gaining practical experience with R and data science. In this case the TidyTuesday data are based on [International Trail Running Association (ITRA)](https://itra.run/Races/FindRaceResults) data but inspired by Benjamin Nowak, . We will use the [TidyTuesday data that are on GitHub](https://github.com/rfordatascience/tidytuesday/tree/master/data/2021/2021-10-26). Nowak's data are [also available on GitHub](https://github.com/BjnNowak/UltraTrailRunning).
```{r}
#| message: false
#| warning: false
race_df <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-10-26/race.csv")
rank_df <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-10-26/ultra_rankings.csv")
```
```{r}
glimpse(race_df)
```
```{r}
glimpse(rank_df)
```
## EDA with skimr
```{r}
skim(race_df)
```
Read more about automagic [EAD packages](eda.html).
## Freewheelin' EDA
```{r}
race_df |>
count(country, sort = TRUE) |>
filter(str_detect(country, regex("Ke", ignore_case = TRUE)))
```
```{r}
race_df |>
filter(country == "Turkey")
```
```{r}
race_df |>
count(participation, sort = TRUE)
```
```{r}
race_df |>
count(participants, sort = TRUE)
```
```{r}
skim(rank_df)
```
```{r}
rank_df |>
filter(str_detect(nationality, regex("ken", ignore_case = TRUE)))
```
```{r}
rank_df |>
arrange(rank)
rank_df |>
count(rank, sort = TRUE)
rank_df |>
drop_na(rank) |>
count(rank, gender, age, sort = TRUE)
race_df |>
count(distance, sort = TRUE)
```
```{r}
rank_df |>
filter(race_year_id == 41449)
race_df |>
filter(race_year_id == 41449)
race_df |>
filter(distance == 161)
```
```{r}
race_df
```
```{r}
race_df |>
count(race, city, sort = TRUE)
race_df |>
filter(race == "Centurion North Downs Way 100")
```
```{r}
race_df |>
filter(race_year_id == 68140)
race_df |>
filter(race == "Millstone 100")
race_df |>
filter(event == "Peak District Ultras")
race_df |>
count(race, sort = TRUE)
race_df |>
count(city, sort = TRUE)
race_df |>
count(event, sort = TRUE)
```
```{r}
race_df |>
filter(event == "Burning River Endurance Run")
```
## Visualize, wrangle, and summarize {#visualize-wrangle-and-summarize}
Here I'm using [this *State of Ultra Running* report](https://runrepeat.com/state-of-ultra-running) as a model to demonstrate **some** of the capabilities of R / Tidyverse
### join datasets
#### Join, Assign, and Pipe
In this case I want to **join** the two data frames `rank_df` and `race_df` using the `left_join()` function.
I can **assign** the output of a "data pipe" (i.e. data sentence) to use in subsequent code-chunks. A common R / Tidyverse assignment operator is the `<-` characters. You can read this as "gets value from".
Additionally, I'm using a **pipe** operator (`|>`) as a conjunction to connect functions. In this way I can form a *data sentence*. Many people call the data sentence a *data pipe*, or just a *pipe*. You may see another common pipe operator: `%>%`. `\>` and `%>%` are synonymous.
using `dplyr::left_join()` I combine the two data sets and then use {`ggplot2`} to create a line graph of participants by year.
```{r}
my_df_joined <- rank_df |>
left_join(race_df, by = "race_year_id") |>
mutate(my_year = lubridate::year(date))
```
### Viz participants
Let' make a quick line plot showing how many people participate in races each year. Here we have a `date` field this is also a date data-type. Data types are important and in this example using a data data-type means {`ggplot2`} will simplify our x-axis labels.
Here we use the {`lubridate`} package to help manage my date data-types. We also use {`ggplot2`} to generate a line graph as a *time series* via the {`ggplot2`} package and a `geom_line()` *layer*. Note that {`ggplot2`} uses the '`+`' as the conjunction or *pipe*.
```{r}
rank_df |>
left_join(race_df |> select(race_year_id, date), by = "race_year_id") |>
mutate(my_year = lubridate::year(date)) |>
count(my_year, sort = TRUE) |>
ggplot(aes(my_year, n)) +
geom_line()
```
### by distance
Here I use `count()` in different ways to see what I can see. I comment out each attempt before settling on summarizing a table of total country participants by year.
```{r}
my_df_joined |>
mutate(participation = str_to_lower(participation)) |>
# count(participation, sort = TRUE)
# count(city) |>
# count(race) |>
count(my_year, country, sort = TRUE)
my_df_joined |>
mutate(participation = str_to_lower(participation)) |>
count(my_year, country, sort = TRUE) |>
drop_na(country) |>
mutate(country = fct_lump_prop(country, prop = .03)) |>
ggplot(aes(my_year, n)) +
geom_line(aes(color = country))
```
### by country
I used `fct_lump_prop()` in the previous code-chunk to lump the `country` variable into categories by frequency. Here we refine the categories into specific **levels**. We are still *mutating* the `country` variable as a categorical factor; this time using the `fct_other()` function of {`forcats`} with some pre-defined levels (see the `my_levels` vector in the code-chunk below).
```{r}
my_levels <- c("United States", "France", "United Kingdom", "Spain")
my_df_joined |>
mutate(country = fct_other(country, keep = my_levels)) |>
count(my_year, country, sort = TRUE) |>
drop_na(country) |>
ggplot(aes(my_year, n, color = country)) +
geom_line() +
geom_point() +
scale_color_brewer(palette = "Dark2")
```
### Country race-host
```{r}
my_df_joined |>
drop_na(country) |>
mutate(country = fct_lump_n(country, n = 7)) |>
count(country, sort = TRUE) |>
ggplot(aes(x = n, y = fct_reorder(country, n))) +
geom_col()
```
### Nationality of runner
```{r}
my_df_joined |>
mutate(nationality = fct_lump_n(nationality, n = 7)) |>
count(nationality, sort = TRUE) |>
ggplot(aes(n, fct_reorder(nationality, n))) +
geom_col()
```
### Unique participants
```{r}
my_df_joined |>
distinct(my_year, runner) |>
count(my_year) |>
ggplot(aes(my_year, n)) +
geom_line()
```
### Participant frequency separated by gender
Note the use of `count`, `if_else`, `as.character`, and `group_by` to transform the data for visualizing. Meanwhile, the visual bar graph is a proportional graph with the y-axis label by percentage. We do this by manipulating the plot *scales*. Scales are also used to choose colors from a predefined palette (i.e. "Dark2".) Findally, we facet the plot by gender (See `facet_wrap()`).
```{r}
my_df_joined |>
count(my_year, gender, runner, sort = TRUE) |>
mutate(n_category = if_else(n >= 5, "more", as.character(n))) |>
group_by(my_year) |>
mutate(total_races = sum(n)) |>
ungroup() |>
ggplot(aes(my_year, total_races)) +
geom_col(aes(fill = fct_rev(n_category)), position = "fill") +
scale_fill_brewer(palette = "Dark2") +
scale_y_continuous(labels = scales::percent) +
facet_wrap(vars(gender))
```
### Pace per mile
We want to calculate a value for each runner's pace (i.e. `minute_miles`). We have to create and convert a character data-type of the `time` variable into a numeric floating point (or `dbl`) data-type so that we can calculate pace (i.e. race-minutes divided by distance.) These data transformations required a lot of manipulation as I was thinking through my goal. I could optimized this code, perhaps. However it works and I've got other things to do. Do I care if the CPU works extra hard? No, not in this case.
```{r, dev='svg'}
my_df_joined |>
mutate(time_hms = str_remove_all(time, "[HMS]"), .after = time) |>
mutate(time_hms = str_replace_all(time_hms, "\\s", ":")) |>
separate(time_hms, into = c("h", "m", "s"), sep = ":") |>
mutate(bigminutes = (
(as.numeric(h) * 60) + as.numeric(m) + (as.numeric(s) * .75)
), .before = h) |>
mutate(pace = bigminutes / distance, .before = bigminutes) |>
drop_na(pace, distance, my_year) |>
filter(distance > 0,
pace > 0) |>
group_by(my_year, gender) |>
summarise(avg_pace = mean(pace), max_pace = max(pace), min_pace = min(pace)) |>
pivot_longer(-c(my_year, gender), names_to = "pace_type") |>
separate(value, into = c("m", "s"), remove = FALSE) |>
mutate(h = "00", .before = m) |>
mutate(m = str_pad(as.numeric(m), width = 2, pad = "0")) |>
mutate(s = str_pad(round(as.numeric(str_c("0.",s)) * 60), width = 2, pad = "0")) |>
unite(minute_miles, h:s, sep = ":") |>
mutate(minute_miles = hms::as_hms(minute_miles)) |>
# drop_na(gender) |>
ggplot(aes(my_year, minute_miles)) +
geom_line(aes(color = pace_type), size = 1) +
scale_color_brewer(palette = "Dark2") +
theme_classic() +
facet_wrap(vars(gender))
```
### Age trends
In this code-chunk we use a {`ggplot2`} function, `cut_width()`, to generate rough categories by age. `dplyr::case_when()` is a more thorough and sophisticated way to make some cuts in my data, but `ggplot2::cut_width()` works well for a quick visualization.
Note the use of labels, scales, themes, and guides in the last visualization. A good plot will need refinement with some or all of these functions.
```{r}
my_df_joined |>
mutate(age_cut = cut_width(age, width = 10, boundary = 0), .after = age) |>
count(age_cut, gender, sort = TRUE)
my_df_joined |>
filter(age < 80) |>
drop_na(gender) |>
ggplot(aes(y = cut_width(age, width = 10, boundary = 0))) +
geom_bar(aes(fill = gender)) +
facet_wrap(vars(gender))
my_df_joined |>
filter(age < 70, age >= 20) |>
drop_na(gender) |>
ggplot(aes(my_year)) +
geom_bar(aes(fill = fct_rev(cut_width(age, width = 10, boundary = 0))), position = "fill") +
scale_y_continuous(labels = scales::percent) +
scale_fill_brewer(palette = "Dark2") +
labs(fill = "Age", title = "Age distribution of ultra runners",
caption = "Source: ITRA > Benjamin Nowak > Tidy Tuesday",
x = NULL, y = NULL) +
theme_classic() +
theme(legend.position = "top", plot.title.position = "plot") +
guides(fill = guide_legend(reverse = TRUE))
```
```{r}
#| echo: false
#| warning: false
#| message: false
my_plot <-
my_df_joined |>
filter(age < 70, age >= 20) |>
drop_na(gender) |>
ggplot(aes(my_year)) +
geom_bar(aes(fill = fct_rev(cut_width(age, width = 10, boundary = 0))), position = "fill") +
scale_y_continuous(labels = scales::percent) +
scale_fill_brewer(palette = "Dark2") +
labs(fill = "Age", title = "Age distribution of ultra runners",
caption = "Source: ITRA > Benjamin Nowak > Tidy Tuesday",
x = NULL, y = NULL) +
theme_classic() +
theme(legend.position = "top", plot.title.position = "plot") +
guides(fill = guide_legend(reverse = TRUE))
ggsave(here::here("images/age_ultra_runners.svg"))
```