-
Notifications
You must be signed in to change notification settings - Fork 4
/
R07_dplyr.Rmd
346 lines (228 loc) · 10.8 KB
/
R07_dplyr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
---
title: "Manipulating and analyzing data with dplyr"
output: html_document
---
------------
> ### Learning Objectives
>
> * Select certain columns in a data frame with the **`dplyr`** function `select`.
> * Select certain rows in a data frame according to filtering conditions with the **`dplyr`** function `filter` .
> * Link the output of one **`dplyr`** function to the input of another function with the 'pipe' operator `%>%`.
> * Add new columns to a data frame that are functions of existing columns with `mutate`.
> * Use `summarize` and `group_by` to split a data frame into groups of observations, apply summary statistics for each group, and then combine the results.
----
# Data Manipulation using **`dplyr`**
* Bracket subsetting `[,]` (with logical operators) is handy, but it can be cumbersome and difficult to read, especially for complicated operations.
* **`dplyr`** is a package for making tabular data manipulation easier.
* It pairs nicely with **`tidyr`** which enables you to swiftly convert between different data formats for plotting and analysis.
* To learn more about **`dplyr`** and **`tidyr`** , you may want to check out this
[handy data transformation with **`dplyr`** cheatsheet](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf) and this [one about **`tidyr`**](https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf).
```{r, message = FALSE,}
## load dplyr
library("dplyr")
```
## Sample data
```{r, message = FALSE,}
## load sample data
NUTS2.DF <- read.csv("datasets/NUTS2data.csv")
# summary(NUTS2.DF)
str(NUTS2.DF)
```
| Variable | Description |
|-------------|-----------------------------------------------------------------|
| Year | time identification of the observation 2010 - 2016 |
| NUTS2 | NUTS2 geographic identification of the observation |
| NUTS0 | State-level identification (AT BE CZ DE DK HU LU NL PL SI SK) |
| GDP_MIO_EUR | GDP in Mio EUR per NUTS2 per Year |
| TotPopNr | Number of inhabitants |
| Area | geographic area in km sq. |
---
# Basic **`dplyr`** functionality
## `select()` - subset columns
* The first argument to this function is the data frame: `NUTS2.DF`,
* Subsequent arguments are the columns to keep.
```{r}
DF2 <- select(NUTS2.DF, Year, NUTS2, TotPopNr)
head(DF2,10)
```
* To select all columns *except* certain ones, put a "-" in front of the variable to exclude it.
```{r}
DF3 <- select(NUTS2.DF, -TotPopNr, -Area)
head(DF3,10)
```
---
## `filter()` - subset rows on conditions
* To choose rows based on a specific criteria, use `filter()`:
```{r, purl = FALSE}
# Choose all records for Slovenia
filter(NUTS2.DF, NUTS0 == "SI")
```
```{r, purl = FALSE}
# Choose all records for Slovenia AND year 2011
filter(NUTS2.DF, Year == 2011, NUTS0 == "SI")
```
---
## Pipes `%>%`
What if you want to select and filter at the same time? There are three
ways to do this:
+ use intermediate steps,
+ nested functions,
+ pipes.
---
##### Sample exercise: Get TotPopNr data (*plus id info* Year and NUTS2) for Slovenia, year 2011 and older
With **intermediate steps**, you create a temporary data frame and use
that as input to the next function, like this:
```{r}
DF4 <- filter(NUTS2.DF, Year <= 2011, NUTS0 == "SI")
DF5 <- select(DF4, Year, NUTS2, TotPopNr)
DF5
```
* This is readable, but can clutter up your workspace with lots of objects that you have to name individually. With multiple steps, that can be hard to keep track of.
---
You can also **nest functions** (i.e. one function inside of another), like this:
```{r}
select(filter(NUTS2.DF, Year <= 2011, NUTS0 == "SI"), Year, NUTS2, TotPopNr)
```
* This is handy, but can be difficult to read if too many functions are nested, as R evaluates the expression from the inside out (in this case, filtering, then selecting).
---
The last option, **pipes**, are a recent addition to R. Pipes let you take
the output of one function and send it directly to the next, which is useful
when you need to do many things to the same dataset.
Pipes in R look like `%>%` and are available within `{dplyr}` (as well as other packages).
* If you use RStudio, you can type the pipe with <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>M</kbd> if you have a PC
* or <kbd>Cmd</kbd> + <kbd>Shift</kbd> + <kbd>M</kbd> if you have a Mac.
```{r}
NUTS2.DF %>%
filter(Year <= 2011, NUTS0 == "SI") %>%
select(Year, NUTS2, TotPopNr)
```
* The pipe operator `%>%` takes the object on its left and passes it as the first argument to the function on its right, we don't need to explicitly include the data frame as an argument to the `filter()` and `select()` functions any more.
* In the above code, we use the pipe to send the `NUTS2.DF` dataset first through `filter()` to keep rows where `Year` is $\leq$ 2011 AND `NUTS0 == "SI"`, then through `select()` to keep only the `Year`, `NUTS2` and `TotPopNr` columns.
* Some may find it helpful to read the pipe like the word **then**.
* The **`dplyr`** functions by themselves are somewhat simple, but by combining them into linear workflows with the pipe, we can accomplish more complex manipulations of data frames.
* If we want to create a new object with this smaller version of the data, we can assign it a new name:
```{r}
NUTS.SI <- NUTS2.DF %>%
filter(Year <= 2011, NUTS0 == "SI") %>%
select(Year, NUTS2, TotPopNr)
NUTS.SI
```
* Note that the final data frame is the leftmost part of the piping expression.
---
**Quick exercise 1:**
Complete `R` script below as follows, using the `NUTS2.DF` data frame:
* Filter GDP data: restrict only to Austria (`"AT"`) AND Years 2015 or later,
* Select columns: Year, NUTS2, GDP_MIO_EUR,
* Use the pipe syntax
```{r}
# Uncomment and complete the task
#NUTS2.DF %>%
```
---
## `mutate()` - create new columns using information in other columns
```{r}
# For Slovenia, calculate GDP per capita (in EUR)
NUTS2.DF %>%
filter(NUTS0 == "SI") %>%
mutate(GDPpc = (GDP_MIO_EUR/TotPopNr)*1000000) %>%
head(10) # pipes work with non-dplyr commands as well (if dplyr is loaded)
```
---
**Quick exercise 2:**
Complete `R` script below as follows:
Show population dentisty by NUTS2 regions
* Use (filter for) year 2016,
* Calculate population density `PopDens` (TotPopNr / Area) ,
* Show columns: NUTS2, `PopDens`,
* Use the pipe syntax
* Show the first 15 rows in your Rmd output,
```{r}
# Uncomment and complete the task
#NUTS2.DF %>%
```
---
## `group_by()` and `summarize()` - summary on grouped data
```{r}
# Calculate average value of GDP per capita at the State level, year 2016
# .. serves for illustration only - NUTS2 to NUTS0 averages are not weighted by population
NUTS2.DF %>%
filter(Year == 2016) %>%
mutate(GDPpc = (GDP_MIO_EUR/TotPopNr)*1000000) %>%
group_by(NUTS0) %>%
summarize(mean_GDPpc = mean(GDPpc, na.rm = TRUE))
```
* Note the output is not a `data.frame` table, but a `tibble` - `{dplyr}` / `{tidyverse}` specific format.
* Many data analysis tasks can be approached using the *split-apply-combine* paradigm: split the data into groups, apply some analysis to each group, and then combine the results.
* `group_by()` is often used together with `summarize()`, which collapses each group into a single-row summary of that group. `group_by()` takes as arguments the column names that contain the **categorical** variables for which you want to calculate the summary statistics.
---
## `group_by()` and `mutate()`
* May be used for calculations on grouped data,
* Easy to calculate lags and individual means for panel data
```{r}
# For Slovenia, calculate first lag of GDP and individual means (over time) for TotPopNr
NUTS2.DF %>%
select(-Area) %>%
filter(NUTS0 == "SI") %>%
group_by(NUTS2) %>%
mutate(GDP_lag1 = lag(GDP_MIO_EUR, k = 1), PopAvg = mean(TotPopNr))
```
---
## `arrange()` - sort results
```{r}
# For Slovenia, calculate first lag of GDP and sort: fist by region, then by time
NUTS2.DF %>%
select(-Area, -TotPopNr) %>%
filter(NUTS0 == "SI") %>%
group_by(NUTS2) %>%
mutate(GDP_lag1 = lag(GDP_MIO_EUR, k = 1)) %>%
arrange(NUTS2,Year) # sorts by NUTS2, then by Year - both ascending
```
* To sort in descending order, use `desc()`.
* e.g. `arrange(desc(NUTS2),Year)`
* You can use `ungroup()` in the pipe for removing the grouping (e.g. for subsequent analysis).
----
# Joining data from multiple datasets
Read in additional dataset
| Variable | Description |
|-------------|-----------------------------------------------------------------|
| Year | time identification of the observation 2011 - 2016 **(no 2010)** |
| NUTS2 | NUTS2 id **(same 113 regions) ** |
| Unem | Unemployment rate in % |
```{r, message = FALSE,}
## load sample data
Unem <- read.csv("datasets/NUTS2data2.csv")
str(Unem)
```
---
## `left_join()` - joins two datasets
##### Start with `NUTS2.DF` and *append* `Unem` dataset
```{r, message = FALSE,}
# Note the missing 2010 Unem values
NewDF <- left_join(NUTS2.DF, Unem, by = c("Year", "NUTS2"))
str(NewDF)
# Show output - head of the table only
NewDF %>%
arrange(NUTS2,Year) %>%
head(12)
```
* All observations in the `left` dataset (`NUTS2.DF`) are preserved.
* `NA` generated if `Unem` observation for a given `"Year", "NUTS2"` combination is not available
---
**Alternative ordering od data.frames to join:** Start with `Unem` and *append* `NUTS2.DF` dataset
```{r, message = FALSE,}
# Note the missing 2010 Unem values
# Show output - head of the table only
Unem %>%
left_join(NUTS2.DF, by = c("Year", "NUTS2")) %>%
arrange(NUTS2,Year) %>%
head(12)
```
* Note the changed ordering of columns.
* All observations in the `left` dataset (`Unem`) are preserved.
* Observations for `2010` in `NUTS2.DF` are NOT *imported*, as there is no such combination of `"Year", "NUTS2"` in the `Unem` dataset.
* `inner_join()` , `right_join()` are available in `dplyr` package.
* sometimes, `merge()` from the `base` packge may be a reasonable alternative.
---
This worksheet draws from
* [Manipulating, analyzing and exporting data with tidyverse](https://github.com/datacarpentry/R-ecology-lesson/blob/master/03-dplyr.Rmd)
* [Data wrangling webinar](http://ucsb-bren.github.io/env-info/wk03_dplyr/wrangling-webinar.pdf)