-
Notifications
You must be signed in to change notification settings - Fork 4
/
R05_tidyr.Rmd
216 lines (135 loc) · 6.54 KB
/
R05_tidyr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
---
title: "Wide and long data format - use of `{tidyr}`"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
suppressPackageStartupMessages({
library(dplyr)
library(tidyr)
library(ggplot2)
})
```
## Overview
> Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in the mundane labor of collecting and preparing data, before it can be explored for useful information.
This tutorial covers the *very basics* of `tidyr` package. Usage:
* General data handling \& processing
* `ggplot2` plots (long format required)
* Eurostat: data in long format, wide format often required for regression analysis
* Panel data analysis
---
## Wide format
* For **cross sections**, row corresponds to an individual (person, firm, etc.) and each column corresponds to a variable (age, height, weight).
* For **time series**, row corresponds to a time period (year, quarter, day, etc.) and each column corresponds to a variable (GDP per capita, Unemployment rate).
* Wide format is used in many `R`-based applications: linear regression, VAR models, etc.
## Long format
* For **cross sections**, data is stretched so that a single individual may occupy multiple rows.
* The same applies to **time series**
---
## Time Series Example - wide format
```{r}
GDPwide <- data.frame(
Year = c(2009:2018),
GER = c(411,723,325,456,579,612,709,513,527,379),
FRA = c(123,300,400,500,600,654,789,906,413,567),
USA = c(957,1000,569,896,956,1345,780,599,1023,678)
)
GDPwide # artifical/made-up data used here
```
---
* Above is the **wide format** - GDP data are shown for different states and years
* In **long format** (below), the same data are shown, yet each `Year` and `Country` combination is on a separate row of the dataset.
+ Basically, columns (*variables* in R) `Year`, `Country` and `GDP` are used in the long format to display the same data.
+ Note that `Country` and `GDP` *variables* need to be created to get the long format (such *variables* do not exist in the wide format dataset)
```{r, echo=F}
GDPlong <- gather(GDPwide, key = "State", value = "GDP", 2:4)
print(GDPlong)
```
---
## `gather()` and `spread()` commands
+ `gather()` converts wide format to long
+ `spread()` converts long format to wide
![](datasets/tidyr.png)
---
### Convert `GDPwide` to long format
`gather(GDPwide, key = "State", value = "GDP", 2:4)`
* `GDPwide` dataframe to reshape - convert to long format
* `"State"` is the name of the new key column - a new *variable* (can be any character string you supply). The keys are generated based on column names of the columns being transformed (`2:4`)
+ *variable*-names in the wide format dataframe (`GER`,`FRA`,`USA`) are used as "entries" (keys) in the new column (i.e. variable `State`).
* `"GDP"` name of the new value column (any character string you supply)
* `2:4` numeric indexes (or names) of columns to collapse.
* Please note that column `Year` (1st column) is left out (Year is the grey column as in the illustration).
```{r}
GDPlong <- gather(GDPwide, key = "State", value = "GDP", 2:4)
print(GDPlong)
```
Please note that `GDPlong` is suitable for panel data analysis - both `id` and `time` identification is provided in each row.
---
### Convert `GDPlong` to wide format
`spread(GDPlong, key = "State", value = "GDP")`
* `GDPlong` dataframe to reshape - convert to wide format
* `key = "State"` column to use for keys (new columns names, i.e. new *variables*)
* `value = "GDP"` column to use for values (data-cells in the new (wide) dataframe)
* Please note that column `Year` is left out from the syntax (Year is the grey column as in the illustration).
```{r}
GDPwide_2 <- spread(GDPlong, key = "State", value = "GDP")
GDPwide_2
GDPwide # for comparison
```
Note that `GDPwide_2` columns are alphabetically organized now.
---
Alternatively, we may choose a *"transposed"* wide format:
```{r}
spread(GDPlong, key = "Year", value = "GDP")
```
* However, this formatting is not very useful for most econometric applications.
---
#### Data in long format are used in panel data regression, plotting by `{ggplot2}`, etc.
```{r}
Plot1 <- ggplot(data = GDPlong, aes(x = Year, y = GDP))+
geom_line(aes(color=State))+
xlab("Year")+ylab("Artificial GDP data")+
facet_grid(State~.)
Plot1
Plot2 <- ggplot(data = GDPlong, aes(x = Year, y = GDP))+
geom_line(aes(color=State))+
xlab("Year")+ylab("Artificial GDP data")+
theme_light() # remove faceting, include theme "light"
Plot2
```
---
> Advanced data reshaping tools
`{tidyr}` provides sufficient functionality for many empirical tasks.
If necessary, additional control over reshaping data may be obtained throught the `{reshape2}` package:
* `melt()` from `{reshape2}` expands the functionality of `gather()`
* `dcast()` from `{reshape2}` expands the functionality of `spread()`
For additional information on `{reshape2}`, see:
+ [Tutorial by Timothy Carsel](https://ademos.people.uic.edu/Chapter8.html)
+ [Tutorial by Sean C. Anderson](https://seananderson.ca/2013/10/19/reshape/)
+ [reshape2 pdf](https://cran.r-project.org/web/packages/reshape2/reshape2.pdf)
+ `R09_Eurostat_in_depth_self_study_material.R` (in your R working directory)
---
## `{tidyr}` Excercise
The following wide-format dataset contains actual GDP per capita in Spain (from Eurostat)
* Euro per inhabitant in percentage of the EU average (100 = EU's average in a given year)
* 2005 - 2016 data
* Measured at the [NUTS2 level](https://en.wikipedia.org/wiki/NUTS_statistical_regions_of_Spain)
```{r}
GDPSpain <- read.csv("datasets/GDPSpain.csv")
head(GDPSpain[,1:8],12) # only columns 1 to 8 are shown
str(GDPSpain)
dim(GDPSpain)
```
Using the `gather()` command, convert `GDPSpain` to a long format
```{r}
# uncomment the following line and comlete the command
# GDPSpain_L <- gather()
```
Use the `spread()` command to transform `GDPSpain_L` (long format) back to the original wide format
```{r}
# uncomment the following line and comlete the command
# GDPSpain_W2 <- spread()
```
As a bonus task, you may use `ggplot2` to plot the data (selected regions).
----
This is just a quick introduction to `tidyr`. For detailed discussion on the topic, you may have a look [here](https://github.com/rstudio/webinars/blob/master/05-Data-Wrangling-with-R-and-RStudio/wrangling-webinar.pdf)