-
Notifications
You must be signed in to change notification settings - Fork 0
/
02-manipulating.Rmd
executable file
·568 lines (404 loc) · 12.6 KB
/
02-manipulating.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
---
title: 'Week 04: Examples for Data Manipulating'
---
# Load tidyverse
```{r, warning=F, message=F}
library(tidyverse)
options(dplyr.print_max = 1e9)
```
# Check the current directory
```{r}
getwd()
```
# Read in the data on EPI2020 Results
# Country scores, 10-year changes, and ranks
# The data is originally available at https://epi.yale.edu/downloads/epi2020resultsregions20200604.csv
# Download file from web to your local.
"
```{r, eval=F}
# download the file into your local computer
download.file("https://epi.yale.edu/downloads/epi2020resultsregions20200604.csv", "data/epi2020resultsregions20200604.csv")
```
# Read in the data into R environment
```{r}
epi_data <- read_csv("data/epi2020resultsregions20200604.csv")
```
# Click on the epi_data variable in the "Environment" tab or
# Run the View() function.
```{r}
View(epi_data)
```
# or structure
```{r}
str(epi_data)
```
# For the interpretation of indicators,
# have a look at the epi2020indicatortla20200604.csv file.
#### Filter() ####
# filter() lets us to subset data via based on some filter conditions.
# We can filter character variables based on their values.
# Let's filter Eastern European countries from the data.
# region is a character variable
```{r}
epi_data %>%
filter(region == "Eastern Europe") %>%
View()
```
# Remember %>% operator does not save the object, you need
# to assign RHS to an object at the LHS.
# Similarly we can use the other "comparison operators":
# Exclude Eastern European countries from the data.
```{r}
epi_data %>%
filter(region != "Eastern Europe") %>%
View()
```
# If we want to select "more than one region" we can use the %in% operator.
# Filter regions Eastern Europe and Asia-Pacific
```{r}
epi_data %>%
filter(region %in% c("Asia-Pacific", "Eastern Europe")) %>%
View()
```
# We can use the %in% operator to deselect certain groups as well,
# in this case we have to negate by adding an exclamation mark (Boolean operator)
# at the beginning of our filter condition.
```{r}
#not works with -
epi_data %>%
filter(!region %in% c("Asia-Pacific", "Eastern Europe")) %>%
View()
```
# We can filter numeric variables based on their values too.
# The most used operators for this are >, >=, <, <=, == and !=
# Filter countries whose EPI.new score is less than
# or equal to 50!..
```{r}
epi_data %>%
filter(EPI.new <= 50) %>%
View()
```
# Multiple conditions
# filter(condition1, condition2) will return rows where "both conditions" are met.
# Let's combine multiple conditions in the same call.
# Filter countries whose EPI.new score is between 50 and 60!..
# 50 <= EPI.new & EPI.new <= 60.
```{r}
epi_data %>%
filter(50 <= EPI.new & EPI.new <= 60) %>%
View()
```
# or use "helper function" between():
```{r}
epi_data %>%
filter(between(EPI.new, 50, 60)) %>%
View()
```
# Filter Eastern Eruropean countries whose EPI.new score is less than
# or equal to 50!..
```{r}
epi_data %>%
filter(region == "Eastern Europe" & EPI.new <= 50) %>%
View()
```
# You can use , and & interchangeably in filter().
```{r}
epi_data %>%
filter(region == "Eastern Europe", EPI.new <= 50) %>%
View()
```
# Filter Asia-Pacific countries whose EPI score is greater than the grand mean.
```{r}
epi_data %>%
filter(region == "Asia-Pacific", EPI.new > mean(EPI.new)) %>%
View()
```
# Multiple AND, OR and NOT conditions can be combined.
# Filter out Eastern Eropean countries whose EPI.new score is less than
# or equal to 50 and AIR.new score is greater 40.
```{r}
epi_data %>%
filter(region == "Eastern Europe", EPI.new <= 50 , AIR.new > 40) %>%
View()
```
# Make sure to use parentheses when combining several logical operators
# to indicate which logical operation should be performed first.
# Practice at home!
# To filter out empty rows, you negate the is.na() function inside a filter:
# The sample code will remove any rows where the specified column is NA.
```{r}
cth_data <- read_csv("data/CTH_raw_na.csv")
View(cth_data)
```
```{r}
dim(cth_data)
```
### Caution!!! Discuss with students
Eğer cth_raw_1950 kolonunda missing varsa, veriden çıkar
```{r}
cth_data %>%
filter(!is.na(CTH.raw.1950)) %>%
View()
```
```{r}
cth_data_filtered <- cth_data %>%
filter(!is.na(CTH.raw.1950))
View(cth_data_filtered)
```
```{r}
dim(cth_data_filtered)
```
#### Select() ####
# select() allows us to select a subset of the columns in the data frame.
# To select a few columns just add their names in the select statement.
# The order in which you add them, will determine the order in which
# they appear in the output.
```{r}
epi_data %>%
select(code, iso, country, region, EPI.new, AIR.new) %>%
View()
```
# We can select multiple columns using the (from:to) operator:
```{r}
epi_data %>%
select(code:AIR.new) %>%
View()
```
# A side note:
# ! for taking the complement of a set of variables.
```{r}
epi_data %>%
select(!code:AIR.new) %>%
View()
```
# For selecting non-consecutive variables, we can use the followings:
# & and | for selecting the intersection or the union of two sets of variables.
# c() for combining selections.
```{r}
epi_data %>%
select(code:iso & iso:EPI.new) %>%
View()
```
```{r}
epi_data %>%
select(code:iso | iso:EPI.new) %>%
View()
```
# We can also deselect (multiple) columns
# by adding a minus sign in front of the column name.
# You can also deselect blocks of columns.
```{r}
# negative sign works with !
epi_data %>%
select(-code) %>%
View()
```
```{r}
epi_data %>%
select(-HLT.new,-(PMD.new:WWT.rgn.rank)) %>%
View()
```
# If we have a lot of columns with a similar structure,
# we can use partial matching by adding
# starts_with(): starts with a prefix,
# ends_with(): ends with a prefix or
# contains(): contains a literal string in your select statement.
```{r}
epi_data %>%
select(code:region, starts_with("AIR")) %>%
View()
```
```{r}
epi_data %>%
select(code:region, ends_with("new")) %>%
View()
```
```{r}
my_Reduced <- epi_data %>%
select(code:region, contains("EPI"))
View(my_Reduced)
```
# Or use a variable for column selection:
# You can set up column names upfront, and
# then refer to them inside a select() statement by either wrapping
# them inside one_of() or by using the !! operator.
```{r}
columns <- c("code", "iso", "country", "region", "EPI.new", "AIR.new")
epi_data %>%
select(one_of(columns)) %>%
#select(!!(columns)) %>% #the same with previous command
View()
```
#everything() within select(), selects everything
```{r, eval=F}
### caution: did not work!!!
columns <- c("code", "iso", "country", "region", "EPI.new", "AIR.new")
epi_data %>%
select(everything(columns)) %>%
View()
```
# If you will be using a select() statement, you can rename straight in the select function.
```{r}
epi_data %>%
select(code:region, EPI = EPI.new, AIR = AIR.new) %>%
View()
```
# Alternatively:
# rename() renames column names.
# The new variable names goes on the LHS of the = sign,
# and the old name on the RHS.
```{r}
epi_data %>%
select(code:region, EPI.new, AIR.new) %>%
rename(EPI = EPI.new, AIR = AIR.new) %>%
View()
```
# More:
# To rename all variables at the same time use set_names():
#### Arrange() ####
# arrange() allows us to sort the values in a data frame by one or more column entries.
# Sort the data according to the EPI.new variable in ascending order
```{r}
epi_data %>%
arrange(desc(EPI.new)) %>%
View()
```
```{r}
epi_data %>%
arrange(EPI.new) %>%
View()
```
# Sort the data according to the EPI.new variable in descending order
# or use desc() within arrange()
```{r}
epi_data %>%
arrange(desc(EPI.new))
View()
```
# Sort the data according to the EPI.new variable in descending order
```{r}
## caution: did not work!!!
epi_data %>%
arrange(-EPI.new)
View()
```
#### Mutate() ####
# We can make new columns with the mutate() function.
# The options inside mutate are "almost endless": pretty much anything that
# you can do to normal vectors, can be done inside a mutate() function.
# Anything inside mutate can either be a new column (by giving mutate a new column name),
# or can replace the current column (by keeping the same column name).
# Let's standardize the EPI.new variable
```{r}
epi_data %>%
mutate(EPI.stand = (EPI.new-mean(EPI.new))/sd(EPI.new)) %>%
select(code:country, EPI.stand) %>%
View()
```
# Note: We can add multiple columns as part of one mutate call.
# The ifelse() statement can be used to turn a numeric column
# into a categoric one with two levels.
# ifelse() takes a logical expression, then what to do if the expression
# returns TRUE and lastly what to do when it returns FALSE.
```{r}
epi_data %>%
mutate(EPI.cat = ifelse(EPI.new >=50,"Good","Bad")) %>%
select(country, EPI.new, EPI.cat) %>%
View
```
# The ifelse() can be nested but if you want more than two levels,
# then it might be even easier to use case_when() which allows as many
# statements as you like and is easier to read than many nested ifelse statements.
# "The arguments are evaluated in order", so only the rows where the first statement
# is not true will continue to be evaluated for the next statement.
# For everything that is left at the end just use the TRUE ~ "newname".
```{r}
epi_data %>%
mutate(EPI.cat = case_when(EPI.new >= 75 ~ "Very Good",
EPI.new >= 50 ~ "Good",
TRUE ~ "Not Good")) %>%
select(EPI.new, country, EPI.cat) %>%
View()
```
#### Summarize() ####
# If we just want to produce summaries of
# the average, sum, standard deviation, minimum, maximum of the data,
# we need summarise().
# To use the function we just add our new column name,
# and after the equal sign the mathematics of what needs to happen:
# column_name = function(variable).
# Useful functions (from https://dplyr.tidyverse.org/reference/summarise.html)
# Center: mean(), median()
# Spread: sd() (standard deviation)
# Range: min(), max()
# Position: first(), last(), nth()
# Count: n(), n_distinct()
# n(): gives the number of observations
# n_distinct(): gives the numbers of unique values of the variable.
# You can add multiple summary functions behind each other.
```{r}
epi_data %>%
summarize(nds = n_distinct(region)) %>%
View()
```
```{r}
## can you guess the dimension of the output?
epi_data %>%
summarize(nrow = n(), avgEPI = mean(EPI.new), sdHLT = sd(HLT.new), maxAIR = max(AIR.new)) %>%
View()
```
#### Group_by() / Summarise() ####
# In most cases, we don’t just want to summarise the whole data table,
# but we want to get summaries by a group.
# To do this, you first need to specify by which variable(s)
# you want to divide the data using group_by().
# You can add one of more variables as arguments in group_by().
# These two verbs always go together. group_by() doesn’t do anything by itself,
# output wise. But what it does is split our data frame into a number of
# smaller data.frames split out by category.
# Then we use summarize to do some sort of summary calculation
# (such as counting or calculating the mean) on these smaller data.frames:
# Calculate the average EPI.new scores across regions and the number of
# countries in each region.
```{r}
epi_data %>%
group_by(region) %>%
summarize(n = n()) %>%
View()
```
# shorcut: group_by()%>%summarize(n = n()) = count()
```{r}
epi_data %>%
count(region) %>%
View()
```
# Then sort the data according to
# average EPI score in descending order.
```{r}
epi_data %>%
group_by(region) %>%
summarize(n = n(), avgEPIreg = round(mean(EPI.new), 2)) %>%
arrange(-avgEPIreg) %>%
View()
```
# Calculate the proportion of countries in each region.
```{r}
epi_data %>%
count(region) %>%
mutate(prop = n /sum(n) * 100) %>%
View()
```
# Side note: we can also group by **more than one** variable .
# Final notes:
# The package **`dplyr`** provides convenient tools for the most common data manipulation
# tasks. It is built to work directly with data frames, with many common tasks
# optimized by being written in a compiled language (C++). An additional feature is the
# ability to work directly with data stored in an external database. The benefits of
# doing this are that the data can be managed natively in a relational database,
# queries can be conducted on that database, and only the results of the query are
# returned.
# This addresses a common problem with R in that all operations are conducted
# in-memory and thus the amount of data you can work with is limited by available
# memory. The database connections essentially remove that limitation in that you
# can have a database of many 100s GB, conduct queries on it directly, and pull
# back into R only what you need for analysis.