-
Notifications
You must be signed in to change notification settings - Fork 2
/
Welcome to the tidyverse.Rmd
1309 lines (971 loc) · 51.9 KB
/
Welcome to the tidyverse.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Welcome to the tidyverse!"
output:
html_document:
theme: cosmo
highlight: tango
toc: true
toc_float: true
---
**Part of the DGfS PhD students' forum, 23 February 2021**
Instructors: Kyla McConnell and Julia Müller
Contact us on Twitter (@McConnellKyla, @JuliaMuellerFr)
Unless indicated, artwork is by the wonderful @allison_horst - find her on [github](https://github.com/allisonhorst/stats-illustrations).
# (1) What is this "tidyverse"?
![The tidyverse](img/tidyverse_celestial.png){width=50%}
Let's jump right in and load the package:
```{r}
library(tidyverse)
```
The tidyverse is an extremely useful collection of R *packages* (i.e. add-ons to *base-R*) that help you get your data into a useful format (among other things). They all share "an underlying design philosophy, grammar, and data structures", according to the [tidyverse website](https://www.tidyverse.org/). In other words, its commands/functions all have a similar structure and descriptive names to make them easier to remember and use.
**The following packages are included in the tidyverse:**
- *ggplot2*: for data visualisation
- *tibble*: for tweaked dataframes
- *tidyr*: for data wrangling
- *readr*: for reading in files
- *purrr*: for functional programming
- *dplyr*: for data manipulation
- *stringr*: for string manipulation
- *forcats*: for working with categorical variables (factors)
## What is tidy data, and why do we use it?
**Characteristics of tidy data:**
![What does tidy data look like?](img/tidydata_1.jpg){width=50%}
**Why this format?**
- Best format for most operations in R, especially plotting and running statistical models
- Hadley Wickham: "Tidy datasets are all alike, but every messy dataset is messy in its own way"
- Tools/commands that work for one tidy dataset will also work for another tidy dataset
- By getting the data in a tidy format, you can spend more time on the actual analysis
![Why use tidy data?](img/tidydata_3.jpg){width=50%}
# (2) Communicating with R
## Understanding warnings and errors
R will often "talk" to you when you're running code. For example, when you install a package, it'll tell you e.g. where it is downloading a package from, and when it's done. Similarly, when you loaded the tidyverse collection of packages, R listed them all. That's nothing to worry about!
When there's a mistake in the code (e.g. misspelling a variable name, forgetting to close quotation marks or brackets), R will give an *error* and be unable to run the line. The error message will give you important information about what went wrong.
```{r}
hello <- "hi"
#Hello
```
In contrast, *warnings* are shown when R thinks there could be an issue with your code or input, but it still runs the line. R is generally just telling you that you MIGHT be making a logical error, not that the code is impossible.
```{r}
c(1, 2, 3) + c(1, 2)
```
It's normal that you'll encounter both warnings and errors while coding! This debugging (= finding and fixing errors in code) bingo gives a few suggestions of what might have gone wrong.
![Debugging bingo, by Dr Ji Son, @cogscimom on Twitter](img/debugging_bingo.jpg)
## Reading function documentation
We'll get to know a number of R *functions* today. These functions can take one or more `arguments`. As an example, let's try out the `say()` function from the cowsay package.
First, (install and) load the cowsay package:
```{r}
# install.packages("cowsay")
library(cowsay)
```
Try the following code:
```{r}
say(
what = "Good luck learning about the tidyverse!",
by = "rabbit")
```
We can see that this function has the `what` argument (what should be said?) and the `by` argument (which animal should say it?). But what other options are there for this command - which other animals, for example, or can you change the colour? To see the documentation for the `say` command, you can either run this line of code:
```{r}
?say
```
...or type in `say` in the Help tab on the bottom right.
This will show you the documentation for the command.
- *Usage* shows an overview of the function arguments and their defaults (e.g. if you typed in `say()` without any arguments in the brackets, you'd get the defaults, i.e. a cat saying "Hello world!")
```{r}
say()
```
- *Arguments* provides more information on each argument.
Arguments are the options you can use within a function.
- what
- by
- type
- what_color
etc.
Each of these can be fed the `say()` function to slightly alter what it does.
- *Examples* at the bottom of the help page lists a few examples you can copy-paste into your code to better understand how a function works.
Don't worry if you don't understand everything in the documentation when you're first starting out. Just try to get an idea for which arguments there are and which options for those arguments. It's good practice to look at help documents often -- this will also help you get more efficient at extracting the info you need from them.
**-- Please work on Exercise block 1 now --**
![say() pumpkin by](img/brownlie.jpg)
By @E_Brownlie
# (3) Reading in and exploring data
Reading in data tends to follow this pattern:
```{r eval=FALSE}
name_of_data_in_R <- read_csv("data_file.csv") # equivalent to
name_of_data_in_R <- read_delim("data_file.csv", delim = ",")
name_of_data_in_R <- read_csv2("data_file.csv") # equivalent to
name_of_data_in_R <- read_delim("data_file.csv", delim = ";")
name_of_data_in_R <- read_tsv("data_file.txt") # tab-separated file
```
This works as long as the data file is saved in the same location as this file! You can add e.g. "data/" before the file name if it is located in a subfolder.
One useful additional argument is `na`. This controls how missing values in the file are recognised. By default, these functions assume that missing values are represented by either empty cells or "NA", but if that's not the case, you can tell R that. For example, if missing values are represented by -9 (as can be the case for questionnaire data), you could add `na = "-9"` to a read command.
## Data we'll use today
**During the workshop:**
- Self-paced reading (dgfs_spr.csv)
- Corpus (cat_dog_corpus_data.csv, corpus_frequencies.csv)
- Questionnaire (animal_survey.csv)
- Acceptability judgments (acceptability_judgements.csv)
We'll introduce each dataset in more detail when we start using it.
**In the exercise blocks:**
- Lexical decision experiment (dgfs_lexdec.csv)
- Questionnaire data - information on the participants in that lexical decision experiment (dgfs_pars_lexdec.csv)
- Phonetic data (wh_split.txt)
Here, you can often choose which data you'd like to focus on - lexical decision or phonetic data!
## Introducing our first dataset
Let's read in a self-paced reading dataset (saved in the data folder, so we need to add `data/` to tell R that):
```{r}
spr <- read_csv("data/dgfs_spr.csv")
```
The current dataframe is a self-paced reading experiment where 12 participants read 20 sentences each, plus 3 practice sentences to get them warmed up. In a SPR experiment, participants see one word at a time and need to press a button to be shown the next word in the sentence. Their reaction times (RTs, i.e. how long it takes them to press that button) are recorded.
Half the sentences were about dogs and half the sentences were about cats. In one condition (A), all sentences were paired with appropriate adjectival collocates according to the BNC (lap dog vs. tortoiseshell cat), in the other, these were reverse (lap cat vs. tortoiseshell dog). All sentences were presented in otherwise natural-sounding sentences.
## Exploring our data
Now you have a data file read in, but how do you see what's in it?
```{r}
head(spr)
```
You can change the amount of rows with the `n` *argument*:
```{r}
head(spr, n=3)
```
Or: Click on the name of dataframe in the Environment tab (top right) to open a preview in a new tab. You can also sort columns and filter rows (just for viewing purposes). If the dataframe is large, however, this can get very slow.
There's also an easy way to see what the columns are:
```{r}
colnames(spr)
```
`summary()`: call it on a dataframe to get each column and useful info based on the data type. For example, numeric columns will show the min, median, max and the quartiles (25% increments).
```{r}
summary(spr)
summary(spr$RT)
```
## Character vs. factor columns
In the environment panel (or using `str()`), you can see that all the variables in this data are read in as either numeric or character data. However, some variables should be treated as factors because they represent categories, not text data. Let's convert them.
```{r}
spr$participant <- as_factor(spr$participant)
spr$item_type <- as_factor(spr$item_type)
spr$cond <- as_factor(spr$cond)
```
This way, the summary output also works as expected. For example, we can have a look at how many different participants and conditions we have:
```{r}
summary(spr$participant)
summary(spr$cond)
```
If any other columns in your dataframe are read in wrong (for example, if you have a numeric column that looks like: "43", "18" and is being read as a character column) you can convert them with similar syntax: `as.numeric()`, `as.character()` etc.
# (4) The pipe %>%
One of the most noticeable features of the tidyverse is the pipe %>% (keyboard shortcut: Ctrl/Cmd + Shift + M).
The pipe takes the item before it and feeds it to the following command as the first argument. Since all tidyverse (and many non-tidyverse) functions take the dataframe as the first argument, this can be used to string together multiple functions. It also makes it easier to read sequential code in a more natural order.
Compare the following lines of pseudocode (courtesy of @andrewheiss), which would produce the same output:
![Pipe as explained by Andrew Heiss](img/pipe_andrewheiss.jpg)
You can see that the version with the pipe is easier to read when more than one function is called on the same dataframe! In a chain of commands like this one, you can think of the pipe as meaning "and then".
Another example - maybe @hadleywickham's version of a typical morning speaks more to your own experience (it certainly does for us):
I %>%
tumble(out_of = "bed") %>%
stumble(to = "the kitchen") %>%
pour(who = "myself", unit = "cup", what = "ambition") %>%
yawn() %>%
stretch() %>%
try(come_to_live())
So to re-write the `head()` function with the pipe:
```{r}
spr %>%
head()
```
This produces the exact same output as `head(spr)`.
Here are some more examples:
```{r}
# Equivalent to summary(spr)
spr %>%
summary()
# Equivalent to colnames(spr), which returns all column names
spr %>%
colnames()
# Equivalent to nrow(spr), which returns the number of rows in the df
spr %>%
nrow()
```
You can also stack commands by ending each row (except the last one) with a pipe.
You can see that the version with the pipe is easier to read when more than one function is called on the same dataframe! The line breaks are optional, but make the code more readable.
```{r}
spr %>%
colnames() %>% #extracts column names
nchar() #counts the number of letters
```
# (5) Renaming and rearranging data
## rename()
You can rename columns with the `rename()` function. The syntax is new_name = old_name. Let's rename the `cond` variable:
```{r}
spr %>%
rename(condition = cond)
```
### Preview vs. saving
This is just a preview because we didn't assign the changed dataframe to any name. If you look at the spr dataframe, for example in the Environment panel on the upper-right, the dataframe hasn't changed. This is useful for testing code and making sure it does what you expect and want it to do. To save the changes, we need to assign the call back to the variable name, i.e.
dataframe <- dataframe %>%
some operations here
### Renaming multiple columns
You can also rename multiple columns at once (no need for an array here):
```{r}
spr <- spr %>%
rename(condition = cond,
sentence = full_sentence)
```
Notice above that we've saved output over the spr dataframe to make the changes 'permanent'.
There is no output when you save the changes, but the spr dataframe has been permanently updated (within the R session, not in your file system). To save the changes, but also show the output of a function, we can put brackets around the code.
If you make a mistake: arrow with a line under it in the code block of R-Markdown, runs all blocks above (but not the current one).
## arrange()
This command lets you sort by row values: `arrange()`. By default, this sorts by lowest to highest value, but you can add `desc()` to reverse that.
```{r}
spr %>%
arrange(RT)
spr %>%
arrange(desc(RT))
```
Or in alphabetical order if the column is a character type:
```{r}
spr %>%
arrange(word)
```
Wrapping `desc()` around character or category variable reverses the sorting:
```{r}
spr %>%
arrange(desc(word))
```
It's also possible to sort by several variables:
```{r}
spr %>%
arrange(word, RT)
```
And again, `desc()` can be used to reverse the order:
```{r}
spr %>%
arrange(word, desc(RT))
```
**-- Please work on exercise block 2 now --**
# (6) Subsets
Taking a subset allows you to look at only a part of your data: tidyverse uses `select()` for columns and `filter()` for rows.
![Source: Kai Arzheimer](img/kai_arzheimer.jpg){width=50%}
## select()
You may have seen the traditional syntax: dataframe$column.
A useful step in using pipes and tidyverse calls is the ability to *select* specific columns. That is, instead of writing `spr$RT` we can write:
```{r}
spr %>%
select(RT)
```
### Select multiple columns
You can also use `select()` to take multiple columns.
```{r}
spr %>%
select(participant, word, RT)
```
You can see that these columns are presented in the order you gave them to the select call, too:
```{r}
spr %>%
select(RT, word, participant)
```
You can also use this to reorder columns in that you give the name of the column(s) you want first, then finish with `everything()` to have all other columns follow:
```{r}
spr %>%
select(RT, everything())
```
Or have a look at the documentation for a command we won't have time to discuss today: `relocate()`.
### Remove columns with select
You can also remove columns using select if you use the minus sign. For example, the item_type column is a factor with only one level - it always says "DashedSentence". So let's get rid of it:
```{r}
spr %>%
select(-item_type)
```
You can also remove multiple columns at once by writing them in an array `c()`. We'd like to remove the item type column and also the first column (X1) which seems to be just a counter.
```{r}
spr <- spr %>%
select(-c(item_type, X1))
```
This overwrites the data as it is saved in R. It does **not** overwrite the file that is saved on your computer.
### Leveling up select()
Until now, we've used select() in combination with the full column name, but there are helper functions that let you select columns based on other criteria.
For example, here's how we can select both the `sentence_num` and the `word_num` column - by specifying `ends_with("_num")` in the select() call:
```{r}
spr %>%
select(ends_with("_num"))
```
The opposite is also possible using `starts_with()`.
`contains` is another helper function. Here, we're using it to show all columns that contain an underscore:
```{r}
spr %>%
select(contains("_"))
```
We can also select a range of columns using a colon. This works both with variables and (a range of) numbers:
```{r}
spr %>%
select(condition:RT)
spr %>%
select(1:3)
```
Here, the order of the columns matters!
Other helper functions are:
- matches: similar to contains, but can use regular expressions
- num_range: in a dataset with the variables X1, X2, X3, X4, Y1, and Y2, select(num_range("X", 1:3)) returns X1, X2, and X3
## filter()
### Filter based on a condition
While with `select()`, you can pick **columns** by name or if they fulfill conditions, `filter()` lets you look for **rows** that fulfill certain conditions.
![Filter](img/dplyr_filter.jpg){width=50%}
Use `filter()` to return all items that fit a certain condition. For example, you can use:
- equals to: ==
- not equal to: !=
- greater than: >
- greater than or equal to: >=
- less than: <
- less than or equal to: <=
- in (i.e. in a vector): %in%
**Syntax:**
filter(data, columnname logical-operator condition)
or, using the pipe:
data %>%
filter(columnname logical-operator condition)
Let's look at reaction times that are shorter than 200 ms:
```{r}
spr %>%
filter(RT < 200)
```
...reaction times longer than or equal to 250 ms:
```{r}
spr %>%
filter(RT >= 250)
```
Or you can use it to select all items in a given category. Notice here that you have to use quotation marks to show you're matching a character string.
Look at the error below:
```{r}
# spr %>%
# filter(condition == condA_cat)
```
The correct syntax is: (because you're matching to a string)
```{r}
spr %>%
filter(condition == "condA_cat")
```
You can also use filter to easily drop rows. Let's drop all practice rows and save the output.
```{r}
nrow(spr)
spr <- spr %>%
filter(condition != "practice")
nrow(spr)
```
Useful to check `nrow()` before and after dropping rows to check how much data has been lost.
To use %in%, give an array of options (formatted in the correct way based on whether the column is a character or numeric):
```{r}
spr %>%
filter(word %in% c("cat", "dog"))
spr %>%
filter(sentence_num %in% c(4, 14))
```
Note that filter is case-sensitive, so capitalization matters.
We can also specify several conditions in one `filter()` call:
- Use & for "and" (both conditions must be true)
- Use | for "or" (one condition must be true)
```{r}
spr %>%
filter(word == "cat" & RT > 300)
spr %>%
filter(RT > 600 | RT < 100)
```
We can also "chain" different functions, which is one of the things that makes the pipe so useful. For example, we could filter for data in condition B, which is an incomplete sentence, and only look at the words and their response times:
```{r}
spr %>%
filter(condition == "condB_dog") %>%
select(word, RT)
```
One useful function that can be chained to `filter()` is `distinct()`, which will return only the unique rows. Without an argument, it returns all rows that are unique in all columns.
You can also add a column name as an argument to return only the unique values in a certain column (useful with factors)
```{r}
spr %>%
filter(condition == "condB_dog") %>%
distinct(sentence)
```
You can also use it on its own to return unique values or combinations of values.
```{r}
spr %>%
distinct(sentence, condition)
```
You may also have NAs in your data at some point -- these appear when a cell is left empty for any reason. `is.na()` is a useful companion to `filter()`, especially paired with the not operator `!`
```{r}
spr %>%
filter(!is.na(RT) & word == "cat")
```
Or tidyverse has its own version of a remove NA call: `drop_na()` which can be included in pipe sequences
```{r}
spr %>%
drop_na(RT)
```
However, before you drop NAs, think about why they are there and if there is some better way to deal with them. Dropping NAs removes THE ENTIRE ROW that includes an NA (there is no way around this, if you think about it). Before permanently removing NAs, think about if they are indicative of an error you don't want to ignore, if they can be replaced by a usable value (check the `replace_na()` command), and if you want to permanently remove them or just discard those rows for some parts of the analysis. Always check the number of rows before and after removing NAs (`nrow()`).
**-- Please work on Exercise block 3 now --**
# (7) Separating and uniting columns
For the next two examples, we'll use a different dataset called `animal_corpus`. Read it in and familiarise yourself with its contents.
```{r}
animal_corpus <- read_csv2("data/cat_dog_corpus_data.csv")
```
This contains corpus data of "cat" and "dog" together with the words that precede "cat" and "dog". These are called "collocates" and, in our example, also include part of speech tags (the format is word_tag).
This data is not tidy. Why?
The collocates column contains two variables (word and tag) although according to tidy data principles, each variable should be saved in its own column.
## separate()
Luckily, the tidyverse has a command for that: `separate()`. It takes the following arguments:
- data: our dataframe, we'll pipe it
- col: which column needs to be separated
- into: a vector that contains the names of the new columns
- sep: which symbol separates the values
- remove (optional): by default, the original column will be deleted. Set `remove` to FALSE to keep it.
```{r}
(animal_corpus <- animal_corpus %>%
separate(col = coll,
into = c("coll", "coll_tag"),
sep = "_"))
```
Another example: In the SPR data, the condition column contains two pieces of information: which condition the item was in (condA or condB) and which animal was being read about (cat or dog)
```{r}
(spr <- spr %>%
separate(condition,
into = c("condition", "animal"),
sep = "_"))
```
## unite()
The opposite of `separate()`. This lets you glue columns together.
- col is the name of the new column
- the next argument, a vector, lists the columns that should be united
- sep, as above, lets you specify how the values should be separated
Let's say we'd like our data to be in the format "collocate cat/dog", so without the tag, but in one column, separated by a space.
```{r}
animal_corpus %>%
unite(col = "coll_word",
c("coll", "animal"),
sep = " ")
```
This leads to a lot of repetition - some bigrams appear several times in just our preview. To remedy this, we can use `distinct()`, which only keeps unique rows. To make clear that we want unique collocate cat/dog combinations, we can put `coll_word` into `distinct()` to make clear that this is the relevant column and tags should be ignored.
```{r}
animal_corpus %>%
unite(col = "coll_word",
c("coll", "animal"),
sep = " ") %>%
distinct(coll_word)
```
# (8) Creating and changing columns with mutate()
With the `mutate()` function, you can change existing columns and add new ones.
The syntax is:
mutate(data, col_name = some_operation)
or, with the pipe:
data %>% mutate(col_name = some_operation).
![Mutate](img/dplyr_mutate.png){width=50%}
The response times are measured in ms. Let's convert them to seconds by dividing by 1000:
```{r}
spr %>%
mutate(RT_s = RT / 1000)
```
Now, there's a new column called RT_s (it's at the very end by default).
You can also save the new column with the same name, and this will update all the items in that column (see below, where I divide response times by 1000, but note that I don't save the output):
```{r}
spr %>%
mutate(RT = RT / 1000)
```
You can also do operations to character columns - for example:
```{r}
(spr <- spr %>%
mutate(word = tolower(word)))
```
We can also easily calculate the length of each word (as number of characters):
```{r}
(spr <- spr %>%
mutate(word_length = nchar(word)))
```
## Change data type in a column
We can also change data types using `mutate()`. Instead of the code we used earlier to convert participant and condition to factors, we could write:
```{r}
(spr <- spr %>%
mutate(participant = as_factor(participant),
condition = as_factor(condition)))
```
As you can see, we can change several variables within one `mutate()` call. In the same way, we could create several new columns at the same time.
## Relabel factors
In our SPR experiment, condition A represents a match (i.e. cat/dog is presented with a matching collocate: purring cat, guide dog) and condition B is a mismatch (e.g. guide cat, purring dog). To make this clear in the data, we should label this explicitly. Within a `mutate()` command, we can use `recode()` to change the factor labels. The format for this is "old label" = "new label".
```{r}
(spr <- spr %>%
mutate(condition = recode(condition,
"condA" = "match", "condB" = "mismatch")))
```
Let's look at our third dataset, animal_survey. It contains data from the same participants who answered a few sociodemographic questions and also indicated how cute they think animals are, how much they('d) like look at, pet, and own an animal (scale: 1-5). Besides that, they were also asked to rate cats and dogs (scale: 1-7).
```{r}
animal_survey <- read_csv("data/animal_survey.csv")
animal_survey %>%
head()
```
In our animal_survey data, education is represented by the numbers 1-4. We can most easily change these labels using `recode()`:
```{r}
(animal_survey <- animal_survey %>%
mutate(education = recode(education,
"1" = "Elementary school",
"2" = "High school",
"3" = "Bachelor's degree",
"4" = "Master's degree or higher"
)))
```
Because these numbers should be treated as characters, we need to put them in quotation marks!
## If-else-statements
Now for something fancy. You can also make new columns based on "if" conditions using the call `ifelse()`.
The syntax of ifelse is: ifelse(this_is_true, this_happens, else_this_happens).
For example, we could create a column called "RT_short" that contains "short" if the response time is faster than 100 ms and "long" if it isn't:
```{r}
spr %>%
mutate(RT_cat = ifelse(RT < 100, "short", "long"))
```
You can also use ifelse on categorical / character columns:
```{r}
(spr <- spr %>%
mutate(position = ifelse(word %in% c("cat", "dog"), "critical", "not critical")))
```
## Several conditions: case_when()
What if you have several conditions? For example, if RTs are shorter than 100 ms, they should be labelled "short", if they're longer than 500 ms, "long", and "normal" for all other RTs. While it's possible to chain several `if_else()` statements, it gets confusing and hard to read. Instead, we should use `case_when()`.
![Case when: an extension of if-else](img/dplyr_case_when.png){width=50%}
The syntax within `case_when()` is:
condition ~ what to do if it is true (can be used as often as you want and can even refer to different variables!),
TRUE ~ what to do in all other cases.
```{r}
spr %>%
mutate(RT_cat = case_when(
RT < 100 ~ "short",
RT > 500 ~ "long",
TRUE ~ "normal"
))
```
Another example: In the questionnaire, participants were asked to indicate how much they like cats and dogs, respectively. We can convert this information into a category. To create a variable called "preference", we can use the following `case_when()` statement:
```{r}
head(animal_survey)
(animal_survey <- animal_survey %>%
mutate(preference = case_when(
cats > dogs ~ "cats",
dogs > cats ~ "dogs",
TRUE ~ "undecided"
)))
```
If we look at the output, we see that the "else/TRUE" condition wasn't necessary, strictly speaking, but it's good to account for all the potential cases in the data so you don't end up with superfluous NAs.
**-- Please work on Exercise block 4 now --**
# (9) Summary tables with groupings
## summarize()
Let's extract some information on the reaction times in the self-paced reading data. If we want to do this the tidy way, we can use `summarise(operation(variable))`.
You can use multiple different operations in the summarize part, including:
- mean(col_name)
- median(col_name)
- max(col_name)
- min(col_name)
Note that if you have NAs in your dataset, you would have to at least temporarily drop them to calculate the mean (could pipe in `drop_na()`, then you wouldn't have to drop any rows)
```{r}
spr %>%
summarise(mean(RT))
```
...and the median RTs:
```{r}
spr %>%
summarise(median(RT))
```
By default, the column is labelled "mean(RT)" or "median(RT)", respectively. We can set our own names, though:
```{r}
spr %>%
summarise(average_RT = mean(RT))
```
## group_by()
In our SPR experiment, we showed matching and mismatching collocate + cat/dog combinations, so we might assume that the mismatched combinations were read more slowly. We'd like to see the average RTs for each condition.
To look at summary statistics for specific groupings like these, we have to use a two- (more like three-) step process.
First, group by your grouping variable (here: condition) using `group_by()`
Then, summarize, which creates returns a summary statistic over multiple rows, using `summarize()` or `summarise()`
Finally, ungroup (so that R forgets that this is a grouping and carries on as normal, with `ungroup()`
Usually this makes no difference to what you see but is an important fail-safe if you're continuing in the code block. For example, if you need to drop the former grouping variable and don't use `ungroup()`, R will refuse to drop it.
![Group and ungroup](img/group_by_ungroup.png){width=50%}
For example, to return the average RT in each condition:
```{r}
spr %>%
group_by(condition) %>%
summarize(mean(RT)) %>%
ungroup()
```
You can also give a name to your new summary column:
```{r}
spr %>%
group_by(condition) %>%
summarize(average_RT = mean(RT)) %>%
ungroup()
```
Or further manipulate it, e.g. converting ms to seconds:
```{r}
spr %>%
group_by(condition) %>%
summarize(average_RT_s = mean(RT) / 1000) %>%
ungroup()
```
You can also group by more than one column to return the unique combinations of the variables.
```{r}
spr %>%
group_by(condition, animal) %>%
summarize(mean(RT)) %>%
ungroup()
```
## count()
For categorical columns, you can also count how many rows are in each category using `count()` instead of `summarize()`. Here, we're counting how often which word appeared in the experiment. The count is displayed in the new "n" column:
```{r}
spr %>%
group_by(word) %>%
count() %>%
ungroup()
```
This is sorted alphabetically, but we can pipe it into the `arrange()` call we discussed earlier to sort by frequency, in descending order.
```{r}
spr %>%
group_by(word) %>%
count() %>%
ungroup() %>%
arrange(desc(n))
```
# (10) across()
`across()` is a helper for `mutate()` and `summarise`. It lets you easily apply a change or create a summary for several variables because you can use `select()` helpers (`starts_with()`, `ends_with()`, `contains()`, etc.) with `across()`.
For example, if we'd like to see the mean for columns that end with "num":
```{r}
spr %>%
summarise(across(ends_with("num"), mean))
```
This also works with a list of functions, e.g. the mean and the standard deviation:
```{r}
spr %>%
summarise(across(ends_with("num"), c(mean, sd)))
```
Here's an example of `across()` in a `mutate()` function. This lets you easily convert all character variables into factors:
```{r}
spr %>%
mutate(across(where(is.character), as.factor))
```
![Across](img/dplyr_across.png)
## Row-wise operations
Back to our questionnaire. For each participant, we'd like to calculate the average of the columns that start with "animals" to get one measure of their interest in animals. We might try something like this:
```{r}
animal_survey %>%
summarise(across(starts_with("animals"), mean))
```
As you can see, R gives us the averages for each of the columns, across all participants - not what we're looking for! The reason for this is that R is best at computing over columns. To solve our problem, we need to add `rowwise()`. This is similar to the grouping idea (`group_by()`) we came across a little earlier. Here, we're telling R to treat every row as a single "group". We can then use `c_across()`, which is the version of `across` that works with row-wise operations. We also need `mutate()` because we want to create a new column rather than see a summary:
```{r}
(animal_survey <- animal_survey %>%
rowwise() %>%
mutate(
pet_interest = mean(c_across(starts_with("animals")))
))
```
**-- Please work on exercise block 5 now--**
# (11) Reshaping data
Now, let's look into how to change the shape of your data. There are two options here:
- making your data "longer", i.e. increase the number of rows and decrease the number of columns
- useful for tidying data, especially common with "wild-caught" data
- command: `pivot_longer()` (older command: `gather()`)
- making your data "wider", i.e. decrease the number of rows and increase the number of columns
- not (as) common for tidying but for creating summary tables
- command: `pivot_wider()` (older command: `spread()`)
## pivot_longer()
Let's take another look at the questionnaire data. The format it is in is typical for questionnaire data as you would export it from host sites. The data from each person is represented on one line. This can be useful for some operations (like in the `case_when()` we used to create the "preference" variable) but a problem for other applications, so let's tidy this using `pivot_longer`.
Specifically, we have an issue with the dog and cat ratings being in separate columns. Instead, we should have a column called "animal", which contains either "cats" or "dogs", and another column that contains the rating.
The basic `pivot_longer` arguments are:
cols: which columns should be reshaped?
names_to: the name of the variable that the original column names ("cats" and "dogs") should be stored in as values
values_to: the name of the variable that the contents of the variables should be stored in
```{r}
animal_survey %>%
pivot_longer(cols = c("cats", "dogs"),
names_to = "animal",
values_to = "rating")
```
Another example: This next file contains the averages of acceptability judgements for the sentences we used in the experiment (we'd expect that sentences which contain mismatching collocates such as "barking cat" are rated as less acceptable than sentences with matching collocates). Let's read it in:
```{r}
acc_judge_original <- read_csv("data/acceptability_judgements.csv")
```
We can see that this data is in a very wide format - each column name contains one of the sentences, and the only row consists of the average acceptability judgements.
What we'd like instead is a column that contains all the sentences and another column that contains all the averages.
```{r}
(acc_judge <- acc_judge_original %>%
pivot_longer(
cols = everything(), # we want to reshape the entire data, so all variables are concerned
names_to = "sentence",
values_to = "rating"
))
```
## pivot_wider()
While `pivot_longer()` is often used to tidy data, its opposite `pivot_wider()` is more common when creating and reformatting summary tables. For example, let's first count how often "cat" and "dog" occur with each collocate:
```{r}
animal_corpus %>%
group_by(animal, coll) %>%
count() %>%
ungroup()
```
Looks like it's working! However, we'd like to reshape this data into a more typical contingency table-like format, with "cat" and "dog" as row labels, and the collocates as column labels.
We can use `pivot_wider()` to achieve this. Its main arguments are:
names_from: where should the new column names come from?
values_from: where should the corresponding values come from?
```{r}
animal_corpus %>%
group_by(animal, coll) %>%
count() %>%
ungroup() %>%
pivot_wider(
names_from = coll,
values_from = n
)
```
# (12) Joining several data sets
Let's add the acceptability judgement data to the SPR data: we want to "join" the two datasets into one, so each sentence should be matched up with its average rating. There are several join commands, which differ in how they match up datasets and which cases are kept or discarded.
Their syntax is:
xxxx_join(dataframe1, dataframe2, by = "column that is present in both dfs")
![Join options, from the data wrangling cheat sheet](img/dplyr-joins.png)
The column "sentence" is what we need to match up acceptability judgements with the SPR data. When joining, it's often a good idea to create a new dataframe instead of overwriting any of the old ones.
```{r}
spr_acc <- full_join(spr, acc_judge, by = "sentence")
```
A quick glance at the Environment tab confirms that we now have a dataframe with the same number of rows ("observations") but one more column, just as we want. But when joining data, plenty of things can go wrong, so let's look at the data:
```{r}
spr_acc %>%
select(sentence, rating) %>%
distinct()
```
Looks fine! Let's follow the same procedure to add participant information to this new dataset:
```{r}
spr_acc_participants <- full_join(spr_acc, animal_survey, by = "participant")
```
As a final example, we'll read in one last, very simple, dataset. It contains the raw frequencies of each of the words participants read in the SPR experiment (word frequency tends to be a good predictor of reading times: more common words are usually read faster). Let's have a look:
```{r}
corpus_frequencies <- read_tsv("data/corpus_frequencies.csv")
```
We'd like to add the word frequency to our data, so the column to match by is "word" in the spr data but it's called "token" in the corpus data. To join this, we can either rename one of these or use a vector in `by()` which lists the two column names as they appear in the dataframes:
```{r}
spr_complete <- full_join(spr_acc_participants, corpus_frequencies, by = c("word" = "token"))
rm(spr_acc, spr_acc_participants) #remove dfs to keep the environment clean
```
Other options:
- semi_join
- anti_join
- bind_rows
- bind_cols
# (13) Writing files to disk