forked from UBC-STAT/stat-545-guidebook
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmain.Rmd
4256 lines (2767 loc) · 162 KB
/
main.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "STAT 545 @ UBC: Class Meeting Guide 2019/20"
author: "Vincenzo Coia and Firas Moosvi"
date: "`r Sys.Date()`"
site: bookdown::bookdown_site
output: bookdown::gitbook
documentclass: book
biblio-style: apalike
link-citations: yes
#github-repo: org/repo
cover-image: "stat-545.png"
---
# About This Guide {-}
Welcome to the class meeting (or "lecture") guide for STAT 545A and STAT 547M at UBC for the 2019/20 academic year! This guide organizes what we will be doing in each class meeting. So you can expect it to be updated regularly -- in fact, the date listed above is the last time this guide was updated.
If you're looking for the list of tutorials written by Jenny Bryan, you can find those in the bookdown book found at stat545.com.
## Other Contributors {-}
Various people have contributed to the current state of this guidebook in various ways.
- [Jenny Bryan](https://jennybryan.org/) founded the course, and the big-picture organization of material in this guidebook originated from her work.
- [Victor Yuan](https://twitter.com/victor2wy) delivered and contributed to the [Intro to data wrangling, Part I] lecture (cm006) in 2019.
- [Rashedul Islam](https://www.linkedin.com/in/rashedul-islam-12170432/) delivered and contributed to the [Tibble Joins] lecture (cm010) in 2018.
- [Giulio Valentino Dalla Riva](https://www.gvdallariva.net/) contributed in the 2017 version of the course.
- [Joey Bernhardt](https://www.zoology.ubc.ca/~jbernhar/home/) made a [singer R package](https://github.com/JoeyBernhardt/singer/) that's being used in the [Tibble Joins] lecture (cm010).
<!--chapter:end:index.Rmd-->
# Introduction to STAT 545 and GitHub
## Outline
We'll cover three topics today:
- Syllabus. (~20 min)
- GitHub. (~35 min)
- Getting help (~10 min)
We'll end class with a to-do list before next class.
## Learning Objectives
By the end of today's class, students are expected to be able to:
- Distinguish and navigate between GitHub repositories, Organization accounts, and user accounts.
- Edit plain text files on GitHub.
- Navigate the commit history of a repository and a file on GitHub.
- Contribute to GitHub Issues, especially for STAT 545.
- Identify whether a software-related question has a reproducible example.
## Resources
If you want to learn more about today's topics, check out:
- The [GitHub guide](https://guides.github.com/) has lots of info about GitHub. If you do go here, I recommend you start with ["Hello, World!"](https://guides.github.com/activities/hello-world/). You'll see stuff about branching there -- we'll be discussing that next Thursday.
- Jenny's ["How to get unstuck"](http://stat545.com/help-general.html) page is useful for getting help online (even outside of STAT 545).
## Topic 1: Syllabus (20 min)
The course syllabus can be found on the STAT 545 @ UBC homepage, [stat545.stat.ubc.ca](https://stat545.stat.ubc.ca). We'll cover:
- Guidebook (2 min)
- Learning Objectives and Course Structure (8 min)
- Example of an analysis: `Interpreting-Regression` book.
- Teaching Team and Contact (5 min)
- Resources, especially what's going on with stat545.com (5 min)
## Topic 2: GitHub (35 min)
(2 min)
We will be using [GitHub](https://github.com) a lot in this course:
- All course-related work will go on GitHub.
- Discussion will happen on GitHub.
- Even this guidebook and the STAT 545 website files are on GitHub.
But why GitHub? Because it's tremendously effective for developing a project. Examples:
- [Apple](https://github.com/apple) uses it.
- [Uber](https://github.com/uber) uses it.
- [Netflix](https://github.com/Netflix) uses it.
- [This Guidebook](https://github.com/STAT545-UBC/Classroom) and [the STAT545 @ UBC website](https://github.com/STAT545-UBC/STAT545-home) use it.
- Prominent R packages like [`ggplot2`](https://github.com/tidyverse/ggplot2) use it.
Today, we'll check out:
1. GitHub as cloud storage;
2. GitHub for collaboration; and
3. GitHub for version control with git.
### Register a GitHub account - Activity (4 min)
Your turn:
1. Register for a free account on [github.com](https://github.com).
- You'll be using this account for the duration of the course.
- Give your username [some thought](https://happygitwithr.com/github-acct.html#username-advice) -- ideally, should include your name.
2. Tell us what your username is by filling out [this survey](https://ubc.ca1.qualtrics.com/jfe/form/SV_8jKz3FaT7w5EHfT).
### GitHub as cloud storage (4 min)
At the very least, GitHub allows for cloud storage, like Google Drive and Dropbox do. There's a bit more structure than just storing files under your account:
- __Repositories (aka "repo")__: All files must be organized into _repositories_. Think of these as self-contained projects. These can either be _public_ or _private_.
- __User Accounts__ vs. __Organization Accounts (aka "Org")__: All repositories belong to an account:
- A _user account_ is the account you just made, and typically holds repositories related to your own work.
- An _Organization_ account can be owned by multiple people, and typically holds repositories relevant to a group (like STAT 545).
Examples:
- The [`ggplot2`](https://github.com/tidyverse/ggplot2) repo, within its corresponding `tidyverse` Org.
- My [website](https://github.com/vincenzocoia/website) repo, within my own user account.
Want to read more about GitHub accounts? [Check out this help page on GitHub](https://help.github.com/en/articles/types-of-github-accounts).
### GitHub as cloud storage - Activity (10 min)
__Together: Make a participation repo__
- Follow the [setup instructions](https://stat545.stat.ubc.ca/evaluation/participation/#setup) on the participation page.
__Navigating GitHub__
1. Together: Make a new file on your participation repository:
- Click on the "Create New File" button on your repository's home page.
- Call it `navigating_github.md`
- Leave it blank, and commit ("save") the file by clicking on green "commit new file" button at the bottom of the page.
2. Together: Add the following URL's to your `navigating_github.md` file (click on the pen button to edit), together with some commentary:
- The repository for the STAT 545 home page, called `STAT545-home` (use this if the site ever goes down!)
- The account it's under.
- Whether the account is a _user account_ or an _Org_.
3. Together: Commit the changes.
4. Your turn: Continue the exercise, and add more URL's (with more commentary):
- The URL to your participation repo
- The URL to your user account page
5. Your turn: Commit the changes.
### GitHub for collaboration (4 min)
The "traditional" way to collaborate involves sending files over email. Problems:
- Easily lose track of who has the most recent version.
- Emails get buried.
Addressed by GitHub:
- GitHub repository treated as the "master version".
- Use [_GitHub Issues_](https://guides.github.com/features/issues/) instead of email.
_Issues_ are a discussion board corresponding to a particular repository. One "thread" is called an Issue. Some features:
- Tag other GitHub users using `@username`.
- Get email notifications if you are tagged, or are `Watch`ing a repository.
As an example, check out the Issues in the [`ggplot2`](https://github.com/tidyverse/ggplot2) repository.
More on collaboration next Thursday.
### GitHub for collaboration - Activity (1 min)
__Together: `Watch`ing the `Announcements` repo__
1. Navigate to the STAT 545 [Announcements](https://github.com/STAT545-UBC-hw-2019-20/Announcements) repository.
2. Click `Watch` on the upper-right corner of the repo
You should now get an email notification whenever an Issue is posted.
### GitHub for version control with git (5 min)
GitHub uses a program called `git` to keep track of the project's history (more about `git` next Thursday).
- Users make "commits" to form a _commit history_.
- `git` only tracks the _changes_ associated with a commit, so it doesn't need to take a snapshot of all your files each time.
- The actual changes are called a _diff_.
Demostration:
- View commit history of the [STAT545-home](https://github.com/STAT545-UBC/STAT545-home) repository by clicking on the "commits" button on the repo home page.
- View a recent diff by clicking on the button with the _SHA_ or _hash_ code (something like `6c0a5f1`).
- This is also useful for collaborators to see exactly what you changed.
- View the repository from a while back with the `<>` button.
- View the history of a file by clicking on the file, then clicking "History".
Why version control?
- Don't fret removing stuff
- Leave a breadcrumb trail for troubleshooting
- "Undo" and navigate a previous state
- Helps you define your work
- ...
### GitHub for version control with git - Activity (5 min)
__Your turn: History of the [`STAT545-UBC/Classroom`](https://github.com/STAT545-UBC/Classroom) repository.__
1. Use the commit history of the [`STAT545-UBC/Classroom`](https://github.com/STAT545-UBC/Classroom) repository to find Assignment 01 that was delivered last year in STAT 545A (Note: the course ended in mid October 2018, and the assignments were held in a folder called `assignments`).
2. Add the URL of this assignment to your `navigating_github.md` file in your participation repository. Keep up with the commentary within the file, too. When was the assignment due?
Note: the layout and content of the assignments are changing this year.
## Topic 3: Asking effective questions online (10 min)
(5 min)
We all get stuck sometimes. If you try taking [preliminary measures](http://stat545.com/help-general.html) such as googling, you may have to turn to writing a question on a discussion board. Making your question _effective_ is an art.
To make your question effective, the idea is to make things as easy as possible for someone to answer.
- Will they have to dig to find a resource you're talking about, or do you provide links?
- If your code isn't doing what you expect, or you don't know how to obtain an output, do you provide a [__reproducible example__](https://stackoverflow.com/help/minimal-reproducible-example) (aka "reprex")?
- Ideally, someone should be able to copy and paste a chunk of code to reproduce the problem you are talking about.
- Is your reproducible example _minimal_, meaning you've removed all the unnecessary parts to reproduce the problem?
You'll probably find that the act of writing an effective question causes you to answer your own question!
### Asking questions - Activity (5 min)
__Commenting on some online questions__
1. My turn: Start an Issue on the [Announcements repo](https://github.com/STAT545-UBC-hw-2019-20/Announcements/issues) called `Asking effective questions`.
2. Your turn: Find a question/issue or two that someone has posed online. Check out [Stack Overflow](https://stackoverflow.com/) for inspiration.
3. Your turn: Add a comment to the newly opened Issue with the following:
- The URL to the thread/question
- A few brief points on how the question is worded effectively or ineffectively. What would make it better, if anything?
We'll talk about some examples after you're done.
## To do before next class
- Please fill out [this survey](https://ubc.ca1.qualtrics.com/jfe/form/SV_8jKz3FaT7w5EHfT), so that we can match you to your GitHub account.
- Be sure to complete the in-class activities listed in today's section of the guidebook.
- Please put up a profile photo on GitHub -- it makes the STAT 545 community more personable.
- Install the software stack for this course, as indicated below. Having trouble? Our wonderful TA's are here to help you during office hours.
Optionally, register for the [Student Developer Pack](https://education.github.com/pack) with GitHub for a bunch of free perks for students!
And remember: bring your laptop to every class, as we will always have live-coding activities.
### Software Stack Installation
1. Install R and RStudio.
- R here: <https://cloud.r-project.org>
- RStudio here: <https://www.rstudio.com/products/rstudio/download/preview/>
- Commentary on installing this stuff can be found at [stat545.com: r-rstudio-install](http://stat545.com/block000_r-rstudio-install.html)
2. Install git (this is different from GitHub!). See [happygitwithr: Section 7](http://happygitwithr.com/install-git.html)
- You'll need to work with the command line.
<!--chapter:end:cm001.Rmd-->
# Introduction to R
Today, we'll get you up to speed with a minimum "need to know" about using R and RStudio. We're going to assume you know nothing, but aren't covering the breadth of the R/RStudio landscape.
The format of today's notes aim to teach R by exploration, so is essentially an activity guide with prompts for exploration. These are mostly all exercises we'll be doing together in class.
To participate in today's lecture, you should have:
- R and RStudio installed
- A [participation repo](https://stat545.stat.ubc.ca/evaluation/participation/#setup) on GitHub to put in-class work.
__Announcements__: (10 min)
- Assignment 1 is launched!
- Class meeting schedule is fixed
## Learning Objectives
By the end of today's class, students are expected to be able to:
- Write an R script to perform simple calculations
- Access the R documentation on an as-needed basis
- Use functions and operators in R
- Subset vectors in R
- Explore a data frame in R
- Load packages in R
## Participation
Start a new R script in RStudio, and add your exploratory code to the script as we work through the exercises. What you write on this script doesn't have to be exactly the same as what I write -- we're just looking for some exploration of coding in R.
## Resources
Here are some useful resources for getting oriented with R.
- Jenny's [stat545.com: hello r](http://stat545.com/block002_hello-r-workspace-wd-project.html) page for exploring R roughly follows today's outline.
- Want to practice programming in R? Check out [R Swirl](https://swirlstats.com/) for interactive lessons.
- For a list of R "vocabulary", see [Advanced R - Vocabulary](http://adv-r.had.co.nz/Vocabulary.html); for a list of R operators, see [Quick-R](https://www.statmethods.net/management/operators.html).
Today, we'll be learning just enough base R so that we can dive in to the tidyverse side of R. If you want to learn even more about base R, take a look at [Mike Marin's R playlist on YouTube](https://www.youtube.com/playlist?list=PLqzoL9-eJTNBlVXxWvJkq0dtVut2sICUW).
## Why R?
Why R? Some points taken from [adv-r: intro](http://adv-r.had.co.nz/Introduction.html):
- Free, platform-wide
- Open source
- Comprehensive set of "add on" packages for analysis
- Huge community
- ...
Alternatives exist for data analysis, python being another excellent tool, especially these days as it seems like more and more R-like functionality is added to it. The good thing about python is that it's faster and has better support for machine learning models. For the sake of streamlining, both STAT 545A and STAT 547M only focus on R.
## Orientation to R
### Using R and RStudio (5 min)
Let's try these exercises as our first steps.
1. Try some arithmetic from a script vs. the console.
- Notice that your commands appear in the "History" tab. Do not rely on this! What do you think is better than relying on the history?
2. Store a number in a variable called `number` using `<-` (read this arrow as "gets").
- Notice that the object appears in the "Environment" tab in the top-right of RStudio.
3. Try some arithmetic on the variable.
4. Try some arithmetic on an undefined variable.
5. Try some arithmetic on the variable on a line of code above the variable definition (do you think we'll get an error?)
### Vectors (3 min)
_Vectors_ store multiple entries of a data type, like numbers. You'll discover that they show up just about everywhere in R.
Let's collect some data and store this in a vector called `times`. How long was your commute this morning, in minutes? Here's starter code:
```
times <- c()
```
Operations happen component-wise. Let's calculate those times in hours. How can we "save" the results?
### Functions, Part I (3 min)
What's the average travel time? Instead of computing this manually, let's use a _function_ called `mean`. Notice the syntax of using a function: the _input_ goes inside brackets, which is followed by the function name to the left.
We _input_ `times`, and got some _output_. Did this function change the input? Aside from some bizarre functions, this is always the case.
Functions don't always return a single value. Try the `range()` function, for example. What's the output? What about the `sqrt()` function?
Much of R is about becoming familiar with R's "vocabulary". A nice list can be found in [Advanced R - Vocabulary](http://adv-r.had.co.nz/Vocabulary.html).
### Comparisons (7 min)
We'll now introduce _logicals_.
Which of our travel times are less than (say) 30 minutes? Use `<`.
Which of our travel times are equal to ... (pick something)? What about _not_ equal to it? Notice the use of `==` as opposed to `=` -- why do you think that is?
Which of our travel times are greater than ...(lower)... _and_ less than ...(upper)...? What about less than ...(lower)... _or_ greater than ...(upper)...?
Some functions expect logical inputs. Try using the `which()` function on one of the above. What about `any()`? `all()`?
Logicals can be explicitly specified in R with `TRUE` and `FALSE`.
### Subsetting (10 min)
Use `[]` to subset the vector of times:
1. Extract the third entry.
2. Extract everything except the third entry.
3. Extract the second and fourth entry. The fourth and second entry.
4. Extract the second through fifth entry -- make use of `:` to construct sequential vectors.
4. Extract all entries that are less than 30 minutes. Why does this work? Logical subsetting!
After all of that, did our `times` object change at all?
We can use `[]` in conjunction with `<-` to change the `times` object:
1. Replace two entries with new travel times.
2. "Cap" entries that are "too large" at some set value. If this is more than one value, why don't we need to match the number of values? Recycling!
3. Remove an entry, by overwriting `times`.
### NA (2 min)
Sometimes we have missing data. Those entries are replaced with `NA` in R. Be careful with these!
1. Add `NA` to the vector of times.
2. What's the mean of this new vector of times?
Let's expand our view of functions in order to solve this problem.
### Functions, Part II (10 min)
Functions often take more than one _arguments_ as input, separated by commas. You can find out what these arguments are by accessing the function's _documentation_:
Access the documentation of the `mean()` function by executing `?mean`.
- There are four arguments.
- All the arguments have names, except for the `...` argument (more on `...` later). This is always the case.
- Under "Usage", some of the arguments are of the form `name = value`.
- These are default values, in case you don't specify these arguments.
- This is a sure sign that these arguments are _optional_.
- `x` is "on its own". This typically means that it has no default, and often (but not always) means that the argument is _required_.
We can specify an argument in one of two ways:
- specifying `argument name = value` in the function parentheses; or
- matching the ordering of the input with the ordering of the arguments.
- For readability, this is not recommended beyond the first or sometimes second argument!
Input `TRUE` for the `na.rm` argument in both ways.
### Data frames (12 min)
Living in a vector-only world would be nice if all data analyses involved one variable. When we have more than one variable, _data frames_ come to the rescue. Basically, a data frame holds data in tabular format.
R has some data frames "built in". For example, motor car data is attached to the variable name `mtcars`.
Print `mtcars` to screen. Notice the tabular format.
__Your turn__ (5 min): Finish the exercises of this section:
1. Use some of these built-in R functions to explore `mtcars`, without printing the whole thing to screen:
- `head()`, `tail()`, `str()`, `nrow()`, `ncol()`, `summary()`, `row.names()` (yuck), `names()`.
Notice that `names` and `row.names()` outputs a _character vector_ (we've already seen numeric and logical vectors). These are useful for characterizing categorical data in R.
2. What's the first column name in the `mtcars` dataset?
3. Which column number is named `"wt"`?
Each column is its own vector that can be extracted using `$`. For example, we can extract the `cyl` column with `mtcars$cyl`.
4. Extract the vector of `mpg` values. What's the mean `mpg` of all cars in the dataset?
### R packages (13 min)
Usually, the suite of functions that "come with" R are just not enough to do an analysis.
Usually, the suite of functions that "come with" R are just not very convenient.
In come R _packages_ to the rescue. These are "add ons", each coming with their own suite of functions and objects, usually designed to do one type of task. [CRAN](https://cran.r-project.org/) stores packages that, for all intents and purposes, can be considered "official" R packages. It's easy to install packages from CRAN! Just use the `install.packages()` function.
Run the following lines of code to install the `tibble` and `gapminder` packages. (But don't include this in your scripts -- it's not very nice to others!)
```
install.packages("tibble")
install.packages("gapminder")
```
- `tibble`: a data frame with some useful "bells and whistles"
- `gapminder`: a package that makes the gapminder dataset available (as a `tibble`!)
Installing a package is not enough! To access its functions, you have to _load_ it. Use the `library()` function to load a package. (Note: ironically, it's not libraries we load with the `library()` function, but a package).
Run the following lines of code to load the packages. (Do put these in your scripts, and near the top)
```
library(tibble)
library(gapminder)
```
Take a look at the packages under the "Global Environment" tab to see the new objects that have just been made available to us. PS: you'll notice `mtcars` is not in our workspace/environment, yet we can still access it -- where does `mtcars` live?
Try the following two approaches to access information about the `tibble` package. Run the lines one-at-a-time. Vignettes are your friend, but do not always exist.
```
?tibble
browseVignettes(package = "tibble")
```
Print out the `gapminder` object to screen. It's a tibble -- how does it differ from a data frame in terms of how it's printed?
Because a tibble is a data frame, our exploration functions still work on it. Try some.
### Two slogans to understand computations in R (6 min)
(We probably won't have time to cover this, and that's OK -- I'm leaving it here for you to peruse if you are interested).
John Chambers eloquently sums up using R:
> To understand computations in R, two slogans are helpful:
>
> - Everything that exists is an object.
> - Everything that happens is a function call.
These are useful to remember to prevent us from getting confused.
1. Everything that exists is an object.
This is not obvious when we look at the output of, say, `str()`:
``` r
str(mtcars)
#> 'data.frame': 32 obs. of 11 variables:
#> $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#> $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
#> $ disp: num 160 160 108 258 360 ...
#> $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
#> $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#> $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
#> $ qsec: num 16.5 17 18.6 19.4 17 ...
#> $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
#> $ am : num 1 1 1 0 0 0 0 0 0 0 ...
#> $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
#> $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
```
The stuff you see is simply printed to screen, not an object! The actual object is `NULL`:
``` r
foo <- str(mtcars)
#> 'data.frame': 32 obs. of 11 variables:
#> $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#> $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
# ...(snip)...
foo
#> NULL
```
The output of `summary()` is actually a "table" object (something not often used in R). Let's coerce it to character data:
``` r
foo <- summary(mtcars)
as.character(foo)
#> [1] "Min. :10.40 " "1st Qu.:15.43 " "Median :19.20 "
#> [4] "Mean :20.09 " "3rd Qu.:22.80 " "Max. :33.90 "
#> [7] "Min. :4.000 " "1st Qu.:4.000 " "Median :6.000 "
#> [10] "Mean :6.188 " "3rd Qu.:8.000 " "Max. :8.000 "
#> [13] "Min. : 71.1 " "1st Qu.:120.8 " "Median :196.3 "
#> [16] "Mean :230.7 " "3rd Qu.:326.0 " "Max. :472.0 "
# ...(snip)...
```
2. Everything that happens is a function call.
Did you know that operators like `+` are actually functions? The "plus" function is literally `` `+`() ``, and accepts two arguments.
Here is what's actually happening when we call `5 + 2`:
``` r
`+`(5, 2)
#> [1] 7
```
Want a challenge? What's the difference between the `` `(`() `` function and the `` `{`() `` function? Hint check the documentation with `` ?`{` ``.
## Finishing up (5 min)
1. Highly recommended: [Don't save your workspace](https://www.r-bloggers.com/using-r-dont-save-your-workspace/) when you quit RStudio. Make this a default:
- Go to "RStudio" -> "Preferences..." -> "General"
- Uncheck "restore .RData into workspace on startup"
- Select: "Save workspace to RData on exit:" Never
2. Push your final script to GitHub (you can do this in a simple way by dragging the file onto your respository homepage).
Don't forget! There's an office hour after every class, held upstairs in ESB 3174.
<!--chapter:end:cm002.Rmd-->
# Authoring
Communication of a data analysis is just as important as the analysis itself. Today, we'll be looking at tools for _writing_ about your analysis.
__Announcements__:
- The add/drop deadline for Stat 545A is on Wednesday Sep. 11
- Hang tight -- the canvas slot for Assignment 1 is coming shortly.
## Learning Objectives
By the end of today's class, students are expected to be able to:
- Write documents in markdown on GitHub and RStudio, and render these documents to html and pdf with RStudio.
- Choose whether html or pdf is an appropriate output
- Style an Rmd document by editing the YAML header
- Demonstrate at least two Rmd code chunk options
- Make presentation slides using one of the R Markdown presentation formats.
## Resources
Cheat sheets for "quick reference":
- [GitHub's markdown cheatsheet](https://guides.github.com/pdfs/markdown-cheatsheet-online.pdf)
- [RStudio's R markdown cheatsheet](http://www.rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf)
Further reading:
- The [Rmd website](https://rmarkdown.rstudio.com/) has a fantastic walk-through [tutorial](https://rmarkdown.rstudio.com/lesson-1.html) that gives a great overview of R Markdown. There's also a nice [overview video](https://rmarkdown.rstudio.com/authoring_quick_tour.html) on the site, too.
- Yihui's [Rmd book](https://bookdown.org/yihui/rmarkdown/) for lots more on R Markdown.
Other explorations of this content:
- Interactive [tutorial](https://commonmark.org/help/tutorial/) for learning markdown.
- The [stat545: Rmd test drive](http://stat545.com/block007_first-use-rmarkdown.html).
## Topic 1: Output Formats (5 min)
There are generally two prominent non-proprietary file types to display manuscripts of various types:
1. __pdf__: This is useful if you intend to print your work onto a physical sheet of paper, or for presentation slides. If this is not the primary purpose, then avoid at all costs, because formatting things so that it fits to the page is way more effort than its worth (unless your making presentation slides).
- Example: The [concession template](https://stat545.stat.ubc.ca/concession_template.pdf).
2. __html__: This is what you see when you visit a webpage. Content does not need to be partitioned to pages.
- Example: My [website main page](https://vincenzocoia.com), and its corresponding [html file](https://github.com/vincenzocoia/website/blob/hugo/public/index.html).
- Example: html [slides using ioslides](https://rpubs.com/cheyu/ioslideDemo).
We won't be using proprietary file types, like MS Word. Amongst [many reasons](http://www.antipope.org/charlie/blog-static/2013/10/why-microsoft-word-must-die.html), it just doesn't make sense for integrating reproducible code into the document and for a dynamic analysis.
Others that we won't be covering:
- Jupyter notebooks (actually a JSON file)
- LaTeX
We'll be treating pdf and html files as _output_ that should not be edited. In fact, pdf documents are not even easy to edit, and even if you do pay for the Adobe add-on to edit the files, this is not a reproducible workflow.
What's the source, then? (R) __Markdown__! We'll be discussing this
## Topic 2: Markdown
(3 min)
Markdown is plain text with an easy, readable way of marking up your text. Let's see [GitHub's cheat sheet](https://guides.github.com/pdfs/markdown-cheatsheet-online.pdf). Various software convert markdown to either pdf or html.
File extension: `.md`
### Activity: Modify `navigating_github.md` (5 min)
Together:
1. Open your `navigating_github.md` file that we made in the first class.
2. Mark up the text with some markdown features.
3. Commit your changes.
Notice that GitHub automatically displays markdown files nicely, but not HTML files.
### Activity: Render `navigating_github.md` (5 min)
N.B.: this exercise employs an effective _local_ workflow, which we will address next class.
Together:
1. Download the contents of your GitHub participation repository as a zip file.
2. In RStudio, open the file `navigating_github.md`.
- Yes! RStudio also acts as a plain text editor!
3. Convert the `.md` file to both pdf and html by clicking the appropriate button under the "Preview" tab.
4. Push the two new files to GitHub (by dragging and dropping the files onto your participation repo).
## Topic 3: R Markdown
(2 min)
R Markdown (Rmd) is a "beefed up" version of markdown -- it has many more features built in to it, two important ones being:
- We can specify more features in a _YAML header_.
- This contains metadata about the document to guide how the Rmd document is rendered.
- We can integrate code into a document.
Here's [RStudio's cheat sheet](http://www.rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf) on Rmd. You can see that it certainly has more features than "regular" markdown!
### Activity: getting set up with R packages (5 min)
(Includes what we missed from last class)
To get started with using R Markdown, you'll need to install the `rmarkdown` R package. The activity we have also depends on the `gapminder`, `tibble`, and `DT` packages.
Together:
1. To install these packages, in any R console, run the following:
```
install.packages('rmarkdown')
install.packages('gapminder')
install.packages('tibble')
install.packages('DT')
```
"Official" R packages are stored an retrieved from [CRAN](https://cran.r-project.org/).
2. Check out vignettes for the tibble package by running `browseVignettes(package = "rmarkdown")`.
### Activity: exploring code chunks (15 min)
Last class, we explored data frames. This time, we'll explore tibbles, but within code chunks in an R Markdown document.
Together:
1. Open RStudio's Rmd boilerplate by going to "File" -> "New File" -> "R Markdown" in RStudio. Explore!
2. Scrap everything below the YAML header.
3. Add a code chunk below the YAML header via "Insert" -> "R". Or, by:
- Mac: `Cmd + Option + I`
- Windows: `Ctrl + Alt + I`
4. Load the `gapminder` and `tibble` packages using the `library()` function, by adding the following code to your code chunk:
```
library(gapminder)
library(tibble)
library(DT)
```
5. Print out the `gapminder` data frame to explore the output. Then, in a new code chunk, convert the `mtcars` data frame to a tibble using the `tibble::as_tibble()` function. Try out the `DT::datatable()` function on a data frame!
6. Add some markdown commentary to this comparative analysis.
7. Add an in-line code chunk specifying the number of rows of the `mtcars` dataset.
8. "Knit" to html and pdf.
Note: `knitr` integrates the code into the document. The actual conversion here is Rmd -> md -> pdf/html.
### Activity: exploring the YAML header (10 min)
(Note: If you've "fallen off the bus" from the last exercise, here's a "bus stop" for you to get back on -- just start a new Rmd file and use the boilerplate content while we work through this exercise.)
Now, we'll modify the metadata via the YAML header. Check out a bunch of YAML options [from the R Markdown book](https://bookdown.org/yihui/rmarkdown/html-document.html).
Together, in an Rmd file (ideally the one from the previous exercise):
1. Change the output to `html_document`. We'll be specifying settings for the html document, so this needs to go on a new line after the `output:` field:
```
output:
html_document:
SETTINGS
GO
HERE
```
2. Add the following settings:
- Keep the `md` intermediate file with `keep_md: true`
- Add a theme. My favourite is cerulean: `theme: cerulean`
- Add a table of contents with `toc: true`
- Make the toc float: `toc_float: true`.
3. Knit the results (you may have to delete the pdf, because it is no longer up to date!)
### Activity: exploring chunk options (5 min)
(Bus stop! Couldn't get previous exercises to work? No problem, just start a fresh R Markdown document with File -> New File -> R Markdown)
Just like YAML is metadata for the Rmd document, _code chunk options_ are metadata for the code chunk. Specify them within the `{r}` at the top of a code chunk, separated by commas.
Together, in an Rmd file (ideally the same one we've been working on):
1. Hide the code from the output with `echo = FALSE`.
2. Prevent warnings from the chunk that loads packages with `warning = FALSE`.
3. Knit the results.
## Topic 4: Rmd Presentations
(3 min)
You can also make presentation slides using Rmd. A great resource is Yihui's [Rmd book, "Presentations" section](https://bookdown.org/yihui/rmarkdown/presentations.html).
Some types of formats:
- ioslides
- [xaringan](https://slides.yihui.name/xaringan/#1)
- [slidy](https://www.w3.org/Talks/Tools/Slidy2/#(1))
- [reveal.js](https://revealjs.com/#/)
- ...
### Activity: exploring ioslides (10 min)
Let's turn the file we've been working on into slides.
Together:
1. In RStudio, go to "File" -> "New File" -> "R Markdown" -> "Presentation" -> "ioslides". Explore!
2. Clear everything below the YAML header.
3. Copy and paste the tibble exploration we've been working on (without the YAML header), and turn them into slides.
## Wrap-up (3 min)
Push the following files to your GitHub repo:
1. `navigating_github.md` and its output formats.
2. The Rmd exploration and its output formats.
3. The Rmd presentation slides exploration and its output formats.
<!--chapter:end:cm003.Rmd-->
# The version control workflow
Today's topic is version control..
## Learning Objectives
From this lesson, students are expected to be able to demonstrate each of the git/GitHub functionality listed here.
## Working with git and GitHub
Before we dive into concepts, it's important to distinguish between a __local__ repo and __remote__ repo.
- __Local__ refers to things on your own computer. A local repo is a repo found on your hard drive.
- __Remote__ refers to things on the internet. A remote repo lives on GitHub (and possibly other places).
Note that you can have more than one remote repo! Because of this, git names the remote repos. We will only ever be using one remote in STAT 545, and by default, this remote is named __origin__.
### Preliminary: configuring git (3 min)
You'll need to [config your git](http://happygitwithr.com/hello-git.html) using the command line.
Your RStudio will probably be able to "find" git. But if it can't, you'll encounter errors. See [happygitwithr: see-git](http://happygitwithr.com/rstudio-see-git.html) for help.
__Optional__ (but recommended): After class, you might want to [cache](http://happygitwithr.com/credential-caching.html) your credentials so that you don't have to keep inserting your password.
### The typical workflow (8 min)
The majority of your interaction with version control will be a pull/stage/commit/push workflow, explained here. For another resource on this, check out [happygitwithr: rstudio-git-github](http://happygitwithr.com/rstudio-git-github.html).
0. __Clone__ your repository if you don't have a local copy.
Once you have a local copy of the repository, then working on a project involves frequent use of these three steps:
1. __Pull__ the remote repo.
2. Make changes to your files, __stage__ and __commit__ them.
3. __Push__ the changes (after perhaps pulling again).
Git treats the remote repository as the "official" version of the repository. This means that your local copy is a second class citizen -- the repository you have locally must be up-to-date with the remote repository before you are allowed to push your work. If there are commits on the remote repository that are not present locally, git will throw an error if you try to push your changes.
Integrate version control as you do work!
- Workflow without version control: save your files spontaneously.
- Workflow with version control: save your files spontaneously, commit your changes after every "step" in your work, and push your changes [in case of fire](https://github.com/louim/in-case-of-fire).
Committing often ensures that you can trace back all the work you did. This results in transparency with the way your project has developed, which is a very effective workflow. But, from my experience at least, making half baked work viewable might face you with feelings of vulnerability. I encourage you to push past this.
### The typical workflow: Activity (5 min)
Let's make a change to our repository from local.
1. Cloning your participation repo.
- In RStudio, File -> New Project -> Version Control -> Git.
- You should see a `Git` tab in RStudio, upper-right corner window. If not, see [happygitwithr: see-git](http://happygitwithr.com/rstudio-see-git.html) for help.
- Take a look at the files you just downloaded!
2. Make your README a little nicer. Maybe fix up the title.
3. Stage and commit the changes:
- In the Git tab in RStudio, click the checkboxes for the files that you want to commit. This is called "staging".
- Click the "Commit" button.
- Enter a commit message.
- Click "commit".
4. Push to your remote repository (which is named "origin")
- Click the up arrow in the Git panel in RStudio.
### Git Clients (3 min)
We just saw that RStudio can "talk to" git. But there are other ways we can use git locally. To "directly" interact with git, we type commands in the terminal (or bash) like `git clone`, `git commit`, etc. If you want to access the full functionality of git, you'll need to use the terminal.
Alternatively, there are _git clients_ that provide a visual dashboard for interacting with git. RStudio is one example. Others:
- GitHub Desktop
- Source Tree
- Gitkraken
- ...
In STAT 545, we'll be using the RStudio git client, but you can use whichever method you prefer.
### Merge conflicts (5 min)
If you change a file locally, and that same file (_and_ the same lines) get changed on the remote repository in a different way, you'll end up with a _merge conflict_ that you'll need to resolve. Remember that your local copy is a second class citizen compared to the remote version, so you'll have to resolve things locally before pushing to the remote.
### Merge conflicts: Activity (5 min)
Let's make a merge conflict, and fix it.
1. Edit a line of your README both locally and remotely to something different in both cases. Commit both changes.
2. Try pulling your remote changes. You'll get a _merge conflict_.
3. Update the file that has the conflict, commit your changes, and push.
### Branching (8 min)
Branching is the idea of making commits somewhere other than your main repository on GitHub. Even cloning a repo is a type of branch.
Git and GitHub allow us to make branches _within a repository_, and we can do this both locally and on GitHub (although it seems the RStudio git client doesn't provide functionality for local merging). Let's check out the [STAT545-UBC.github.io](https://github.com/STAT545-UBC/STAT545-UBC.github.io/) repo to see some branches (which is actually being phased out).
Eventually, you may want to merge a branch back to its predecessor. This is called __merging__. A merge specifically on GitHub initiates a __pull request__ -- the idea here being that you'd like the predecessor branch to _pull_ the commits from the child branch, and you're _requesting_ it from the collaborators on your repository (which is sometimes just yourself). [Example from STAT545-UBC.github.io](https://github.com/STAT545-UBC/STAT545-UBC.github.io/pull/64) again. For more info on pull requests, see this [GitHub tutorial](https://help.github.com/articles/about-pull-requests/).
There are many reasons you may want to branch. Here are some:
- A collaborator wants to make a change to the repo, but the end product of the change requires review from collaborators.
- You want to make changes, but don't want to "deploy" the changes until later (such as if pushing to github triggers a website build).
- If you want to try something "risky", it's just safer to work on a branch.
### Branching: Activity (5 min)
Let's organize our participation repo in a branch.
1. Create a new branch locally, called "organizing" (we could have also made this on GitHub):
- Click the "Git" tab in the upper-right panel of RStudio
- Within that window's option bar, click ![](./img/branch.png).
- Name your branch and create!
2. Stage and commit the new files.
3. Restructure your repository in a more sensible way, using folders (locally).
4. Stage and commit the changes; push to GitHub.
5. Explore:
- switch between branches to see that the repo structure is different.
6. Merge the branch to "master" via GitHub by making a pull request.
### Undoing Changes (5 min)
There are many ways that work can be "undone" in git. We will only investigate two of the simpler methods. For more advanced methods, like reverting to a previous commit, check out these resources by [bitbucket](https://www.atlassian.com/git/tutorials/undoing-changes) and [GitHub](https://blog.github.com/2015-06-08-how-to-undo-almost-anything-with-git/) -- you'll need to use the command line.
The two most useful "undo"s are:
1. Undoing your (uncommited) work to the previous commit.
2. Browsing the repo at previous states, and taking files from there.
We'll demonstrate (1) in an activity.
### Undoing Changes: Activity (2 min)
Here's how to go back to the most recent commit.
1. First, make and save a change to (say) a README file in your participation repo.
2. In the Git panel of RStudio, stage the file that you want to return to the previous commit. Click "More" -> "Revert..." -> "Yes"
That's it!
### Getting errors? (3 min)
It's not unusual to experience some errors in git, especially if you're first learning how to use it. Try to get yourself unstuck with the concepts we've discussed here first.
But, you might find yourself stuck. The git documentation is full of jargon, making it difficult to read and therefore difficult to debug things. There's even a [parody](https://git-man-page-generator.lokaltog.net/) on it. If you are in this position, it's best to just [burn it all down](http://happygitwithr.com/burn.html). There's even an [xkcd comic](https://xkcd.com/1597/) on this.
### Tagging a Release (5 min)
Tagging a release on GitHub is like putting a "star" next to a particular commit. It highlights a particular point in time of your repository that is noteworthy, typically after achieving some milestone. It's just easier than having to manually keep track of noteworthy points in your commit history.
Examples:
- After every year of finishing STAT 545A/547M, we tag a release so that we can easily navigate to earlier versions of the course.
- After sufficient development of an R package like [ggplot2](https://github.com/tidyverse/ggplot2/releases), a new release is tagged corresponding to the version of the package.
### Tagging a Release: Activity (3 min)
Congratulations! We finished the first two weeks of STAT 545A, which focussed on _tools_. To mark this milestone, let's tag a release on our participation repositories.
1. On your GitHub repo, click "releases"
2. Click "Create a new release"
3. Fill in the fields:
- It probably makes sense to use a versioning system like `cm004` here.
4. "Publish Release".
<!--chapter:end:cm004.Rmd-->
# Intro to plotting with `ggplot2`, Part I
__Announcements__:
- Homework 1 is due tonight. Your work should be stored in your _homework_ repository, not your _partication_ repo! Also, please put your URL on canvas.
__Recap__:
- Previous two weeks: software for data analytic work: git & GitHub, markdown, and R.
- Next three weeks: fundamental methods in exploratory data analysis: R tidyverse.
- Last two weeks (and STAT 547M): special topics in exploratory data analysis.
__Today__: Introduction to plotting with `ggplot2` (to be continued next Thursday).
__Worksheet__: You can find a worksheet template for today [here](https://raw.githubusercontent.com/STAT545-UBC/Classroom/master/tutorials/cm005-exercise.Rmd).
Set up the workspace:
```{r, warning = FALSE}
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(gapminder))
suppressPackageStartupMessages(library(scales))
knitr::opts_chunk$set(fig.width = 5, fig.height = 2, fig.align = "center")
```
## Learning Objectives
By the end of this lesson, students are expected to be able to:
- Identify the plotting framework available in R
- Have a sense of why we're learning the `ggplot2` tool
- Have a sense of the importance of statistical graphics in communicating information
- Identify the seven components of the grammar of graphics underlying `ggplot2`
- Use different geometric objects and aesthetics to explore various plot types.
## Resources (2 min)
For me, I learned `ggplot2` from Stack Overflow by googling error messages or "how to ... in ggplot2" queries, together with persistence. It might take you a bit longer to make a graph using `ggplot2` if you're unfamiliar with it, but persistence pays off.
Here are some good walk-throughs that introduce `ggplot2`, in a similar way to today's lesson:
- [r4ds: data-vis](http://r4ds.had.co.nz/data-visualisation.html) chapter.
- Perhaps the most compact "walk-through" style resource.
- The [ggplot2 book](http://webcat2.library.ubc.ca/vwebv/holdingsInfo?bibId=8489511), Chapter 2.
- A bit more comprehensive "walk-through" style resource.
- Section 1.2 introduces the actual grammar components.
- [Jenny's ggplot2 tutorial](https://github.com/jennybc/ggplot2-tutorial).
- Has a lot of examples, but less dialogue.
Here are some good resource to use as a reference:
- [`ggplot2` cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/data-visualization-2.1.pdf)
- [R Graphics Cookbook](http://www.cookbook-r.com/Graphs/)
- Good as a reference if you want to learn how to make a specific type of plot.
## Orientation to plotting in R (7 min)
TL;DR: We're using `ggplot2` in STAT 545, and a little bit of plotly.
Traditionally, plots in R are produced using "base R" methods, the crown function here being `plot()`. This method tends to be quite involved, and requires a lot of "coding by hand".