forked from UBC-STAT/stat-545-guidebook
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcm005.Rmd
192 lines (128 loc) · 10.7 KB
/
cm005.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
# Intro to plotting with `ggplot2`, Part I
__Announcements__:
- Homework 1 is due tonight. Your work should be stored in your _homework_ repository, not your _partication_ repo! Also, please put your URL on canvas.
__Recap__:
- Previous two weeks: software for data analytic work: git & GitHub, markdown, and R.
- Next three weeks: fundamental methods in exploratory data analysis: R tidyverse.
- Last two weeks (and STAT 547M): special topics in exploratory data analysis.
__Today__: Introduction to plotting with `ggplot2` (to be continued next Thursday).
__Worksheet__: You can find a worksheet template for today [here](https://raw.githubusercontent.com/STAT545-UBC/Classroom/master/tutorials/cm005-exercise.Rmd).
Set up the workspace:
```{r, warning = FALSE}
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(gapminder))
suppressPackageStartupMessages(library(scales))
knitr::opts_chunk$set(fig.width = 5, fig.height = 2, fig.align = "center")
```
## Learning Objectives
By the end of this lesson, students are expected to be able to:
- Identify the plotting framework available in R
- Have a sense of why we're learning the `ggplot2` tool
- Have a sense of the importance of statistical graphics in communicating information
- Identify the seven components of the grammar of graphics underlying `ggplot2`
- Use different geometric objects and aesthetics to explore various plot types.
## Resources (2 min)
For me, I learned `ggplot2` from Stack Overflow by googling error messages or "how to ... in ggplot2" queries, together with persistence. It might take you a bit longer to make a graph using `ggplot2` if you're unfamiliar with it, but persistence pays off.
Here are some good walk-throughs that introduce `ggplot2`, in a similar way to today's lesson:
- [r4ds: data-vis](http://r4ds.had.co.nz/data-visualisation.html) chapter.
- Perhaps the most compact "walk-through" style resource.
- The [ggplot2 book](http://webcat2.library.ubc.ca/vwebv/holdingsInfo?bibId=8489511), Chapter 2.
- A bit more comprehensive "walk-through" style resource.
- Section 1.2 introduces the actual grammar components.
- [Jenny's ggplot2 tutorial](https://github.com/jennybc/ggplot2-tutorial).
- Has a lot of examples, but less dialogue.
Here are some good resource to use as a reference:
- [`ggplot2` cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/data-visualization-2.1.pdf)
- [R Graphics Cookbook](http://www.cookbook-r.com/Graphs/)
- Good as a reference if you want to learn how to make a specific type of plot.
## Orientation to plotting in R (7 min)
TL;DR: We're using `ggplot2` in STAT 545, and a little bit of plotly.
Traditionally, plots in R are produced using "base R" methods, the crown function here being `plot()`. This method tends to be quite involved, and requires a lot of "coding by hand".
Then, an R package called `lattice` was created that aimed to make it easier to create multiple "panels" of plots. It seems to have gone to the wayside in the R community. Personally, I found that using this package often involved several lines of code to set up a plot, which then needed to get overriden by "special cases".
After `lattice` came `ggplot2`, which provides a very powerful and relatively simple framework for making plots. It has a theoretical underpinning, too, based on the Grammar of Graphics, first described by Leland Wilkinson in his ["Grammar of Graphics" book](http://resolve.library.ubc.ca/cgi-bin/catsearch?bid=5507286). With `ggplot2`, you can make a great many type of plots with minimal code. It's been a hit in and outside of the R community.
Check out [this comparison of the three](http://www.jvcasillas.com/base_lattice_ggplot/) by Joseph V. Casillas.
A newer tool is called [plotly](https://plot.ly/), which was actually developed outside of R, but the `plotly` R package accesses the plotly functionality. Plotly graphs allow for interactive exploration of a plot. You can convert ggplot2 graphics to a plotly graph, too.
## Just plot it (7 min)
The human visual cortex is a powerful thing. If you're wanting to point someone's attention to a bunch of numbers, I can assure you that you won't elicit any "aha" moments by displaying a large table [like this](https://i.stack.imgur.com/2JdLt.png), either in a report or (especially!) a presentation. Make a plot to communicate your message.
If you really feel the need to tell your audience exactly what every quantity evaluates to, consider putting your table in an appendix. Because chances are, the reader doesn't care about the exact numeric values. Or, perhaps you just want to point out one or a few numbers, in which case you can put that number directly on a plot.
[Challenger example from Jenny Bryan](https://speakerdeck.com/jennybc/ggplot2-tutorial?slide=7).
## The grammar of graphics (15 min)
You can think of the grammar of graphics as a systematic approach for describing the components of a graph. It has seven components (the ones in bold are required to be specifed explicitly in `ggplot2`):
- __Data__
- Exactly as it sounds: the data that you're feeding into a plot.
- __Aesthetic mappings__
- This is a specification of how you will connect variables (columns) from your data to a visual dimension. These visual dimensions are called "aesthetics", and can be (for example) horizontal positioning, vertical positioning, size, colour, shape, etc.
- __Geometric objects__
- This is a specification of what object will actually be drawn on the plot. This could be a point, a line, a bar, etc.
- Scales
- This is a specification of how a variable is mapped to its aesthetic. Will it be mapped linearly? On a log scale? Something else?
- Statistical transformations
- This is a specification of whether and how the data are combined/transformed before being plotted. For example, in a bar chart, data are transformed into their frequencies; in a box-plot, data are transformed to a five-number summary.
- Coordinate system
- This is a specification of how the position aesthetics (x and y) are depicted on the plot. For example, rectangular/cartesian, or polar coordinates.
- Facet
- This is a specification of data variables that partition the data into smaller "sub plots", or panels.
These components are like parameters of statistical graphics, defining the "space" of statistical graphics. In theory, there is a one-to-one mapping between a plot and its grammar components, making this a useful way to specify graphics.
### Example: Scatterplot grammar
For example, consider the following plot from the `gapminder` data set. For now, don't focus on the code, just the graph itself.
```{r}
ggplot(gapminder, aes(gdpPercap, lifeExp)) +
geom_point(alpha = 0.1) +
scale_x_log10("GDP per capita", labels = scales::dollar_format()) +
theme_bw() +
ylab("Life Expectancy")
```
This scatterplot has the following components of the grammar of graphics.
| Grammar Component | Specification |
|-----------------------|---------------|
| __data__ | `gapminder` |
| __aesthetic mapping__ | __x__: `gdpPercap`, __y:__ `lifeExp` |
| __geometric object__ | points |
| scale | x: log10, y: linear |
| statistical transform | none |
| coordinate system | rectangular |
| facetting | none |
Note that `x` and `y` aesthetics are required for scatterplots (or "point" geometric objects). In general, each geometric object has its own required set of aesthetics.
### Activity: Bar chart grammar
Fill out __Exercise 1: Bar Chart Grammar (Together)__ in your worksheet.
Click [here](https://raw.githubusercontent.com/STAT545-UBC/Classroom/master/tutorials/cm005-exercise.Rmd) if you don't have it yet.
## Working with `ggplot2` (40 min)
First, the `ggplot2` package comes with the `tidyverse` meta-package. So, loading that is enough.
There are two main ways to interact with `ggplot2`:
1. The `qplot()` or `quickplot()` functions (the two are identical): Useful for making a quick plot if you have vectors stored in your workspace that you'd like to plot. Usually not worthwhile using.
2. The `ggplot()` function: use to access the full power of `ggplot2`.
Let's use the above scatterplot as an example to see how to use the `ggplot()` function.
First, the `ggplot()` function takes two arguments:
- `data`: the data frame containing your plotting data.
- `mapping`: aesthetic mappings applying to the entire plot. Expecting the output of the `aes()` function.
Notice that the `aes()` function has `x` and `y` as its first two arguments, so we don't need to explicitly name these aesthetics.
```{r}
ggplot(gapminder, aes(gdpPercap, lifeExp))
```
This just _initializes_ the plot. You'll notice that the aesthetic mappings are already in place. Now, we need to add components by adding layers, literally using the `+` sign. These layers are functions that have further specifications.
For our next layer, let's add a geometric object to the plot, which have the syntax `geom_SOMETHING()`. There's a bit of overplotting, so we can specify some alpha transparency using the `alpha` argument (you can interpret `alpha` as neeing `1/alpha` points overlaid to achieve an opaque point).
```{r}
ggplot(gapminder, aes(gdpPercap, lifeExp)) +
geom_point(alpha = 0.1)
```
That's the only `geom` that we're wanting to add. Now, let's specify a scale transformation, because the plot would really benefit if the x-axis is on a log scale. These functions take the form `scale_AESTHETIC_TRANSFORM()`. As usual, you can tweak this layer, too, using this function's arguments. In this example, we're re-naming the x-axis (the first argument), and changing the labels to have a dollar format (a handy function thanks to the `scales` package).
```{r}
ggplot(gapminder, aes(gdpPercap, lifeExp)) +
geom_point(alpha = 0.1) +
scale_x_log10("GDP per capita", labels = scales::dollar_format())
```
I'm tired of seeing the grey background, so I'll add a `theme()` layer. I like `theme_bw()`. Then, I'll re-label the y-axis using the `ylab()` function. Et voilà!
```{r}
ggplot(gapminder, aes(gdpPercap, lifeExp)) +
geom_point(alpha = 0.1) +
scale_x_log10("GDP per capita", labels = scales::dollar_format()) +
theme_bw() +
ylab("Life Expectancy")
```
### Activity: Plotting
1. Go to your worksheet
2. Set up the workspace by following the instructions in the "Preliminary" section.
3. Fill out __Exercise 2: `ggplot2` Syntax (Your Turn)__ in your worksheet.
_Bus stop_: Did you lose track of where we are? You can still do the exercise!
- Click [here](https://raw.githubusercontent.com/STAT545-UBC/Classroom/master/tutorials/cm005-exercise.Rmd) to obtain the worksheet if you don't have it.
- You're all set! Hint for completing the exercise: use the information from this section ("Working with `ggplot2`") to complete the exercise.