-
Notifications
You must be signed in to change notification settings - Fork 2
/
02-first-steps.Rmd
394 lines (251 loc) · 12.1 KB
/
02-first-steps.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
# First steps
```{r, include=FALSE}
library(tidyverse)
```
## Exercises
There are none in this section.
## Exercises
**1.** List five functions that you could use to get more information about the `mpg` dataset.
- `help(mpg)`: Documentation of dataset
- `dim(mpg)`: Dimensions of dataset
- `summary(mpg)`: Summary measures of dataset
- `str(mpg)`: Display of the internal structure of dataset
- `glimpse(mpg)`: `dplyr` version of `str(mpg)`
<br>
**2.** How can you find out what other datasets are included with ggplot2?
- `data(package = "ggplot2")` loads the available data sets in ggplot2. Alternatively,if you have internet access, go to https://ggplot2.tidyverse.org/reference/index.html#section-data
<br>
**3.** Apart from the US, most countries use fuel consumption (fuel consumed over fixed distance) rather than fuel economy (distance travelled with fixed amount of fuel). How could you convert cty and hwy into the European standard of l/100km?
- According to [asknumbers](https://www.asknumbers.com/mpg-to-L100km.aspx), you divide 235.214583 by the mpg values in `cty` and `hwy` to convert them into the European standard of l/100km.
- Function to convert into European standard (Rademaker, 2016):
```{r}
mpgTol100km <- function(milespergallon){
GalloLiter <- 3.785411784
MileKilometer <- 1.609344
l100km <- (100*GalloLiter)/(milespergallon*MileKilometer)
l100km
}
```
<br>
**4.** Which manufacturer has the most models in this dataset? Which model has the most variations? Does your answer change if you remove the redundant specification of drive train (e.g. "pathfinder 4wd", "a4 quattro") from the model name?
```{r}
# Count manufacturers and sort
mpg %>% count(manufacturer, sort = TRUE)
```
- `dodge` has the most models in this dataset.
```{r}
unique(mpg$model)
```
- (Rademaker, 2016) The `a4` and `camry` both have a second model (the `a4 quattro` and the `camry solar`)
```{r}
# Remove redundant information (Rademaker, 2016)
str_trim(str_replace_all(unique(mpg$model), c("quattro" = "", "4wd" = "",
"2wd" = "", "awd" = "")))
```
<br>
## Exercises
**1.** How would you describe the relationship between `cty` and `hwy`? Do you have any concerns about drawing conclusions from that plot?
```{r}
mpg %>%
ggplot(aes(cty, hwy)) +
geom_point()
```
- The plot shows a strongly linear relationship, which tells me that `cty` and `hwy` are highly correlated variables. The only concern I have is that the points seem to be overlapping.
- There is not much insight to be gained except that cars which are fuel efficient on a highway are also fuel efficient in cities. This relationship is probably a function of speed (Rademaker, 2016)
<br>
**2.** What does `ggplot(mpg, aes(model, manufacturer)) + geom_point()` show? Is it useful? How could you modify the data to make it more informative?
```{r}
ggplot(mpg, aes(model, manufacturer)) +
geom_point()
```
- The plot shows the manufacturer of each model. Its not very readable since there are too many models and this clutters up the x-axis with too many ticks! I would just plot 20 or so models so that the graph is more readable. See below:
```{r}
mpg %>%
head(25) %>%
ggplot(aes(model, manufacturer)) +
geom_point()
```
- A possible alternative would be to look total number of observations for each manufacturer-model combination using geom_bar(). (Rademaker, 2016)
```{r}
df <- mpg %>%
transmute("man_mod" = paste(manufacturer, model, sep = " "))
ggplot(df, aes(man_mod)) +
geom_bar() +
coord_flip()
```
<br>
**3.** Describe the data, aesthetic mappings and layers used for each of the following plots. You'll need to guess a little because you haven't seen all the datasets and functions yet, but use your common sense! See if you can predict what the plot will look like before running the code.
1. `ggplot(mpg, aes(cty, hwy)) + geom_point()`
- *Data*: `mpg`
- *Aesthetic*: highway miles per gallon is mapped to y position and city miles per gallon is mapped to x position.
- *Layer*: points
2. `ggplot(diamonds, aes(carat, price)) + geom_point()`
- *Data*: `diamonds`
- *Aesthetic*: price in US dollars is mapped to y position, weight of the diamond is mapped to x position.
- *Layer*: points
3. `ggplot(economics, aes(date, unemploy)) + geom_line()`
- *Data*: `economics`
- *Aesthetic*: median duration of unemployment, in weeks, is mapped to y position and month of data collection is mapped to x position.
- *Layer*: line
(Rademaker, 2016) Alternatively, you can always access plot info using summary(<plot>) as in e.g.
```{r}
# summary(<plot>)
summary(ggplot(economics, aes(date, unemploy)) + geom_line())
```
<br>
## Exercises
**1.** Experiment with the colour, shape and size aesthetics. What happens when you map them to continuous values? What about categorical values? What happens when you use more than one aesthetic in a plot?
```{r}
# Map color to continuous value
mpg %>%
ggplot(aes(cty, hwy, color = displ)) +
geom_point()
# Map color to categorical value
mpg %>%
ggplot(aes(cty, hwy, color = trans)) +
geom_point()
# Use more than one aesthetic in a plot
mpg %>%
ggplot(aes(cty, hwy, color = trans, size = trans)) +
geom_point()
```
<br>
**2.** What happens if you map a continuous variable to shape? Why? What happens if you map trans to shape? Why?
```{r, eval = FALSE}
mpg %>%
ggplot(aes(cty, hwy, shape = displ)) +
geom_point()
```
- I can not map a continuous variable to shape and I get an error message: `Error: A continuous variable can not be mapped to shape`
```{r}
mpg %>%
ggplot(aes(cty, hwy, shape = trans)) +
geom_point()
```
- I get an warning message that tells me the shape palette can only deal with 6 discrete values.
<br>
**3.** How is drive train related to fuel economy? How is drive train related to engine size and class?
```{r}
mpg %>%
group_by(drv) %>%
summarise(mean_cty = mean(cty)) %>%
ggplot(aes(drv, mean_cty)) +
geom_col()
mpg %>%
group_by(drv) %>%
summarise(mean_hwy = mean(hwy)) %>%
ggplot(aes(drv, mean_hwy)) +
geom_col()
```
- Front-wheel drive has the best fuel economy, then 4wd, then rear wheel drive.
```{r}
mpg %>%
ggplot(aes(drv, displ, fill = class)) +
geom_col(position = "dodge")
```
- 4wd has biggest engine size, then front-wheel, then rear wheel drive. Out of all 4wd, suvs have biggest engine size. Out of all front-wheel drive, midsize has biggest engine size. Out of all rear wheel drive, 2 seater has biggest engine size.
<br>
## Exercises
**1.** What happens if you try to facet by a continuous variable like hwy? What about cyl? What's the key difference?
```{r}
mpg %>%
ggplot(aes(drv, displ, fill = class)) +
geom_col(position = "dodge") +
facet_wrap(~hwy)
mpg %>%
ggplot(aes(drv, displ, fill = class)) +
geom_col(position = "dodge") +
facet_wrap(~cyl)
```
- The key difference is `hwy` is a continuous variable that has 27 unique values, so you get 27 different subsets. However, `cly` is a categorical variable and has 4 unique values, so `cyl` only has 4 different subsets. It is less cluttered when you try to facet.
- (Rademaker, 2016) Facetting by a continous variable works but becomes hard to read and interpret when the variable that we facet by has to many levels.
<br>
**2.** Use faceting to explore the 3-way relationship between fuel economy, engine size, and number of cylinders. How does faceting by number of cylinders change your assessement of the relationship between engine size and fuel economy?
```{r}
mpg %>%
ggplot(aes(displ, cty)) +
geom_point()
mpg %>%
ggplot(aes(displ, cty)) +
geom_point() +
facet_wrap(~cyl)
```
- When I initially plot engine size and fuel economy, I see an overall decreasing linear relationship. Upon faceting, I see that the decreasing relationship is mostly seen in the 4 cylinder subset. In the other cylinder subsets, we see a flat relationship - as engine displacement increases, fuel economy remains constant.
<br>
**3.** Read the documentation for `facet_wrap()`. What arguments can you use to control how many rows and columns appear in the output?
- I can use the arguments `nrow, ncol` to control how many rows and columns appear in the output.
<br>
**4.** What does the `scales` argument to `facet_wrap()` do? When might you use it?
- It allows users to decide whether scales should be fixed. I would use it whenever different subsets of the data are on vastly different scales.
- (Rademaker, 2016) If we want to compare across facets, `scales = "fixed"` is more appropriate. If our focus is on individual patterns within each facet, setting `scales = "free"` might be more approriate.
<br>
## Exercises
**1.** What's the problem with the plot created by `ggplot(mpg, aes(cty, hwy)) + geom_point()`? Which of the geoms described above is most effective at remedying the problem?
```{r}
ggplot(mpg, aes(cty, hwy)) +
geom_point()
```
- The problem is overplotting.
Solution 1. Use `geom_jitter` to add random noise to the data and avoid overplotting.
```{r}
ggplot(mpg, aes(cty, hwy)) +
geom_jitter()
```
Solution 2. (Rademaker, 2016) Set opacity with `alpha`
```{r}
ggplot(mpg, aes(cty, hwy)) +
geom_point(alpha = 0.3)
```
<br>
**2.** One challenge with `ggplot(mpg, aes(class, hwy)) + geom_boxplot()` is that the ordering of `class` is alphabetical, which is not terribly useful. How could you change the factor levels to be more informative?
```{r}
mpg %>%
mutate(class = factor(class),
class = fct_reorder(class, hwy)) %>%
ggplot(aes(class, hwy)) +
geom_boxplot()
```
- We could convert `class` to a factor and reorder it by `hwy`
<br>
**3.** Explore the distribution of the carat variable in the `diamonds` dataset. What binwidth reveals the most interesting patterns?
```{r}
diamonds %>%
ggplot(aes(carat)) +
geom_histogram(binwidth = 0.3)
```
- This is a subjective answer, but binwidth of 0.2 or 0.3 reveals that the distribution of `carat` is heavily skewed to the right. This means that most diamonds carats are between 0 and 1.
<br>
**4.** Explore the distribution of the price variable in the `diamonds` data. How does the distribution vary by cut?
```{r}
diamonds %>%
ggplot(aes(price)) +
geom_histogram()
diamonds %>%
mutate(cut = fct_reorder(cut, price)) %>%
ggplot(aes(cut, price)) +
geom_boxplot()
ggplot(diamonds, aes(x = price, y =..density.., color = cut)) +
geom_freqpoly(binwidth = 200)
```
- (Rademaker, 2016) Fair quality diamonds are more expensive then others. Possible reason is they are bigger.
<br>
**5.** You now know (at least) three ways to compare the distributions of subgroups: `geom_violin()`, `geom_freqpoly()` and the colour aesthetic, or `geom_histogram()` and faceting. What are the strengths and weaknesses of each approach? What other approaches could you try?
- According to the book, `geom_violin()` shows a compact representation of the "density" of the distribution, highlighting the areas where more points are found. Its weakness is that violin plos rely on the calculation of a density estimate, which is hard to interpret.
- According to the book, `geom_freqploy()` bins the data, then counts the number of observations in each bin using lines. One possible weakness is that you have to select the width of the bins yourself by experimentation.
- According to the book, `geom_histogram()` and faceting makes it easier to see the distribution of each group, but makes comparisons between groups a little harder.
<br>
**6.** Read the documentation for `geom_bar()`. What does the `weight` aesthetic do?
```{r, eval = FALSE}
?geom_bar()
```
- The `weight` aesthetic converts the number of cases to a weight and makes the height of the bar proportional to the sum of the weights. See below:
```{r}
g <- ggplot(mpg, aes(class))
# Number of cars in each class:
g + geom_bar()
# Total engine displacement of each class
g + geom_bar(aes(weight = displ))
```
<br>
**7.** Using the techniques already discussed in this chapter, come up with three ways to visualize a 2d categorical distribution. Try them out by visualising the distribution of `model` and `manufacturer`, `trans` and `class`, and `cyl` and `trans`.
- Not sure