-
Notifications
You must be signed in to change notification settings - Fork 6
/
exercise_4_solution.Rmd
358 lines (276 loc) · 21.6 KB
/
exercise_4_solution.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
---
title: 'Exercise Solutions'
output:
html_document:
toc: false
code_folding: hide
---
```{r setup, echo=FALSE, purl = FALSE}
knitr::opts_chunk$set(echo=TRUE, message = FALSE, warning = FALSE, eval = FALSE, cache = FALSE)
```
\
## Exercise 4: Visualising data using base R and lattice graphics
\
Read [Chapter 4](https://intro2r.com/graphics-base-r.html) to help you complete the questions in this exercise.
\
1\. As in previous exercises, either create a new R script or continue with your previous R script in your RStudio Project. Again, make sure you include any metadata you feel is appropriate (title, description of task, date of creation etc) and don't forget to comment out your metadata with a `#` at the beginning of the line.
\
2\. Download the data file *‘squid1.xlsx’* from the **[<i class="fa fa-download"></i> Data](data.html)** link and save it to the `data` directory you created during exercise 1. Open this file in Microsoft Excel (or even better use an open source equivalent - [LibreOffice](https://www.libreoffice.org/download/download/) is a good free alternative) and save it as a tab delimited file type. Name the file *‘squid1.txt'* and save it to the `data` directory.
\
3\. These data were originally collected as part of a study published in Aquatic Living Resources^1^ in 2005. The aim of the study was to investigate the seasonal patterns of investment in somatic and reproductive tissues in the long finned squid *Loligo forbesi* caught in Scottish waters. Squid were caught monthly from December 1989 - July 1991 (`month` and `year` variables). After capture, each squid was given a unique `specimen` code, weighed (`weight`) and the sex determined (`sex` - only female squid are included here). The size of individuals was also measured as the dorsal mantle length (`DML`) and the mantle weight measured without internal organs (`eviscerate.weight`). The gonads were weighed (`ovary.weight`) along with the accessory reproductive organ (the nidamental gland, `nid.weight`, `nid.length`). Each individual was also assigned a categorical measure of maturity (`maturity.stage`, ranging from 1 to 5 with 1 = immature, 5 = mature). The digestive gland weight (`dig.weight`) was also recorded to assess nutritional status of the individual. If you're not familiar with squid morphology and are interested in finding out more see [here](https://en.wikipedia.org/wiki/Cephalopod_size).
\
4\. Import the *'squid1.txt'* file into R using the `read.table()` function and assign it to a variable named `squid`. Use the `str()` function to display the structure of the dataset and the `summary()` function to summarise the dataset. How many observations are in this dataset? How many variables? The `year`, `month` and `maturity.stage` variables were coded as integers in the original dataset. Here we would like to code them as factors. Create a new variable for each of these variables in the `squid` dataframe and recode them as factors. Use the `str()` function again to check the coding of these new variables.
```{r Q4, results = 'asis'}
squid <- read.table('workshop/data/squid1.txt', header =TRUE,
stringsAsFactors = TRUE)
str(squid)
# 'data.frame': 519 obs. of 13 variables:
# $ sample.no : int 105128901 105128901 105128901 105128901 ...
# $ specimen : int 1002 1003 1005 1007 1008 1009 1011 1013 ...
# $ year : int 1989 1989 1989 1989 1989 1989 1989 1989 ...
# $ month : int 12 12 12 12 12 12 12 12 12 12 ...
# $ weight : num 152 106 138 141 126 ...
# $ sex : int 2 2 2 2 2 2 2 2 2 2 ...
# $ maturity.stage : int 3 1 2 2 3 1 2 3 3 4 ...
# $ DML : int 174 153 169 175 169 116 135 192 170 205 ...
# $ eviscerate.weight: num 87.5 62.6 79.4 83.1 72.2 ...
# $ dig.weight : num 4.648 3.138 0.307 4.123 3.605 ...
# $ nid.length : num 39.4 24.1 39 41.4 39.8 20 14 55 44 53 ...
# $ nid.weight : num 2.46 0.319 1.169 1.631 2.03 ...
# $ ovary.weight : num 1.68 0.103 0.289 0.252 0.86 ...
summary(squid)
# convert variables to factors
squid$Fmaturity <- factor(squid$maturity.stage)
squid$Fmonth <- factor(squid$month)
squid$Fyear <- factor(squid$year)
str(squid)
# 'data.frame': 519 obs. of 16 variables:
# $ sample.no : int 105128901 105128901 105128901 ...
# $ specimen : int 1002 1003 1005 1007 1008 1009 ...
# $ year : int 1989 1989 1989 1989 1989 1989 ...
# $ month : int 12 12 12 12 12 12 12 12 12 12 ...
# $ weight : num 152 106 138 141 126 ...
# $ sex : int 2 2 2 2 2 2 2 2 2 2 ...
# $ maturity.stage : int 3 1 2 2 3 1 2 3 3 4 ...
# $ DML : int 174 153 169 175 169 116 135 ...
# $ eviscerate.weight: num 87.5 62.6 79.4 83.1 72.2 ...
# $ dig.weight : num 4.648 3.138 0.307 4.123 3.605 ...
# $ nid.length : num 39.4 24.1 39 41.4 39.8 20 14 ...
# $ nid.weight : num 2.46 0.319 1.169 1.631 2.03 ...
# $ ovary.weight : num 1.68 0.103 0.289 0.252 0.86 ...
# $ Fmaturity : Factor w/ 5 levels "1","2","3","4" "5"...
# $ Fmonth : Factor w/ 12 levels "1","2","3","4" ...
# $ Fyear : Factor w/ 3 levels "1989","1990",..1 ...
```
\
5\. How many observations are there per month and year combination (hint: remember the `table()` or `xtabs()` functions?)? Don't forget to use the factor recoded versions of these variables. Do you have data for each month in each year? Which years have the most observations? (optional) Use a combination of the `xtabs()` and `ftable()` functions to create a flattened table of the number of observations for each year, maturity stage and month (aka a contingency table).
```{r Q5}
table(squid$Fmonth, squid$Fyear)
# 1989 1990 1991
# 1 0 3 37
# 2 0 0 30
# 3 0 40 29
# 4 0 10 33
# 5 0 1 30
# 6 0 0 14
# 7 0 42 1
# 8 0 29 0
# 9 0 82 0
# 10 0 19 0
# 11 0 76 0
# 12 12 31 0
ftable(xtabs(~ Fyear + Fmaturity + Fmonth, data = squid))
# Fmonth 1 2 3 4 5 6 7 8 9 10 11 12
# Fyear Fmaturity
# 1989 1 0 0 0 0 0 0 0 0 0 0 0 2
# 2 0 0 0 0 0 0 0 0 0 0 0 3
# 3 0 0 0 0 0 0 0 0 0 0 0 5
# 4 0 0 0 0 0 0 0 0 0 0 0 2
# 5 0 0 0 0 0 0 0 0 0 0 0 0
# 1990 1 0 0 0 0 0 0 8 0 1 1 1 2
# 2 0 0 0 0 0 0 22 21 76 17 31 4
# 3 0 0 0 0 0 0 0 5 5 1 31 6
# 4 2 0 15 7 0 0 4 3 0 0 10 13
# 5 1 0 25 3 1 0 8 0 0 0 3 6
# 1991 1 0 0 0 2 0 4 0 0 0 0 0 0
# 2 1 1 0 1 0 6 0 0 0 0 0 0
# 3 2 0 0 1 1 0 0 0 0 0 0 0
# 4 16 8 6 13 6 1 1 0 0 0 0 0
# 5 18 21 23 16 23 3 0 0 0 0 0 0
```
\
6\. The humble cleveland dotplot is a great way of identifying if you have potential outliers in continuous variables (See [Section 4.2.4](https://intro2r.com/simple-base-r-plots.html#dotcharts)). Create dotplots (using the `dotchart()` function) for the following variables; `DML`, `weight`, `nid.length` and `ovary.weight`. Do these variables contain any unusually large or small observations? Don't forget, if you prefer to create a single figure with all 4 plots you can always split your plotting device into 2 rows and 2 columns (see [Section 4.4](https://intro2r.com/mult_graphs.html#mult_graphs) of the book). Use the `pdf()` function to save a pdf version of your plot(s) in your `output` directory you created in Exercise 1 (see [Section 4.5](https://intro2r.com/export_plots.html#export_plots) of the book to see how the `pdf()` function works). I have also included some alternative code in the [solutions for this exercise](exercise_4_solution.html) using the `dotplot()` function from the `lattice` package.
```{r Q6}
pdf('figures/ex4_dotplots.pdf')
par(mfrow = c(2, 2))
dotchart(squid$DML, main = "DML")
dotchart(squid$weight, main = "weight")
dotchart(squid$nid.length, main = "nid length")
dotchart(squid$ovary.weight, main = "ovary weight")
dev.off()
# alternative code using dotplot function from lattice package
library(lattice)
dotplot(as.matrix(squid[,c("DML", "weight", "nid.length", "ovary.weight")]),
groups=FALSE,
strip = strip.custom(bg = 'white',
par.strip.text = list(cex = 0.8)),
scales = list(x = list(relation = "free"),
y = list(relation = "free"),
draw = FALSE),
col=1, cex =0.5, pch = 16,
xlab = "Value of the variable",
ylab = "Order of the data from text file")
```
\
7\. It looks like the variable `nid.length` contains an unusually large value. Actually, this value is biologically implausible and clearly an error. The researchers were asked to go back and check their field notebooks and sure enough they discover a typo. It looks like a tired researcher accidentally inserted a zero by mistake when transcribing these data (mistakes in data are very common and why we **always** explore, check and validate any data we are working on). We can clearly see this value is over 400 so we can use the `which()` function to identify which observation this is `which(squid$nid.length > 400)`. It looks like this is the 11^th^ observation of the `squid$nid.length` variable. Use your skill with the square brackets `[ ]` to first confirm the this is the correct value (you should get 430.2) and then change this value to 43.2. Now redo the dotchart to visualise your correction. Caution: You can only do this because you have confirmed that this is an transcribing error. You should **not** remove or change values in your data just because you feel like it or they look 'unusual'. This is scientific fraud! Also, the advantage of making this change in your R script rather than in Excel is that you now have a permanent record of the change you made and can state a clear reason for the change.
```{r Q7}
which(squid$nid.length > 400)
# [1] 11
squid$nid.length[11]
# [1] 430.2
squid$nid.length[11] <- 43.2
squid$nid.length[11]
# [1] 43.2
dotchart(squid$nid.length, main = "nid length")
```
\
8\. When exploring your data it is often useful to visualise the distribution of continuous variables. Take a look at [Section 4.2.2](https://intro2r.com/simple-base-r-plots.html#histograms) and then create histograms for the variables; `DML`, `weight`, `eviscerate.weight` and `ovary.weight`. Again, its up to you if you want to plot all 4 plots separately or in the same figure. Export your plot(s) as pdf file(s) to the `output` directory. One potential problem with histograms is that the distribution of data can look quite different depending on the number of 'breaks' used. The `hist()` function does it's best to create appropriate 'breaks' for your plots (it uses the [Sturges algorithm](https://en.wikipedia.org/wiki/Histogram) for those that want to know) but experiment with changing the number of breaks for the `DML` variable to see how the shape of the distribution changes (see [Section 4.2.2](https://intro2r.com/simple-base-r-plots.html#histograms) of the book for further details of how to change the breaks).
```{r Q8}
pdf('workshop/figures/ex4_hist.pdf')
par(mfrow = c(2,2))
hist(squid$DML, main="", xlab = "DML")
hist(squid$weight, main="", xlab = "weight")
hist(squid$eviscerate.weight, main="", xlab = "eviscerate weight")
hist(squid$ovary.weight, main="", xlab = "ovary weight")
dev.off()
# need to get the min and max values for DML
# to work out the limits for the breaks
summary(squid$DML)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 88 187 217 215 240 323
# experimenting with different breaks
par(mfrow = c(2,2))
brk1 <- seq(from = 80, to = 340, by = 20)
hist(squid$DML, xlab = "DML", breaks = brk1, main = "brk: 20")
brk2 <- seq(from = 80, to = 340, by = 10)
hist(squid$DML, xlab = "DML", breaks = brk2, main = "brk: 10")
brk3 <- seq(from = 80, to = 340, by = 5)
hist(squid$DML, xlab = "DML", breaks = brk3, main = "brk: 5")
brk4 <- seq(from = 80, to = 340, by = 2)
hist(squid$DML, xlab = "DML", breaks = brk4, main = "brk: 2")
```
\
9\. Scatterplots are great for visualising relationships between two continuous variables ([Section 4.2.1](https://intro2r.com/simple-base-r-plots.html#scatterplot)). Plot the relationship between `DML` on the x axis and `weight` on the y axis. How would you describe this relationship? Is it linear? One approach to linearising relationships is to apply a transformation on one or both variables. Try transforming the `weight` variable with either a natural log (`log()`) or square root (`sqrt()`) transformation. I suggest you create new variables in the `squid` dataframe for your transformed variables and use these variables when creating your new plots (ask if you're not sure how to do this). Which transformation best linearises this relationship? Again, save your plots as a pdf file (or try saving in your `output` directory as jpeg or png format using the `jpeg()` or `png()` functions - [Section 4.5](https://intro2r.com/export_plots.html#export_plots) if you feel the need for a change!).
```{r Q9}
# clearly not linear
plot(squid$DML, squid$weight)
# natural log and sqrt tranform weight
squid$weight.sqrt <- sqrt(squid$weight)
squid$weight.log <- log(squid$weight)
par(mfrow = c(1,2))
plot(squid$DML, squid$weight.sqrt)
plot(squid$DML, squid$weight.log)
# the square root transformation looks
# most appropriate
jpeg('output/ex4_transf_plot.jpeg')
plot(squid$DML, squid$weight.sqrt)
dev.off()
png('output/ex4_transf_plot.png')
plot(squid$DML, squid$weight.sqrt)
dev.off()
```
\
10\. When visualising differences in a continuous variable between levels of a factor (categorical variable) then a boxplot is your friend (avoid using bar plots - Google 'bar plots are evil' for more info). Create a boxplot to visualise the differences in DML at each maturity stage (don't forget to use the recoded version of this variable you created in Q4) . Include x and y axes labels in your plot. Make sure you understand the anatomy of a boxplot before moving on - please ask if you're not sure (also see [Section 4.2.3](https://intro2r.com/simple-base-r-plots.html#box-and-violin-plots) of the book). An alternative to the boxplot is the violin plot. A violin plot is a combination of a boxplot and a [kernel density plot](https://www.statmethods.net/graphs/density.html) and is great at visualising the distribution of a variable. To create a violin plot you will first need to install the `vioplot` package from CRAN and make it available using `library(vioplot)`. You can now use the `vioplot()` function in pretty much the same way as you created your boxplot (again [Section 4.2.3](https://intro2r.com/simple-base-r-plots.html#box-and-violin-plots) of the book walks you through this).
```{r Q10, tidy = TRUE}
# note: Fmaturity is the recoded maturity.stage variable cerated in Q4
boxplot(DML ~ Fmaturity, data = squid, xlab = "maturity stage", ylab = "DML")
# violin plot
library(vioplot)
vioplot(DML ~ Fmaturity, data = squid, xlab = "maturity stage", ylab = "DML" , col = "lightblue")
```
\
11\. To visualise the relationship between two continuous variables but for different levels of a factor variable you can create a conditional scatterplot. Use the `coplot()` function ([Section 4.2.6](https://intro2r.com/simple-base-r-plots.html#coplots)) to plot the relationship between DML on the x axis and square root transformed weight on the y axis (you created this variable in Q8) for each level of maturity stage. Does the relationship between DML and weight look different for each maturity stage (suggesting an interaction)? If you prefer, you can also create a similar plot using the `xyplot()` function ([Section 4.2.7](https://intro2r.com/simple-base-r-plots.html#lattice-plots)) from the `lattice` package (don't forget to make the function available by using `library(lattice)` first).
```{r Q11}
coplot(weight.sqrt ~ DML | Fmaturity, data = squid)
# using xyplot from the lattice package
library(lattice)
xyplot(weight.sqrt ~ DML | Fmaturity, data = squid)
```
\
12\. To explore the relationships between multiple continuous variables it's hard to beat a pairs plot. Create a pairs plot for the variables; `DML`, `weight`, `eviscerate.weight`, `ovary.weight`, `nid.length`, and `nid.weight` (see [Section 4.2.5](https://intro2r.com/simple-base-r-plots.html#pairs-plots) of the book for more details). If it looks a little cramped in RStudio then click on the 'zoom' button in the plot viewer to see a larger version. One of the great things about the `pairs()` function is that you can customise what goes into each panel. Modify your pairs plot to include a histogram of the variables on the diagonal panel and include a correlation coefficient for each relationship on the upper triangle panels. Also include a smoother (wiggly line) in the lower triangle panels to help visualise these relationships. Take a look at the Introduction to R book to see how to do all this (or `?pairs`).
```{r Q12, tidy = TRUE}
# vanilla pairs plot
pairs(squid[,c(5, 8, 9, 11, 12, 13)])
# customise the plot. You will need to define the
# panel.hist and panel.cor functions first. see the
# course manual or ?pairs help file
pairs(squid[,c(5, 8, 9, 11, 12, 13)], diag.panel = panel.hist, upper.panel = panel.cor,
lower.panel = panel.smooth)
```
\
13\. Almost every aspect of the figures you create in R is customisable. Learning how to get your plot looking just right is not only rewarding but also means that you will never have to include a plot in your paper/thesis that you're not completely happy with. When you start learning how to use R it can sometimes seem to take a lot of work to customise your plots. Don't worry, it gets easier with experience (most of the time anyway) and you will always have your code if you want to create a similar plot in the future. Use the `plot()` function to produce a scatterplot of DML on the x axis and ovary weight on the y axis (you might need to apply a transformation on the variable `ovary.weight`). Use a different colour to highlight points for each level of maturity stage. Produce a legend explaining the different colours and place it in a suitable position on the plot. Format the graph further to make it suitable for inclusion into your paper/thesis (i.e. add axes labels, change the axes scales etc). See [Section 4.3](https://intro2r.com/custom_plot.html#custom_plot) for more details about customising plots. I have given you a few different solutions in the code below. give them a go!
```{r Q13a}
# quick and dirty way
# need to transform ovary.weight first
squid$ovary.weight.sqrt <- sqrt(squid$ovary.weight)
# create the plot
with(squid, plot(DML, ovary.weight.sqrt, xlab = "DML (mm)",
ylab = "square root ovary weight (g)",
col = as.numeric(Fmaturity),
xlim = c(60, 350), ylim = c(0, 8.5)))
# create the legend
labs <- c("stage 1", "stage 2", "stage 3", "stage 4","stage 5")
cols <- as.numeric(levels(squid$Fmaturity))
legend("topleft", labs,col = cols, pch = 1)
```
```{r Q13b}
# longer but more control
# need to scales package to set transparency of points
# will the alpha function
library(scales)
#setup the axes but dont plot the points
with(squid, plot(DML, ovary.weight.sqrt, xlab = "DML (mm)",
ylab = "square root ovary weight (g)",
type = "n", xlim = c(60, 350), ylim = c(0, 8.5)))
# plot the points with custom colours
with(squid, points(DML[Fmaturity == "1"], ovary.weight.sqrt[Fmaturity == "1"],
col = alpha("deepskyblue3", 0.7), pch = 16))
with(squid, points(DML[Fmaturity == "2"], ovary.weight.sqrt[Fmaturity == "2"],
col = alpha("darkolivegreen3", 0.7), pch = 16))
with(squid, points(DML[Fmaturity == "3"], ovary.weight.sqrt[Fmaturity == "3"],
col = alpha("coral3", 0.7), pch = 16))
with(squid, points(DML[Fmaturity == "4"], ovary.weight.sqrt[Fmaturity == "4"],
col = alpha("lemonchiffon3", 0.7), pch = 16))
with(squid, points(DML[Fmaturity == "5"], ovary.weight.sqrt[Fmaturity == "5"],
col = alpha("darkorchid3", 0.7), pch = 16))
# include the legend
labs <- c("stage 1", "stage 2", "stage 3", "stage 4","stage 5")
cols <- c("deepskyblue3", "darkolivegreen3", "coral3",
"lemonchiffon3", "darkorchid3")
legend(55, 8.2, labs,col = alpha(cols, 0.7), pch = 16, bty = "n")
```
```{r Q13c}
# or use the ggplot2 package
# square root transform ovary weight
squid$ovary.weight.sqrt <- sqrt(squid$ovary.weight)
library(ggplot2)
ggplot(data = squid) +
geom_point(aes(x = DML, y = ovary.weight.sqrt, colour = Fmaturity),
alpha = 0.8, size = 2) +
scale_colour_manual(values = c("deepskyblue3", "darkolivegreen3", "coral3",
"lemonchiffon3", "darkorchid3"),
labels = c("stage 1", "stage 2", "stage 3",
"stage 4", "stage 5")) +
theme_classic(base_size = 12) +
labs(colour = "", x = "DML (mm)", y = "square root ovary weight (g)")
# OR
ggplot(data = squid) +
geom_point(aes(x = DML, y = ovary.weight.sqrt, colour = Fmaturity),
alpha = 0.8, size = 2) +
theme_classic() +
labs(colour = "", x = "DML (mm)", y = "square root ovary weight (g)")
```
\
^1^ Smith JM et al (2005) Seasonal patterns of investment in reproductive and somatic tissues in the squid *Loligo forbesi*, Aquatic Living Resources. 18, 341–351.
\
End of Exercise 4