-
Notifications
You must be signed in to change notification settings - Fork 0
/
graphical_data_exploration_exercise.Rmd
255 lines (177 loc) · 15 KB
/
graphical_data_exploration_exercise.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
---
title: 'Exercises'
output:
html_document:
toc: false
code_folding: hide
---
```{r setup, echo=FALSE, purl = FALSE}
knitr::opts_chunk$set(echo=TRUE, message = FALSE, warning = FALSE, eval = FALSE, cache = FALSE)
SOLUTIONS <- FALSE
```
\
## Exercise: Graphical data exploration using R
\
1\. Start RStudio on your computer. If you haven't already done so, create a new RStudio Project (select File --> New Project on the main menu). Create the Project in a new directory by selecting 'New Directory' and then select 'New Project'. Give the Project a suitable name ('pgr_stats' maybe) in the 'Directory name:' box and choose where you would like to create this Project directory by clicking on the 'Browse' button. Finally create the project by clicking on the 'Create Project' button. This will be your main RStudio Project file and directory which you will use throughout this course. See [Section 1.6](https://intro2r.com/rsprojs.html#rsprojs) of the Introduction to R book for more information about RStudio Projects and [here](https://alexd106.github.io/PGR-LM/howto.html#rstudio_proj-vid) for a short video.
Now create a new R script inside this Project by selecting File –> New File –> R Script from the main menu (or use the shortcut button). Before you start writing any code save this script by selecting File –> Save from the main menu. Call this script ‘graphical_data_exploration’ or something similar. Click on the ‘Files’ tab in the bottom right RStudio pane to see whether your file has been saved in the correct location. Ok, at the top of almost every R script (there are very few exceptions to this!) you should include some metadata to help your collaborators (and the future you) know who wrote the script, when it was written and what the script does (amongst other things). Include this information at the top of your R script making sure that you place a # at the beginning of every line to let R know this is a comment. See [Section 1.10](https://intro2r.com/proj-doc.html) for a little more detail.
\
2\. If you haven't already, download the data file *'loyn.xlsx'* from the **[<i class="fa fa-download"></i> Data](data.html)** link and save it to the `data` directory. Open this file in Microsoft Excel (or even better use an open source equivalent - [LibreOffice](https://www.libreoffice.org/download/download/) is a good free alternative) and save it as a tab delimited file type. Name the file *'loyn.txt'* and also save it to the `data` directory.
\
3\. These data are from a study originally conducted by Loyn (1987)^1^ and subsequently re-analysed by Quinn and Keough (2002)^2^ and Zuur et al (2009)^3^. Note, I have had to do some slight 'tweaking' of these data to improve usability for this course. The aim of the study was to relate bird density in 67 forest patches to a number of different environmental variables and management practices. A summary of the variables is: **ABUND**: Density of birds, continuous response variable; **AREA**: Size of forest patch, continuous explanatory variable; **DIST**: Distance to nearest patch, continuous explanatory; **LDIST**: Distance to nearest larger patch, continuous explanatory; **ALT**: Mean altitude of patch, continuous explanatory; **YR.ISOL**: Year of isolation of clearance, continuous explanatory; **GRAZE**: Index of livestock grazing intensity, 5 level categorical explanatory 1= low graze, 5 = high graze. Copy this information to your R script (make sure you comment it out with a `#` - can you remember the keyboard [shortcut](https://intro2r.com/proj_doc.html#proj_doc)?) and clearly highlight which variable is the response variable and which variables are potential explanatory variables.
\
4\. Import your tab delimited file from Q2 (*'loyn.txt'*) into R using the `read.table()` function and assign it to an object called `loyn` (checkout [Section 3.3.2](https://intro2r.com/importing-data.html#import_fnc) if you need a reminder). Use the `str()` function to display the structure of the dataset and the `summary()` function to summarise the dataset. Copy and paste the output of `str()` and `Summary()` to your R code as a record. Don't forget to comment this code with a `#` at the beginning of each line (use the keyboard to comment code blocks [shortcut](https://intro2r.com/proj_doc.html#proj_doc)?).
How many observations are in this dataset? How many variables does the dataframe contain?
Are there any missing values (coded as `NA`) in any variable?
How is the variable `GRAZE` coded? (as a number or a string?). If you think this will cause a problem (hint: it will!), create a new variable called `FGRAZE` **in the dataframe** with `GRAZE` recoded as a factor. See [here](https://intro2r.com/data-types.html#data-types) to see how to convert/coerce a numeric variable into a factor (TL;DR: use the `as.factor()` or `factor()` function).
```{r Q4, eval=SOLUTIONS, echo=SOLUTIONS, collapse=TRUE}
loyn <- read.table("./data/loyn.txt", header = TRUE,
stringsAsFactors = TRUE)
str(loyn)
# 67 observations and 8 variables (from str())
summary(loyn)
# GRAZE is coded as numeric (i.e. 1,2,3,5)
# create a new factor variable variable FGRAZE which is a factor of GRAZE
loyn$FGRAZE <- factor(loyn$GRAZE)
```
\
5\. Use the function `table()` (or `xtabs()`) to determine how many observations were recorded for each `FGRAZE` category (level). See [section 3.5](https://intro2r.com/summarising-data-frames.html) of the Introduction to R book to remind yourself how to do this.
```{r Q5, eval=SOLUTIONS, echo=SOLUTIONS, collapse=TRUE}
table(loyn$FGRAZE)
# or use xtabs function
xtabs(~ FGRAZE, data = loyn)
```
\
6\. Using the `tapply()` function what is the mean bird abundance (`ABUND`) for each level of `FGRAZE`?
Can you also determine the variance for each `FGRAZE` level? Again see [section 3.5](https://intro2r.com/summarising-data-frames.html) of the Introduction to R book to remind yourself how to do this.
```{r Q6, eval=SOLUTIONS, echo=SOLUTIONS, collapse=TRUE}
# mean abundance of birds for each level of FGRAZE
tapply(loyn$ABUND, loyn$FGRAZE, mean, na.rm = TRUE)
# variance in the abundance of birds for each level of FGRAZE
tapply(loyn$ABUND, loyn$FGRAZE, var, na.rm = TRUE)
# OR use the summary function
tapply(loyn$ABUND, loyn$FGRAZE, summary, na.rm = TRUE)
```
\
7\. Now onto some plotting action. Plot a Cleveland dotchart ([Section 4.2.4](https://intro2r.com/simple-base-r-plots.html#dotcharts)) of each **numeric** variable separately to assess whether there are any outliers (unusually large or small values) in the response variable (`ABUND`) or any of the continuous explanatory variables (see Q3).
If you feel in the mood, output these plots to an external PDF file in an `output` directory within your RStudio project (don't forget to create the directory first if required). If you would like to include all of the plots on a single 'page' (technically a device) then you can split your page into two rows and three columns using `par(mfrow = c(2,3))` before you run your plot code for each plot. Make a note of which variables contain outliers.
```{r Q7a, eval=SOLUTIONS, echo=SOLUTIONS, collapse=TRUE}
# first split the plotting device into 2 rows
# and 3 columns
par(mfrow = c(2,3))
# now produce the plots
dotchart(loyn$AREA, main = "Area")
dotchart(loyn$DIST, main = "Distance")
dotchart(loyn$LDIST, main = "Distance to larger patch")
dotchart(loyn$YR.ISOL, main = "Year of isolation")
dotchart(loyn$ALT, main = "Altitude")
dotchart(loyn$GRAZE, main = "Grazing levels")
```
\
```{r Q7b, eval=SOLUTIONS, echo=SOLUTIONS, collapse=TRUE}
# A fancier version of a dotplot - just for fun!
Z <- cbind(loyn$ABUND, loyn$AREA, loyn$DIST,
loyn$LDIST,loyn$YR.ISOL,loyn$ALT,
loyn$GRAZE)
colnames(Z) <- c("Abundance", "Area","Distance",
"larger dist","year of isolation",
"Altitude", "Grazing")
library(lattice)
dotplot(as.matrix(Z),
groups=FALSE,
strip = strip.custom(bg = 'white',
par.strip.text = list(cex = 0.8)),
scales = list(x = list(relation = "free"),
y = list(relation = "free"),
draw = FALSE),
col=1, cex =0.5, pch = 16,
xlab = "Value of the variable",
ylab = "Order of the data from text file")
```
8\. If you do spot any unusual observations have a think about what you want to do with them (NOTE: do **not** just remove them without justification!). If you're unsure, please speak to an instructor to discuss your options during the practical session. Perhaps you should apply a data transformation to see if this reduces the magnitude of any outlier. The best thing to do here is to play around with different transformations (i.e. `log10`, `sqrt`) to see which transformation does what you want it to do. When applying a data transformation to a variable, it's best practice to create a new variable **in your dataframe** to contain your transformed variable rather than overwrite your original data.
After you have applied these data transformations make sure you re-plot your dotcharts with any transformed variable to double check whether the transformation is doing something sensible. Hint: a log~10~ transformation might help reduce the magnitude of the outliers for some of the variables.
```{r Q8, eval=SOLUTIONS, echo=SOLUTIONS, collapse=TRUE}
# There appears to be two unusually large forest patches compared to the rest
# Also one potentially large distance in DIST
# One option would be to log10 transform AREA, DIST
# log base 10 transform variables
loyn$LOGAREA <- log10(loyn$AREA)
loyn$LOGDIST <- log10(loyn$DIST)
# check the dataframe
str(loyn)
# first split the plotting device into 2 rows
# and 3 columns
par(mfrow = c(1,2))
# now plot the transformed variables
dotchart(loyn$LOGAREA, main = "LOG Area")
dotchart(loyn$LOGDIST, main = "LOG Distance")
```
\
9\. Next, check if there is any potential collinearity between any of the **explanatory variables**. Remember, collinearity is *strong* relationships between your explanatory variables. Plot these variables using the `pairs()` function ([Section 4.2.5](https://intro2r.com/simple-base-r-plots.html#pairs-plots)).
You will need to extract your explanatory variables from the `loyn` dataframe (using `[]`) either before you use the `pairs()` function or whilst using it (don't forget to plot the transformed versions of any variables from Q8).
Optionally, include the correlation coefficient between variables in the upper panel of the pairs plot (see [section 4.2.5](https://intro2r.com/simple-base-r-plots.html#pairs-plots) of the introduction to R book for details) to help you decide whether collinearity is an issue.
```{r Q9a, eval=SOLUTIONS, echo=SOLUTIONS, collapse=TRUE}
# Vanilla pairs plot
pairs(loyn[,c("LOGAREA", "LOGDIST", "LDIST",
"YR.ISOL", "ALT", "GRAZE")])
# or first create a new dataframe and then use this
# data frame with the pairs function
explan_vars <- loyn[,c("LOGAREA", "LOGDIST", "LDIST",
"YR.ISOL", "ALT", "GRAZE")]
pairs(explan_vars)
```
```{r Q9b, eval=SOLUTIONS, echo=SOLUTIONS, collapse=TRUE}
# And with correlations in the upper panel
# first need to define the panel.cor function
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...){
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
# then use the panel.cor function when we use pairs
pairs(loyn[,c("LOGAREA","LOGDIST", "LDIST",
"YR.ISOL","ALT","GRAZE")],
upper.panel = panel.cor)
```
\
10\. Now that we've checked for collinearity let's assess whether there are any clear relationships between the response variable (`ABUND`) and individual explanatory variables. Use appropriate plotting functions (`plot()`, `boxplot()`, `pairs()` etc) to visualise these relationships.
Don't forget, if you have applied a data transformation to any of your variables (Q8) you will need to plot these transformed variables instead of the original variables.
Also, don't forget, you can split your plotting device up to allow you to plot multiple graphs ([Section 4.4](https://intro2r.com/mult_graphs.html#mult_graphs)) or again use a function like `pairs()` to create a multi-panel plot. Output these plots to the `output` directory as PDFs. Add some comments in your R code to summarise your findings.
```{r Q10a, eval=SOLUTIONS, echo=SOLUTIONS, collapse=TRUE}
pairs(loyn[,c("ABUND","LOGAREA","LOGDIST", "LDIST",
"YR.ISOL","ALT","GRAZE")],
lower.panel = panel.cor)
```
```{r Q10b, eval=SOLUTIONS, echo=SOLUTIONS, collapse=TRUE}
plot(loyn$LOGAREA, loyn$ABUND, xlab = "log area", ylab = "bird abundance")
```
\
11\. One of the main aims of this study was to determine whether management practices such as grazing intensity (`GRAZE`) and size of the forest (`AREA`) affected the abundance of birds (`ABUND`). One hypothesis was that the size of the forest affected the number of birds, but this was dependent on the intensity of the grazing regime (in other words, there is an interaction between `AREA` and `GRAZE` - don't worry if you haven't heard of an interaction term, we will go through this later on in the course).
Use an appropriate plotting function to explore these data for such an interaction (perhaps a `coplot()` or `xyplot()` in [Section 4.2.6](https://intro2r.com/simple-base-r-plots.html#coplots) might be helpful?).
Again, don't forget, if you have applied a data transformation to your `AREA` variable you need to use the transformed variable in this plot **not** the original `AREA` variable. Likewise, as we've converted the `GRAZE` variable into a factor type variable we should use our new factor `FGRAZE` (or whatever you have called it).
Save this plot as a PDF to your `output` directory and add some comments to your R code to describe any patterns you observe.
```{r Q11a, eval=SOLUTIONS, echo=SOLUTIONS, collapse=TRUE}
# Interaction between LOGAREA and FGRAZE?
# Do the slopes look similar or different?
coplot(ABUND ~ LOGAREA | FGRAZE, data = loyn)
```
\
```{r Q11b, eval=SOLUTIONS, echo=SOLUTIONS, collapse=TRUE}
# Fancier version of the above plot
# with a line of best fit included just for fun
coplot(ABUND ~ LOGAREA | FGRAZE,
data = loyn,
panel = function(x, y, ...) {
tmp <- lm(y ~ x, na.action = na.omit)
abline(tmp)
points(x, y) })
```
\
^1^ Loyn, R. (1987). Effects of patch area and habitat on bird abundances, species numbers and tree health in fragmented Victoria forests. Nature conservation: the role of remnants of native vegetation. 65-77.
^2^ Quinn, G. P., and Michael J. Keough. 2002. Experimental design and data analysis for biologists. Cambridge, UK: Cambridge University Press.
^3^ Zuur, A.F., Ieno, E.N. and Elphick, C.S. (2010), A protocol for data exploration to avoid common statistical problems. Methods in Ecology and Evolution, 1: 3-14. doi:10.1111/j.2041-210X.2009.00001.x
\
End of Graphical data exploration using R Exercise