-
Notifications
You must be signed in to change notification settings - Fork 0
/
more-querying-gbif.Rmd
393 lines (174 loc) · 9.5 KB
/
more-querying-gbif.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
---
title: "More Querying GBIF Data"
output: pdf_document
---
Your name here
Assignment 3
DSCI 245 / RNR 496B
Due: February 7th, 2022
## More Querying GBIF Data
### Learning objectives:
* Understand how to query GBIF data with spocc, using a variety of arguments.
* Practice some data cleaning strategies.
* Learn how to create a derivative data set with fewer columns.
* Successfully create a csv file from a data set.
### Querying data from GBIF with spocc
Picking up from last week, we're going to dive back into querying data from GBIF (Global Biodiversity Information Facility) using the spocc (SPecies OCCurrence) R package. First, let's load the spocc and tidyverse packages using the library() function:
```{r}
library("spocc")
library("tidyverse")
```
Now, let's revisit the "occ()" function. The documentation for this function is available here:
https://docs.ropensci.org/spocc/reference/occ.html
Let's start by revisiting how to run a basic query to get species occurrences from GBIF. The following example will gives us data from GBIF on Southern Black Widows (Latrodectus mactans):
```{r}
#Note the default limit is 500, so run your query once, view the total, and then adjust your limit so it exceeds the max, and run again.
blackWidows<-occ(query='Latrodectus mactans', from="gbif", limit=4000)
blackWidows
```
One thing we didn't look at last week was the "View()" function. If you recall, occ() returns an "object", which is basically a series of hierarchical data containing various pieces of information. We can navigate the object using the "$" operator, but it's only really useful if we know how to navigate the object. So let's run "View()" on blackWidows, and see what we get:
```{r}
View(blackWidows)
```
You'll notice this opens up a new tab in RStudio, and shows us what's inside the object, and how to navigate it. You can see by clicking the various arrows that we can navigate to the data by this path: blackWidows$gbif$data$ Latrodectus_mactans
```{r}
bwData<-blackWidows$gbif$data$Latrodectus_mactans
bwData
```
YOUR TURN:
* Pick a bird from the Internet Bird Collection: https://search.macaulaylibrary.org/catalog?sort=rating_rank_desc&collection=COLL58&cap=all&includeUnconfirmed=T
* Use the species name to query GBIF using the occ() function.
* Use the View() function to see the object hierarchy.
* Create a variable to target the data set from the object, and print it to the screen.
```{r}
# Create a variable set to the initial occ() query
# Use View() to see the object result
# Create another variable to target the data set, and then print to screen
```
### Further refining GBIF queries
The occ() function has many arguments to help refine our query. One particularly useful argument is "gbifopts", which provides a number of additional arguments specific to GBIF queries (remember, occ() can query multiple data sources...each one has their own 'opts' argument).
We can look at all the possibilities for "gbifopts" by running the following:
```{r}
occ_options('gbif')
```
You can contain gbifopts in an R "list". The syntax for using a list is:
list(key1="value1", key2="value2")
Now I'm going to perform a search for the American goldfinch (Spinus tristis), and use gbifopts to limit it by the state of Oregon, and the year 2020.
```{r}
goldfinchOr<-occ(query="Spinus tristis", from="gbif", gbifopts = list(stateProvince="Oregon", year="2020"), limit=34000)
goldfinchOr
```
...and like before, we'll use View(), and then drill down to get the data:
```{r}
View(goldfinchOr)
gfOrData<-goldfinchOr$gbif$data$Spinus_tristis
gfOrData
```
YOUR TURN:
Refine your bird query to just return occurrences in a country of your choosing, for the years 2020 and 2021 (hint: refer to occ_options('gbif') to determine how to query multiple years, and how to query by country). Adjust your limit once you see the total.
```{r}
# Run the occ() query
# use View() to see the object
# Define a new variable, and print to the screen.
```
### Analyzing/cleaning your data
Since humans are involved in recording and submitting the data, you can't necessarily assume it's all accurate. As such, we should probably write some scripts to "clean" the data.
First we want to make sure that all the data represents "real" occurrences. There are two columns of importance here:
occurrenceStatus - basically we're looking for the value "PRESENT". Sometimes the value is "ABSENT", if a researcher is doing periodic occurrence counts of a species. We want to remove these.
individualCount - here we want to remove any rows where the value is 0.
The strategy in both cases is to use the "unique()" function on each column. This will return all unique values for the given column. If it appears there are rows to remove, then we use the "subset()" function to isolate those rows, and then remove them from the data set.
Let's start with occurrenceStatus:
```{r}
unique(gfOrData$occurrenceStatus)
```
If any values in the returned list were other than "PRESENT", here are the steps we would take to remove the rows:
```{r}
# Here we use the subset function to get rows from the original data set based on a condition we specify
absent<-subset(x=gfOrData, occurrenceStatus !="PRESENT")
absent
# Now we use "anti_join" to redefine our data set by subtracting the "absent" rows from it.
gfOrData<-anti_join(gfOrData, absent)
gfOrData
```
YOUR TURN:
Check your bird data for unique values of the occurrenceStatus column. Remove any rows that don't contain the value "PRESENT".
```{r}
```
Now let's do the same thing for the individualCount column. In this case we're looking for rows with 0:
```{r}
unique(gfOrData$individualCount)
zeroFinches<-subset(x=gfOrData, individualCount==0)
gfOrData<-anti_join(gfOrData, zeroFinches)
gfOrData
```
YOUR TURN:
Check your bird data for any rows in which individualCount is 0, and remove them:
```{r}
```
Now we're going to check for any errors in coordinate data, and clean them up. It's not uncommon for coordinates to be off because of an erroneous "-" (either present or absent).
It's not a bad idea to first eyeball the coordinate data by looking at the points plotted on a map. With the goldfinch data, we expect all the coordinates to be in Oregon. Are they?
(First, this code will create a pretend error in the data, so you'll see what to look for)
```{r}
gfOrData<-gfOrData %>%mutate(latitude = replace(latitude, latitude==45.559186, -45.559186))
```
```{r}
#Don't worry too much about this code. We'll learn ggplot later in the term.
# For now, just know that this creates a map, with gfOrData using the longitude and latitude columns.
#plot data to get an overview
wm <- borders("world", colour="gray50", fill="gray50")
ggplot()+ coord_fixed()+ wm +
geom_point(data = gfOrData, aes(x = longitude, y = latitude),
colour = "darkred", size = 1.0)+
theme_bw()
```
So the strategy here would be to:
* fix any rows that simply have errors with "-" signs
* delete any rows with coordinates that are clearly way off
In both cases, it might be a good idea to use the "sort" function on latitude and longitude to look for outliers:
```{r}
# The [1:20] at the end means "just print the first 20 rows"
sort(gfOrData$latitude)[1:20]
# now in reverse order
sort(gfOrData$latitude, decreasing = TRUE)[1:20]
# can do same for longitude
```
To edit the value of a cell, one would use code that looks like this:
```{r}
# here we're finding a cell value of latitude with an erroneous "-", and replacing it with the "-" omitted
gfOrData<-gfOrData %>% mutate(latitude = replace(latitude, latitude==-45.559186, 45.559186))
```
To remove any bad data, we would use a similar strategy as with occurrenceStatus and individualCount:
* use subset() to target rows with bad data
* use anti_join() to remove them from the data set
YOUR TURN:
Plug your bird data into the map code below, and run it. Do you see any outliers that don't fit? If so, describe what you find with comments in the code chunk, and write some code to address the issues:
```{r}
# Replace YOURDATAHERE with your data set variable
wm <- borders("world", colour="gray50", fill="gray50")
ggplot()+ coord_fixed()+ wm +
geom_point(data = YOURDATAHERE, aes(x = longitude, y = latitude),
colour = "darkred", size = 1.0)+
theme_bw()
#How does it look? Add some comments to describe. If necessary, write some code to fix the issues.
```
Other things to check:
* Are they all the correct species?
* Are the dates accurate/within the range you query?
### Reducing and saving your data
Data queried from GBIF has 100 columns, which is probably a little much. In order to more easily work with the data and cut down on processing time, we can eliminate many columns, and then save to a file.
We can use the select() function to identify columns we want to keep:
```{r}
# with select(), we first identify the original data set, then identify a list of columns using the c() function
lessData<-select(gfOrData, c(name, longitude, latitude, scientificName, year, month, day, eventDate, individualCount))
lessData
```
Now we can write this to a csv file, which will appear in the Files tab.
```{r}
write_csv(lessData, "goldfinch.csv")
```
YOUR TURN:
Examine your bird data set, and identify the columns that might be useful. Use select() to make a data set with those columns, and write it to a csv:
```{r}
```
To turn in by midnight on Monday, February 7th:
Knit this document to a PDF, and load it to the Assignment 3 folder in Google Docs (check syllabus for URL).