-
Notifications
You must be signed in to change notification settings - Fork 1
/
04-identifiers.Rmd
96 lines (77 loc) · 3.52 KB
/
04-identifiers.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
```{r echo=FALSE}
knitr::opts_chunk$set(
comment = "#>",
warning = FALSE,
message = FALSE
)
```
# Identifiers {#identifiers}
Taxonomic identifiers are a central concept in `taxize` - and around which
many things depend on. We often either need to get to an identifier for each
name in question, or take an identifier to do something else, for example
get a taxonomic classification.
(_Note_: See [Progress and state](#progress-state) for details on keeping track
of progress maintaining state in `get_` functions, and functions that use
`get_` functions under the hood.)
## get functions
There's a family of functions, each targeted at a specific data source, for
getting taxonomic identifiers from names. They all start with the prefix
`get_`, and end with an abbreviation for the data source. They are:
```{r echo=FALSE, comment=NA, results='asis'}
cat(paste(" -", paste(sprintf("`%s`", grep("get_genes|get_seqs", grep("get_[A-Za-z]+$", getNamespaceExports("taxize"), value = TRUE), value = TRUE, invert=TRUE)), collapse = "\n - ")))
```
Another set of related functions with a trailing underscore (e.g., `get_uid_()`) are
available for all data sources, but they do not do interactive usage, but instead
provide all data. That is, if you use e.g., `get_tsn()` and there is more than
one option returned from ITIS, you will be given a prompt which will require
your input. If you want to avoid potential prompts, use `get_tsn_()` instead,
and then manipulate/filter data yourself.
## get functions return objects
The returned data from `get_*` functions is either an `NA_character_` if no match,
or a character string or numeric (some data providers use numeric taxonomic
identifiers, while others use alphanumeric identifiers) if a match. The length can
be greater than one as all the functions are vectorized.
There are a number of attributes attached to the output, where each attributes
length equals the length of the id vector:
* class: the class of the object
* match: whether a match was found or not
* multiple_matches: whether there were multiple matches or not
* pattern_match: whether there was a match by pattern (i.e., regex)
* uri: the URI/URL for the taxon
An example return from a `get_*` function:
```r
[1] "50157568"
attr(,"class")
[1] "tpsid"
attr(,"match")
[1] "found"
attr(,"multiple_matches")
[1] TRUE
attr(,"pattern_match")
[1] FALSE
attr(,"uri")
[1] "http://tropicos.org/Name/50157568"
```
## identifier notes
* Taxonomic identifiers are unique to the data source. That is, the taxonomic
ID for any individual taxon in data source A is different from that in data
source B, and different from that in data source C, etc. For example, the
taxonomic ID for _Pinus contorta_ in NCBI is `3339`, the ID in ITIS is
`183327`, and the ID in COL is `13c40c3f2be0b0965bddf948d2b3115f`.
* Taxonomic identifiers vary among data sources in their structure. That is,
some are numeric (only numbers), some are alphanumeric (combination of numbers
and letters), and one has an odd structure with letters, numbers and symbols
(Nature Serve, e.g., `ELEMENT_GLOBAL.2.147775`). Most data sources provide
numeric identifiers. Wikipedia is unusual in that there are no
numeric/alphanumeric/etc. taxonomic identifiers for each name - instead, the
"identifier" __IS__ the name.
* ...
## Looking up taxa on the web
If you want to look up the taxon page for any taxa that you've passed through
`get_*` functions you can do so by using the `uri` attributes returned from
the function.
For example:
```{r eval = FALSE}
x <- get_tsn('Pinus contorta')
browseURL(attr(x, 'uri'))
```