-
Notifications
You must be signed in to change notification settings - Fork 20
/
1_05_numeric_vectors.Rmd
230 lines (191 loc) · 16 KB
/
1_05_numeric_vectors.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
# Numeric vectors
## Introduction
The term "data structure" is computer science jargon for a particular way of organising data on our computers, so that it can be accessed easily and efficiently. Computer languages use many different kinds of data structures, but fortunately, we only need to learn about a couple of relatively simple ones to use R for data analysis. In fact, in this book only two kinds of data structure really matter: "vectors" and "data frames". We'll learn how to work with data frames in the next section of the book.
The next three chapters will consider vectors. This chapter has two goals. First, we want to learn the basics of how to work with vectors. For example, we'll see how "vectorised operations" may be used to express a repeated calculation. Second, we'll learn how to construct and use __numeric vectors__ to perform various calculations. Keep in mind that although we're going to focus on numeric vectors, many of the principles we learn here can be applied to the other kinds of vectors considered later.
## Atomic vectors
A vector is a 1-dimensional object that contains a set of data values, which are accessible by their position: position 1, position 2, position 3, and so one. When people talk about vectors in R they're often referring to __atomic vectors__^[The other common vector is called a "list". Lists are very useful but we won't cover them in this book]. An atomic vector is the simplest kind of data structure in R. There are a few different kinds of atomic vector, but the defining feature of each one is that it can only contain data of one type. An atomic vector might contain all integers (e.g. 2, 4, 6, ...) or all characters (e.g. "A", "B", "C"), but it can't mix and match integers and characters (e.g. "A", 2, "C", 5).
The word "atomic" in the name refers to the fact that an atomic vector can't be broken down into anything simpler---they are the simplest kind of data structure R knows about. Even when working with a single number we're actually dealing with an atomic vector. It just happens to be of length one. Here's the very first expression we evaluated in the first chapter:
```{r}
1 + 1
```
Look at the output. What is that `[1]` at the beginning? We ignored it before because we weren't in a position to understand its significance. The `[1]` is a clue that the output resulting from `1 + 1` is a atomic vector. We can verify this with the `is.vector` and `is.atomic` functions:
```{r}
x <- 1 + 1
# what value is associated with x?
x
# is it a vector?
is.vector(x)
# is it atomic?
is.atomic(x)
```
This little exercise demonstrates an important point about R. Atomic vectors really are the simplest kind of data structure in R. Unlike many other languages there is simply no way to represent just a number. Instead, a single number must be stored as a vector of length one^[The same is true for things like sets of characters (`"dog"`, `"cat"`, `"fish"`, ...) and logical values (`TRUE` or `FALSE`) discussed in the next two chapters.].
## Numeric vectors {#intro-vectors}
A lot of work in R involves __numeric vectors__. After all, data analysis is all about numbers. Here's a simple way to construct a numeric vector:
```{r}
numeric(length = 50)
```
What happened? We made a numeric vector with 50 __elements__, each of which is the number 0. The word "element" is used to reference any object (a number in this case) that resides at a particular position in a vector.
When we create a vector but don't assign it to a name using ` <- ` R just prints it to the Console for us. Notice what happens when the vector is printed to the screen. Since the length-50 vector can't fit on one line, it was printed over two. At the beginning of each line there is a `[X]`: the `X` is a number that describes the position of the element shown at the beginning of a particular line.
```{block, type='info'}
#### Different kinds of numbers
Roughly speaking, R stores numbers in two different ways depending of whether they are whole numbers ("integers") or numbers containing decimal points ("doubles" -- don't ask). We're not going to worry about this distinction. Most of the time the distinction is fairly invisible to users so it is easier to just think in terms of numeric vectors. We can mix and match integers and doubles in R without having to worry too much about R is storing the numbers.
```
If we need to check that we really have made a numeric vector we can use the `is.numeric` function to do this:
```{r}
# let's create a variable that is a numeric vector
numvec <- numeric(length = 50)
# check it really is a numeric vector
is.numeric(numvec)
```
This returns `TRUE` as expected. A value of `FALSE` would imply that `numvec` is some other kind of object. This may not look like the most useful function in the world, but sometimes we need functions like `is.numeric` to understand what R is doing or root out mistakes in our scripts.
Keep in mind that when we print a numeric vector to Console R only prints the elements to 7 significant figures by default. We can see this by printing the built in constant `pi` to the Console:
```{r}
pi
```
The actual value stored in `pi` is actually much more precise than this. We can see this by printing `pi` again using the `print` function:
```{r}
print(pi, digits = 16)
```
## Constructing numeric vectors
We just saw how to use the `numeric` function to make a numeric vector of zeros. The size of the vector is controlled by the `length` argument---we used `length = 50` above to make a vector with 50 elements. This is arguably not a particularly useful skill, as we usually need to work vectors of particular values (not just 0). A very useful function for creating custom vectors is the `c` function. Take a look at this example:
```{r}
c(1.1, 2.3, 4.0, 5.7)
```
The "c" in the function name stands for "combine". The `c` function takes a variable number of arguments, each of which must be a vector of some kind, and combines these into a single, new vector. We supplied the `c` function with four arguments, each of which was a vector of length 1 (remember: a single number is treated as a length-one vector). The `c` function combines these to generate a vector of length 4. Simple. Now look at this example:
```{r}
vec1 <- c(1.1, 2.3)
vec2 <- c(4.0, 5.7, 3.6)
c(vec1, vec2)
```
This shows that we can use the `c` function to concatenate ("stick together") two or more vectors, even if they are not of length 1. We combined a length-2 vector with a length-3 vector to produce a new length-5 vector.
Notice that we did not have to name the arguments in those two examples---there were no `=` involved. The `c` function is an example of one of those flexible functions that breaks the simple rules of thumb for using arguments that we set out earlier: it can take a variable number of arguments, and these arguments do not have predefined names. This behaviour is necessary for `c` to be of any use: in order to be useful it needs to be flexible enough to take any combination of arguments.
```{block, type="info"}
#### Finding out about a vector R
Sometimes it is useful to be able to find out a little extra information about a vector you are working with, especially if it is very large. Three functions that can extract some useful information about a vector for us are `length`, `head` and `tail`. Using a variety of different vectors, experiment with these to find out what they do. Make sure you use vectors that aren't too short (e.g. with a length of at least 10). Hint: `head` and `tail` can be used with a second argument, `n`.
```
## Named vectors
What happens if we name the arguments to `c` when constructing a vector? Take a look at this:
```{r}
namedv <- c(a = 1, b = 2, c = 3)
namedv
```
What happened here is that the argument names were used to define the names of elements in the vector we made. The resulting vector is still a 1-dimensional data structure. When it is printed to the Console the value of each element is printed, along with the associated name above it. We can extract the names from a named vector using the `names` function:
```{r}
names(namedv)
```
Being able to name the elements of a vector is very useful because it enables us to more easily identify relevant information and extract the bits we need---we'll see how this works in the next chapter.
## Generating sequences of numbers
The main limitation of the `c` function is that we have to manually construct vectors from their elements. This isn't very practical if we need to construct very long vectors of numeric values. There are various functions that can help with this kind of thing though. These are useful when the elements of the target vector need to follow a sequence or repeating pattern. This may not appear all that useful at first, but repeating sequences are used a lot in R.
### Generating sequences of numbers
The `seq` function allows us to generate sequences of numbers. It needs at least two arguments, but there are several different ways to control the sequence produced by `seq`. The method used is determined by our choice of arguments: `from`, `to`, `by`, `length.out` and `along.with`. We don't need to use all of these though---setting 2-3 of these arguments will often work:
1. Using the `by` argument generates a sequence of numbers that increase or decrease by the requested step size:
```{r}
seq(from = 0, to = 12, by = 2)
```
This is fairly self-explanatory. The `seq` function constructed a numeric vector with elements that started at 0 and ended 12, with successive elements increasing in steps of 2. Be careful when using `seq` like this. If the `by` argument does not lead to a sequence that ends exactly on the value of `to` then that value won't appear in the vector. For example:
```{r}
seq(from = 0, to = 11, by = 2)
```
We can generate a descending sequence by using a negative `by` value, like this:
```{r}
seq(from = 12, to = 0, by = -2)
```
2. Using the `length.out` argument generates a sequence of numbers where the resulting vector has the length specified by `length.out`:
```{r}
seq(from = 0, to = 12, length.out = 6)
```
Using the `length.out` argument will always produce a sequence that starts and ends exactly on the values specified by `from` and `to` (if we need a descending sequence we just have to make sure `from` is larger than `to`).
3. Using the `along.with` argument allows us to use another vector to determine the length of the numeric sequence we want to produce:
```{r}
# make any any vector we like
my_vec <- c(1, 3, 7, 2, 4, 2)
# use it to make a sequence of the same length
seq(from = 0, to = 12, along.with = my_vec)
```
It doesn't matter what the elements of `myvec` are. The behaviour of `seq` is controlled by the length of `myvec` when we use `along.with`.
It's conventional to leave out the `from` and `to` argument names when using the `seq` function. For example, we could rewrite the first example above as:
```{r}
seq(0, 12, by = 2)
```
When we leave out the name of the third argument its value is position matched to the `by` argument:
```{r}
seq(0, 12, 2)
```
However, our advice is to explicitly name the `by` argument instead of relying on position matching, because this reminds us what we're doing and will stop us forgetting about the different control arguments.
If we do not specify values of either `by`, `length.out` and `along.with` when using `seq` the default behaviour is to assume we meant `by = 1`:
```{r}
seq(from = 1, to = 12)
```
Generating sequences of integers that go up or down in steps of 1 is something we do a lot in R. Because of this, there is a special operator to generate these simple sequences---the colon, `:`. For example, to produce the last sequence again we use:
```{r}
1:12
```
When we use the `:` operator the convention is to __not__ leave spaces either side of it.
### Generating repeated sequences of numbers
The `rep` function is designed to replicate the values inside a vector, i.e. it's short for "replicate". The first argument (called `x`) is the vector we want to replicate. There are two other arguments that control how this is done. The `times` argument specifies the number of times to replicate the whole vector:
```{r}
# make a simple sequence of integers
seqvec <- 1:4
seqvec
# now replicate *the whole vector* 3 times
rep(seqvec, times = 3)
```
All we did here was take a sequence of integers from 1 to 4 and replicate this end-to-end three times, resulting in a length-12 vector. Alternatively, we can use the `each` argument to replicate each element of a vector while retaining the original order:
```{r}
# make a simple sequence of integers
seqvec <- 1:4
seqvec
# now replicate *each element* vector 3 times
rep(seqvec, each = 3)
```
This example produced a similar vector to the previous one. It contains the same elements and has the same length, but now the order is different. All the 1's appear first, then the 2's, and so on.
Predictably, we can also use both the `times` and `each` arguments together if we want to:
```{r}
seqvec <- 1:4
rep(seqvec, times = 3, each = 2)
```
The way to think about how this works is to imagine that we apply `rep` twice, first with `each = 2`, then with `times = 3` (or vice versa). We can achieve the same thing using nested function calls, though it is much uglier:
```{r}
seqvec <- 1:4
rep(rep(seqvec, each = 2), times = 3)
```
## Vectorised operations
All the simple arithmetic operators (e.g. `+` and `-`) and many mathematical functions are __vectorised__ in R . What this means is that when we use a vectorised function it operates on vectors on an element-by-element basis. Take a look at this example to see what we mean by this:
```{r}
# make a simple sequence
x <- 1:10
x
# make another simple sequence *of the same length*
y <- seq(0.1, 1, length = 10)
y
# add them
x + y
```
We constructed two length-10 numeric vectors, called `x` and `y`, where `x` is a sequence from 1 to 10 and `y` is a sequence from 0.1 to 1. When R evaluates the expression `x + y` it does this by adding the first element of `x` to the first element of `y`, the second element of `x` to the second element of `y`, and so on, working through all 10 elements of `x` and `y`.
Vectorisation is also implemented in the standard mathematical functions. For example, our friend the `round` function will round each element of a numeric vector:
```{r}
# make a simple sequence of non-integer values
my_nums <- seq(0, 1, length = 13)
my_nums
# now round the values
round(my_nums, digits = 2)
```
The same behaviour is seen with other mathematical functions like `sin`, `cos`, `exp`, and `log`. Each of these will apply the relevant function to each element of a numeric vector.
Not all functions are vectorised. For example, the `sum` function takes a vector of numbers and adds them up:
```{r}
sum(my_nums)
```
Although `sum` obviously works on a numeric vector it is not "vectorised" in the sense that it works element-by-element to return an output vector of the same length as its main argument.
Many functions apply the vectorisation principle to more than one argument. Take a look at this example to see what we mean by this:
```{r}
# make a repeated set of non-integer values
my_nums <- rep(2 / 7, times = 6)
my_nums
# round each one to a different number of decimal places
round(my_nums, digits = 1:6)
```
We constructed a length 6 vector containing the number 2/7 (~ 0.285714) and then used the `round` function to round each element to a different number of decimal places. The first element was rounded to 1 decimal place, the second to two decimal places, and so on. We get this behaviour because instead of using a single number for the `digits` argument, we used a vector that is an integer sequence from 1 to 6.
```{block, type="info"}
#### Vectorisation is not the norm
R's vectorised behaviour may seem like the "obvious" thing to do, but most computer languages do not work like this. In other languages we typically have to write a much more complicated expression to do something so simple. This is one of the reasons R is such a data analysis language: vectorisation allows us to express repetitious calculations in a simple, intuitive way. This behaviour can save us a lot of time. However, not every function treats its arguments in a vectorised way, so we always need to check (most easily, by experimenting) whether this behaviour is available before relying on it.
```