-
Notifications
You must be signed in to change notification settings - Fork 0
/
ch19.Rmd
377 lines (267 loc) · 12.4 KB
/
ch19.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
---
title: "Chapter 19 - Exercises - R for Data Science"
author: "Francisco Yira Albornoz"
output:
github_document:
toc: true
toc_depth: 4
df_print: tibble
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# 19.2 When you should write a function
## 19.2.1 Practice
1. Why is `TRUE` not a parameter to `rescale01()`? What would happen if `x` contained a single missing value, and `na.rm` was `FALSE`?
It's not a parameter because after loading the function is not possible to change the value `TRUE` through the arguments supplied in the function calling.
If `na.rm` was `FALSE` and `x` contained a single missing value, then the output would be `NA`.
2. In the second variant of `rescale01()`, infinite values are left unchanged. Rewrite `rescale01()` so that `-Inf` is mapped to `0`, and `Inf` is mapped to `1`.
```{r echo=TRUE}
x <- c(1:10, Inf, -Inf)
rescale01 <- function(x) {
rng <- range(x, na.rm = TRUE, finite = TRUE)
y <- (x - rng[1]) / (rng[2] - rng[1])
if (any(y == Inf)) y[y == Inf] <- 1
if (any(y == -Inf)) y[y == -Inf] <- 0
y
}
rescale01(x)
```
3. Practice turning the following code snippets into functions. Think about what each function does. What would you call it? How many arguments does it need? Can you rewrite it to be more expressive or less duplicative?
```{r message=FALSE, warning=FALSE}
mean(is.na(x))
prop_na <- function(x) mean(is.na(x))
```
```{r}
x / sum(x, na.rm = TRUE)
prop_total <- function(x) x / sum(x, na.rm = TRUE)
prop_total2 <- function(x) {
total <- sum(x, na.rm = TRUE)
x / total
}
```
```{r}
sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE)
coef_variation <- function(x) sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE)
coef_variation2 <- function(x) {
sd_x <- sd(x, na.rm = TRUE)
mean_x <- mean(x, na.rm = TRUE)
sd_x / mean_x
}
```
4. Follow http://nicercode.github.io/intro/writing-functions.html to write your own functions to compute the variance and skew of a numeric vector.
```{r}
my_var <- function(x) {
mean_x <- mean(x)
n <- length(x)
sum((x - mean_x)^2) / (n - 1)
}
my_skewness <- function(x) {
mean_x <- mean(x)
var_x <- var(x)
sqr_x <- x^(1/2)
my_skew <- sqr_x * sum((x-mean_x)^3) / ((var_x)^(3/2))
}
```
5. Write `both_na()`, a function that takes two vectors of the same length and returns the number of positions that have an `NA` in both vectors.
```{r}
both_na <- function(x, y) {
is_na_both <- is.na(x) & is.na(y)
sum(is_na_both)
}
a <- c(NA, 1:6, NA, 8)
b <- c(1:7, NA, 8)
both_na(a, b)
```
6. What do the following functions do? Why are they useful even though they are so short?
```{r}
is_directory <- function(x) file.info(x)$isdir
is_readable <- function(x) file.access(x, 4) == 0
```
The first function `is_directory` checks if a character vector is a valid directory in the file system.
The second function `is_readable` returns `TRUE` when we have read permissions for the file specified in the character vector allows, and `FALSE` otherwise.
They are both useful because their names communicates more clearly the intent of the code (versus using the code inside them without the function call).
7. Read the complete lyrics to "Little Bunny Foo Foo". There's a lot of duplication in this song. Extend the initial piping example to recreate the complete song, and use functions to reduce the duplication.
The original song:
> Little bunny Foo Foo
Hopping through the forest
Scooping up the field mice
And bopping them on the head
> Down came the Good Fairy, and she said
"Little bunny Foo Foo
I don't want to see you
Scooping up the field mice
And bopping them on the head.
I'll give you three chances,
And if you don't stop, I'll turn you into a GOON!"
```{r eval = FALSE}
foo_foo %>%
hop(through = forest) %>%
scoop(up = field_mice) %>%
bop(on = head)
good_fairy %>%
came(direction = down) %>%
want(!is_scoop(foo_foo, up = field_mice),
!is_bop(foo_foo, on = head)) %>%
give(n_chances = 3,
if_not = turn_into_goon())
```
# 19.3 Functions are for humans and computers
## 19.3.1 Excercises
1. Read the source code for each of the following three functions, puzzle out what they do, and then brainstorm better names.
```{r}
f1 <- function(string, prefix) {
substr(string, 1, nchar(prefix)) == prefix
}
f1("hello", "he")
f1("precedente", "pre")
```
This function tell us if a string starts with a certain character sequence (a prefix). Better names could be `starts_with` or `has_prefix`.
```{r}
f2 <- function(x) {
if (length(x) <= 1) return(NULL)
x[-length(x)]
}
f2(1)
f2(1:10)
```
This function removes the last element of input vector. Better names could be `remove_last`, `remove_tail` or `minus_tail`.
```{r}
f3 <- function(x, y) {
rep(y, length.out = length(x))
}
f3(1:10, 2)
f3(1:10, 1:3)
f3(1:10, 1:20)
```
This function repeats (recycles) the input vector `y` until the output matches the length of the input vector `x`. Descriptive names for this function could be `make_eq_length`, `as_long_as`, `rep_until_same_length`.
2. Take a function that you've written recently and spend 5 minutes brainstorming a better name for it and its arguments.
```{r}
run_sql <- function(conn, path){
library(stringr)
library(DBI)
library(odbc)
query <- str_flatten(readLines(path), collapse ="\n")
as_tibble(dbGetQuery(conn, query))
}
```
Maybe `run_query` is a better name for this function. Also, argument `connection` could be a better argument than `conn` (better be explicit than hoping than everyone will figure out what `conn` stands for). Finally , the argument`path` could be renamed to `path_to_query` to avoid confusion about which path is required.
3. Compare and contrast `rnorm()` and `MASS::mvrnorm()`. How could you make them more consistent?
Both functions have 3 common arguments (number of observations to generate, mean, and standard deviation). However, some of them are optional in one but not in the other, and also the mean and standard devation argument has different name in each function (`mean` and `sd` in `rnorm()`, and `mu` and `Sigma` in `MASS:mvrnorm()`).
An idea to make them more consistent would be change argument name `Sigma` to `sd` in `MASS:mvrnorm()`,
4. Make a case for why `norm_r()`, `norm_d()` etc would be better than `rnorm()`, `dnorm()`. Make a case for the opposite.
Ussing a common prefix for related functions is a good idea because when users type the prefix in RStudio, it will trigger the auto-complete pop-up, which will help the user to find those related functions even if they don't remember the exact name.
However, the common prefix could make more easy to miss the difference between the related functions (when writing or reading code). Maybe a longer, more descriptive suffix would solve this potential issue.
# 19.4 Conditional execution
## 19.4.4 Exercises
1. What's the difference between `if` and `ifelse()`? Carefully read the help and construct three examples that illustrate the key differences
`ifelse()` is a vectorized control flow function, and allows using a condition vector (argument `test`) with length greater than 1. It will evaluate the condition element-wise, and pick elements from other two vectors (`yes` and `no`) depending on the result. The output will be a vector with the same properties than the condition vector (same dimension and class).
`if` will only use a length 1 condition vector. However, it allows us to use more complex conditional expressions (for example, write a file to disk, install a package, etc).
Example 1: if we want vectorized evauation, we need to use `ifelse()`.
```{r}
numbers <- 1:10
if (numbers %% 2 == 0) {
"even"
} else {
"odd"
}
ifelse(numbers %% 2 == 0, "even", "odd")
```
Example 2: `if` allows us to execute expressions which generate side-effects insted of returning values.
```{r}
if (exists("mtcars")) saveRDS(mtcars, "mtcars.rds")
```
Example 3: `ifelse()` allows `NA` in the condition vector (they simply become `NA` in the output vector).
```{r}
animals <- c(NA, "dog", "cat", "parrot")
ifelse(animals == "dog",
"love him",
"not sure")
```
2. Write a greeting function that says "good morning", "good afternoon", or "good evening", depending on the time of day. (Hint: use a time argument that defaults to lubridate::now(). That will make it easier to test your function.)
```{r}
greeting <- function(time = lubridate::now()) {
if (lubridate::hour(time) <= 12) {
"Good morning"
} else if (lubridate::hour(time) <= 18) {
"Good afternoon"
} else {
"Good evening"
}
}
greeting()
```
3. Implement a `fizzbuzz` function. It takes a single number as input. If the number is divisible by three, it returns "fizz". If it's divisible by five it returns "buzz". If it's divisible by three and five, it returns "fizzbuzz". Otherwise, it returns the number. Make sure you first write working code before you create the function.
```{r}
fizzbuzz <- function(x) {
if (x %% 15 == 0) {
"fizzbuzz"
} else if (x %% 3 == 0) {
"fizz"
} else if (x %% 5 == 0) {
"buzz"
} else {
x
}
}
```
4. How could you use `cut()` to simplify this set of nested if-else statements?
```{r, eval=F, echo=T}
if (temp <= 0) {
"freezing"
} else if (temp <= 10) {
"cold"
} else if (temp <= 20) {
"cool"
} else if (temp <= 30) {
"warm"
} else {
"hot"
}
```
```{r, eval=F, echo=T}
cut(
temp,
breaks = c(-Inf, 0, 10, 20, 30, Inf),
labels = c("freezing", "cold", "cool", "warm", "hot")
)
```
How would you change the call to `cut()` if I'd used < instead of <=?
```{r, eval=F, echo=T}
cut(
temp,
breaks = c(-Inf, 0, 10, 20, 30, Inf),
labels = c("freezing", "cold", "cool", "warm", "hot"),
include.lowest = TRUE
)
```
What is the other chief advantage of `cut()` for this problem? (Hint: what happens if you have many values in `temp`?)
`cut()` is a vectorized function, so if `temp` has length greater than 1, then we will still get a vector with the variable recoded following our criteria. However, with the if/else solution, only the first element of `temp` will be recoded.
5. What happens if you use `switch()` with numeric values?
It returns the element in the corresponding position in the list of alternatives (i.e. a `3` wil return the third element of the list, even if there is another one named `3`).
6. What does this `switch()` call do? What happens if `x` is "e"?
```{r, eval=F, echo=T}
switch(x,
a = ,
b = "ab",
c = ,
d = "cd"
)
```
Documentation says that if the matching element is missing, then the next one is evaluated, which means that when `x = "a"`, the element `b = "ab"` will be returned. If there are no more elements, a `NULL` value will be returned invisibly.
# 19.4 Function arguments
## 19.5.5 Exercises
1. What does `commas(letters, collapse = "-")` do? Why?
It throws an error. The `...` argument in `commas` passes all the arguments (including `collapse = "-"`) to `...` inside `stringr::str_c`. This causes duplication of values for the argument `collapse` inside `str_c`, and then the function doesn't know which one to use.
```{r, eval=F, echo=T}
commas <- function(...) stringr::str_c(..., collapse = ", ")
commas(letters, collapse = "-")
```
2. It'd be nice if you could supply multiple characters to the `pad` argument, e.g. `rule("Title", pad = "-+")`. Why doesn't this currently work? How could you fix it?
It doesn't work because the function calculates the number of repetitions asuming that `pad` has length of 1, so if we use multiple arguments in the argument we will get a result of more than one line due to the extra characters.
What is needed is to obtain the `length` of `pad` inside the function, and then use this number in the `width` calcultation.
3. What does the `trim` argument to `mean()` do? When might you use it?
It allows to compute a trimmed mean by removing (trimming) a fraction `trim` of observations in both sides of the input vector. If the vector is ordered, the elements removed are likely to be extreme values, so web obtain a number less susceptible to sample fluctuation than the arithmethic mean.
http://davidmlane.com/hyperstat/A11971.html
4. The default value for the method argument to `cor()` is `c("pearson", "kendall", "spearman")`. What does that mean? What value is used by default?
The default method is "pearson". It works that way because inside the function the first comparison that is made is `if (method == "pearson")`, which evaluates to `[1] TRUE FALSE FALSE` using the default value, and since `if` is not vectorised, it considers onlye the first value of the resulting logical vector.