-
Notifications
You must be signed in to change notification settings - Fork 0
/
intro-biostat.Rmd
571 lines (366 loc) · 14.3 KB
/
intro-biostat.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
---
title: "Lecture 3: Introduction to biostatistics"
subtitle: "MEDI 504"
author:
- "@TiffanyTimbers (UBC)"
- "`r Sys.Date()`"
output:
xaringan::moon_reader:
lib_dir: libs
css: xaringan-themer.css
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
slideNumberFormat: '%current%'
titleSlideClass: [top, left]
ratio: '16:9'
---
```{r setup, include=FALSE}
options(htmltools.dir.version = FALSE)
library(knitr)
opts_chunk$set(echo = TRUE)
```
```{r xaringan-themer, include=FALSE, warning=FALSE}
library(xaringanthemer)
style_duo_accent(
primary_color = "#9F999C",
secondary_color = "#FFE5F3",
inverse_header_color = "#8F8C8E",
link_color = "deeppink",
title_slide_text_color = "#3d3d3d",
title_slide_background_image = "img/title-slide-background-light.png",
title_slide_background_position = "left",
header_font_google = google_font("Josefin Sans"),
text_font_google = google_font("Montserrat", "300", "300i"),
code_font_google = google_font("Fira Mono")
)
```
## Module learning objectives
By the end of this module, students should be able to:
- Identify the different types of data analysis questions and categorize a question into the correct type
- Identify a suitable analysis type to answer an inferential question, given the data set at hand
- Use the R programming language to carry out analysis to answer inferential question
- Interpret and communicate the results of the analysis from an inferential question
---
## What is the question?
```{r what-is-the-q, out.width = "50%", echo = FALSE, fig.cap = "What is the question? by Roger Peng and Jeff Leek"}
include_graphics("img/what_is_the_question.png")
```
---
#### 1. Descriptive
One that seeks to summarize a characteristic of a set of data. No interpretation of the result itself as the result is a fact, an attribute of the data set you are working with.
--
Examples:
- What is the frequency of viral illnesses in a set of data collected from a group of individuals?
--
- How many people live in each US state?
---
#### 2. Exploratory
One in which you analyze the data to see if there are patterns, trends, or relationships between variables looking for patterns that would support proposing a hypothesis to test in a future study.
--
Examples:
- Do diets rich in certain foods have differing frequencies of viral illnesses **in a set of data** collected from a group of individuals?
--
- Does air pollution correlate with life expectancy **in a set of data** collected from groups of individuals from several regions in the United States?
---
#### 3. Inferential
One in which you analyze the data to see if there are patterns, trends, or relationships between variables in a representative sample. We want to quantify how much the patterns, trends, or relationships between variables is applicable to all individuals units in the population.
--
Examples:
- Is eating at least 5 servings a day of fresh fruit and vegetables is associated with fewer viral illnesses per year?
--
- Is the gestational length of first born babies the same as that of non-first borns?
---
#### 4. Predictive
One where you are trying to predict measurements or labels for individuals (people or things). Less interested in what causes the predicted outcome, just what predicts it.
--
Examples:
- How many viral illnesses will someone have next year?
--
- What political party will someone vote for in the next US election?
---
#### 5. Causal
Asks about whether changing one factor will change another factor, on average, in a population. Sometimes the underlying design of the data collection, by default, allows for the question that you ask to be causal (e.g., randomized experiment or trial)
--
Examples:
- Does eating at least 5 servings a day of fresh fruit and vegetables cause fewer viral illnesses per year?
--
- Does smoking lead to cancer?
---
#### 6. Mechanistic
One that tries to explain the underlying mechanism of the observed patterns, trends, or relationship (how does it happen?)
--
Examples:
- How do changes in diet lead to a reduction in the number of viral illnesses?
--
- How does how airplane wing design changes air flow over a wing, leading to decreased drag?
---
## Challenge #1
What kind of statistical question is this?
#### *Is a yet undiagnosed patient's breast cancer tumor malignant or benign?*
---
## Challenge #2
What kind of statistical question is this?
#### *Is inhalation of marijuana associated with lung cancer?*
---
## Challenge #2
What kind of statistical question is this?
#### *Does a truncation of the BRCA2 gene cause cancer?*
---
## Challenge #4
What kind of statistical question is this?
#### *Are there sub-types of ovarian tumors?*
---
### So you know the type of question, now what?
.pull-left[
This helps narrow down the possibilities
of the kind of analysis you might want to do!
--
For example, if you have the question: **"How many viral illnesses will someone have next year?"**
and you identify that it is **predictive.**
You could narrow down that some kind of statistical or machine learning model
might help you answer that.
--
Then you need to go a step deeper and look at the data that you have,
and see which kind of statistical
or machine learning model is most suitable for your data.
]
.pull-right[
<img src="https://scikit-learn.org/stable/_static/ml_map.png" width=700>
Source: [scikit-learn algorithm cheat sheet](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)
]
---
## Another example
.pull-left[
For example, if you have the question: "Is the gestational length of first born babies the same as that of non-first borns?"
and you identify that it is inferential.
You could narrow down that some kind of statistical inference approach
might help you answer that.
--
Then again, you need to go a step deeper and look at the data that you have,
and see which kind of statistical inference approach is most suitable for your data.
]
.pull-right[
<img src="https://onishlab.colostate.edu/wp-content/uploads/2019/07/which_test_flowchart.png" width=700>
Source: https://onishlab.colostate.edu/summer-statistics-workshop-2019/
Or for another, see this website: http://www.biostathandbook.com/testchoice.html
]
---
## Practice
---
### Case 1
Question: *Is a yet undiagnosed patient's breast cancer tumor malignant or benign?*
Data:
```{r, echo = FALSE, message = FALSE, warning = FALSE}
library(tidyverse)
read_csv("data/wdbc.csv") %>%
select(ID, Radius:Smoothness, Class) %>%
tail() %>%
kable()
```
Data reference: https://archive-beta.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+diagnostic
---
### Case 2
Question: *Is inhalation of marijuana associated with lung cancer?*
Data:
```{r, echo = FALSE, message = FALSE, warning = FALSE}
tibble(ID = c(round(runif(6, 50000, 60000))),
sex = c("male", "male", "male", "female", "female", "male"),
gender = c("fluid", "male", "male", "female", "female", "male"),
age = c(35, 43, 29, 54, 37, 51),
smoker = c(1, 0, 0, 0, 0, 0),
marijuana_use = c("never", "frequent", "sometimes", "frequent", "never", "never"),
bmi = c(22.3, 18, 32.5, 20, 26.1, 29.8),
lung_cancer = c(0, 0, 1, 0, 0, 1)) %>%
kable()
```
*Note: this is simulated data.*
---
### Case 3
Question: *Does a truncation of the BRCA2 gene cause cancer?*
Data:
```{r, echo = FALSE, message = FALSE, warning = FALSE}
tibble(ID = c(round(runif(6, 20000, 30000))),
sex = c("male", "male", "male", "female", "female", "male"),
gender = c("fluid", "male", "male", "female", "female", "male"),
age = c(35, 43, 29, 54, 37, 51),
smoker = c(1, 0, 0, 0, 0, 0),
bmi = c(22.3, 18, 32.5, 20, 26.1, 29.8),
brca2_truncation = c(1, 0, 1, 0, 0, 0),
cancer = c(1, 0, 1, 0, 0, 1)) %>%
kable()
```
*Note: this is simulated data.*
---
### Case 4
Question: *Are there sub-types of ovarian tumors?*
Data:
```{r, echo = FALSE, message = FALSE, warning = FALSE}
library(tidyverse)
read_csv("data/wdbc.csv") %>%
select(ID, Radius:Smoothness) %>%
tail() %>%
kable()
```
*Note: this is simulated data.*
---
## Some key notes:
- identifying whether there even is a response variable is important!
- the kind of response variable/target is critical for narrowing down the method
- the explanatory variables/predictors/features are also important, but I consider these after the response variabe
---
## A question for you!
Write down one statistical question you are trying to answer with your research.
Identify the type of question it is.
---
## The statistical landscape in R
Common packages include:
```{r, echo = FALSE, out.width = "80%"}
include_graphics("img/stat-landscape.png")
```
---
## Example of an inferential analysis in R
Question: *Does sexual activity effect the longevity of male fruit flies?*
--
What kind of question is this?
---
## Data
Fruitflies were divided randomly into groups of 25 each.
The response was the longevity of the fruitfly in days. One group was kept
solitary, while another was given 8 virgin females per day.
```{r, warning = FALSE, message = FALSE}
library(tidyverse)
fruitfly_2_groups <- read_csv("data/fruitfly_2_groups.csv")
fruitfly_2_groups
```
*Note: this is a modification of the original data set where we only considered two of the groups from the original experiment.*
*Original data source: [`faraway` R package](https://cran.r-project.org/web/packages/faraway/faraway.pdf).*
---
## So how should we analyze this data?
- What is our response variable? What kind of data is it?
- What is our explanatory variable? What kind of data is it?
---
## So how should we analyze this data?
- a t-test is suitable here (as would be a permutation test for difference of means, or a Mann Whitney U Test)
- to perform this, we need to parameterize null ($H_0$) and alternative hypotheses ($H_A$):
*$H_0$: There **is no** difference in mean longevity of sexually active and non-sexually active male fruitflies.*
*$H_A$: There **is** difference in mean longevity of sexually active and non-sexually active male fruitflies.*
---
## Always start with a visualization
- The visualization should be related to your question!
- It should complement your statistical method(s)
--
- We are interested in the mean - the population mean however!
--
- **So here, we should visualize our estimates of the population means, as well as our uncertainty about them!**
---
## Visualizing estimates and their uncertainty
1. Calculate estimates & uncertainty
2. Visualize estimates and uncertainty, communicating as much about the underlying sample data as possible!
---
## Calculate estimates & uncertainty
Here we calculate the sample means and 95% confidence interval for a mean
using the t-distribution, assuming independence and the central limit theorem.
```{r}
fruitfly_2_estimates <- fruitfly_2_groups %>%
group_by(sexually_active) %>%
summarise(mean = mean(longevity),
n = n(),
se = sd(longevity) / sqrt(n()),
df = n - 1,
t_star = qt(0.975, df),
lower = mean - t_star * se,
upper = mean + t_star * se)
fruitfly_2_estimates
```
---
## Final visualization
```{r plot fruitfly_2_groups data, warning = FALSE}
# plot raw data points for each group as a transparent grey/black point
# overlay mean as a red diamond
fruitfly_2_estimates_viz <- ggplot(fruitfly_2_groups, aes(x = sexually_active, y = longevity)) +
geom_jitter(width = 0.1, size = 2, alpha = 0.2) +
stat_summary(fun = mean,
geom = "point", shape = 18,
size = 4, color = "blue") +
geom_errorbar(data = fruitfly_2_estimates,
mapping = aes(x = sexually_active,
y = mean,
ymin = lower,
ymax = upper),
width = 0.15, colour = "blue", size = 1) +
ylim(c(0, 100)) +
ylab("Longevity (days)") +
xlab("Sexual activity") +
theme_bw() +
theme(text = element_text(size=20))
```
---
class: center
```{r, warning = FALSE}
fruitfly_2_estimates_viz
```
--
What do we think so far? Are we likely to reject the null hypothesis?
---
## Example of t-test
```{r}
fruitfly_2_ttest <- t.test(longevity ~ sexually_active, data = fruitfly_2_groups)
fruitfly_2_ttest
```
--
How do we parse this output?
---
## Using base R
```{r}
fruitfly_2_ttest$p.value
```
```{r}
fruitfly_2_ttest$statistic
```
---
## Using `broom`
```{r}
library(broom)
fruitfly_2_ttest_tidy <- tidy(fruitfly_2_ttest)
```
```{r}
fruitfly_2_ttest_pval <- fruitfly_2_ttest_tidy %>%
select(p.value) %>%
pull()
fruitfly_2_ttest_pval
```
---
## What are our conclusions?
--
The male fruitflys which were not sexually active were observed to have
an increased lifespan
(they lived `r round(abs(fruitfly_2_estimates$mean[1] - fruitfly_2_estimates$mean[2]), 2)` days longer).
Specifically, the male fruitflys which were not sexually active had a mean lifespan of
`r round(fruitfly_2_estimates$mean[1])` (95% confidence interval was `r round(fruitfly_2_estimates$lower[1])`, `r round(fruitfly_2_estimates$upper[1])`)
days,
while male fruitflys which were sexually active had a mean lifespan of
`r round(fruitfly_2_estimates$mean[2])` (95% confidence interval was `r round(fruitfly_2_estimates$lower[2])`, `r round(fruitfly_2_estimates$upper[2])`)
days.
Carrying out a t-test (assuming independence and the central limit theorem)
with alpha set to 0.05,
indicated that we have enough statistical evidence to reject our null hypothesis, $H_0$,
as our p-value was much smaller than alpha (p = `r fruitfly_2_ttest_pval`).
We can suggest the alternative hypothesis, $H_1$, may be more favourable.
Specifically, that there is a difference in the male fruitfly lifespan
when males are sexually active compared to when they are not.
Due to the randomized experimental design,
we can also suggest that this effect of sexual activity is causal on the change in lifespan.
Specifically, sexual activity in male fruitflys decreases lifespan.
---
## Summary
1. Identify the kind of question
2. Look at the data
3. Identify a **suitable** statistical method for your question and data
4. Create a visualization
5. Apply your statistical method
6. (maybe create another visualization)
7. Interpret and communicate your assumptions and results
---
## Questions?