-
Notifications
You must be signed in to change notification settings - Fork 0
/
08-hypothesis_testing.Rmd
1626 lines (1186 loc) · 97.5 KB
/
08-hypothesis_testing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
```{r, echo=FALSE, warning=FALSE}
library(knitr)
#This code automatically tidies code so that it does not reach over the page
opts_chunk$set(tidy.opts=list(width.cutoff=50),tidy=TRUE, rownames.print = FALSE, rows.print = 10)
options(scipen = 999, digits = 7)
```
# Hypothesis testing
This chapter is primarily based on Field, A., Miles J., & Field, Z. (2012): Discovering Statistics Using R. Sage Publications, **chapter 5**.
[You can download the corresponding R-Code here](./Code/07-hypothesis_testing (2).R)
## Introduction
We test hypotheses because we are confined to taking samples – we rarely work with the entire population. In the previous chapter, we introduced the standard error (i.e., the standard deviation of a large number of hypothetical samples) as an estimate of how well a particular sample represents the population. We also saw how we can construct confidence intervals around the sample mean $\bar x$ by computing $SE_{\bar x}$ as an estimate of $\sigma_{\bar x}$ using $s$ as an estimate of $\sigma$ and calculating the 95% CI as $\bar x \pm 1.96 * SE_{\bar x}$. Although we do not know the true population mean ($\mu$), we might have an hypothesis about it and this would tell us how the corresponding sampling distribution looks like. Based on the sampling distribution of the hypothesized population mean, we could then determine the probability of a given sample **assuming that the hypothesis is true**.
Let us again begin by assuming we know the entire population using the example of music listening times among students from the previous example. As a reminder, the following plot shows the distribution of music listening times in the population of WU students.
```{r, message = FALSE, warning=FALSE}
library(tidyverse)
library(ggplot2)
library(latex2exp)
set.seed(321)
hours <- rgamma(25000, shape = 2, scale = 10)
ggplot(data.frame(hours)) +
geom_histogram(aes(x = hours), bins = 30, fill='white', color='black') +
geom_vline(xintercept = mean(hours), size = 1) + theme_bw() +
labs(title = "Histogram of listening times",
subtitle = TeX(sprintf("Population mean ($\\mu$) = %.2f; population standard deviation ($\\sigma$) = %.2f",round(mean(hours),2),round(sd(hours),2))),
y = 'Number of students',
x = 'Hours')
```
In this example, the population mean ($\mu$) is equal to `r round(mean(hours),2)`, and the population standard deviation $\sigma$ is equal to `r round(sd(hours),2)`.
### The null hypothesis
Let us assume that we were planning to take a random sample of 50 students from this population and our hypothesis was that the mean listening time is equal to some specific value $\mu_0$, say $10$. This would be our **null hypothesis**. The null hypothesis refers to the statement that is being tested and is usually a statement of the status quo, one of no difference or no effect. In our example, the null hypothesis would state that there is no difference between the true population mean $\mu$ and the hypothesized value $\mu_0$ (in our example $10$), which can be expressed as follows:
$$
H_0: \mu = \mu_0
$$
When conducting research, we are usually interested in providing evidence against the null hypothesis. If we then observe sufficient evidence against it, our estimate is said to be significant. If the null hypothesis is rejected, this is taken as support for the **alternative hypothesis**. The alternative hypothesis assumes that some difference exists, which can be expressed as follows:
$$
H_1: \mu \neq \mu_0
$$
Accepting the alternative hypothesis in turn will often lead to changes in opinions or actions. Note that while the null hypothesis may be rejected, it can never be accepted based on a single test. If we fail to reject the null hypothesis, it means that we simply haven't collected enough evidence against the null hypothesis to disprove it. In classical hypothesis testing, there is no way to determine whether the null hypothesis is true. **Hypothesis testing** provides a means to quantify to what extent the data from our sample is in line with the null hypothesis.
In order to quantify the concept of "sufficient evidence" we look at the theoretical distribution of the sample means given our null hypothesis and the sample standard error. Using the available information we can infer the sampling distribution for our null hypothesis. Recall that the standard deviation of the sampling distribution (i.e., the standard error of the mean) is given by $\sigma_{\bar x}={\sigma \over \sqrt{n}}$, and thus can be computed as follows:
```{r}
mean_pop <- mean(hours)
sigma <- sd(hours) #population standard deviation
n <- 50 #sample size
standard_error <- sigma/sqrt(n) #standard error
standard_error
```
Since we know from the central limit theorem that the sampling distribution is normal for large enough samples, we can now visualize the expected sampling distribution **if our null hypothesis was in fact true** (i.e., if the was no difference between the true population mean and the hypothesized mean of 10).
```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig.align="center", fig.height = 4, fig.width = 8}
library(latex2exp)
H_0 <- 10
p1 <- 0.025
p2 <- 0.975
min <- 0
max <- 20
norm1 <- round(qnorm(p1), digits = 3)
norm2 <- round(qnorm(p2), digits = 3)
ggplot(data.frame(x = c(min, max)), aes(x = x)) +
stat_function(fun = dnorm, args = list(mean = H_0, sd = standard_error)) +
stat_function(fun = dnorm, args = list(mean = H_0, sd = standard_error), xlim = c(min, qnorm(p1, mean = H_0, sd = standard_error)), geom = "area") +
stat_function(fun = dnorm, args = list(mean = H_0, sd = standard_error), xlim = c(max, qnorm(p2, mean = H_0, sd = standard_error)), geom = "area") +
scale_x_continuous(breaks=c(0,qnorm(p1, mean = H_0, sd = standard_error),10,qnorm(p2, mean = H_0, sd = standard_error),20), labels=c("0",TeX(sprintf("%.2f $* \\sigma_{\\bar x}$",qnorm(p1))),"10",TeX(sprintf("%.2f $* \\sigma_{\\bar x}$",qnorm(p2))),"20")) +
labs(title = TeX(sprintf("Theoretical density given null hypothesis $\\mu_0=$ 10 ($\\sigma_{\\bar x}$ = %.2f)",standard_error)),x = "Hours", y = "Density") +
theme(legend.position="none") +
theme_bw()
```
We also know that 95% of the probability is within `r round(qnorm(p2),2)` standard deviations from the mean. Values higher than that are rather unlikely, if our hypothesis about the population mean was indeed true. This is shown by the shaded area, also known as the "rejection region". To test our hypothesis that the population mean is equal to $10$, let us take a random sample from the population.
```{r}
set.seed(12567)
H_0 <- 10
student_sample <- sample(1:25000, size = 50, replace = FALSE)
student_sample <- hours[student_sample]
mean_sample <- mean(student_sample)
ggplot(data.frame(student_sample)) +
geom_histogram(aes(x = student_sample), fill = 'white', color = 'black', bins = 20) +
theme_bw() + geom_vline(xintercept = mean(student_sample), color = 'black', size=1) +
labs(title = TeX(sprintf("Distribution of values in the sample ($n =$ %.0f, $\\bar{x] = $ %.2f, s = %.2f)",n,mean(student_sample),sd(student_sample))),x = "Hours", y = "Frequency")
```
The mean listening time in the sample (black line) $\bar x$ is `r round(mean(student_sample),2)`. We can already see from the graphic above that such a value is rather unlikely under the hypothesis that the population mean is $10$. Intuitively, such a result would therefore provide evidence against our null hypothesis. But how could we quantify specifically how unlikely it is to obtain such a value and decide whether or not to reject the null hypothesis? Significance tests can be used to provide answers to these questions.
### Statistical inference on a sample
#### Test statistic
##### z-scores
Let's go back to the sampling distribution above. We know that 95% of all values will fall within `r round(qnorm(p2),2)` standard deviations from the mean. So if we could express the distance between our sample mean and the null hypothesis in terms of standard deviations, we could make statements about the probability of getting a sample mean of the observed magnitude (or more extreme values). Essentially, we would like to know how many standard deviations ($\sigma_{\bar x}$) our sample mean ($\bar x$) is away from the population mean if the null hypothesis was true ($\mu_0$). This can be formally expressed as follows:
$$
\bar x- \mu_0 = z \sigma_{\bar x}
$$
In this equation, ```z``` will tell us how many standard deviations the sample mean $\bar x$ is away from the null hypothesis $\mu_0$. Solving for ```z``` gives us:
$$
z = {\bar x- \mu_0 \over \sigma_{\bar x}}={\bar x- \mu_0 \over \sigma / \sqrt{n}}
$$
This standardized value (or "z-score") is also referred to as a **test statistic**. Let's compute the test statistic for our example above:
```{r}
z_score <- (mean_sample - H_0)/(sigma/sqrt(n))
z_score
```
To make a decision on whether the difference can be deemed statistically significant, we now need to compare this calculated test statistic to a meaningful threshold. In order to do so, we need to decide on a significance level $\alpha$, which expresses the probability of finding an effect that does not actually exist (i.e., Type I Error). You can find a detailed discussion of this point at the end of this chapter. For now, we will adopt the widely accepted significance level of 5% and set $\alpha$ to 0.05. The critical value for the normal distribution and $\alpha$ = 0.05 can be computed using the ```qnorm()``` function as follows:
```{r}
z_crit <- qnorm(0.975)
z_crit
```
We use ```0.975``` and not ```0.95``` since we are running a two-sided test and need to account for the rejection region at the other end of the distribution. Recall that for the normal distribution, 95% of the total probability falls within `r round(qnorm(0.975),2)` standard deviations of the mean, so that higher (absolute) values provide evidence against the null hypothesis. Generally, we speak of a statistically significant effect if the (absolute) calculated test statistic is larger than the (absolute) critical value. We can easily check if this is the case in our example:
```{r}
abs(z_score) > abs(z_crit)
```
Since the absolute value of the calculated test statistic is larger than the critical value, we would reject $H_0$ and conclude that the true population mean $\mu$ is significantly different from the hypothesized value $\mu_0 = 10$.
##### t-statistic
You may have noticed that the formula for the z-score above assumes that we know the true population standard deviation ($\sigma$) when computing the standard deviation of the sampling distribution ($\sigma_{\bar x}$) in the denominator. However, the population standard deviation is usually not known in the real world and therefore represents another unknown population parameter which we have to estimate from the sample. We saw in the previous chapter that we usually use $s$ as an estimate of $\sigma$ and $SE_{\bar x}$ as and estimate of $\sigma_{\bar x}$. Intuitively, we should be more conservative regarding the critical value that we used above to assess whether we have a significant effect to reflect this uncertainty about the true population standard deviation. That is, the threshold for a "significant" effect should be higher to safeguard against falsely claiming a significant effect when there is none. If we replace $\sigma_{\bar x}$ by it's estimate $SE_{\bar x}$ in the formula for the z-score, we get a new test statistic (i.e, the **t-statistic**) with its own distribution (the **t-distribution**):
$$
t = {\bar x- \mu_0 \over SE_{\bar x}}={\bar x- \mu_0 \over s / \sqrt{n}}
$$
Here, $\bar X$ denotes the sample mean and $s$ the sample standard deviation. The t-distribution has more probability in its "tails", i.e. farther away from the mean. This reflects the higher uncertainty introduced by replacing the population standard deviation by its sample estimate. Intuitively, this is particularly relevant for small samples, since the uncertainty about the true population parameters decreases with increasing sample size. This is reflected by the fact that the exact shape of the t-distribution depends on the **degrees of freedom**, which is the sample size minus one (i.e., $n-1$). To see this, the following graph shows the t-distribution with different degrees of freedom for a two-tailed test and $\alpha = 0.05$. The grey curve shows the normal distribution.
```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig.align="center", fig.height = 8, fig.width = 8}
library(cowplot)
library(gridExtra)
library(grid)
df <- 5
p1 <- 0.025
p2 <- 0.975
min <- -5
max <- 5
t1 <- round(qt(p1, df = df), digits = 3)
t2 <- round(qt(p2, df = df), digits = 3)
plot1 <- ggplot(data.frame(x = c(min, max)), aes(x = x)) +
stat_function(fun = dnorm, color = "grey") +
stat_function(fun = dt, args = list(df = df)) +
stat_function(fun = dt, args = list(df = df), xlim = c(min, qt(p1, df = df)), geom = "area") +
stat_function(fun = dt, args = list(df = df), xlim = c(max, qt(p2, df = df)), geom = "area") +
stat_function(fun = dnorm, color = "grey") +
scale_x_continuous(breaks = c(t1, 0, t2)) +
labs(title = paste0("df= ", df),x = "x", y = "Density") +
theme(legend.position="none") +
theme_bw()
df <- 10
p1 <- 0.025
p2 <- 0.975
min <- -5
max <- 5
t1 <- round(qt(p1, df = df), digits = 3)
t2 <- round(qt(p2, df = df), digits = 3)
plot2 <- ggplot(data.frame(x = c(min, max)), aes(x = x)) +
stat_function(fun = dnorm, color = "grey") +
stat_function(fun = dt, args = list(df = df)) +
stat_function(fun = dt, args = list(df = df), xlim = c(min, qt(p1, df = df)), geom = "area") +
stat_function(fun = dt, args = list(df = df), xlim = c(max, qt(p2, df = df)), geom = "area") +
scale_x_continuous(breaks = c(t1, 0, t2)) +
labs(title = paste0("df= ",df),x = "x", y = "Density") +
theme(legend.position = "none") +
theme_bw()
df <- 100
p1 <- 0.025
p2 <- 0.975
min <- -5
max <- 5
t1 <- round(qt(p1, df = df), digits = 3)
t2 <- round(qt(p2, df = df), digits = 3)
plot3 <- ggplot(data.frame(x = c(min, max)), aes(x = x)) +
stat_function(fun = dnorm, color = "grey") +
stat_function(fun = dt, args = list(df = df)) +
stat_function(fun = dt, args = list(df = df), xlim = c(min, qt(p1, df = df)), geom = "area") +
stat_function(fun = dt, args = list(df = df), xlim = c(max, qt(p2, df = df)), geom = "area") +
scale_x_continuous(breaks = c(t1, 0, t2)) +
labs(title = paste0("df= ",df),x = "x", y = "Density") +
theme(legend.position = "none") +
theme_bw()
df <- 1000
p1 <- 0.025
p2 <- 0.975
min <- -5
max <- 5
t1 <- round(qt(p1, df = df), digits = 3)
t2 <- round(qt(p2, df = df), digits = 3)
plot4 <- ggplot(data.frame(x = c(min, max)), aes(x = x)) +
stat_function(fun = dnorm, color = "grey") +
stat_function(fun = dt, args = list(df = df)) +
stat_function(fun = dt, args = list(df = df), xlim = c(min, qt(p1, df = df)), geom = "area") +
stat_function(fun = dt, args = list(df = df), xlim = c(max, qt(p2, df = df)), geom = "area") +
scale_x_continuous(breaks = c(t1, 0, t2)) +
labs(title = paste0("df= ",df),
x = "x", y = "Density") +
theme(legend.position = "none") +
theme_bw()
p <- plot_grid(plot1, plot2, plot3, plot4, ncol = 2,
labels = c("A", "B","C","D"))
title <- ggdraw() + draw_label('Degrees of freedom and the t-distribution', fontface='bold')
p <- plot_grid(title, p, ncol=1, rel_heights=c(0.1, 1)) # rel_heights values control title margins
print(p)
```
Notice that as $n$ gets larger, the t-distribution gets closer and closer to the normal distribution, reflecting the fact that the uncertainty introduced by $s$ is reduced. To summarize, we now have an estimate for the standard deviation of the distribution of the sample mean (i.e., $SE_{\bar x}$) and an appropriate distribution that takes into account the necessary uncertainty (i.e., the t-distribution). Let us now compute the t-statistic according to the formula above:
```{r}
SE <- (sd(student_sample)/sqrt(n))
t_score <- (mean_sample - H_0)/SE
t_score
```
Notice that the value of the t-statistic is higher compared to the z-score (`r round(z_score,2)`). This can be attributed to the fact that by using the $s$ as and estimate of $\sigma$, we underestimate the true population standard deviation. Hence, the critical value would need to be larger to adjust for this. This is what the t-distribution does. Let us compute the critical value from the t-distribution with ```n - 1```degrees of freedom.
```{r}
df = n - 1
t_crit <- qt(0.975, df = df)
t_crit
```
Again, we use ```0.975``` and not ```0.95``` since we are running a two-sided test and need to account for the rejection region at the other end of the distribution. Notice that the new critical value based on the t-distributionis larger, to reflect the uncertainty when estimating $\sigma$ from $s$. Now we can see that the calculated test statistic is still larger than the critical value.
```{r}
abs(t_score) > abs(t_crit)
```
The following graphics shows that the calculated test statistic (red line) falls into the rejection region so that in our example, we would reject the null hypothesis that the true population mean is equal to $10$.
```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig.align="center", fig.height = 6, fig.width = 8}
p1 <- 0.025
p2 <- 0.975
min <- -6
max <- 6
t1 <- round(qt(p1, df = df), digits = 3)
t2 <- round(qt(p2, df = df), digits = 3)
ggplot(data.frame(x = c(min, max)), aes(x = x)) +
stat_function(fun = dt, args = list(df = df)) +
stat_function(fun = dt, args = list(df = df), xlim = c(min, qt(p1, df = df)), geom = "area") +
stat_function(fun = dt, args = list(df = df), xlim = c(max, qt(p2, df = df)), geom = "area") +
geom_vline(xintercept = t_score, color = 'red', size=1) +
scale_x_continuous(breaks = c(t1, 0, t2)) +
labs(title = "Theoretical density given null hypothesis 10 and sample t-statistic",
x = "x", y = "Density") +
theme(legend.position = "none") +
theme_bw()
```
**Decision:** Reject $H_0$, given that the calculated test statistic is larger than critical value.
Something to keep in mind here is the fact the test statistic is a function of the sample size. This, as $n$ gets large, the test statistic gets larger as well and we are more likely to find a significant effect. This reflects the decrease in uncertainty about the true population mean as our sample size increases.
#### P-values
In the previous section, we computed the test statistic, which tells us how close our sample is to the null hypothesis. The p-value corresponds to the probability that the test statistic would take a value as extreme or more extreme than the one that we actually observed, **assuming that the null hypothesis is true**. It is important to note that this is a **conditional probability**: we compute the probability of observing a sample mean (or a more extreme value) conditional on the assumption that the null hypothesis is true. The ```pnorm()```function can be used to compute this probability. It is the cumulative probability distribution function of the `normal distribution. Cumulative probability means that the function returns the probability that the test statistic will take a value **less than or equal to** the calculated test statistic given the degrees of freedom. However, we are interested in obtaining the probability of observing a test statistic **larger than or equal to** the calculated test statistic under the null hypothesis (i.e., the p-value). Thus, we need to subtract the cumulative probability from 1. In addition, since we are running a two-sided test, we need to multiply the probability by 2 to account for the rejection region at the other side of the distribution.
```{r}
p_value <- 2*(1-pt(abs(t_score), df = df))
p_value
```
This value corresponds to the probability of observing a mean equal to or larger than the one we obtained from our sample, if the null hypothesis was true. As you can see, this probability is very low. A small p-value signals that it is unlikely to observe the calculated test statistic under the null hypothesis. To decide whether or not to reject the null hypothesis, we would now compare this value to the level of significance ($\alpha$) that we chose for our test. For this example, we adopt the widely accepted significance level of 5%, so any test results with a p-value < 0.05 would be deemed statistically significant. Note that the p-value is directly related to the value of the test statistic. The relationship is such that the higher (lower) the value of the test statistic, the lower (higher) the p-value.
**Decision:** Reject $H_0$, given that the p-value is smaller than 0.05.
#### Confidence interval
For a given statistic calculated for a sample of observations (e.g., listening times), a 95% confidence interval can be constructed such that in 95% of samples, the true value of the true population mean will fall within its limits. If the parameter value specified in the null hypothesis (here $10$) does not lie within the bounds, we reject $H_0$. Building on what we learned about confidence intervals in the previous chapter, the 95% confidence interval based on the t-distribution can be computed as follows:
$$
CI_{lower} = {\bar x} - t_{1-{\alpha \over 2}} * SE_{\bar x} \\
CI_{upper} = {\bar x} + t_{1-{\alpha \over 2}} * SE_{\bar x}
$$
It is easy to compute this interval manually:
```{r message=FALSE, warning=FALSE}
ci_lower <- (mean_sample)-qt(0.975, df = df)*SE
ci_upper <- (mean_sample)+qt(0.975, df = df)*SE
ci_lower
ci_upper
```
The interpretation of this interval is as follows: if we would (hypothetically) take 100 samples and calculated the mean and confidence interval for each of them, then the true population mean would be included in 95% of these intervals. The CI is informative when reporting the result of your test, since it provides an estimate of the uncertainty associated with the test result. From the test statistic or the p-value alone, it is not easy to judge in which range the true population parameter is located. The CI provides an estimate of this range.
**Decision:** Reject $H_0$, given that the parameter value from the null hypothesis ($10$) is not included in the interval.
To summarize, you can see that we arrive at the same conclusion (i.e., reject $H_0$), irrespective if we use the test statistic, the p-value, or the confidence interval. However, keep in mind that rejecting the null hypothesis does not prove the alternative hypothesis (we can merely provide support for it). Rather, think of the p-value as the chance of obtaining the data we've collected assuming that the null hypothesis is true. You should report the confidence interval to provide an estimate of the uncertainty associated with your test results.
### Choosing the right test
The test statistic, as we have seen, measures how close the sample is to the null hypothesis and often follows a well-known distribution (e.g., normal, t, or chi-square). To select the correct test, various factors need to be taken into consideration. Some examples are:
* On what scale are your variables measured (categorical vs. continuous)?
* Do you want to test for relationships or differences?
* If you test for differences, how many groups would you like to test?
* For parametric tests, are the assumptions fulfilled?
The previous discussion used a **one sample t-test** as an example, which requires that variable is measured on an interval or ratio scale. If you are confronted with other settings, the following flow chart provides a rough guideline on selecting the correct test:
![Flowchart for selecting an appropriate test (source: McElreath, R. (2016): Statistical Rethinking, p. 2)](https://github.com/IMSMWU/Teaching/raw/master/MRDA2017/testselection.JPG)
For a detailed overview over the different type of tests, please also refer to <a href="https://stats.idre.ucla.edu/other/mult-pkg/whatstat/" target="_blank">this overview</a> by the UCLA.
#### Parametric vs. non-parametric tests
A basic distinction can be made between parametric and non-parametric tests. **Parametric tests** require that variables are measured on an interval or ratio scale and that the sampling distribution follows a known distribution. **Non-Parametric tests** on the other hand do not require the sampling distribution to be normally distributed (a.k.a. "assumption free tests"). These tests may be used when the variable of interest is measured on an ordinal scale or when the parametric assumptions do not hold. They often rely on ranking the data instead of analyzing the actual scores. By ranking the data, information on the magnitude of differences is lost. Thus, parametric tests are more powerful if the sampling distribution is normally distributed. In this chapter, we will first focus on parametric tests and cover non-parametric tests later.
#### One-tailed vs. two-tailed test
For some tests you may choose between a **one-tailed test** versus a **two-tailed test**. The choice depends on the hypothesis you specified, i.e., whether you specified a directional or a non-directional hypotheses. In the example above, we used a **non-directional hypothesis**. That is, we stated that the mean is different from the comparison value $\mu_0$, but we did not state the direction of the effect. A **directional hypothesis** states the direction of the effect. For example, we might test whether the population mean is smaller than a comparison value:
$$
H_0: \mu \ge \mu_0 \\
H_1: \mu < \mu_0
$$
Similarly, we could test whether the population mean is larger than a comparison value:
$$
H_0: \mu \le \mu_0 \\
H_1: \mu > \mu_0
$$
Connected to the decision of how to phrase the hypotheses (directional vs. non-directional) is the choice of a **one-tailed test** versus a **two-tailed test**. Let's first think about the meaning of a one-tailed test. Using a significance level of 0.05, a one-tailed test means that 5% of the total area under the probability distribution of our test statistic is located in one tail. Thus, under a one-tailed test, we test for the possibility of the relationship in one direction only, disregarding the possibility of a relationship in the other direction. In our example, a one-tailed test could test either if the mean listening time is significantly larger or smaller compared to the control condition, but not both. Depending on the direction, the mean listening time is significantly larger (smaller) if the test statistic is located in the top (bottom) 5% of its probability distribution.
The following graph shows the critical values that our test statistic would need to surpass so that the difference between the population mean and the comparison value would be deemed statistically significant.
```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig2, fig.align="center", fig.height = 3, fig.width = 10}
library(cowplot)
library(gridExtra)
library(grid)
df <- n-1
p1 <- 0.025
p2 <- 0.975
min <- -5
max <- 5
t1 <- round(qt(p1, df = df), digits = 3)
t2 <- round(qt(p2, df = df), digits = 3)
plot1 <- ggplot(data.frame(x = c(min, max)), aes(x = x)) +
stat_function(fun = dt, args = list(df = df)) +
stat_function(fun = dt, args = list(df = df), xlim = c(min, qt(p1, df = df)), geom = "area") +
stat_function(fun = dt, args = list(df = df), xlim = c(max, qt(p2, df = df)), geom = "area") +
scale_x_continuous(breaks = c(t1, 0, t2)) +
labs(title = paste0("Two-sided test"),
subtitle = "0.025 of total area on each side; df = 49",
x = "x", y = "Density") +
theme(legend.position = "none") +
theme_bw()
df <- n-1
p1 <- 0.000
p2 <- 0.950
min <- -5
max <- 5
t1 <- round(qt(p1,df=df), digits = 3)
t2 <- round(qt(p2,df=df), digits = 3)
plot2 <- ggplot(data.frame(x = c(min, max)), aes(x = x)) +
stat_function(fun = dt, args = list(df = df)) +
stat_function(fun = dt, args = list(df = df), xlim = c(min,qt(p1,df=df)), geom = "area") +
stat_function(fun = dt, args = list(df = df), xlim = c(max,qt(p2,df=df)), geom = "area") +
scale_x_continuous(breaks = c(t1,0,t2)) +
labs(title = paste0("One-sided test (right)"),
subtitle = "0.05 of total area on the right; df = 49",
x = "x", y = "Density") +
theme(legend.position="none") + theme_bw()
df <- n-1
p1 <- 0.000
p2 <- 0.050
min <- -5
max <- 5
t1 <- round(qt(p1,df=df), digits = 3)
t2 <- round(qt(p2,df=df), digits = 3)
plot3 <- ggplot(data.frame(x = c(min, max)), aes(x = x)) +
stat_function(fun = dt, args = list(df = df)) +
stat_function(fun = dt, args = list(df = df), xlim = c(max,qt(p1,df=df)), geom = "area") +
stat_function(fun = dt, args = list(df = df), xlim = c(min,qt(p2,df=df)), geom = "area") +
scale_x_continuous(breaks = c(t1,0,t2)) +
labs(title = paste0("One-sided test (left)"),
subtitle = "0.05 of total area on the left; df = 49",
x = "x", y = "Density") +
theme(legend.position="none") + theme_bw()
p <- plot_grid(plot3,plot1, plot2, ncol = 3)
print(p)
```
It can be seen that under a one-sided test, the rejection region is at one end of the distribution or the other. In a two-sided test, the rejection region is split between the two tails. As a consequence, the critical value of the test statistic is smaller using a one-tailed test, meaning that it has more power to detect an effect. Having said that, in most applications, we would like to be able catch effects in both directions, simply because we can often not rule out that an effect might exist that is not in the hypothesized direction. For example, if we would conduct a one-tailed test for a mean larger than some specified value but the mean turns out to be substantially smaller, then testing a one-directional hypothesis ($H_0: \mu \le \mu_0 $) would not allow us to conclude that there is a significant effect because there is not rejection at this end of the distribution.
### Summary
As we have seen, the process of hypothesis testing consists of various steps:
1. Formulate null and alternative hypotheses
2. Select an appropriate test
3. Choose the level of significance ($\alpha$)
4. Descriptive statistics and data visualization
5. Conduct significance test
6. Report results and draw a marketing conclusion
# Chi-square test
This chapter is primarily based on Field, A., Miles J., & Field, Z. (2012): Discovering Statistics Using R. Sage Publications, **chapter 18**.
In some instances, you will be confronted with differences between proportions, rather than differences between means. For example, you may conduct an A/B-Test and wish to compare the conversion rates between two advertising campaigns. In this case, your data is binary (0 = no conversion, 1 = conversion) and the sampling distribution for such data is binomial. While binomial probabilities are difficult to calculate, we can use a Normal approximation to the binomial when ```n``` is large (>100) and the true likelihood of a 1 is not too close to 0 or 1.
Let's use an example: assume a call center where service agents call potential customers to sell a product. We consider two call center agents:
* Service agent 1 talks to 300 customers and gets 200 of them to buy (conversion rate=2/3)
* Service agent 2 talks to 300 customers and gets 100 of them to buy (conversion rate=1/3)
As always, we load the data first:
[You can download the corresponding R-Code here](./Code/10-categorical_data (1).R)
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE}
call_center <- read.table("https://raw.githubusercontent.com/IMSMWU/Teaching/master/MRDA2017/call_center.dat",
sep = "\t",
header = TRUE) #read in data
call_center$conversion <- factor(call_center$conversion , levels = c(0:1), labels = c("no", "yes")) #convert to factor
call_center$agent <- factor(call_center$agent , levels = c(0:1), labels = c("agent_1", "agent_2")) #convert to factor
```
Next, we create a table to check the relative frequencies:
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE}
rel_freq_table <- as.data.frame(prop.table(table(call_center), 2)) #conditional relative frequencies
rel_freq_table
```
We could also plot the data to visualize the frequencies using ggplot:
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "proportion of conversions per agent (stacked bar chart)"}
ggplot(rel_freq_table, aes(x = agent, y = Freq, fill = conversion)) + #plot data
geom_col(width = .7) + #position
geom_text(aes(label = paste0(round(Freq*100,0),"%")), position = position_stack(vjust = 0.5), size = 4) + #add percentages
ylab("Proportion of conversions") + xlab("Agent") + # specify axis labels
theme_bw()
```
... or using the ```mosaicplot()``` function:
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "proportion of conversions per agent (mosaic plot)"}
contigency_table <- table(call_center)
mosaicplot(contigency_table, main = "Proportion of conversions by agent")
```
## Confidence intervals for proportions
Recall that we can use confidence intervals to determine the range of values that the true population parameter will take with a certain level of confidence based on the sample. Similar to the confidence interval for means, we can compute a confidence interval for proportions. The (1-$\alpha$)% confidence interval for proportions is approximately
$$
CI = p\pm z_{1-\frac{\alpha}{2}}*\sqrt{\frac{p*(1-p)}{N}}
$$
where $\sqrt{p(1-p)}$ is the equivalent to the standard deviation in the formula for the confidence interval for means. Based on the equation, it is easy to compute the confidence intervals for the conversion rates of the call center agents:
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE}
n1 <- nrow(subset(call_center,agent=="agent_1")) #number of observations for agent 1
n2 <- nrow(subset(call_center,agent=="agent_2")) #number of observations for agent 1
n1_conv <- nrow(subset(call_center,agent=="agent_1" & conversion=="yes")) #number of conversions for agent 1
n2_conv <- nrow(subset(call_center,agent=="agent_2" & conversion=="yes")) #number of conversions for agent 2
p1 <- n1_conv/n1 #proportion of conversions for agent 1
p2 <- n2_conv/n2 #proportion of conversions for agent 2
error1 <- qnorm(0.975)*sqrt((p1*(1-p1))/n1)
ci_lower1 <- p1 - error1
ci_upper1 <- p1 + error1
ci_lower1
ci_upper1
error2 <- qnorm(0.975)*sqrt((p2*(1-p2))/n2)
ci_lower2 <- p2 - error2
ci_upper2 <- p2 + error2
ci_lower2
ci_upper2
```
Similar to testing for differences in means, we could also ask: Is agent 1 twice as likely as agent 2 to convert a customer? Or, to state it formally:
$$H_0: \pi_1=\pi_2 \\
H_1: \pi_1\ne \pi_2$$
where $\pi$ denotes the population parameter associated with the proportion in the respective population. One approach to test this is based on confidence intervals to estimate the difference between two populations. We can compute an approximate confidence interval for the difference between the proportion of successes in group 1 and group 2, as:
$$
CI = p_1-p_2\pm z_{1-\frac{\alpha}{2}}*\sqrt{\frac{p_1*(1-p_1)}{n_1}+\frac{p_2*(1-p_2)}{n_2}}
$$
If the confidence interval includes zero, then the data does not suggest a difference between the groups. Let's compute the confidence interval for differences in the proportions by hand first:
```{r}
ci_lower <- p1 - p2 - qnorm(0.975)*sqrt(p1*(1 - p1)/n1 + p2*(1 - p2)/n2) #95% CI lower bound
ci_upper <- p1 - p2 + qnorm(0.975)*sqrt(p1*(1 - p1)/n1 + p2*(1 - p2)/n2) #95% CI upper bound
ci_lower
ci_upper
```
Now we can see that the 95% confidence interval estimate of the difference between the proportion of conversions for agent 1 and the proportion of conversions for agent 2 is between `r round(ci_lower*100,0)`% and `r round(ci_upper*100,0)`%. This interval tells us the range of plausible values for the difference between the two population proportions. According to this interval, zero is not a plausible value for the difference (i.e., interval does not cross zero), so we reject the null hypothesis that the population proportions are the same.
Instead of computing the intervals by hand, we could also use the ```prop.test()``` function:
```{r}
prop.test(x = c(n1_conv, n2_conv), n = c(n1, n2), conf.level = 0.95)
```
Note that the ```prop.test()``` function uses a slightly different (more accurate) way to compute the confidence interval (Wilson's score method is used). It is particularly a better approximation for smaller N. That's why the confidence interval in the output slightly deviates from the manual computation above, which uses the Wald interval.
You can also see that the output from the ```prop.test()``` includes the results from a χ<sup>2</sup> test for the equality of proportions (which will be discussed below) and the associated p-value. Since the p-value is less than 0.05, we reject the null hypothesis of equal probability. Thus, the reporting would be:
The test showed that the conversion rate for agent 1 was higher by `r round(((prop.test(x = c(n1_conv, n2_conv), n = c(n1, n2), conf.level = 0.95)$estimate[1])-(prop.test(x = c(n1_conv, n2_conv), n = c(n1, n2), conf.level = 0.95)$estimate[2]))*100,0)`%. This difference is significant χ (1) = 70, p < .05 (95% CI = [`r round(prop.test(x = c(n1_conv, n2_conv), n = c(n1, n2), conf.level = 0.95)$conf.int[1],2)`,`r round(prop.test(x = c(n1_conv, n2_conv), n = c(n1, n2), conf.level = 0.95)$conf.int[2],2)`]).
## Chi-square test
In the previous section, we saw how we can compute the confidence interval for the difference between proportions to decide on whether or not to reject the null hypothesis. Whenever you would like to investigate the relationship between two categorical variables, the $\chi^2$ test may be used to test whether the variables are independent of each other. It achieves this by comparing the expected number of observations in a group to the actual values. Let's continue with the example from the previous section. Under the null hypothesis, the two variables *agent* and *conversion* in our contingency table are independent (i.e., there is no relationship). This means that the frequency in each field will be roughly proportional to the probability of an observation being in that category, calculated under the assumption that they are independent. The difference between that expected quantity and the actual quantity can be used to construct the test statistic. The test statistic is computed as follows:
$$
\chi^2=\sum_{i=1}^{J}\frac{(f_o-f_e)^2}{f_e}
$$
where $J$ is the number of cells in the contingency table, $f_o$ are the observed cell frequencies and $f_e$ are the expected cell frequencies. The larger the differences, the larger the test statistic and the smaller the p-value.
The observed cell frequencies can easily be seen from the contingency table:
```{r message=FALSE, warning=FALSE}
obs_cell1 <- contigency_table[1,1]
obs_cell2 <- contigency_table[1,2]
obs_cell3 <- contigency_table[2,1]
obs_cell4 <- contigency_table[2,2]
```
The expected cell frequencies can be calculated as follows:
$$
f_e=\frac{(n_r*n_c)}{n}
$$
where $n_r$ are the total observed frequencies per row, $n_c$ are the total observed frequencies per column, and $n$ is the total number of observations. Thus, the expected cell frequencies under the assumption of independence can be calculated as:
```{r message=FALSE, warning=FALSE}
n <- nrow(call_center)
exp_cell1 <- (nrow(call_center[call_center$agent=="agent_1",])*nrow(call_center[call_center$conversion=="no",]))/n
exp_cell2 <- (nrow(call_center[call_center$agent=="agent_1",])*nrow(call_center[call_center$conversion=="yes",]))/n
exp_cell3 <- (nrow(call_center[call_center$agent=="agent_2",])*nrow(call_center[call_center$conversion=="no",]))/n
exp_cell4 <- (nrow(call_center[call_center$agent=="agent_2",])*nrow(call_center[call_center$conversion=="yes",]))/n
```
To sum up, these are the expected cell frequencies
```{r message=FALSE, warning=FALSE, paged.print = FALSE}
data.frame(conversion_no = rbind(exp_cell1,exp_cell3),conversion_yes = rbind(exp_cell2,exp_cell4), row.names = c("agent_1","agent_2"))
```
... and these are the observed cell frequencies
```{r message=FALSE, warning=FALSE, paged.print = FALSE}
data.frame(conversion_no = rbind(obs_cell1,obs_cell2),conversion_yes = rbind(obs_cell3,obs_cell4), row.names = c("agent_1","agent_2"))
```
To obtain the test statistic, we simply plug the values into the formula:
```{r message=FALSE, warning=FALSE}
chisq_cal <- sum(((obs_cell1 - exp_cell1)^2/exp_cell1),
((obs_cell2 - exp_cell2)^2/exp_cell2),
((obs_cell3 - exp_cell3)^2/exp_cell3),
((obs_cell4 - exp_cell4)^2/exp_cell4))
chisq_cal
```
The test statistic is $\chi^2$ distributed. The chi-square distribution is a non-symmetric distribution. Actually, there are many different chi-square distributions, one for each degree of freedom as show in the following figure.
```{r echo = F, message=FALSE, warning=FALSE, eval=T, fig.align="center", fig.cap = "The chi-square distribution"}
library(ggplot2)
a <- seq(2,10, 2)
ggplot(data.frame(x=c(0,20)), aes(x))+
stat_function(fun = dchisq, args = list(8), aes(colour = '8'))+
stat_function(fun = dchisq, args = list(1), aes(colour = '1'))+
stat_function(fun = dchisq, args = list(2), aes(colour = '2'))+
stat_function(fun = dchisq, args = list(4), aes(colour = '4'))+
stat_function(fun = dchisq, args = list(6), aes(colour = '6'))+
stat_function(fun = dchisq, args = list(15), aes(colour = '15'))+
ylim(min=0, max=0.6) +
labs(colour = 'Degrees of Freedom', x = 'Value', y = 'Density') + theme_bw()
```
You can see that as the degrees of freedom increase, the chi-square curve approaches a normal distribution. To find the critical value, we need to specify the corresponding degrees of freedom, given by:
$$
df=(r-1)*(c-1)
$$
where $r$ is the number of rows and $c$ is the number of columns in the contingency table. Recall that degrees of freedom are generally the number of values that can vary freely when calculating a statistic. In a 2 by 2 table as in our case, we have 2 variables (or two samples) with 2 levels and in each one we have 1 that vary freely. Hence, in our example the degrees of freedom can be calculated as:
```{r message=FALSE, warning=FALSE}
df <- (nrow(contigency_table) - 1) * (ncol(contigency_table) -1)
df
```
Now, we can derive the critical value given the degrees of freedom and the level of confidence using the ```qchisq()``` function and test if the calculated test statistic is larger than the critical value:
```{r message=FALSE, warning=FALSE}
chisq_crit <- qchisq(0.95, df)
chisq_crit
chisq_cal > chisq_crit
```
```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig.align="center", fig.cap = "Visual depiction of the test result"}
df <- 1
p <- 0.95
min <- 0
max <- 15
chsq1 <- round(qchisq(p,df=df), digits = 3)
t2 <- round(qt(p2,df=df), digits = 3)
plot1 <- ggplot(data.frame(x = c(min, max)), aes(x = x)) +
stat_function(fun = dchisq, args = list(df))+
stat_function(fun = dchisq, args = list(df), xlim = c(qchisq(p,df=df),max), geom = "area") +
scale_x_continuous(breaks = c(0, chsq1, chisq_cal)) +
geom_vline(xintercept = chisq_cal, color = "red") +
labs(title = paste0("Result of chi-square test: reject H0"),
subtitle = paste0("Red line: Calculated test statistic;"," Black area: Rejection region"),
x = "x", y = "Density") +
theme(legend.position="none") +
theme_bw()
plot1
```
We could also compute the p-value using the ```pchisq()``` function, which tells us the probability of the observed cell frequencies if the null hypothesis was true (i.e., there was no association):
```{r message=FALSE, warning=FALSE}
p_val <- 1-pchisq(chisq_cal,df)
p_val
```
The test statistic can also be calculated in R directly on the contingency table with the function ```chisq.test()```.
```{r message=FALSE, warning=FALSE}
chisq.test(contigency_table, correct = FALSE)
```
Since the p-value is smaller than 0.05 (i.e., the calculated test statistic is larger than the critical value), we reject H<sub>0</sub> that the two variables are independent.
Note that the test statistic is sensitive to the sample size. To see this, let's assume that we have a sample of 100 observations instead of 1000 observations:
```{r message=FALSE, warning=FALSE}
chisq.test(contigency_table/10, correct = FALSE)
```
You can see that even though the proportions haven't changed, the test is insignificant now. The following equation lets you compute a measure of the effect size, which is insensitive to sample size:
$$
\phi=\sqrt{\frac{\chi^2}{n}}
$$
The following guidelines are used to determine the magnitude of the effect size (Cohen, 1988):
* 0.1 (small effect)
* 0.3 (medium effect)
* 0.5 (large effect)
In our example, we can compute the effect sizes for the large and small samples as follows:
```{r message=FALSE, warning=FALSE}
test_stat <- chisq.test(contigency_table, correct = FALSE)$statistic
phi1 <- sqrt(test_stat/n)
test_stat <- chisq.test(contigency_table/10, correct = FALSE)$statistic
phi2 <- sqrt(test_stat/(n/10))
phi1
phi2
```
You can see that the statistic is insensitive to the sample size.
Note that the Φ coefficient is appropriate for two dichotomous variables (resulting from a 2 x 2 table as above). If any your nominal variables has more than two categories, Cramér's V should be used instead:
$$
V=\sqrt{\frac{\chi^2}{n*df_{min}}}
$$
where $df_{min}$ refers to the degrees of freedom associated with the variable that has fewer categories (e.g., if we have two nominal variables with 3 and 4 categories, $df_{min}$ would be 3 - 1 = 2). The degrees of freedom need to be taken into account when judging the magnitude of the effect sizes (see e.g., <a href="http://www.real-statistics.com/chi-square-and-f-distributions/effect-size-chi-square/" target="_blank">here</a>).
Note that the ```correct = FALSE``` argument above ensures that the test statistic is computed in the same way as we have done by hand above. By default, ```chisq.test()``` applies a correction to prevent overestimation of statistical significance for small data (called the Yates' correction). The correction is implemented by subtracting the value 0.5 from the computed difference between the observed and expected cell counts in the numerator of the test statistic. This means that the calculated test statistic will be smaller (i.e., more conservative). Although the adjustment may go too far in some instances, you should generally rely on the adjusted results, which can be computed as follows:
```{r message=FALSE, warning=FALSE}
chisq.test(contigency_table)
```
As you can see, the results don't change much in our example, since the differences between the observed and expected cell frequencies are fairly large relative to the correction.
Caution is warranted when the cell counts in the contingency table are small. The usual rule of thumb is that all cell counts should be at least 5 (this may be a little too stringent though). When some cell counts are too small, you can use Fisher's exact test using the ```fisher.test()``` function.
```{r message=FALSE, warning=FALSE}
fisher.test(contigency_table)
```
The Fisher test, while more conservative, also shows a significant difference between the proportions (p < 0.05). This is not surprising since the cell counts in our example are fairly large.
## Sample size
To **calculate the required sample size** when comparing proportions, the ```power.prop.test()``` function can be used. For example, we could ask how large our sample needs to be if we would like to compare two groups with conversion rates of 2% and 2.5%, respectively using the conventional settings for $\alpha$ and $\beta$:
```{r}
power.prop.test(p1=0.02,p2=0.025,sig.level=0.05,power=0.8)
```
The output tells us that we need `r round(power.prop.test(p1=0.02,p2=0.025,sig.level=0.05,power=0.8)$n,0)` observations per group to detect a difference of the desired size.
## Summary
* Confidence interval for proportions: the interval that shows the range of plausible values for the difference between the two population proportions. If the interval does not cross zero, then zero is not a plausible value for the difference between two proportions, and we reject the null hypothesis that the population proportions are the same.
* Application of the Chi-square test: when there are 2 categorical variables (ordinal or nominal). In our example: conversion (yes or no) and agent (1 or 2).
* Hypothesis: Is there an association between categorical variable X (e.g. conversion) and categorical variable Y (e.g. agent)?
H0: There is no association between the two variables.
H1: There is an association between the two variables.
* Interpretation: If we find an association between two variables (p < 0.05), we can conclude that the propotion of conversions to proportion of non-conversions between agent 1 and agent 2 significantly differ from each other.
* Example of application in marketing: Marketing department of a firm producing toilet paper is interested in studying the consumer behavior in the context of purchase decision. The company is planing to launch a high quality, super soft toilet paper, but beforehand they decided to conduct a research. They are interested in knowing whether the income level of their consumers influence their choice of the toilet paper quality or not. Currently there are 4 income categories the comapany's consumers are assigned to, and 4 different levels of toilet paper quality (4x4 table).
# Correlation
Before we start with regression analysis, we will review the basic concept of correlation first. Correlation helps us to determine the degree to which the variation in one variable, X, is related to the variation in another variable, Y.
## Correlation coefficient
The correlation coefficient summarizes the strength of the linear relationship between two metric (interval or ratio scaled) variables. Let's consider a simple example. Say you conduct a survey to investigate the relationship between the attitude towards a city and the duration of residency. The "Attitude" variable can take values between 1 (very unfavorable) and 12 (very favorable), and the "duration of residency" is measured in years. Let's further assume for this example that the attitude measurement represents an interval scale (although it is usually not realistic to assume that the scale points on an itemized rating scale have the same distance). To keep it simple, let's further assume that you only asked 12 people. We can create a short data set like this:
```{r, include=FALSE}
library(openssl)
passphrase <- charToRaw("MRDAnils")
key <- sha256(passphrase)
url <- "https://raw.githubusercontent.com/IMSMWU/mrda_data_pub/master/secret-music_data.rds"
download.file(url, "./data/secret_music_data.rds", method = "auto", quiet=FALSE)
encrypted_music_data <- readRDS("./data/secret_music_data.rds")
music_data <- unserialize(aes_cbc_decrypt(encrypted_music_data, key = key))
```
```{r, eval=FALSE}
readRDS("music_data.rds")
```
```{r message=FALSE, warning=FALSE}
s.genre <- c("pop","hip hop","rock","rap","indie")
music_data <- subset(music_data, top.genre %in% s.genre)
music_data$genre_cat <- as.factor(music_data$top.genre)
music_data$explicit_cat <- factor(music_data$explicit, levels = c(0:1),
labels = c("not explicit", "explicit"))
head(music_data)
```
```{r message=FALSE, warning=FALSE, echo=FALSE, eval=FALSE,paged.print = FALSE}
#library(psych)
#attitude <- c(6,9,8,3,10,4,5,2,11,9,10,2)
#duration <- c(10,12,12,4,12,6,8,2,18,9,17,2)
#att_data <- data.frame(attitude, duration)
#att_data <- att_data[order(-attitude), ]
#att_data$respodentID <- c(1:12)
#str(att_data)
#psych::describe(att_data[, c("attitude","duration")])
#att_data
```
Let's look at the data. The following graph shows the individual data points for the "duration of residency"" variable, where the blue horizontal line represents the mean of the variable (`r round(mean(music_data$energy, na.rm=T),2)`) and the vertical lines show the distance of the individual data points from the mean.
```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig.align="center", fig.cap = "Scores for duration of residency variable"}
library(ggplot2)
#h <- round(mean(att_data$duration), 2)
ggplot(music_data) +
geom_histogram(aes(energy), bins = 50, fill = 'white', color = 'deepskyblue4') +
geom_vline(aes(xintercept = mean(energy, na.rm=T)), color='black', size=1) +
labs(title = "Histogram of energy",
#subtitle = TeX(sprintf("Population mean ($\\mu$) = %.2f; population standard deviation ($\\sigma$) = %.2f",round(mean(hours),2),round(sd(hours),2))),
y = 'Number of observations',
x = 'Energy') +
theme_bw()
#ggplot(music_data, aes(y = energy)) +
# geom_point(size = 3, color = "deepskyblue4") +
# scale_x_continuous(breaks = 1:12) +
# geom_hline(data = att_data, aes(yintercept = mean(duration)), color ="deepskyblue4") +
# labs(x = "Observations",y = "Duration of residency", size = 11) +
# coord_cartesian(ylim = c(0, 18)) +
# geom_segment(aes(x = respodentID,y = duration, xend = respodentID,
# yend = mean(duration)), color = "deepskyblue4", size = 1) +
# theme(axis.title = element_text(size = 16),
# axis.text = element_text(size=16),
# strip.text.x = element_text(size = 16),
# legend.position="none") +
# theme_bw()
```
You can see that there are some respondents that have been living in the city longer than average and some respondents that have been living in the city shorter than average. Let's do the same for the second variable ("Attitude"):
```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig.align="center", fig.cap = "Scores for attitude variable"}
ggplot(music_data) +
geom_histogram(aes(loudness), bins = 50, fill = 'white', color = '#f9756d') +
geom_vline(aes(xintercept = mean(energy, na.rm=T)), color='black', size=1) +
labs(title = "Histogram of loudness",
#subtitle = TeX(sprintf("Population mean ($\\mu$) = %.2f; population standard deviation ($\\sigma$) = %.2f",round(mean(hours),2),round(sd(hours),2))),
y = 'Number of observations',
x = 'Loudness') +
theme_bw()
#ggplot(att_data, aes(x = respodentID, y = attitude)) +
# geom_point(size = 3, color = "#f9756d") +
# scale_x_continuous(breaks = 1:12) +
# geom_hline(data = att_data, aes(yintercept = mean(attitude)), color = "#f9756d") +
# labs(x = "Observations",y = "Attitude", size = 11) +
# coord_cartesian(ylim = c(0,18)) +
# geom_segment(aes(x = respodentID, y = attitude, xend = respodentID,
# yend = mean(attitude)), color = "#f9756d", size = 1) +
# theme_bw()
```
Again, we can see that some respondents have an above average attitude towards the city (more favorable) and some respondents have a below average attitude towards the city. Let's plot the data in one graph now to see if there is some co-movement:
```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE,fig.align="center", fig.cap = "Scores for attitude and duration of residency variables"}
ggplot(music_data, aes(energy, loudness)) +
# geom_point(size = 3, aes(id, energy), color = "#f9756d") +
# geom_point(size = 3, aes(id, loudness), color = "deepskyblue4") +
geom_point() +
# scale_x_continuous(breaks = 1:12) +
# geom_hline(data = att_data, aes(yintercept = mean(duration)), color = "deepskyblue4") +
# geom_hline(data = att_data, aes(yintercept = mean(attitude)), color = "#f9756d") +
labs(x = "Observations", y = "Duration/Attitude", size = 11) +
# coord_cartesian(ylim = c(0, 18)) +
# scale_color_manual(values = c("#f9756d", "deepskyblue4")) +
theme_bw()
#ggplot(att_data) +
# geom_point(size = 3, aes(respodentID, attitude), color = "#f9756d") +
# geom_point(size = 3, aes(respodentID, duration), color = "deepskyblue4") +
# scale_x_continuous(breaks = 1:12) +
# geom_hline(data = att_data, aes(yintercept = mean(duration)), color = "deepskyblue4") +
# geom_hline(data = att_data, aes(yintercept = mean(attitude)), color = "#f9756d") +
# labs(x = "Observations", y = "Duration/Attitude", size = 11) +
# coord_cartesian(ylim = c(0, 18)) +
# scale_color_manual(values = c("#f9756d", "deepskyblue4")) +
# theme_bw()
```
We can see that there is indeed some co-movement here. The variables <b>covary</b> because respondents who have an above (below) average attitude towards the city also appear to have been living in the city for an above (below) average amount of time and vice versa. Correlation helps us to quantify this relationship. Before you proceed to compute the correlation coefficient, you should first look at the data. We usually use a scatterplot to visualize the relationship between two metric variables:
```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig.align="center", fig.cap = "Scatterplot for duration and attitute variables"}
#ggplot(att_data) +
# geom_point(size = 3, aes(duration, attitude)) +
# labs(x = "Duration",y = "Attitude", size = 11) +
# theme_bw()
```
How can we compute the correlation coefficient? Remember that the variance measures the average deviation from the mean of a variable:
\begin{equation}
\begin{split}
s_x^2&=\frac{\sum_{i=1}^{N} (X_i-\overline{X})^2}{N-1} \\
&= \frac{\sum_{i=1}^{N} (X_i-\overline{X})*(X_i-\overline{X})}{N-1}
\end{split}
(\#eq:variance)
\end{equation}
When we consider two variables, we multiply the deviation for one variable by the respective deviation for the second variable:
<p style="text-align:center;">
$(X_i-\overline{X})*(Y_i-\overline{Y})$
</p>
This is called the cross-product deviation. Then we sum the cross-product deviations:
<p style="text-align:center;">
$\sum_{i=1}^{N}(X_i-\overline{X})*(Y_i-\overline{Y})$
</p>
... and compute the average of the sum of all cross-product deviations to get the <b>covariance</b>:
\begin{equation}
Cov(x, y) =\frac{\sum_{i=1}^{N}(X_i-\overline{X})*(Y_i-\overline{Y})}{N-1}
(\#eq:covariance)
\end{equation}
You can easily compute the covariance manually as follows
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE}
x <- music_data$energy[!is.na(music_data$energy)]
x_bar <- mean(music_data$energy, na.rm = T)
y <- music_data$loudness[!is.na(music_data$loudness)]
y_bar <- mean(music_data$loudness, na.rm = T)
N <- nrow(music_data)
cov <- (sum((x - x_bar)*(y - y_bar))) / (N - 1)
cov
```
Or you simply use the built-in ```cov()``` function:
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE}
cov(x, y) # apply the cov function
```
A positive covariance indicates that as one variable deviates from the mean, the other variable deviates in the same direction. A negative covariance indicates that as one variable deviates from the mean (e.g., increases), the other variable deviates in the opposite direction (e.g., decreases).
However, the size of the covariance depends on the scale of measurement. Larger scale units will lead to larger covariance. To overcome the problem of dependence on measurement scale, we need to convert covariance to a standard set of units through standardization by dividing the covariance by the standard deviation (i.e., similar to how we compute z-scores).
With two variables, there are two standard deviations. We simply multiply the two standard deviations. We then divide the covariance by the product of the two standard deviations to get the standardized covariance, which is known as a correlation coefficient r:
\begin{equation}
r=\frac{Cov_{xy}}{s_x*s_y}
(\#eq:corcoeff)
\end{equation}
This is known as the product moment correlation (r) and it is straight-forward to compute:
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE}
x_sd <- sd(music_data$energy, na.rm = T)
y_sd <- sd(music_data$loudness, na.rm = T)
r <- cov/(x_sd*y_sd)
r
```
Or you could just use the ```cor()``` function:
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE}
cor(music_data[, c("energy", "loudness")], method = "pearson", use = "complete")
```
Properties of r:
* ranges from -1 to + 1
* +1 indicates perfect linear relationship
* -1 indicates perfect negative relationship
* 0 indicates no linear relationship
* ± .1 represents small effect
* ± .3 represents medium effect
* ± .5 represents large effect
## Significance testing
How can we determine if our two variables are significantly related? To test this, we denote the population moment correlation ρ. Then we test the null of no relationship between variables:
$$H_0:\rho=0$$
$$H_1:\rho\ne0$$
The test statistic is:
\begin{equation}
t=\frac{r*\sqrt{N-2}}{\sqrt{1-r^2}}
(\#eq:cortest)
\end{equation}
It has a t distribution with n - 2 degrees of freedom. Then, we follow the usual procedure of calculating the test statistic and comparing the test statistic to the critical value of the underlying probability distribution. If the calculated test statistic is larger than the critical value, the null hypothesis of no relationship between X and Y is rejected.
```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE}
t_calc <- r*sqrt(N - 2)/sqrt(1 - r^2) #calculated test statistic
t_calc
df <- (N - 2) #degrees of freedom
t_crit <- qt(0.975, df) #critical value
t_crit
pt(q = t_calc, df = df, lower.tail = F) * 2 #p-value