08-hypothesis_testing.Rmd


```{r, echo=FALSE, warning=FALSE}
library(knitr)
#This code automatically tidies code so that it does not reach over the page
opts_chunk$set(tidy.opts=list(width.cutoff=50),tidy=TRUE, rownames.print = FALSE, rows.print = 10)
options(scipen = 999, digits = 7)
```

# Hypothesis testing

This chapter is primarily based on Field, A., Miles J., & Field, Z. (2012): Discovering Statistics Using R. Sage Publications, **chapter 5**.

[You can download the corresponding R-Code here](./Code/07-hypothesis_testing (2).R)

## Introduction

We test hypotheses because we are confined to taking samples – we rarely work with the entire population. In the previous chapter, we introduced the standard error (i.e., the standard deviation of a large number of hypothetical samples) as an estimate of how well a particular sample represents the population. We also saw how we can construct confidence intervals around the sample mean $\bar x$ by computing $SE_{\bar x}$ as an estimate of $\sigma_{\bar x}$ using $s$ as an estimate of $\sigma$ and calculating the 95% CI as $\bar x \pm 1.96 * SE_{\bar x}$. Although we do not know the true population mean ($\mu$), we might have an hypothesis about it and this would tell us how the corresponding sampling distribution looks like. Based on the sampling distribution of the hypothesized population mean, we could then determine the probability of a given sample **assuming that the hypothesis is true**. 

Let us again begin by assuming we know the entire population using the example of music listening times among students from the previous example. As a reminder, the following plot shows the distribution of music listening times in the population of WU students. 

```{r, message = FALSE, warning=FALSE}
library(tidyverse)
library(ggplot2)
library(latex2exp)
set.seed(321)
hours <- rgamma(25000, shape = 2, scale = 10)
ggplot(data.frame(hours)) +
  geom_histogram(aes(x = hours), bins = 30, fill='white', color='black') +
    geom_vline(xintercept = mean(hours), size = 1)  +  theme_bw() +
  labs(title = "Histogram of listening times",
       subtitle = TeX(sprintf("Population mean ($\\mu$) = %.2f; population standard deviation ($\\sigma$) = %.2f",round(mean(hours),2),round(sd(hours),2))),
       y = 'Number of students', 
       x = 'Hours') 
```

In this example, the population mean ($\mu$) is equal to `r round(mean(hours),2)`, and the population standard deviation $\sigma$ is equal to `r round(sd(hours),2)`. 

### The null hypothesis

Let us assume that we were planning to take a random sample of 50 students from this population and our hypothesis was that the mean listening time is equal to some specific value $\mu_0$, say $10$. This would be our **null hypothesis**. The null hypothesis refers to the statement that is being tested and is usually a statement of the status quo, one of no difference or no effect. In our example, the null hypothesis would state that there is no difference between the true population mean $\mu$ and the hypothesized value $\mu_0$ (in our example $10$), which can be expressed as follows:

$$
H_0: \mu = \mu_0
$$
When conducting research, we are usually interested in providing evidence against the null hypothesis. If we then observe sufficient evidence against it, our estimate is said to be significant. If the null hypothesis is rejected, this is taken as support for the **alternative hypothesis**. The alternative hypothesis assumes that some difference exists, which can be expressed as follows: 

$$
H_1: \mu \neq \mu_0
$$
Accepting the alternative hypothesis in turn will often lead to changes in opinions or actions. Note that while the null hypothesis may be rejected, it can never be accepted based on a single test. If we fail to reject the null hypothesis, it means that we simply haven't collected enough evidence against the null hypothesis to disprove it. In classical hypothesis testing, there is no way to determine whether the null hypothesis is true. **Hypothesis testing** provides a means to quantify to what extent the data from our sample is in line with the null hypothesis.

In order to quantify the concept of "sufficient evidence" we look at the theoretical distribution of the sample means given our null hypothesis and the sample standard error. Using the available information we can infer the sampling distribution for our null hypothesis. Recall that the standard deviation of the sampling distribution (i.e., the standard error of the mean) is given by $\sigma_{\bar x}={\sigma \over \sqrt{n}}$, and thus can be computed as follows:

```{r}
mean_pop <- mean(hours)
sigma <- sd(hours) #population standard deviation
n <- 50 #sample size
standard_error <- sigma/sqrt(n) #standard error
standard_error
```

Since we know from the central limit theorem that the sampling distribution is normal for large enough samples, we can now visualize the expected sampling distribution **if our null hypothesis was in fact true** (i.e., if the was no difference between the true population mean and the hypothesized mean of 10). 

```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig.align="center", fig.height = 4, fig.width = 8}
library(latex2exp)
H_0 <- 10
p1 <- 0.025
p2 <- 0.975
min <- 0
max <- 20
norm1 <- round(qnorm(p1), digits = 3)
norm2 <- round(qnorm(p2), digits = 3)
ggplot(data.frame(x = c(min, max)), aes(x = x)) +
  stat_function(fun = dnorm, args = list(mean = H_0, sd = standard_error)) + 
  stat_function(fun = dnorm, args = list(mean = H_0, sd = standard_error), xlim = c(min, qnorm(p1, mean = H_0, sd = standard_error)), geom = "area") +
  stat_function(fun = dnorm, args = list(mean = H_0, sd = standard_error), xlim = c(max, qnorm(p2, mean = H_0, sd = standard_error)), geom = "area") +
  scale_x_continuous(breaks=c(0,qnorm(p1, mean = H_0, sd = standard_error),10,qnorm(p2, mean = H_0, sd = standard_error),20), labels=c("0",TeX(sprintf("%.2f $* \\sigma_{\\bar x}$",qnorm(p1))),"10",TeX(sprintf("%.2f $* \\sigma_{\\bar x}$",qnorm(p2))),"20")) +
  labs(title = TeX(sprintf("Theoretical density given null hypothesis $\\mu_0=$ 10 ($\\sigma_{\\bar x}$ = %.2f)",standard_error)),x = "Hours", y = "Density") +
  theme(legend.position="none") + 
  theme_bw()
```

We also know that 95% of the probability is within `r round(qnorm(p2),2)` standard deviations from the mean. Values higher than that are rather unlikely, if our hypothesis about the population mean was indeed true. This is shown by the shaded area, also known as the "rejection region". To test our hypothesis that the population mean is equal to $10$, let us take a random sample from the population.

```{r}
set.seed(12567)
H_0 <- 10
student_sample <- sample(1:25000, size = 50, replace = FALSE)
student_sample <- hours[student_sample]
mean_sample <- mean(student_sample)
ggplot(data.frame(student_sample)) + 
  geom_histogram(aes(x = student_sample), fill = 'white', color = 'black', bins = 20) +
  theme_bw() +  geom_vline(xintercept = mean(student_sample), color = 'black', size=1) +
  labs(title = TeX(sprintf("Distribution of values in the sample ($n =$ %.0f, $\\bar{x] = $ %.2f, s = %.2f)",n,mean(student_sample),sd(student_sample))),x = "Hours", y = "Frequency") 
```

The mean listening time in the sample (black line) $\bar x$ is `r round(mean(student_sample),2)`. We can already see from the graphic above that such a value is rather unlikely under the hypothesis that the population mean is $10$. Intuitively, such a result would therefore provide evidence against our null hypothesis. But how could we quantify specifically how unlikely it is to obtain such a value and decide whether or not to reject the null hypothesis? Significance tests can be used to provide answers to these questions. 


### Statistical inference on a sample

#### Test statistic

##### z-scores

Let's go back to the sampling distribution above. We know that 95% of all values will fall within `r round(qnorm(p2),2)` standard deviations from the mean. So if we could express the distance between our sample mean and the null hypothesis in terms of standard deviations, we could make statements about the probability of getting a sample mean of the observed magnitude (or more extreme values). Essentially, we would like to know how many standard deviations ($\sigma_{\bar x}$) our sample mean ($\bar x$) is away from the population mean if the null hypothesis was true ($\mu_0$). This can be formally expressed as follows:

$$
\bar x-  \mu_0 = z \sigma_{\bar x}
$$

In this equation, ```z``` will tell us how many standard deviations the sample mean $\bar x$ is away from the null hypothesis $\mu_0$. Solving for ```z``` gives us:

$$
z = {\bar x-  \mu_0 \over \sigma_{\bar x}}={\bar x-  \mu_0 \over \sigma / \sqrt{n}}
$$

This standardized value (or "z-score") is also referred to as a **test statistic**. Let's compute the test statistic for our example above:

```{r}
z_score <- (mean_sample - H_0)/(sigma/sqrt(n))
z_score
```

To make a decision on whether the difference can be deemed statistically significant, we now need to compare this calculated test statistic to a meaningful threshold. In order to do so, we need to decide on a significance level $\alpha$, which expresses the probability of finding an effect that does not actually exist (i.e., Type I Error). You can find a detailed discussion of this point at the end of this chapter. For now, we will adopt the widely accepted significance level of 5% and set $\alpha$ to 0.05. The critical value for the normal distribution and $\alpha$ = 0.05 can be computed using the ```qnorm()``` function as follows:

```{r}
z_crit <- qnorm(0.975)
z_crit
```

We use ```0.975``` and not ```0.95``` since we are running a two-sided test and need to account for the rejection region at the other end of the distribution. Recall that for the normal distribution, 95% of the total probability falls within `r round(qnorm(0.975),2)` standard deviations of the mean, so that higher (absolute) values provide evidence against the null hypothesis. Generally, we speak of a statistically significant effect if the (absolute) calculated test statistic is larger than the (absolute) critical value. We can easily check if this is the case in our example:

```{r}
abs(z_score) > abs(z_crit)
```

Since the absolute value of the calculated test statistic is larger than the critical value, we would reject $H_0$ and conclude that the true population mean $\mu$ is significantly different from the hypothesized value $\mu_0 = 10$.

##### t-statistic

You may have noticed that the formula for the z-score above assumes that we know the true population standard deviation ($\sigma$) when computing the standard deviation of the sampling distribution ($\sigma_{\bar x}$) in the denominator. However, the population standard deviation is usually not known in the real world and therefore represents another unknown population parameter which we have to estimate from the sample. We saw in the previous chapter that we usually use $s$ as an estimate of $\sigma$ and $SE_{\bar x}$ as and estimate of $\sigma_{\bar x}$. Intuitively, we should be more conservative regarding the critical value that we used above to assess whether we have a significant effect to reflect this uncertainty about the true population standard deviation. That is, the threshold for a "significant" effect should be higher to safeguard against falsely claiming a significant effect when there is none. If we replace $\sigma_{\bar x}$ by it's estimate $SE_{\bar x}$ in the formula for the z-score, we get a new test statistic (i.e, the **t-statistic**) with its own distribution (the **t-distribution**): 

$$
t = {\bar x-  \mu_0 \over SE_{\bar x}}={\bar x-  \mu_0 \over s / \sqrt{n}}
$$

Here, $\bar X$ denotes the sample mean and $s$ the sample standard deviation. The t-distribution has more probability in its "tails", i.e. farther away from the mean. This reflects the higher uncertainty introduced by replacing the population standard deviation by its sample estimate. Intuitively, this is particularly relevant for small samples, since the uncertainty about the true population parameters decreases with increasing sample size. This is reflected by the fact that the exact shape of the t-distribution depends on the **degrees of freedom**, which is the sample size minus one (i.e., $n-1$). To see this, the following graph shows the t-distribution with different degrees of freedom for a two-tailed test and $\alpha = 0.05$. The grey curve shows the normal distribution. 

```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig.align="center", fig.height = 8, fig.width = 8}
library(cowplot)
library(gridExtra)
library(grid)

df <- 5
p1 <- 0.025
p2 <- 0.975
min <- -5
max <- 5
t1 <- round(qt(p1, df = df), digits = 3)
t2 <- round(qt(p2, df = df), digits = 3)
plot1 <- ggplot(data.frame(x = c(min, max)), aes(x = x)) +
  stat_function(fun = dnorm, color = "grey") + 
  stat_function(fun = dt, args = list(df = df)) + 
  stat_function(fun = dt, args = list(df = df), xlim = c(min, qt(p1, df = df)), geom = "area") +
  stat_function(fun = dt, args = list(df = df), xlim = c(max, qt(p2, df = df)), geom = "area") +
  stat_function(fun = dnorm, color = "grey") + 
  scale_x_continuous(breaks = c(t1, 0, t2)) +
    labs(title = paste0("df= ", df),x = "x", y = "Density") +
  theme(legend.position="none") + 
  theme_bw()

df <- 10
p1 <- 0.025
p2 <- 0.975
min <- -5
max <- 5
t1 <- round(qt(p1, df = df), digits = 3)
t2 <- round(qt(p2, df = df), digits = 3)
plot2 <- ggplot(data.frame(x = c(min, max)), aes(x = x)) +
  stat_function(fun = dnorm, color = "grey") + 
  stat_function(fun = dt, args = list(df = df)) + 
  stat_function(fun = dt, args = list(df = df), xlim = c(min, qt(p1, df = df)), geom = "area") +
  stat_function(fun = dt, args = list(df = df), xlim = c(max, qt(p2, df = df)), geom = "area") +
  scale_x_continuous(breaks = c(t1, 0, t2)) +
    labs(title = paste0("df= ",df),x = "x", y = "Density") +
  theme(legend.position = "none") + 
  theme_bw()

df <- 100
p1 <- 0.025
p2 <- 0.975
min <- -5
max <- 5
t1 <- round(qt(p1, df = df), digits = 3)
t2 <- round(qt(p2, df = df), digits = 3)
plot3 <- ggplot(data.frame(x = c(min, max)), aes(x = x)) +
  stat_function(fun = dnorm, color = "grey") + 
  stat_function(fun = dt, args = list(df = df)) + 
  stat_function(fun = dt, args = list(df = df), xlim = c(min, qt(p1, df = df)), geom = "area") +
  stat_function(fun = dt, args = list(df = df), xlim = c(max, qt(p2, df = df)), geom = "area") +
  scale_x_continuous(breaks = c(t1, 0, t2)) +
    labs(title = paste0("df= ",df),x = "x", y = "Density") +
  theme(legend.position = "none") + 
  theme_bw()


df <- 1000
p1 <- 0.025
p2 <- 0.975
min <- -5
max <- 5
t1 <- round(qt(p1, df = df), digits = 3)
t2 <- round(qt(p2, df = df), digits = 3)
plot4 <- ggplot(data.frame(x = c(min, max)), aes(x = x)) +
  stat_function(fun = dnorm, color = "grey") + 
  stat_function(fun = dt, args = list(df = df)) + 
  stat_function(fun = dt, args = list(df = df), xlim = c(min, qt(p1, df = df)), geom = "area") +
  stat_function(fun = dt, args = list(df = df), xlim = c(max, qt(p2, df = df)), geom = "area") +
  scale_x_continuous(breaks = c(t1, 0, t2)) +
    labs(title = paste0("df= ",df),
      x = "x", y = "Density") +
  theme(legend.position = "none") + 
  theme_bw()

p <- plot_grid(plot1, plot2, plot3, plot4, ncol = 2,
           labels = c("A", "B","C","D"))
title <- ggdraw() + draw_label('Degrees of freedom and the t-distribution', fontface='bold')
p <- plot_grid(title, p, ncol=1, rel_heights=c(0.1, 1)) # rel_heights values control title margins
print(p)


```

Notice that as $n$ gets larger, the t-distribution gets closer and closer to the normal distribution, reflecting the fact that the uncertainty introduced by $s$ is reduced. To summarize, we now have an estimate for the standard deviation of the distribution of the sample mean (i.e., $SE_{\bar x}$) and an appropriate distribution that takes into account the necessary uncertainty (i.e., the t-distribution). Let us now compute the t-statistic according to the formula above:

```{r}
SE <- (sd(student_sample)/sqrt(n))
t_score <- (mean_sample - H_0)/SE
t_score
```

Notice that the value of the t-statistic is higher compared to the z-score (`r round(z_score,2)`). This can be attributed to the fact that by using the $s$ as and estimate of $\sigma$, we underestimate the true population standard deviation. Hence, the critical value would need to be larger to adjust for this. This is what the t-distribution does. Let us compute the critical value from the t-distribution with ```n - 1```degrees of freedom.     

```{r}
df = n - 1
t_crit <- qt(0.975, df = df)
t_crit
```

Again, we use ```0.975``` and not ```0.95``` since we are running a two-sided test and need to account for the rejection region at the other end of the distribution. Notice that the new critical value based on the t-distributionis larger, to reflect the uncertainty when estimating $\sigma$ from $s$. Now we can see that the calculated test statistic is still larger than the critical value.  

```{r}
abs(t_score) > abs(t_crit)
```

The following graphics shows that the calculated test statistic (red line) falls into the rejection region so that in our example, we would reject the null hypothesis that the true population mean is equal to $10$. 

```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig.align="center", fig.height = 6, fig.width = 8}
p1 <- 0.025
p2 <- 0.975
min <- -6
max <- 6
t1 <- round(qt(p1, df = df), digits = 3)
t2 <- round(qt(p2, df = df), digits = 3)
ggplot(data.frame(x = c(min, max)), aes(x = x)) +
  stat_function(fun = dt, args = list(df = df)) + 
  stat_function(fun = dt, args = list(df = df), xlim = c(min, qt(p1, df = df)), geom = "area") +
  stat_function(fun = dt, args = list(df = df), xlim = c(max, qt(p2, df = df)), geom = "area") +
  geom_vline(xintercept = t_score, color = 'red', size=1) +
  scale_x_continuous(breaks = c(t1, 0, t2)) +
    labs(title = "Theoretical density given null hypothesis 10 and sample t-statistic",
         x = "x", y = "Density") +
  theme(legend.position = "none") + 
  theme_bw()
```

**Decision:** Reject $H_0$, given that the calculated test statistic is larger than critical value.

Something to keep in mind here is the fact the test statistic is a function of the sample size. This, as $n$ gets large, the test statistic gets larger as well and we are more likely to find a significant effect. This reflects the decrease in uncertainty about the true population mean as our sample size increases.  

#### P-values

In the previous section, we computed the test statistic, which tells us how close our sample is to the null hypothesis. The p-value corresponds to the probability that the test statistic would take a value as extreme or more extreme than the one that we actually observed, **assuming that the null hypothesis is true**. It is important to note that this is a **conditional probability**: we compute the probability of observing a sample mean (or a more extreme value) conditional on the assumption that the null hypothesis is true. The ```pnorm()```function can be used to compute this probability. It is the cumulative probability distribution function of the `normal distribution. Cumulative probability means that the function returns the probability that the test statistic will take a value **less than or equal to** the calculated test statistic given the degrees of freedom. However, we are interested in obtaining the probability of observing a test statistic **larger than or equal to** the calculated test statistic under the null hypothesis (i.e., the p-value). Thus, we need to subtract the cumulative probability from 1. In addition, since we are running a two-sided test, we need to multiply the probability by 2 to account for the rejection region at the other side of the distribution.  

```{r}
p_value <- 2*(1-pt(abs(t_score), df = df))
p_value
```

This value corresponds to the probability of observing a mean equal to or larger than the one we obtained from our sample, if the null hypothesis was true. As you can see, this probability is very low. A small p-value signals that it is unlikely to observe the calculated test statistic under the null hypothesis. To decide whether or not to reject the null hypothesis, we would now compare this value to the level of significance ($\alpha$) that we chose for our test. For this example, we adopt the widely accepted significance level of 5%, so any test results with a p-value < 0.05 would be deemed statistically significant. Note that the p-value is directly related to the value of the test statistic. The relationship is such that the higher (lower) the value of the test statistic, the lower (higher) the p-value.   

**Decision:** Reject $H_0$, given that the p-value is smaller than 0.05. 

#### Confidence interval

For a given statistic calculated for a sample of observations (e.g., listening times), a 95% confidence interval can be constructed such that in 95% of samples, the true value of the true population mean will fall within its limits. If the parameter value specified in the null hypothesis (here $10$) does not lie within the bounds, we reject $H_0$. Building on what we learned about confidence intervals in the previous chapter, the 95% confidence interval based on the t-distribution can be computed as follows:

$$
CI_{lower} = {\bar x} - t_{1-{\alpha \over 2}} * SE_{\bar x} \\
CI_{upper} = {\bar x} + t_{1-{\alpha \over 2}} * SE_{\bar x}
$$ 

It is easy to compute this interval manually:

```{r message=FALSE, warning=FALSE}
ci_lower <- (mean_sample)-qt(0.975, df = df)*SE
ci_upper <- (mean_sample)+qt(0.975, df = df)*SE
ci_lower
ci_upper
```

The interpretation of this interval is as follows: if we would (hypothetically) take 100 samples and calculated the mean and confidence interval for each of them, then the true population mean would be included in 95% of these intervals. The CI is informative when reporting the result of your test, since it provides an estimate of the uncertainty associated with the test result. From the test statistic or the p-value alone, it is not easy to judge in which range the true population parameter is located.  The CI provides an estimate of this range. 

**Decision:** Reject $H_0$, given that the parameter value from the null hypothesis ($10$) is not included in the interval. 

To summarize, you can see that we arrive at the same conclusion (i.e., reject $H_0$), irrespective if we use the test statistic, the p-value, or the confidence interval. However, keep in mind that rejecting the null hypothesis does not prove the alternative hypothesis (we can merely provide support for it). Rather, think of the p-value as the chance of obtaining the data we've collected assuming that the null hypothesis is true. You should report the confidence interval to provide an estimate of the uncertainty associated with your test results.  

### Choosing the right test

The test statistic, as we have seen, measures how close the sample is to the null hypothesis and often follows a well-known distribution (e.g., normal, t, or chi-square). To select the correct test, various factors need to be taken into consideration. Some examples are:

* On what scale are your variables measured (categorical vs. continuous)?
* Do you want to test for relationships or differences?
* If you test for differences, how many groups would you like to test?
* For parametric tests, are the assumptions fulfilled?

The previous discussion used a **one sample t-test** as an example, which requires that variable is measured on an interval or ratio scale. If you are confronted with other settings, the following flow chart provides a rough guideline on selecting the correct test:

![Flowchart for selecting an appropriate test (source: McElreath, R. (2016): Statistical Rethinking, p. 2)](https://github.com/IMSMWU/Teaching/raw/master/MRDA2017/testselection.JPG)

For a detailed overview over the different type of tests, please also refer to <a href="https://stats.idre.ucla.edu/other/mult-pkg/whatstat/" target="_blank">this overview</a> by the UCLA.

#### Parametric vs. non-parametric tests

A basic distinction can be made between parametric and non-parametric tests. **Parametric tests** require that variables are measured on an interval or ratio scale and that the sampling distribution follows a known distribution. **Non-Parametric tests** on the other hand do not require the sampling distribution to be normally distributed (a.k.a. "assumption free tests"). These tests may be used when the variable of interest is measured on an ordinal scale or when the parametric assumptions do not hold. They often rely on ranking the data instead of analyzing the actual scores. By ranking the data, information on the magnitude of differences is lost. Thus, parametric tests are more powerful if the sampling distribution is normally distributed. In this chapter, we will first focus on parametric tests and cover non-parametric tests later. 

#### One-tailed vs. two-tailed test

For some tests you may choose between a **one-tailed test** versus a **two-tailed test**. The choice depends on the hypothesis you specified, i.e., whether you specified a directional or a non-directional hypotheses. In the example above, we used a **non-directional hypothesis**. That is, we stated that the mean is different from the comparison value $\mu_0$, but we did not state the direction of the effect. A **directional hypothesis** states the direction of the effect. For example, we might test whether the population mean is smaller than a comparison value:

$$
H_0: \mu \ge \mu_0 \\
H_1: \mu < \mu_0
$$

Similarly, we could test whether the population mean is larger than a comparison value:

$$
H_0: \mu \le \mu_0 \\
H_1: \mu > \mu_0
$$

Connected to the decision of how to phrase the hypotheses (directional vs. non-directional) is the choice of a **one-tailed test** versus a **two-tailed test**. Let's first think about the meaning of a one-tailed test. Using a significance level of 0.05, a one-tailed test means that 5% of the total area under the probability distribution of our test statistic is located in one tail. Thus, under a one-tailed test, we test for the possibility of the relationship in one direction only, disregarding the possibility of a relationship in the other direction. In our example, a one-tailed test could test either if the mean listening time is significantly larger or smaller compared to the control condition, but not both. Depending on the direction, the mean listening time is significantly larger (smaller) if the test statistic is located in the top (bottom) 5% of its probability distribution. 

The following graph shows the critical values that our test statistic would need to surpass so that the difference between the population mean and the comparison value would be deemed statistically significant.

```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig2, fig.align="center", fig.height = 3, fig.width = 10}
library(cowplot)
library(gridExtra)
library(grid)

df <- n-1
p1 <- 0.025
p2 <- 0.975
min <- -5
max <- 5
t1 <- round(qt(p1, df = df), digits = 3)
t2 <- round(qt(p2, df = df), digits = 3)
plot1 <- ggplot(data.frame(x = c(min, max)), aes(x = x)) +
  stat_function(fun = dt, args = list(df = df)) + 
  stat_function(fun = dt, args = list(df = df), xlim = c(min, qt(p1, df = df)), geom = "area") +
  stat_function(fun = dt, args = list(df = df), xlim = c(max, qt(p2, df = df)), geom = "area") +
  scale_x_continuous(breaks = c(t1, 0, t2)) +
    labs(title = paste0("Two-sided test"),
         subtitle = "0.025 of total area on each side; df = 49",
         x = "x", y = "Density") +
  theme(legend.position = "none") + 
  theme_bw()

df <- n-1
p1 <- 0.000
p2 <- 0.950
min <- -5
max <- 5
t1 <- round(qt(p1,df=df), digits = 3)
t2 <- round(qt(p2,df=df), digits = 3)
plot2 <- ggplot(data.frame(x = c(min, max)), aes(x = x)) +
  stat_function(fun = dt, args = list(df = df)) + 
  stat_function(fun = dt, args = list(df = df), xlim = c(min,qt(p1,df=df)), geom = "area") +
  stat_function(fun = dt, args = list(df = df), xlim = c(max,qt(p2,df=df)), geom = "area") +
  scale_x_continuous(breaks = c(t1,0,t2)) +
    labs(title = paste0("One-sided test (right)"),
         subtitle = "0.05 of total area on the right; df = 49",
         x = "x", y = "Density") +
  theme(legend.position="none") + theme_bw()

df <- n-1
p1 <- 0.000
p2 <- 0.050
min <- -5
max <- 5
t1 <- round(qt(p1,df=df), digits = 3)
t2 <- round(qt(p2,df=df), digits = 3)
plot3 <- ggplot(data.frame(x = c(min, max)), aes(x = x)) +
  stat_function(fun = dt, args = list(df = df)) + 
  stat_function(fun = dt, args = list(df = df), xlim = c(max,qt(p1,df=df)), geom = "area") +
  stat_function(fun = dt, args = list(df = df), xlim = c(min,qt(p2,df=df)), geom = "area") +
  scale_x_continuous(breaks = c(t1,0,t2)) +
  labs(title = paste0("One-sided test (left)"),
         subtitle = "0.05 of total area on the left; df = 49",
         x = "x", y = "Density") +
  theme(legend.position="none") + theme_bw()

p <- plot_grid(plot3,plot1, plot2, ncol = 3)
print(p)
```

It can be seen that under a one-sided test, the rejection region is at one end of the distribution or the other. In a two-sided test, the rejection region is split between the two tails. As a consequence, the critical value of the test statistic is smaller using a one-tailed test, meaning that it has more power to detect an effect. Having said that, in most applications, we would like to be able catch effects in both directions, simply because we can often not rule out that an effect might exist that is not in the hypothesized direction. For example, if we would conduct a one-tailed test for a mean larger than some specified value but the mean turns out to be substantially smaller, then testing a one-directional hypothesis ($H_0: \mu \le \mu_0 $) would not allow us to conclude that there is a significant effect because there is not rejection at this end of the distribution.   

### Summary

As we have seen, the process of hypothesis testing consists of various steps:

1. Formulate null and alternative hypotheses
2. Select an appropriate test
3. Choose the level of significance ($\alpha$)
4. Descriptive statistics and data visualization
5. Conduct significance test
6. Report results and draw a marketing conclusion

# Chi-square test
This chapter is primarily based on Field, A., Miles J., & Field, Z. (2012): Discovering Statistics Using R. Sage Publications, **chapter 18**.

In some instances, you will be confronted with differences between proportions, rather than differences between means. For example, you may conduct an A/B-Test and wish to compare the conversion rates between two advertising campaigns. In this case, your data is binary (0 = no conversion, 1 = conversion) and the sampling distribution for such data is binomial. While binomial probabilities are difficult to calculate, we can use a Normal approximation to the binomial when ```n``` is large (>100) and the true likelihood of a 1 is not too close to 0 or 1. 

Let's use an example: assume a call center where service agents call potential customers to sell a product. We consider two call center agents:

* Service agent 1 talks to 300 customers and gets 200 of them to buy (conversion rate=2/3)
* Service agent 2 talks to 300 customers and gets 100 of them to buy (conversion rate=1/3)

As always, we load the data first:

[You can download the corresponding R-Code here](./Code/10-categorical_data (1).R)


```{r  message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE}
call_center <- read.table("https://raw.githubusercontent.com/IMSMWU/Teaching/master/MRDA2017/call_center.dat", 
                          sep = "\t", 
                          header = TRUE) #read in data
call_center$conversion <- factor(call_center$conversion , levels = c(0:1), labels = c("no", "yes")) #convert to factor
call_center$agent <- factor(call_center$agent , levels = c(0:1), labels = c("agent_1", "agent_2")) #convert to factor
```

Next, we create a table to check the relative frequencies:

```{r  message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE}
rel_freq_table <- as.data.frame(prop.table(table(call_center), 2)) #conditional relative frequencies
rel_freq_table
```

We could also plot the data to visualize the frequencies using ggplot:

```{r  message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "proportion of conversions per agent (stacked bar chart)"}
ggplot(rel_freq_table, aes(x = agent, y = Freq, fill = conversion)) + #plot data
  geom_col(width = .7) + #position
  geom_text(aes(label = paste0(round(Freq*100,0),"%")), position = position_stack(vjust = 0.5), size = 4) + #add percentages
  ylab("Proportion of conversions") + xlab("Agent") + # specify axis labels
  theme_bw()
```

... or using the ```mosaicplot()``` function:

```{r  message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "proportion of conversions per agent (mosaic plot)"}
contigency_table <- table(call_center)
mosaicplot(contigency_table, main = "Proportion of conversions by agent")
```

## Confidence intervals for proportions

Recall that we can use confidence intervals to determine the range of values that the true population parameter will take with a certain level of confidence based on the sample. Similar to the confidence interval for means, we can compute a confidence interval for proportions. The (1-$\alpha$)% confidence interval for proportions is approximately 

$$
CI = p\pm z_{1-\frac{\alpha}{2}}*\sqrt{\frac{p*(1-p)}{N}}
$$

where $\sqrt{p(1-p)}$ is the equivalent to the standard deviation in the formula for the confidence interval for means. Based on the equation, it is easy to compute the confidence intervals for the conversion rates of the call center agents:

```{r  message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE}
n1 <- nrow(subset(call_center,agent=="agent_1")) #number of observations for agent 1
n2 <- nrow(subset(call_center,agent=="agent_2")) #number of observations for agent 1
n1_conv <- nrow(subset(call_center,agent=="agent_1" & conversion=="yes")) #number of conversions for agent 1
n2_conv <- nrow(subset(call_center,agent=="agent_2" & conversion=="yes")) #number of conversions for agent 2
p1 <- n1_conv/n1  #proportion of conversions for agent 1
p2 <- n2_conv/n2  #proportion of conversions for agent 2

error1 <- qnorm(0.975)*sqrt((p1*(1-p1))/n1)
ci_lower1 <- p1 - error1
ci_upper1 <- p1 + error1
ci_lower1
ci_upper1

error2 <- qnorm(0.975)*sqrt((p2*(1-p2))/n2)
ci_lower2 <- p2 - error2
ci_upper2 <- p2 + error2
ci_lower2
ci_upper2
```

Similar to testing for differences in means, we could also ask: Is agent 1 twice as likely as agent 2 to convert a customer? Or, to state it formally:

$$H_0: \pi_1=\pi_2 \\
H_1: \pi_1\ne \pi_2$$ 

where $\pi$ denotes the population parameter associated with the proportion in the respective population. One approach to test this is based on confidence intervals to estimate the difference between two populations. We can compute an approximate confidence interval for the difference between the proportion of successes in group 1 and group 2, as:

$$
CI = p_1-p_2\pm z_{1-\frac{\alpha}{2}}*\sqrt{\frac{p_1*(1-p_1)}{n_1}+\frac{p_2*(1-p_2)}{n_2}}
$$ 

If the confidence interval includes zero, then the data does not suggest a difference between the groups. Let's compute the confidence interval for differences in the proportions by hand first:

```{r}
ci_lower <- p1 - p2 - qnorm(0.975)*sqrt(p1*(1 - p1)/n1 + p2*(1 - p2)/n2) #95% CI lower bound
ci_upper <- p1 - p2 + qnorm(0.975)*sqrt(p1*(1 - p1)/n1 + p2*(1 - p2)/n2) #95% CI upper bound
ci_lower
ci_upper
```

Now we can see that the 95% confidence interval estimate of the difference between the proportion of conversions for agent 1 and the proportion of conversions for agent 2 is between `r round(ci_lower*100,0)`% and `r round(ci_upper*100,0)`%. This interval tells us the range of plausible values for the difference between the two population proportions. According to this interval, zero is not a plausible value for the difference (i.e., interval does not cross zero), so we reject the null hypothesis that the population proportions are the same.

Instead of computing the intervals by hand, we could also use the ```prop.test()``` function:

```{r}
prop.test(x = c(n1_conv, n2_conv), n = c(n1, n2), conf.level = 0.95)
```

Note that the ```prop.test()``` function uses a slightly different (more accurate) way to compute the confidence interval (Wilson's score method is used). It is particularly a better approximation for smaller N. That's why the confidence interval in the output slightly deviates from the manual computation above, which uses the Wald interval. 

You can also see that the output from the ```prop.test()``` includes the results from a &chi;<sup>2</sup> test for the equality of proportions (which will be  discussed below) and the associated p-value. Since the p-value is less than 0.05, we reject the null hypothesis of equal probability. Thus, the reporting would be: 

The test showed that the conversion rate for agent 1 was higher by `r round(((prop.test(x = c(n1_conv, n2_conv), n = c(n1, n2), conf.level = 0.95)$estimate[1])-(prop.test(x = c(n1_conv, n2_conv), n = c(n1, n2), conf.level = 0.95)$estimate[2]))*100,0)`%. This difference is significant &chi; (1) = 70, p < .05 (95% CI = [`r round(prop.test(x = c(n1_conv, n2_conv), n = c(n1, n2), conf.level = 0.95)$conf.int[1],2)`,`r round(prop.test(x = c(n1_conv, n2_conv), n = c(n1, n2), conf.level = 0.95)$conf.int[2],2)`]).


## Chi-square test

In the previous section, we saw how we can compute the confidence interval for the difference between proportions to decide on whether or not to reject the null hypothesis. Whenever you would like to investigate the relationship between two categorical variables, the $\chi^2$ test may be used to test whether the variables are independent of each other. It achieves this by comparing the expected number of observations in a group to the actual values. Let's continue with the example from the previous section. Under the null hypothesis, the two variables *agent* and *conversion* in our contingency table are independent (i.e., there is no relationship). This means that the frequency in each field will be roughly proportional to the probability of an observation being in that category, calculated under the assumption that they are independent. The difference between that expected quantity and the actual quantity can be used to construct the test statistic. The test statistic is computed as follows:

$$
\chi^2=\sum_{i=1}^{J}\frac{(f_o-f_e)^2}{f_e}
$$

where $J$ is the number of cells in the contingency table, $f_o$ are the observed cell frequencies and $f_e$ are the expected cell frequencies. The larger the differences, the larger the test statistic and the smaller the p-value. 

The observed cell frequencies can easily be seen from the contingency table: 

```{r message=FALSE, warning=FALSE}
obs_cell1 <- contigency_table[1,1]
obs_cell2 <- contigency_table[1,2]
obs_cell3 <- contigency_table[2,1]
obs_cell4 <- contigency_table[2,2]
```

The expected cell frequencies can be calculated as follows:

$$
f_e=\frac{(n_r*n_c)}{n}
$$

where $n_r$ are the total observed frequencies per row, $n_c$ are the total observed frequencies per column, and $n$ is the total number of observations. Thus, the expected cell frequencies under the assumption of independence can be calculated as: 

```{r message=FALSE, warning=FALSE}
n <- nrow(call_center)
exp_cell1 <- (nrow(call_center[call_center$agent=="agent_1",])*nrow(call_center[call_center$conversion=="no",]))/n
exp_cell2 <- (nrow(call_center[call_center$agent=="agent_1",])*nrow(call_center[call_center$conversion=="yes",]))/n
exp_cell3 <- (nrow(call_center[call_center$agent=="agent_2",])*nrow(call_center[call_center$conversion=="no",]))/n
exp_cell4 <- (nrow(call_center[call_center$agent=="agent_2",])*nrow(call_center[call_center$conversion=="yes",]))/n
```

To sum up, these are the expected cell frequencies

```{r message=FALSE, warning=FALSE, paged.print = FALSE}
data.frame(conversion_no = rbind(exp_cell1,exp_cell3),conversion_yes = rbind(exp_cell2,exp_cell4), row.names = c("agent_1","agent_2")) 
```

... and these are the observed cell frequencies

```{r message=FALSE, warning=FALSE, paged.print = FALSE}
data.frame(conversion_no = rbind(obs_cell1,obs_cell2),conversion_yes = rbind(obs_cell3,obs_cell4), row.names = c("agent_1","agent_2")) 
```

To obtain the test statistic, we simply plug the values into the formula: 

```{r message=FALSE, warning=FALSE}
chisq_cal <-  sum(((obs_cell1 - exp_cell1)^2/exp_cell1),
                  ((obs_cell2 - exp_cell2)^2/exp_cell2),
                  ((obs_cell3 - exp_cell3)^2/exp_cell3),
                  ((obs_cell4 - exp_cell4)^2/exp_cell4))
chisq_cal
```

The test statistic is $\chi^2$ distributed. The chi-square distribution is a non-symmetric distribution. Actually, there are many different chi-square distributions, one for each degree of freedom as show in the following figure. 

```{r echo = F, message=FALSE, warning=FALSE, eval=T, fig.align="center", fig.cap = "The chi-square distribution"}
library(ggplot2)
a <- seq(2,10, 2)
ggplot(data.frame(x=c(0,20)), aes(x))+
  stat_function(fun = dchisq, args = list(8), aes(colour = '8'))+
  stat_function(fun = dchisq, args = list(1), aes(colour = '1'))+
  stat_function(fun = dchisq, args = list(2), aes(colour = '2'))+
  stat_function(fun = dchisq, args = list(4), aes(colour = '4'))+
  stat_function(fun = dchisq, args = list(6), aes(colour = '6'))+
  stat_function(fun = dchisq, args = list(15), aes(colour = '15'))+
  ylim(min=0, max=0.6) +
  labs(colour = 'Degrees of Freedom', x = 'Value', y = 'Density') + theme_bw()
```

You can see that as the degrees of freedom increase, the chi-square curve approaches a normal distribution. To find the critical value, we need to specify the corresponding degrees of freedom, given by:

$$
df=(r-1)*(c-1)
$$

where $r$ is the number of rows and $c$ is the number of columns in the contingency table. Recall that degrees of freedom are generally the number of values that can vary freely when calculating a statistic. In a 2 by 2 table as in our case, we have 2 variables (or two samples) with 2 levels and in each one we have 1 that vary freely. Hence, in our example the degrees of freedom can be calculated as:

```{r message=FALSE, warning=FALSE}
df <-  (nrow(contigency_table) - 1) * (ncol(contigency_table) -1)
df
```

Now, we can derive the critical value given the degrees of freedom and the level of confidence using the ```qchisq()``` function and test if the calculated test statistic is larger than the critical value:

```{r message=FALSE, warning=FALSE}
chisq_crit <- qchisq(0.95, df)
chisq_crit
chisq_cal > chisq_crit
```

```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig.align="center", fig.cap = "Visual depiction of the test result"}
df <- 1
p <- 0.95
min <- 0
max <- 15
chsq1 <- round(qchisq(p,df=df), digits = 3)
t2 <- round(qt(p2,df=df), digits = 3)
plot1 <- ggplot(data.frame(x = c(min, max)), aes(x = x)) +
  stat_function(fun = dchisq, args = list(df))+
  stat_function(fun = dchisq, args = list(df), xlim = c(qchisq(p,df=df),max), geom = "area") +
  scale_x_continuous(breaks = c(0, chsq1, chisq_cal)) +
  geom_vline(xintercept = chisq_cal, color = "red") +
  labs(title = paste0("Result of chi-square test: reject H0"),
         subtitle = paste0("Red line: Calculated test statistic;"," Black area: Rejection region"),
         x = "x", y = "Density") +
  theme(legend.position="none") + 
  theme_bw()
plot1
```

We could also compute the p-value using the ```pchisq()``` function, which tells us the probability of the observed cell frequencies if the null hypothesis was true (i.e., there was no association):

```{r message=FALSE, warning=FALSE}
p_val <- 1-pchisq(chisq_cal,df)
p_val
```

The test statistic can also be calculated in R directly on the contingency table with the function ```chisq.test()```.

```{r message=FALSE, warning=FALSE}
chisq.test(contigency_table, correct = FALSE)
```

Since the p-value is smaller than 0.05 (i.e., the calculated test statistic is larger than the critical value), we reject H<sub>0</sub> that the two variables are independent. 

Note that the test statistic is sensitive to the sample size. To see this, let's assume that we have a sample of 100 observations instead of 1000 observations:

```{r message=FALSE, warning=FALSE}
chisq.test(contigency_table/10, correct = FALSE)
```

You can see that even though the proportions haven't changed, the test is insignificant now. The following equation lets you compute a measure of the effect size, which is insensitive to sample size: 

$$
\phi=\sqrt{\frac{\chi^2}{n}}
$$

The following guidelines are used to determine the magnitude of the effect size (Cohen, 1988): 

* 0.1 (small effect)
* 0.3 (medium effect)
* 0.5 (large effect)

In our example, we can compute the effect sizes for the large and small samples as follows:

```{r message=FALSE, warning=FALSE}
test_stat <- chisq.test(contigency_table, correct = FALSE)$statistic
phi1 <- sqrt(test_stat/n)
test_stat <- chisq.test(contigency_table/10, correct = FALSE)$statistic
phi2 <- sqrt(test_stat/(n/10))
phi1
phi2
```

You can see that the statistic is insensitive to the sample size. 

Note that the &Phi; coefficient is appropriate for two dichotomous variables (resulting from a 2 x 2 table as above). If any your nominal variables has more than two categories, Cramér's V should be used instead:

$$
V=\sqrt{\frac{\chi^2}{n*df_{min}}}
$$

where $df_{min}$ refers to the degrees of freedom associated with the variable that has fewer categories (e.g., if we have two nominal variables with 3 and 4 categories, $df_{min}$ would be 3 - 1 = 2). The degrees of freedom need to be taken into account when judging the magnitude of the effect sizes (see e.g., <a href="http://www.real-statistics.com/chi-square-and-f-distributions/effect-size-chi-square/" target="_blank">here</a>). 

Note that the ```correct = FALSE``` argument above ensures that the test statistic is computed in the same way as we have done by hand above. By default, ```chisq.test()``` applies a correction to prevent overestimation of statistical significance for small data (called the Yates' correction). The correction is implemented by subtracting the value 0.5 from the computed difference between the observed and expected cell counts in the numerator of the test statistic. This means that the calculated test statistic will be smaller (i.e., more conservative). Although the adjustment may go too far in some instances, you should generally rely on the adjusted results, which can be computed as follows:

```{r message=FALSE, warning=FALSE}
chisq.test(contigency_table)
```

As you can see, the results don't change much in our example, since the differences between the observed and expected cell frequencies are fairly large relative to the correction.

Caution is warranted when the cell counts in the contingency table are small. The usual rule of thumb is that all cell counts should be at least 5 (this may be a little too stringent though). When some cell counts are too small, you can use Fisher's exact test using the ```fisher.test()``` function. 

```{r message=FALSE, warning=FALSE}
fisher.test(contigency_table)
```

The Fisher test, while more conservative, also shows a significant difference between the proportions (p < 0.05). This is not surprising since the cell counts in our example are fairly large.

## Sample size


To **calculate the required sample size** when comparing proportions, the ```power.prop.test()``` function can be used. For example, we could ask how large our sample needs to be if we would like to compare two groups with conversion rates of 2% and 2.5%, respectively using the conventional settings for $\alpha$ and $\beta$:

```{r}
power.prop.test(p1=0.02,p2=0.025,sig.level=0.05,power=0.8)
```

The output tells us that we need `r round(power.prop.test(p1=0.02,p2=0.025,sig.level=0.05,power=0.8)$n,0)` observations per group to detect a difference of the desired size.

## Summary

* Confidence interval for proportions: the interval that shows the range of plausible values for the difference between the two population proportions. If the interval does not cross zero, then zero is not a plausible value for the difference between two proportions, and we reject the null hypothesis that the population proportions are the same.

* Application of the Chi-square test: when there are 2 categorical variables (ordinal or nominal). In our example: conversion (yes or no) and agent (1 or 2).
* Hypothesis: Is there an association between categorical variable X (e.g. conversion) and categorical variable Y (e.g. agent)?  

  H0: There is no association between the two variables.  
  H1: There is an association between the two variables.

* Interpretation: If we find an association between two variables (p < 0.05), we can conclude that the propotion of conversions to proportion of non-conversions between agent 1 and agent 2 significantly differ from each other. 

* Example of application in marketing: Marketing department of a firm producing toilet paper is interested in studying the consumer behavior in the context of purchase decision. The company is planing to launch a high quality, super soft toilet paper, but beforehand they decided to conduct a research. They are interested in knowing whether the income level of their consumers influence their choice of the toilet paper quality or not. Currently there are 4 income categories the comapany's consumers are assigned to, and  4 different levels of toilet paper quality (4x4 table).


# Correlation

Before we start with regression analysis, we will review the basic concept of correlation first. Correlation helps us to determine the degree to which the variation in one variable, X, is related to the variation in another variable, Y. 

## Correlation coefficient

The correlation coefficient summarizes the strength of the linear relationship between two metric (interval or ratio scaled) variables. Let's consider a simple example. Say you conduct a survey to investigate the relationship between the attitude towards a city and the duration of residency. The "Attitude" variable can take values between 1 (very unfavorable) and 12 (very favorable), and the "duration of residency" is measured in years. Let's further assume for this example that the attitude measurement represents an interval scale (although it is usually not realistic to assume that the scale points on an itemized rating scale have the same distance). To keep it simple, let's further assume that you only asked 12 people. We can create a short data set like this:  

```{r, include=FALSE}
library(openssl)
passphrase <- charToRaw("MRDAnils")
key <- sha256(passphrase)
url <- "https://raw.githubusercontent.com/IMSMWU/mrda_data_pub/master/secret-music_data.rds"
download.file(url, "./data/secret_music_data.rds", method = "auto", quiet=FALSE)
encrypted_music_data <- readRDS("./data/secret_music_data.rds")
music_data <- unserialize(aes_cbc_decrypt(encrypted_music_data, key = key))
```


```{r, eval=FALSE}
readRDS("music_data.rds")
```

```{r message=FALSE, warning=FALSE}
s.genre <- c("pop","hip hop","rock","rap","indie")
music_data <- subset(music_data, top.genre %in% s.genre)

music_data$genre_cat <- as.factor(music_data$top.genre)
music_data$explicit_cat <- factor(music_data$explicit, levels = c(0:1), 
    labels = c("not explicit", "explicit"))

head(music_data)
```


```{r message=FALSE, warning=FALSE, echo=FALSE, eval=FALSE,paged.print = FALSE}
#library(psych)
#attitude <- c(6,9,8,3,10,4,5,2,11,9,10,2)
#duration <- c(10,12,12,4,12,6,8,2,18,9,17,2)
#att_data <- data.frame(attitude, duration)
#att_data <- att_data[order(-attitude), ]
#att_data$respodentID <- c(1:12)
#str(att_data)
#psych::describe(att_data[, c("attitude","duration")])
#att_data
```

Let's look at the data. The following graph shows the individual data points for the "duration of residency"" variable, where the blue horizontal line represents the mean of the variable (`r round(mean(music_data$energy, na.rm=T),2)`) and the vertical lines show the distance of the individual data points from the mean.

```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig.align="center", fig.cap = "Scores for duration of residency variable"}
library(ggplot2)
#h <- round(mean(att_data$duration), 2)

ggplot(music_data) +
  geom_histogram(aes(energy), bins = 50, fill = 'white', color = 'deepskyblue4') +
  geom_vline(aes(xintercept = mean(energy, na.rm=T)), color='black', size=1) +
  labs(title = "Histogram of energy",
       #subtitle = TeX(sprintf("Population mean ($\\mu$) = %.2f; population standard deviation ($\\sigma$) = %.2f",round(mean(hours),2),round(sd(hours),2))),
       y = 'Number of observations', 
       x = 'Energy') +
  theme_bw()

#ggplot(music_data, aes(y = energy)) + 
#  geom_point(size = 3, color = "deepskyblue4") + 
#  scale_x_continuous(breaks = 1:12) + 
#  geom_hline(data = att_data, aes(yintercept = mean(duration)), color ="deepskyblue4") +
#  labs(x = "Observations",y = "Duration of residency", size = 11) +
#  coord_cartesian(ylim = c(0, 18)) +
#  geom_segment(aes(x = respodentID,y = duration, xend = respodentID, 
#                   yend = mean(duration)), color = "deepskyblue4", size = 1) + 
#  theme(axis.title = element_text(size = 16),
#        axis.text  = element_text(size=16),
#        strip.text.x = element_text(size = 16),
#        legend.position="none") + 
#  theme_bw()
```

You can see that there are some respondents that have been living in the city longer than average and some respondents that have been living in the city shorter than average. Let's do the same for the second variable ("Attitude"): 

```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig.align="center", fig.cap = "Scores for attitude variable"}

ggplot(music_data) +
  geom_histogram(aes(loudness), bins = 50, fill = 'white', color = '#f9756d') +
  geom_vline(aes(xintercept = mean(energy, na.rm=T)), color='black', size=1) +
  labs(title = "Histogram of loudness",
       #subtitle = TeX(sprintf("Population mean ($\\mu$) = %.2f; population standard deviation ($\\sigma$) = %.2f",round(mean(hours),2),round(sd(hours),2))),
       y = 'Number of observations', 
       x = 'Loudness') +
  theme_bw()


#ggplot(att_data, aes(x = respodentID, y = attitude)) + 
#  geom_point(size = 3, color = "#f9756d") + 
#  scale_x_continuous(breaks = 1:12) + 
#  geom_hline(data = att_data, aes(yintercept = mean(attitude)), color = "#f9756d") +
#  labs(x = "Observations",y = "Attitude", size = 11) +
#  coord_cartesian(ylim = c(0,18)) +
#  geom_segment(aes(x = respodentID, y = attitude, xend = respodentID, 
#                   yend = mean(attitude)), color = "#f9756d", size = 1) + 
#  theme_bw()
```

Again, we can see that some respondents have an above average attitude towards the city (more favorable) and some respondents have a below average attitude towards the city. Let's plot the data in one graph now to see if there is some co-movement: 

```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE,fig.align="center", fig.cap = "Scores for attitude and duration of residency variables"}

ggplot(music_data, aes(energy, loudness)) + 
#  geom_point(size = 3, aes(id, energy), color = "#f9756d") +
#  geom_point(size = 3, aes(id, loudness), color = "deepskyblue4")  + 
   geom_point() +
#  scale_x_continuous(breaks = 1:12) + 
#  geom_hline(data = att_data, aes(yintercept = mean(duration)), color = "deepskyblue4") +
#  geom_hline(data = att_data, aes(yintercept = mean(attitude)), color = "#f9756d") +
  labs(x = "Observations", y = "Duration/Attitude", size = 11) +
#  coord_cartesian(ylim = c(0, 18)) +
#  scale_color_manual(values = c("#f9756d", "deepskyblue4")) +
  theme_bw()

#ggplot(att_data) + 
#  geom_point(size = 3, aes(respodentID, attitude), color = "#f9756d") +
#  geom_point(size = 3, aes(respodentID, duration), color = "deepskyblue4")  + 
#  scale_x_continuous(breaks = 1:12) + 
#  geom_hline(data = att_data, aes(yintercept = mean(duration)), color = "deepskyblue4") +
#  geom_hline(data = att_data, aes(yintercept = mean(attitude)), color = "#f9756d") +
#  labs(x = "Observations", y = "Duration/Attitude", size = 11) +
#  coord_cartesian(ylim = c(0, 18)) +
#  scale_color_manual(values = c("#f9756d", "deepskyblue4")) +
#  theme_bw()
```

We can see that there is indeed some co-movement here. The variables <b>covary</b> because respondents who have an above (below) average attitude towards the city also appear to have been living in the city for an above (below) average amount of time and vice versa. Correlation helps us to quantify this relationship. Before you proceed to compute the correlation coefficient, you should first look at the data. We usually use a scatterplot to visualize the relationship between two metric variables:

```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig.align="center", fig.cap = "Scatterplot for duration and attitute variables"}
#ggplot(att_data) + 
#  geom_point(size = 3, aes(duration, attitude)) + 
#  labs(x = "Duration",y = "Attitude", size = 11) +
#  theme_bw()
```

How can we compute the correlation coefficient? Remember that the variance measures the average deviation from the mean of a variable:

\begin{equation} 
\begin{split}
s_x^2&=\frac{\sum_{i=1}^{N} (X_i-\overline{X})^2}{N-1} \\
     &= \frac{\sum_{i=1}^{N} (X_i-\overline{X})*(X_i-\overline{X})}{N-1}
\end{split}
(\#eq:variance)
\end{equation} 

When we consider two variables, we multiply the deviation for one variable by the respective deviation for the second variable: 

<p style="text-align:center;">
$(X_i-\overline{X})*(Y_i-\overline{Y})$
</p>

This is called the cross-product deviation. Then we sum the cross-product deviations:

<p style="text-align:center;">
$\sum_{i=1}^{N}(X_i-\overline{X})*(Y_i-\overline{Y})$
</p>

... and compute the average of the sum of all cross-product deviations to get the <b>covariance</b>:

\begin{equation} 
Cov(x, y) =\frac{\sum_{i=1}^{N}(X_i-\overline{X})*(Y_i-\overline{Y})}{N-1}
(\#eq:covariance)
\end{equation} 

You can easily compute the covariance manually as follows

```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE}
x <- music_data$energy[!is.na(music_data$energy)]
x_bar <- mean(music_data$energy, na.rm = T)
y <- music_data$loudness[!is.na(music_data$loudness)]
y_bar <- mean(music_data$loudness, na.rm = T)
N <- nrow(music_data)
cov <- (sum((x - x_bar)*(y - y_bar))) / (N - 1)
cov
```

Or you simply use the built-in ```cov()``` function:

```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE}
cov(x, y)          # apply the cov function 
```

A positive covariance indicates that as one variable deviates from the mean, the other variable deviates in the same direction. A negative covariance indicates that as one variable deviates from the mean (e.g., increases), the other variable deviates in the opposite direction (e.g., decreases).

However, the size of the covariance depends on the scale of measurement. Larger scale units will lead to larger covariance. To overcome the problem of dependence on measurement scale, we need to convert covariance to a standard set of units through standardization by dividing the covariance by the standard deviation (i.e., similar to how we compute z-scores).

With two variables, there are two standard deviations. We simply multiply the two standard deviations. We then divide the covariance by the product of the two standard deviations to get the standardized covariance, which is known as a correlation coefficient r:

\begin{equation} 
r=\frac{Cov_{xy}}{s_x*s_y}
(\#eq:corcoeff)
\end{equation} 

This is known as the product moment correlation (r) and it is straight-forward to compute:

```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE}
x_sd <- sd(music_data$energy, na.rm = T)
y_sd <- sd(music_data$loudness, na.rm = T)
r <- cov/(x_sd*y_sd)
r
```

Or you could just use the ```cor()``` function:

```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE}
cor(music_data[, c("energy", "loudness")], method = "pearson", use = "complete")
```

Properties of r:

* ranges from -1 to + 1
* +1 indicates perfect linear relationship
* -1 indicates perfect negative relationship
* 0 indicates no linear relationship
* ± .1 represents small effect
* ± .3 represents medium effect
* ± .5 represents large effect

## Significance testing

How can we determine if our two variables are significantly related? To test this, we denote the population moment correlation &rho;. Then we test the null of no relationship between variables:

$$H_0:\rho=0$$
$$H_1:\rho\ne0$$

The test statistic is: 

\begin{equation} 
t=\frac{r*\sqrt{N-2}}{\sqrt{1-r^2}}
(\#eq:cortest)
\end{equation} 

It has a t distribution with n - 2 degrees of freedom. Then, we follow the usual procedure of calculating the test statistic and comparing the test statistic to the critical value of the underlying probability distribution. If the calculated test statistic is larger than the critical value, the null hypothesis of no relationship between X and Y is rejected. 

```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE}
t_calc <- r*sqrt(N - 2)/sqrt(1 - r^2) #calculated test statistic
t_calc
df <- (N - 2) #degrees of freedom
t_crit <- qt(0.975, df) #critical value
t_crit
pt(q = t_calc, df = df, lower.tail = F) * 2 #p-value 
```

Or you can simply use the ```cor.test()``` function, which also produces the 95% confidence interval:

```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE}
cor.test(music_data$energy, music_data$loudness, alternative = "two.sided", method = "pearson", conf.level = 0.95)
```

To determine the linear relationship between variables, the data only needs to be measured using interval scales. If you want to test the significance of the association, the sampling distribution needs to be normally distributed (we usually assume this when our data are normally distributed or when N is large). If parametric assumptions are violated, you should use non-parametric tests:

* Spearman's correlation coefficient: requires ordinal data and ranks the data before applying Pearson's equation.
* Kendall's tau: use when N is small or the number of tied ranks is large.

```{r message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE}
cor.test(music_data$energy, music_data$loudness, alternative = "two.sided", method = "spearman", conf.level = 0.95)
cor.test(music_data$energy, music_data$loudness, alternative = "two.sided", method = "kendall", conf.level = 0.95)
```

Report the results:

A Pearson product-moment correlation coefficient was computed to assess the relationship between the duration of residence in a city and the attitude toward the city. There was a positive correlation between the two variables, r = 0.936, n = 12, p < 0.05. A scatterplot summarizes the results (Figure XY).

**A note on the interpretation of correlation coefficients:**

Correlation coefficients give no indication of the direction of causality. In our example, we can conclude that the attitude toward the city is more positive as the years of residence increases. However, we cannot say that the years of residence cause the attitudes to be more positive. There are two main reasons for caution when interpreting correlations:

* Third-variable problem: there may be other unobserved factors that affect the results.
* Direction of causality: Correlations say nothing about which variable causes the other to change (reverse causality: attitudes may just as well cause the years of residence variable).

## Summary

* Correlation (Pearson’s) can be observed between 2 or more continuous variables (interval or ratio). If we apply Pearson’s equation on ordinal data, which we ranked beforehand, we will obtain Spearman’s correlation coefficient.
* Correlation helps us to determine the degree to which the variation in one variable, X, is related to the variation in another variable, Y.
* A raw measure of the relationship between variables is the covariance.
* If we standardize the covariance (divide by the product of the two standard deviations) we get Pearson’s correlation coefficient, r.
* The correlation coefficient can be between −1 and +1.
* +1 indicates a perfect positive relationship; −1 indicates a perfect negative relationship; 0 indicates no linear relationship at all.
* The correlation coefficient is a commonly used measure of the size of an effect: values of +/-0.1 are considered a small effect, +/-0.3 a medium effect and +/-0.5 a large effect. However, an interpretation of the size of correlation should be made based on the context of the research rather than on following these benchmarks.
* Correlation coefficients give no indication of the direction of causality.
* Consideration of the third-variable problem is necessary when interpreting correlation as there might be other unobserved variables that influence the result.

# t-test
This chapter is primarily based on Field, A., Miles J., & Field, Z. (2012): Discovering Statistics Using R. Sage Publications, **chapter 9**.

## One sample t-test

The example we used in the introduction was an example of the **one sample t-test** and we computed all statistics by hand to explain the underlying intuition. When you conduct hypothesis tests using R, you do not need to calculate these statistics by hand, since there are build-in routines to conduct the steps for you. Let us use the same example again to see how you would conduct hypothesis tests in R.  

**1. Formulate null and alternative hypotheses**

The null hypothesis states that there is no difference between the true population mean $\mu$ and the hypothesized value (i.e., $10$), while the alternative hypothesis states the opposite: 

$$
H_0: \mu = 10 \\
H_1: \mu \neq 10
$$

**2. Select an appropriate test**

Because we would like to test if the mean of a variable is different from a specified threshold, the one-sample t-test is appropriate. The assumptions of the test are 1) that the variable is measured using an interval or ratio scale, and 2) that the sampling distribution is normal. Both assumptions are met since 1) listening time is a ratio scale, and 2) we deem the sample size (n = 50) large enough to assume a normal sampling distribution according to the central limit theorem.  

**3. Choose the level of significance**

We choose the conventional 5% significance level. 

**4. Descriptive statistics and data visualization**

Provide descriptive statistics using the ```stat.desc()``` function: 

```{r, message = FALSE, warning=FALSE}
library(pastecs)
stat.desc(student_sample)
```

From this, we can already see that the mean is different from the hypothesized value. The question however remains, whether this difference is significantly different, given the sample size and the variability in the data. Since we only have one continuous variable, we can visualize the distribution in a histogram. 

```{r}
ggplot(data.frame(student_sample)) + 
  geom_histogram(aes(x = student_sample), fill = 'white', color = 'black', bins = 20) +
  theme_bw() +
  labs(title = "Distribution of values in the sample",x = "Hours", y = "Frequency") 
```

**5. Conduct significance test**

In the beginning of the chapter, we saw, how you could conduct significance test by hand. However, R has built-in routines that you can use to conduct the analyses. The ```t.test()``` function can be used to conduct the test. To test if the listening time among WU students was 10, you can use the following code:

```{r}
H_0 <- 10
t.test(student_sample, mu = H_0, alternative = 'two.sided')
```

Note that if you would have stated a directional hypothesis (i.e., the mean is either greater or smaller than 10 hours), you could easily amend the code to conduct a one sided test by changing the argument ```alternative```from ```'two.sided'``` to either ```'less'``` or ```'greater'```.

**6. Report results and draw a marketing conclusion**

Note that the results are the same as above, when we computed the test by hand. You could summarize the results as follows:

On average, the listening times in our sample were different form 10 hours per month (Mean = 18.99 hours, SE = 1.78). This difference was significant t(49) = 5.058, p < .05 (95% CI = [15.42; 22.56]). Based on this evidence, we can conclude that the mean in our sample is significantly lower compared to the hypothesized population mean of $10$ hours, providing evidence against the null hypothesis. 

Note that in the reporting above, the number ```49``` in parenthesis refers to the degrees of freedom that are available from the output. 

## Comparing two means

In the one-sample test above, we tested the hypothesis that the population mean has some specific value $\mu_0$ using data from only one sample. In marketing (as in many other disciplines), you will often be confronted with a situation where you wish to compare the means of two groups. For example, you may conduct an experiment and randomly split your sample into two groups, one of which receives a treatment (experimental group) while the other doesn't (control group). In this case, the units (e.g., participants, products) in each group are different ('between-subjects design') and the samples are said to be independent. Hence, we would use a **independent-means t-test**. If you run an experiment with two experimental conditions and the same units (e.g., participants, products) were observed in both experimental conditions, the sample is said to be dependent in the sense that you have the same units in each group ('within-subjects design'). In this case, we would need to conduct an **dependent-means t-test**. Both tests are described in the following sections, beginning with the independent-means t-test.      

### Independent-means t-test

Using an independent-means t-test, we can compare the means of two possibly different populations. It is, for example, quite common for online companies to test new service features by running an experiment and randomly splitting their website visitors into two groups: one is exposed to the website with the new feature (experimental group) and the other group is not exposed to the new feature (control group). This is a typical A/B-Test scenario.

As an example, imagine that a music streaming service would like to introduce a new playlist feature that let's their users access playlists created by other users. The goal is to analyse how the new service feature impacts the listening time of users. The service randomly splits a representative subset of their users into two groups and collects data about their listening times over one month. Let's create a data set to simulate such a scenario. 

```{r, message = FALSE, warning=FALSE}
set.seed(321)
hours_population_1 <- rgamma(25000, shape = 2, scale = 10)
set.seed(12567)
sample_1 <- sample(1:25000, size = 98, replace = FALSE)
sample_1_hours <- hours_population_1[sample_1]
sample_1_df <- data.frame(hours = round(sample_1_hours,0), group = "A")
set.seed(321)
hours_population_2 <- rgamma(25000, shape = 2.5, scale = 11)
set.seed(12567)
sample_2 <- sample(1:25000, size = 112, replace = FALSE)
sample_2_hours <- hours_population_2[sample_2]
sample_2_df <- data.frame(hours = round(sample_2_hours,0), group = "B")
hours_a_b <- rbind(sample_1_df,sample_2_df) 
head(hours_a_b)
```

This data set contains two variables: the variable ```hours``` indicates the music listening times (in hours) and the variable ```group``` indicates from which group the observation comes, where 'A' refers to the control group (with the standard service) and 'B' refers to the experimental group (with the new playlist feature). Let's first look at the descriptive statistics by group using the ```describeBy``` function:

```{r}
library(psych)
describeBy(hours_a_b$hours, hours_a_b$group)
```

From this, we can already see that there is a difference in means between groups A and B. We can also see that the number of observations is different, as is the standard deviation. The question that we would like to answer is whether there is a significant difference in mean listening times between the groups. Remember that different users are contained in each group ('between-subjects design') and that the observations in one group are independent of the observations in the other group. Before we will see how you can easily conduct an independent-means t-test, let's go over some theory first.

#### Theory

As a starting point, let us label the unknown population mean of group A (control group) in our experiment $\mu_1$, and that of group B (experimental group) $\mu_2$. In this setting, the null hypothesis would state that the mean in group A is equal to the mean in group B:

$$
H_0: \mu_1=\mu_2
$$

This is equivalent to stating that the difference between the two groups ($\delta$) is zero:

$$
H_0: \mu_1 - \mu_2=0=\delta
$$

That is, $\delta$ is the new unknown population parameter, so that the null and alternative hypothesis become:   

$$
H_0: \delta = 0 \\
H_1: \delta \ne 0
$$

Remember that we usually don't have access to the entire population so that we can not observe $\delta$ and have to estimate is from a sample statistic, which we define as $d = \bar x_1-\bar x_2$, i.e., the difference between the sample means from group a ($\bar x_1$) and group b ($\bar x_2$). But can we really estimate $d$ from $\delta$? Remember from the previous chapter, that we could estimate $\mu$ from $\bar x$, because if we (hypothetically) take a larger number of samples, the distribution of the means of these samples (the sampling distribution) will be normally distributed and its mean will be (in the limit) equal to the population mean. It turns out that we can use the same underlying logic here. The above samples were drawn from two different populations with $\mu_1$ and $\mu_2$. Let us compute the difference in means between these two populations:       

```{r}
delta_pop <- mean(hours_population_1)-mean(hours_population_2)
delta_pop
```

This means that the true difference between the mean listening times of groups a and b is `r round(delta_pop,2)`. Let us now repeat the exercise from the previous chapter: let us repeatedly draw a large number of $20,000$ random samples of 100 users from each of these populations, compute the difference (i.e., $d$, our estimate of $\delta$), store the difference for each draw and create a histogram of $d$.

```{r}
set.seed(321)
hours_population_1 <- rgamma(25000, shape = 2, scale = 10)
hours_population_2 <- rgamma(25000, shape = 2.5, scale = 11)

samples <- 20000
mean_delta <- matrix(NA, nrow = samples)
for (i in 1:samples){
  student_sample <- sample(1:25000, size = 100, replace = FALSE)
  mean_delta[i,] <- mean(hours_population_1[student_sample])-mean(hours_population_2[student_sample])
}

ggplot(data.frame(mean_delta)) +
  geom_histogram(aes(x = mean_delta), bins = 30, fill='white', color='black') +
  theme_bw() +
  theme(legend.title = element_blank()) +
  geom_vline(aes(xintercept = mean(mean_delta)), size=1) + xlab("d") +
  ggtitle(TeX(sprintf("%d samples; $d_{\\bar{x}}$ = %.2f",samples, round(mean(mean_delta),2))))
```

This gives us the sampling distribution of the mean differences between the samples. You will notice that this distribution follows a normal distribution and is centered around the true difference between the populations. This means that, on average, the difference between two sample means $d$ is a good estimate of $\delta$. In our example, the difference between $\bar x_1$ and $\bar x_2$ is:

```{r}
mean_x1 <- mean(hours_a_b[hours_a_b$group=="A","hours"])
mean_x1
mean_x2 <- mean(hours_a_b[hours_a_b$group=="B","hours"])
mean_x2
d <- mean_x1-mean_x2
d 
```

Now that we have $d$ as an estimate of $\delta$, how can we find out if the observed difference is significantly different from the null hypothesis (i.e., $\delta = 0$)?

Recall from the previous section, that the standard deviation of the sampling distribution $\sigma_{\bar x}$ (i.e., the standard error) gives us indication about the precision of our estimate. Further recall that the standard error can be calculated as $\sigma_{\bar x}={\sigma \over \sqrt{n}}$. So how can we calculate the standard error of the difference between two population means? According to the **variance sum law**, to find the variance of the sampling distribution of differences, we merely need to add together the variances of the sampling distributions of the two populations that we are comparing. To find the standard error, we only need to take the square root of the variance (because the standard error is the standard deviation of the sampling distribution and the standard deviation is the square root of the variance), so that we get:

$$
\sigma_{\bar x_1-\bar x_2} = \sqrt{{\sigma_1^2 \over n_1}+{\sigma_2^2 \over n_2}}
$$

But recall that we don't actually know the true population standard deviation, so we use $SE_{\bar x_1-\bar x_2}$ as an estimate of $\sigma_{\bar x_1-\bar x_2}$:

$$
SE_{\bar x_1-\bar x_2} = \sqrt{{s_1^2 \over n_1}+{s_2^2 \over n_2}}
$$

Hence, for our example, we can calculate the standard error as follows: 

```{r}
n1 <- 98
n2 <- 112
s1 <- var(hours_a_b[hours_a_b$group=="A","hours"])
s1
s2 <- var(hours_a_b[hours_a_b$group=="B","hours"])
s2
SE_x1_x2 <- sqrt(s1/n1+s2/n2)
SE_x1_x2
```

Recall from above that we can calculate the t-statistic as:

$$
t= {\bar x - \mu_0 \over {s \over \sqrt{n}}} 
$$

Exchanging $\bar x$ for $d$, we get

$$
t= {(\bar{x}_1 - \bar{x}_2) - (\mu_1 - \mu_2) \over {\sqrt{{s_1^2 \over n_1}+{s_2^2 \over n_2}}}} 
$$

Note that according to our hypothesis $\mu_1-\mu_2=0$, so that we can calculate the t-statistic as: 

```{r}
t_score <- d/SE_x1_x2
t_score
```

Following the example of our one sample t-test above, we would now need to compare this calculated test statistic to a critical value in order to assess if $d$ is sufficiently far away from the null hypothesis to be statistically significant. To do this, we would need to know the exact t-distribution, which depends on the degrees of freedom. The problem is that deriving the degrees of freedom in this case is not that obvious. If we were willing to assume that $\sigma_1=\sigma_2$, the correct t-distribution has $n_1 -1 + n_2-1$ degrees of freedom (i.e., the sum of the degrees of freedom of the two samples). However, because in real life we don not know if $\sigma_1=\sigma_2$, we need to account for this additional uncertainty. We will not go into detail here, but R automatically uses a sophisticated approach to correct the degrees of freedom called the Welch's correction, as we will see in the subsequent application. 

#### Application

The section above explained the theory behind the independent-means t-test and showed how to compute the statistics manually. Obviously you don't have to compute these statistics by hand in this section shows you how to conduct an independent-means t-test in R using the example from above.  

**1. Formulate null and alternative hypotheses**

We wish to analyze whether there is a significant difference in music listening times between groups A and B. So our null hypothesis is that the means from the two populations are the same (i.e., there is no difference), while the alternative hypothesis states the opposite:   

$$
H_0: \mu_1=\mu_2\\
H_1: \mu_1 \ne \mu_2
$$

**2. Select an appropriate test**

Since we have a ratio scaled variable (i.e., listening times) and two independent groups, where the mean of one sample is independent of the group of the second sample (i.e., the groups contain different units), the independent-means t-test is appropriate. 

**3. Choose the level of significance**

We choose the conventional 5% significance level. 

**4. Descriptive statistics and data visualization**

We can compute the descriptive statistics for each group separately, using the ```describeBy()``` function:

```{r}
library(psych)
describeBy(hours_a_b$hours, hours_a_b$group)
```

This already shows us that the mean between groups A and B are different. We can visualize the data using a plot of means, boxplot, and a histogram. 

```{r, message = FALSE, warning=FALSE}
ggplot(hours_a_b,aes(group, hours)) + 
  geom_bar(stat = "summary",  color = "black", fill = "white", width = 0.7) +
  geom_pointrange(stat = "summary") + 
  labs(x = "Group", y = "Listening time (hours)") +
  ggtitle("Means and standard errors of listening times") +
  theme_bw()

ggplot(hours_a_b, aes(x = group, y = hours)) + 
  geom_boxplot() + 
  labs(x = "Group", y = "Listening time (hours)") + 
  ggtitle("Boxplot of listening times") +
  theme_bw() 

ggplot(hours_a_b,aes(hours)) + 
  geom_histogram(col = "black", fill = "darkblue") + 
  labs(x = "Listening time (hours)", y = "Frequency") + 
  ggtitle("Histogram of listening times") +
  facet_wrap(~group) +
  theme_bw()
```

**5. Conduct significance test**

To conduct the independent means t-test, we can use the ```t.test()``` function:

```{r}
t.test(hours ~ group, data = hours_a_b, mu = 0, alternative = "two.sided", conf.level = 0.95, var.equal = FALSE)
```

**6. Report results and draw a marketing conclusion**

The results showed that listening times were higher in the experimental group (Mean = 26.43, SE = 1.54) compared to the control group (Mean = 20.91, SE = 1.37). This means that the listening times were `r abs(round(d,2))` hours higher on average in the experimental group, compared to the control group. An independent-means t-test showed that this difference is significant t(207) = 2.684, p < .05 (95% CI = [1.46,9.58]).


### Dependent-means t-test

While the independent-means t-test is used when different units (e.g., participants, products) were assigned to the different condition, the **dependent-means t-test** is used when there are two experimental conditions and the same units (e.g., participants, products) were observed in both experimental conditions.

Imagine, for example, a slightly different experimental setup for the above experiment. Imagine that we do not assign different users to the groups, but that a sample of 100 users gets to use the music streaming service with the new feature for one month and we compare the music listening times of these users during the month of the experiment with the listening time in the previous month. Let us generate data for this example: 

```{r, message = FALSE, warning=FALSE}
set.seed(321)
hours_population_1 <- rgamma(25000, shape = 2, scale = 10)
set.seed(12567)
sample_1 <- sample(1:25000, size = 100, replace = FALSE)
sample_1_hours <- hours_population_1[sample_1]
set.seed(321)
hours_population_2 <- rgamma(25000, shape = 2.5, scale = 11)
set.seed(12567)
sample_2 <- sample(1:25000, size = 100, replace = FALSE)
sample_2_hours <- hours_population_2[sample_2]
hours_a_b_paired <- data.frame(hours_a = round(sample_1_hours,0),hours_b = round(sample_2_hours,0)) 
head(hours_a_b_paired)
```

Note that the data set has almost the same structure as before only that we know have two variables representing the listening times of each user in the month before the experiment and during the month of the experiment when the new feature was tested.

#### Theory

In this case, we want to test the hypothesis that there is no difference in mean the mean listening times between the two months. This can be expressed as follows:

$$
H_0: \mu_D = 0 \\
$$
Note that the hypothesis only refers to one population, since both observations come from the same units (i.e., users). To use consistent notation, we replace $\mu_D$ with $\delta$ and get:

$$
H_0: \delta = 0 \\
H_1: \delta \neq 0
$$

where $\delta$ denotes the difference between the observed listening times from the two consecutive months **of the same users**. As is the previous example, since we do not observe the entire population, we estimate $\delta$ based on the sample using $d$, which is the difference in mean listening time between the two months for our sample. Note that we assume that everything else (e.g., number of new releases) remained constant over the two month to keep it simple. We can show as above that the sampling distribution follows a normal distribution with a mean that is (in the limit) the same as the population mean. This means, again, that the difference in sample means is a good estimate for the difference in population means. Let's compute a new variable $d$, which is the difference between two month. 

```{r}
hours_a_b_paired$d <- hours_a_b_paired$hours_a - hours_a_b_paired$hours_b
head(hours_a_b_paired)
```

Note that we now have a new variable, which is the difference in listening times (in hours) between the two months. The mean of this difference is:

```{r}
mean_d <- mean(hours_a_b_paired$d)
mean_d
```

Again, we use $SE_{\bar x}$ as an estimate of $\sigma_{\bar x}$:

$$
SE_{\bar d}={s \over \sqrt{n}}
$$
Hence, we can compute the standard error as:

```{r}
n <- nrow(hours_a_b_paired)
SE_d <- sd(hours_a_b_paired$d)/sqrt(n)
SE_d
```

The test statistic is therefore:

$$
t = {\bar d-  \mu_0 \over SE_{\bar d}}
$$
on 99 (i.e., n-1) degrees of freedom. Now we can compute the t-statistic as follows:

```{r}
t_score <- mean_d/SE_d
t_score
qt(0.975,df=99)
```

Note that in the case of the dependent-means t-test, we only base our hypothesis on one population and hence there is only one population variance. This is because in the dependent sample test, the observations come from the same observational units (i.e., users). Hence, there is no unsystematic variation due to potential differences between users that were assigned to the experimental groups. This means that the influence of unobserved factors (unsystematic variation) relative to the variation due to the experimental manipulation (systematic variation) is not as strong in the dependent-means test compared to the independent-means test and we don't need to correct for differences in the population variances. 

#### Application

Again, we don't have to compute all this by hand since the ```t.test(...)``` function can be used to do it for us. Now we have to use the argument ```paired=TRUE``` to let R know that we are working with dependent observations. 

**1. Formulate null and alternative hypotheses**

We would like to the test if there is a difference in music listening times between the two consecutive months, so our null hypothesis is that there is no difference, while the alternative hypothesis states the opposite:

$$
H_0: \mu_D = 0 \\
H_1: \mu_D \ne 0
$$

**2. Select an appropriate test**

Since we have a ratio scaled variable (i.e., listening times) and two observations of the same group of users (i.e., the groups contain the same units), the dependent-means t-test is appropriate. 

**3. Choose the level of significance**

We choose the conventional 5% significance level. 

**4. Descriptive statistics and data visualization**

We can compute the descriptive statistics for each month separately, using the ```describe()``` function:

```{r}
library(psych)
describe(hours_a_b_paired)
```

This already shows us that the mean between the two months are different. We can visiualize the data using a plot of means, boxplot, and a histogram. 

To plot the data, we need to do some restructuring first, since the variables are now stored in two different columns ("hours_a" and "hours_b"). This is also known as the "wide" format. To plot the data we need all observations to be stored in one variable. This is also known as the "long" format. We can use the ```melt(...)``` function from the ```reshape2```package to "melt" the two variable into one column to plot the data. 

```{r  message=FALSE, warning=FALSE, echo=TRUE, eval=TRUE, fig.align="center", fig.cap = "plot of means (dependent test)"}
library(reshape2)
hours_a_b_paired_long <- melt(hours_a_b_paired[, c("hours_a", "hours_b")]) 
names(hours_a_b_paired_long) <- c("group","hours")
head(hours_a_b_paired_long)
```

Now we are ready to plot the data:

```{r, message = FALSE, warning=FALSE}
ggplot(hours_a_b_paired_long,aes(group, hours)) + 
  geom_bar(stat = "summary",  color = "black", fill = "white", width = 0.7) +
  geom_pointrange(stat = "summary") + 
  labs(x = "Group", y = "Listening time (hours)") +
  ggtitle("Means and standard errors of listining times") +
  theme_bw()

ggplot(hours_a_b_paired_long, aes(x = group, y = hours)) + 
  geom_boxplot() + 
  labs(x = "Group", y = "Listening time (hours)") + 
  ggtitle("Boxplot of listening times") +
  theme_bw() 

ggplot(hours_a_b_paired_long,aes(hours)) + 
  geom_histogram(col = "black", fill = "darkblue") + 
  labs(x = "Listening time (hours)", y = "Frequency") + 
  ggtitle("Histogram of listening times") +
  facet_wrap(~group) +
  theme_bw()
```

**5. Conduct significance test**

To conduct the independent means t-test, we can use the ```t.test()``` function with the argument ```paired = TRUE```:

```{r, message = FALSE, warning=FALSE}
t.test(hours_a_b_paired$hours_a, hours_a_b_paired$hours_b, mu = 0, alternative = "two.sided", conf.level = 0.95, paired=TRUE)
```


**6. Report results and draw a marketing conclusion**

On average, the same users used the service more when it included the new feature (M = 25.96, SE = 1.68) compared to the service without the feature (M = 20.99, SE = 1.34). This difference was significant t(99) = 2.3781, p < .05 (95% CI = [0.82, 9.12]).


## Summary

* Between-subject experimental design: When you run experiment and randomly split sample into two groups whose units/participants are independent of each other, one of which receives a treatment (experimental group) while the other doesn’t (control group). In this case we have two independent samples and use independent t-test.

* Within-subject experimental design: If you run an experiment with two experimental conditions and the same participants/units were observed in both experimental conditions, the sample is said to be dependent in the sense that you have the same participants/units in each group. In this case we have two dependent samples and use dependent t-test (Paired test).

* Hypotheses:  
  H0: The difference between the two (independent or dependent) groups is equal zero.  
  H1: The difference between the two (independent or dependent) groups is not equal 0.

* Both the independent t-test and the dependent t-test are parametric tests based on the normal distribution. 

* Assumptions for both type of tests:

1) The sampling distribution is normally distributed. In the dependent t-test this means that the sampling distribution of the differences between scores should be normal, not the scores themselves.

2) Data are measured at least at the interval level.

* The independent t-test, because it is used to test independent groups of people, also assumes:

1) Scores in different treatment conditions (control group and test group) are independent (because they come from different participants/units/products).

2) Homogeneity of variance – it assumes equal variances.

* Application in marketing: T-test can help marketing department of retailers to answer questions about customers spendings. Since they have collected extensive data about their customers through loyalty program, they may want to use to conduct a research. They might be interested in comparing average amount of Euros spent per purchase between customers who live closer (within range of 500 m) and customers who live further (more than 500m) from their new outlet. Therefore, they could form the following hypotheses:      

  H0: There is no difference in a spending per purchase between customers who live closer to the outlet and customers who live further.  
  H1: There is a difference in a spending per purchase between customers who live closer to the outlet and customers who live further.

## Further considerations

### Type I and Type II Errors

When choosing the level of significance ($\alpha$). It is important to note that the choice of the significance level affects the type 1 and type 2 error:

* Type I error: When we believe there is a genuine effect in our population, when in fact there isn't. Probability of type I error ($\alpha$) = level of significance.
* Type II error: When we believe that there is no effect in the population, when in fact there is. 

This following table shows the possible outcomes of a test (retain vs. reject $H_0$), depending on whether $H_0$ is true or false in reality.

&nbsp; | Retain <b>H<sub>0</sub></b>	 | Reject <b>H<sub>0</sub></b>	
--------------- | -------------------------------------- | --------------------------------------
<b>H<sub>0</sub> is true</b>  | Correct decision:<br>1-&alpha; (probability of correct retention); | Type 1 error:<br> &alpha; (level of significance)
<b>H<sub>0</sub> is false</b>  | Type 2 error:<br>&beta; (type 2 error rate) | Correct decision:<br>1-&beta; (power of the test)

### Significance level, sample size, power, and effect size

When you plan to conduct an experiment, there are some factors that are under direct control of the researcher:

* **Significance level ($\alpha$)**: The probability of finding an effect that does not genuinely exist.
* **Sample size (n)**: The number of observations in each group of the experimental design.

Unlike &alpha; and n, which are specified by the researcher, the magnitude of &beta; depends on the actual value of the population parameter. In addition, &beta; is influenced by the effect size (e.g., Cohen’s d), which can be used to determine a standardized measure of the magnitude of an observed effect. The following parameters are affected more indirectly:

* **Power (1-&beta;)**: The probability of finding an effect that does genuinely exists. 
* **Effect size (d)**: Standardized measure of the effect size under the alternate hypothesis. 

Although &beta; is unknown, it is related to &alpha;. For example, if we would like to be absolutely sure that we do not falsely identify an effect which does not exist (i.e., make a type I error), this means that the probability of identifying an effect that does exist (i.e., 1-&beta;) decreases and vice versa. Thus, an extremely low value of &alpha; (e.g., &alpha; = 0.0001) will result in intolerably high &beta; errors. A common approach is to set &alpha;=0.05 and &beta;=0.80. 

Unlike the t-value of our test, the effect size (d) is unaffected by the sample size and can be categorized as follows (see Cohen, J. 1988): 

* 0.2 (small effect)
* 0.5 (medium effect)
* 0.8 (large effect)

In order to test more subtle effects (smaller effect sizes), you need a larger sample size compared to the test of more obvious effects. In <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2205186" target="_blank">this paper</a>, you can find a list of examples for different effect sizes and the number of observations you need to reliably find an effect of that magnitude. Although the exact effect size is unknown before the experiment, you might be able to make a guess about the effect size (e.g., based on previous studies).   

If you wish to obtain a standardized measure of the effect, you may compute the effect size (Cohen's d) using the ```cohensD()``` function from the ```lsr``` package. Using the examples from the independent-means t-test above, we would use: 

```{r message=FALSE, warning=FALSE}
library(lsr)
cohensD(hours ~ group, data = hours_a_b)
```

According to the thresholds defined above, this effect would be judged to be a small-medium effect.

For the dependent-means t-test, we would use: 

```{r message=FALSE, warning=FALSE}
cohensD(hours_a_b_paired$hours_a, hours_a_b_paired$hours_b, method="paired")
```

According to the thresholds defined above, this effect would also be judged to be a small-medium effect.

When constructing an experimental design, your goal should be to maximize the power of the test while maintaining an acceptable significance level and keeping the sample as small as possible. To achieve this goal, you may use the ```pwr``` package, which let's you compute ```n```, ```d```, ```alpha```, and ```power```. You only need to specify three of the four input variables to get the fourth.

For example, what sample size do we need (per group) to identify an effect with d = 0.6, &alpha; = 0.05, and power = 0.8:

```{r message=FALSE, warning=FALSE}
library(pwr)
pwr.t.test(d = 0.6, sig.level = 0.05, power = 0.8, type = c("two.sample"), alternative = c("two.sided"))
```

Or we could ask, what is the power of our test with 51 observations in each group, d = 0.6, and &alpha; = 0.05:

```{r message=FALSE, warning=FALSE}
pwr.t.test(n = 51, d = 0.6, sig.level = 0.05, type = c("two.sample"), alternative = c("two.sided"))
```

### P-values, stopping rules and p-hacking

From my experience, students tend to place a lot of weight on p-values when interpreting their research findings. It is therefore important to note some points that hopefully help to put the meaning of a "significant" vs. "insignificant" test result into perspective.

**Significant result**

* Even if the probability of the effect being a chance result is small (e.g., less than .05) it doesn't necessarily mean that the effect is important. 
* Very small and unimportant effects can turn out to be statistically significant if the sample size is large enough. 

**Insignificant result**

* If the probability of the effect occurring by chance is large (greater than .05), the alternative hypothesis is rejected. However, this does not mean that the null hypothesis is true.
* Although an effect might not be large enough to be anything other than a chance finding, it doesn't mean that the effect is zero.
* In fact, two random samples will always have slightly different means that would deemed to be statistically significant if the samples were large enough.   

Thus, you should not base your research conclusion on p-values alone!

It is also crucial to **determine the sample size before you run the experiment** or before you start your analysis. Why? Consider the following example:

* You run an experiment
* After each respondent you analyze the data and look at the mean difference between the two groups with a t-test
* You stop when you have a significant effect

This is called p-hacking and should be avoided at all costs. Assuming that both groups come from the same population (i.e., there is **no difference** in the means): What is the likelihood that the result will be significant at some point? In other words, what is the likelihood that you will draw the wrong conclusion from your data that there is an effect, while there is none? This is shown in the following graph using simulated data - the color red indicates significant test results that arise although there is no effect (i.e., false positives).  

```{r message=FALSE, warning=FALSE, echo=FALSE, eval=TRUE, fig.align="center", fig.cap = "p-hacking (red indicates false positives)"}
set.seed(300)
R <- 1000 
tvalues <- numeric()
tcrit <- numeric()
replication <- numeric()
group1 <- rnorm(3,1,10)
group2 <- rnorm(3,1,10)

for (r in 1:R) {
  newobs <- rnorm(1,1,10)
  if (runif(1 )> .5) {
    group1 <- c(group1, newobs)
  } else {
    group2 <- c(group2, newobs)
  }
  t <- t.test(group1, group2, var.equal = TRUE)
  tvalues[r] <- t$statistic
  replication[r] <- r
  degf <- (length(group1) + length(group2)-2)
  tcrit[r] <- qt(0.975,degf)
}
df <- as.data.frame(cbind(replication, tvalues,tcrit))
df$col <- ifelse(abs(df$tvalues)>abs(df$tcrit),1,0)

ggplot(data=df,aes(y=tvalues, x=replication,color = col)) +
  geom_line()+theme_bw() +
  theme(legend.position="none") +scale_colour_gradientn(colours=c("black","red"))
```