foundations-errors.qmd

# Decision Errors {#sec-foundations-decision-errors}

```{r}
#| include: false
source("_common.R")
```

::: {.chapterintro data-latex=""}
Using data to make inferential decisions about larger populations is not a perfect process.
As seen in [Chapter -@sec-foundations-randomization], a small p-value typically leads the researcher to a decision to reject the null claim or hypothesis.
Sometimes, however, data can produce a small p-value when the null hypothesis is actually true and the data are just inherently variable.
Here we describe the errors which can arise in hypothesis testing, how to define and quantify the different errors, and suggestions for mitigating errors if possible.
:::

\index{decision errors}

Hypothesis tests are not flawless.
Just think of the court system: innocent people are sometimes wrongly convicted and the guilty sometimes walk free.
Similarly, data can point to the wrong conclusion.
However, what distinguishes statistical hypothesis tests from a court system is that our framework allows us to quantify and control how often the data lead us to the incorrect conclusion.

In a hypothesis test, there are two competing hypotheses: the null and the alternative.
We make a statement about which one might be true, but we might choose incorrectly.
There are four possible scenarios in a hypothesis test, which are summarized in @tbl-fourHTScenarios.

```{r}
#| label: tbl-fourHTScenarios
#| tbl-cap: Four different scenarios for hypothesis tests.
#| tbl-pos: H
ht_scenarios <- tribble(
  ~Truth, ~`Reject null hypothesis`, ~`Fail to reject null hypothesis`,
  "Null hypothesis is true", "Type I error", "Good decision",
  "Alternative hypothesis is true", "Good decision", "Type II error"
)

ht_scenarios |>
  kbl(linesep = "", booktabs = TRUE) |>
  add_header_above(c("", "Test conclusion" = 2)) |>
  kable_styling(bootstrap_options = c("striped", "condensed"), latex_options = c("striped"), full_width = FALSE) |>
  column_spec(1, width = "14em") |>
  column_spec(2:3, width = "12em")
```

A **Type I error**\index{Type I error} is rejecting the null hypothesis when $H_0$ is actually true.
Since we rejected the null hypothesis in the sex discrimination and opportunity cost studies, it is possible that we made a Type I error in one or both of those studies.
A **Type II error**\index{Type II error} is failing to reject the null hypothesis when the alternative is actually true.

```{r}
#| include: false
terms_chp_14 <- c("Type I error", "Type II error")
```

::: {.workedexample data-latex=""}
In a US court, the defendant is either innocent $(H_0)$ or guilty $(H_A).$ What does a Type I error represent in this context?
What does a Type II error represent?
@tbl-fourHTScenarios may be useful.

------------------------------------------------------------------------

If the court makes a Type I error, this means the defendant is innocent $(H_0$ true) but wrongly convicted.
A Type II error means the court failed to reject $H_0$ (i.e., failed to convict the person) when they were in fact guilty $(H_A$ true).
:::

::: {.guidedpractice data-latex=""}
Consider the opportunity cost study where we concluded students were less likely to make a DVD purchase if they were reminded that money not spent now could be spent later.
What would a Type I error represent in this context?[^14-foundations-errors-1]
:::

[^14-foundations-errors-1]: Making a Type I error in this context would mean that reminding students that money not spent now can be spent later does not affect their buying habits, despite the strong evidence (the data suggesting otherwise) found in the experiment.
    Notice that this does *not* necessarily mean something was wrong with the data or that we made a computational mistake.
    Sometimes data simply point us to the wrong conclusion, which is why scientific studies are often repeated to check initial findings.

::: {.workedexample data-latex=""}
How could we reduce the Type I error rate in US courts?
What influence would this have on the Type II error rate?

------------------------------------------------------------------------

To lower the Type I error rate, we might raise our standard for conviction from "beyond a reasonable doubt" to "beyond a conceivable doubt" so fewer people would be wrongly convicted.
However, this would also make it more difficult to convict the people who are actually guilty, so we would make more Type II errors.
:::

::: {.guidedpractice data-latex=""}
How could we reduce the Type II error rate in US courts?
What influence would this have on the Type I error rate?[^14-foundations-errors-2]
:::

[^14-foundations-errors-2]: To lower the Type II error rate, we want to convict more guilty people.
    We could lower the standards for conviction from "beyond a reasonable doubt" to "beyond a little doubt".
    Lowering the bar for guilt will also result in more wrongful convictions, raising the Type I error rate.

\index{decision errors}

The example and guided practice above provide an important lesson: if we reduce how often we make one type of error, we generally make more of the other type.

## Discernibility level

The **discernibility level**\index{discernibility level}\index{significance level} provides the cutoff for the p-value which will lead to a decision of "reject the null hypothesis." Choosing a discernibility level for a test is important in many contexts, and the traditional level is 0.05.
However, it is sometimes helpful to adjust the discernibility level based on the application.
We may select a level that is smaller or larger than 0.05 depending on the consequences of any conclusions reached from the test.

If making a Type I error is dangerous or especially costly, we should choose a small discernibility level (e.g., 0.01 or 0.001).
If we want to be very cautious about rejecting the null hypothesis, we demand very strong evidence favoring the alternative $H_A$ before we would reject $H_0.$

If a Type II error is relatively more dangerous or much more costly than a Type I error, then we should choose a higher discernibility level (e.g., 0.10).
Here we want to be cautious about failing to reject $H_0$ when the null is actually false.

```{r}
#| include: false
terms_chp_14 <- c(terms_chp_14, "discernibility level", "significance level")
```

::: {.tip data-latex=""}
**Discernibility levels should reflect consequences of errors.**

The discernibility level selected for a test should reflect the real-world consequences associated with making a Type I or Type II error.
:::

## Two-sided hypotheses {#sec-two-sided-hypotheses}

\index{hypothesis testing}

In [Chapter -@sec-foundations-randomization] we explored whether women were discriminated against and whether a simple trick could make students a little thriftier.
In these two case studies, we have actually ignored some possibilities:

-   What if *men* are actually discriminated against?
-   What if the money trick actually makes students *spend more*?

These possibilities weren't considered in our original hypotheses or analyses.
The disregard of the extra alternatives may have seemed natural since the data pointed in the directions in which we framed the problems.
However, there are two dangers if we ignore possibilities that disagree with our data or that conflict with our world view:

1.  Framing an alternative hypothesis simply to match the direction that the data point will generally inflate the Type I error rate.
    After all the work we have done (and will continue to do) to rigorously control the error rates in hypothesis tests, careless construction of the alternative hypotheses can disrupt that hard work.

2.  If we only use alternative hypotheses that agree with our worldview, then we are going to be subjecting ourselves to **confirmation bias**\index{confirmation bias}, which means we are looking for data that supports our ideas.
    That's not very scientific, and we can do better!

The original hypotheses we have seen are called **one-sided hypothesis tests**\index{one-sided hypothesis test} because they only explored one direction of possibilities.
Such hypotheses are appropriate when we are exclusively interested in the single direction, but usually we want to consider all possibilities.
To do so, let's learn about **two-sided hypothesis tests**\index{two-sided hypothesis test} in the context of a new study that examines the impact of using blood thinners on patients who have undergone CPR.

```{r}
#| include: false
terms_chp_14 <- c(terms_chp_14, "confirmation bias", "one-sided hypothesis test", "two-sided hypothesis test")
```

Cardiopulmonary resuscitation (CPR) is a procedure used on individuals suffering a heart attack when other emergency resources are unavailable.
This procedure is helpful in providing some blood circulation to keep a person alive, but CPR chest compression can also cause internal injuries.
Internal bleeding and other injuries that can result from CPR complicate additional treatment efforts.
For instance, blood thinners may be used to help release a clot that is causing the heart attack once a patient arrives in the hospital.
However, blood thinners negatively affect internal injuries.

Here we consider an experiment with patients who underwent CPR for a heart attack and were subsequently admitted to a hospital.
Each patient was randomly assigned to either receive a blood thinner (treatment group) or not receive a blood thinner (control group).
The outcome variable of interest was whether the patient survived for at least 24 hours.
[@Bottiger:2001]

::: {.data data-latex=""}
The [`cpr`](http://openintrostat.github.io/openintro/reference/cpr.html) data can be found in the [**openintro**](http://openintrostat.github.io/openintro) R package.
:::

::: {.workedexample data-latex=""}
Form hypotheses for this study in plain and statistical language.
Let $p_C$ represent the true survival rate of people who do not receive a blood thinner (corresponding to the control group) and $p_T$ represent the survival rate for people receiving a blood thinner (corresponding to the treatment group).

------------------------------------------------------------------------

We want to understand whether blood thinners are helpful or harmful.
We'll consider both of these possibilities using a two-sided hypothesis test.

-   $H_0:$ Blood thinners do not have an overall survival effect, i.e., the survival proportions are the same in each group.
    $p_T - p_C = 0.$

-   $H_A:$ Blood thinners have an impact on survival, either positive or negative, but not zero.
    $p_T - p_C \neq 0.$

Note that if we had done a one-sided hypothesis test, the resulting hypotheses would have been:

-   $H_0:$ Blood thinners do not have a positive overall survival effect, i.e., the survival proportions for the blood thinner group is the same or lower than the control group.
    $p_T - p_C \leq 0.$

-   $H_A:$ Blood thinners have a positive impact on survival.
    $p_T - p_C > 0.$
:::

There were 50 patients in the experiment who did not receive a blood thinner and 40 patients who did.
The study results are shown in @tbl-cpr-summary.

```{r}
#| label: tbl-cpr-summary
#| tbl-cap: |
#|   Results for the CPR study. Patients in the treatment group were given a
#|   blood thinner, and patients in the control group were not.
#| tbl-pos: H
cpr |>
  mutate(
    outcome = str_to_title(outcome),
    group   = str_to_title(group)
  ) |>
  count(group, outcome) |>
  pivot_wider(names_from = outcome, values_from = n) |>
  janitor::adorn_totals(where = c("row", "col")) |>
  kbl(
    linesep = "", booktabs = TRUE,
    col.names = c("Group", "Died", "Survived", "Total"),
    align = "lccc"
  ) |>
  kable_styling(
    bootstrap_options = c("striped", "condensed"),
    latex_options = c("striped"),
    full_width = FALSE
  ) |>
  column_spec(1:4, width = "7em")
```

::: {.guidedpractice data-latex=""}
What is the observed survival rate in the control group?
And in the treatment group?
Also, provide a point estimate $(\hat{p}_T - \hat{p}_C)$ for the true difference in population survival proportions across the two groups: $p_T - p_C.$[^14-foundations-errors-3]
:::

[^14-foundations-errors-3]: Observed control survival rate: $\hat{p}_C = \frac{11}{50} = 0.22.$ Treatment survival rate: $\hat{p}_T = \frac{14}{40} = 0.35.$ Observed difference: $\hat{p}_T - \hat{p}_C = 0.35 - 0.22 = 0.13.$

According to the point estimate, for patients who have undergone CPR outside of the hospital, an additional 13% of these patients survive when they are treated with blood thinners.
However, we wonder if this difference could be easily explainable by chance, if the treatment has no effect on survival.

As we did in past studies, we will simulate what type of differences we might see from chance alone under the null hypothesis.
By randomly assigning each of the patient's files to a "simulated treatment" or "simulated control" allocation, we get a new grouping.
If we repeat this simulation 1,000 times, we can build a **null distribution**\index{null distribution} of the differences shown in @fig-CPR-study-right-tail.

```{r}
#| include: false
terms_chp_14 <- c(terms_chp_14, "null distribution")
```

```{r}
#| label: fig-CPR-study-right-tail
#| fig-cap: |
#|   Null distribution of the point estimate for the difference in proportions,
#|   $\hat{p}_T - \hat{p}_C.$ The shaded right tail shows observations that are
#|   at least as large as the observed difference, 0.13.
#| fig-alt: |
#|   Histogram of the null distribution of the point estimate for the difference
#|   in proportions, $\hat{p}_T - \hat{p}_C.$ The shaded right tail shows
#|   observations that are at least as large as the observed difference, 0.13.
#| fig-asp: 0.5
set.seed(47)
cpr_rand_dist <- cpr |>
  specify(response = outcome, explanatory = group, success = "survived") |>
  hypothesize(null = "independence") |>
  generate(reps = 1000, type = "permute") |>
  calculate(stat = "diff in props", order = c("treatment", "control")) |>
  # simplify by rounding
  mutate(stat = round(stat, 3))

cpr_rand_dist |>
  ggplot(aes(x = stat)) +
  geom_histogram(binwidth = 0.05) +
  labs(
    y = NULL,
    x = "Difference in randomized survival rates\n(treatment - control)",
    title = "1,000 randomized differences"
  ) +
  theme(
    axis.title.y = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank()
  ) +
  gghighlight(stat >= 0.13)

cpr_tail <- cpr_rand_dist |>
  filter(stat >= 0.13) |>
  summarise(n() / 1000) |>
  pull()
```

The right tail area is `r cpr_tail`.
(Note: it is only a coincidence that we also have $\hat{p}_T - \hat{p}_C=0.13.)$ However, contrary to how we calculated the p-value in previous studies, the p-value of this test is not actually the tail area we calculated, i.e., it's not `r cpr_tail`!

The p-value is defined as the probability we observe a result at least as favorable to the alternative hypothesis as the observed difference.
In this case, any differences less than or equal to -0.13 would also provide equally strong evidence favoring the alternative hypothesis as a difference of +0.13 did.
A difference of -0.13 would correspond to 13% higher survival rate in the control group than the treatment group.
In @fig-CPR-study-p-value we have also shaded these differences in the left tail of the distribution.
These two shaded tails provide a visual representation of the p-value for a two-sided test.

```{r}
#| label: fig-CPR-study-p-value
#| fig-cap: |
#|   Null distribution of the point estimate for the difference in proportions,
#|   $\hat{p}_T - \hat{p}_C.$ All values that are at least as extreme as +0.13
#|   but in either direction away from 0 are shaded.
#| fig-alt: |
#|   Histogram of the null distribution of the point estimate for the difference
#|   in proportions, $\hat{p}_T - \hat{p}_C.$ All values that are at least as
#|   extreme as +0.13 but in either direction away from 0 are shaded.
#| fig-asp: 0.5
cpr_rand_dist |>
  ggplot(aes(x = stat)) +
  geom_histogram(binwidth = 0.05) +
  labs(
    y = NULL,
    x = "Difference in randomized survival rates\n(treatment - control)",
    title = "1,000 randomized differences"
  ) +
  theme(
    axis.title.y = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank()
  ) +
  gghighlight(stat >= 0.13 | stat <= -0.13)
```

For a two-sided test, take the single tail (in this case, 0.131) and double it to get the p-value: 0.262.
Since this p-value is larger than 0.05, we do not reject the null hypothesis.
That is, we do not find convincing evidence that the blood thinner has any influence on survival of patients who undergo CPR prior to arriving at the hospital.

::: {.important data-latex=""}
**Default to a two-sided test.**

We want to be rigorous and keep an open mind when we analyze data and evidence.
Use a one-sided hypothesis test only if you truly have interest in only one direction.
:::

::: {.important data-latex=""}
**Computing a p-value for a two-sided test.**

First compute the p-value for one tail of the distribution, then double that value to get the two-sided p-value.
That's it!
:::

\vspace{-3mm}

::: {.workedexample data-latex=""}
Consider the situation of the medical consultant.
Now that you know about one-sided and two-sided tests, which type of test do you think is more appropriate?

------------------------------------------------------------------------

The setting has been framed in the context of the consultant being helpful (which is what led us to a one-sided test originally), but what if the consultant actually performed *worse* than the average?
Would we care?
More than ever!
Since it turns out that we care about a finding in either direction, we should run a two-sided test.
The p-value for the two-sided test is double that of the one-sided test, here the simulated p-value would be 0.2444.
:::

\vspace{-5mm}

Generally, to find a two-sided p-value we double the single tail area, which remains a reasonable approach even when the distribution is asymmetric.
However, the approach can result in p-values larger than 1 when the point estimate is very near the mean in the null distribution; in such cases, we write that the p-value is 1.
Also, very large p-values computed in this way (e.g., 0.85), may also be slightly inflated.
Typically, we do not worry too much about the precision of very large p-values because they lead to the same analysis conclusion, even if the value is slightly off.

## Controlling the Type I error rate

Now that we understand the difference between one-sided and two-sided tests, we must recognize when to use each type of test.
Because of the result of increased error rates, it is never okay to change two-sided tests to one-sided tests after observing the data.
We explore the consequences of ignoring this advice in the next example.

::: {.workedexample data-latex=""}
Using $\alpha=0.05,$ we show that freely switching from two-sided tests to one-sided tests will lead us to make twice as many Type I errors as intended.

------------------------------------------------------------------------

Suppose we are interested in finding any difference from 0.
We've created a smooth-looking **null distribution** representing differences due to chance below.

```{r}
#| label: type1ErrorDoublingExampleFigure
#| fig-alt: |
#|   Density curve of a normal distribution with mean 0 and standard deviation 1.
#|   The shaded regions represent areas where we would reject $H_0$ under the
#|   bad practices considered in when $\alpha = 0.05.$
#| fig-asp: 0.4
#| out-width: 60%
#| fig-align: center
par(mar = c(2, 0, 0, 0))
normTail(U = 1.65, L = -1.65, col = IMSCOL["blue", "full"])
arrows(-2, 0.17, -1.9, 0.08, length = 0.07, lwd = 1.5)
text(-1.92, 0.21, "5%", cex = 1.2)
arrows(2, 0.17, 1.9, 0.08, length = 0.07, lwd = 1.5)
text(2.08, 0.21, "5%", cex = 1.2)
```

First, suppose the sample difference was larger than 0.
In a one-sided test, we would set $H_A:$ difference $> 0.$ If the observed difference falls in the upper 5% of the distribution, we would reject $H_0$ since the p-value would just be the single tail.
Thus, if $H_0$ is true, we incorrectly reject $H_0$ about 5% of the time when the sample mean is above the null value, as shown above.

Then, suppose the sample difference was smaller than 0.
In a one-sided test, we would set $H_A:$ difference $< 0.$ If the observed difference falls in the lower 5% of the figure, we would reject $H_0.$ That is, if $H_0$ is true, then we would observe this situation about 5% of the time.

By examining these two scenarios, we can determine that we will make a Type I error $5\%+5\%=10\%$ of the time if we are allowed to swap to the "best" one-sided test for the data.
This is twice the error rate we prescribed with our discernibility level: $\alpha=0.05$!
:::

::: {.important data-latex=""}
**Hypothesis tests should be set up *before* seeing the data.**

After observing data, it is tempting to turn a two-sided test into a one-sided test.
Avoid this temptation.
Hypotheses should be set up *before* observing the data.
:::

\index{hypothesis testing}

## Power {#sec-pow}

Although we won't go into extensive detail here, power is an important topic for follow-up consideration after understanding the basics of hypothesis testing.
A good power analysis is a vital preliminary step to any study as it will inform whether the data you collect are sufficient for being able to conclude your research broadly.

Often times in experiment planning, there are two competing considerations:

-   We want to collect enough data that we can detect important effects.
-   Collecting data can be expensive, and, in experiments involving people, there may be some risk to patients.

When planning a study, we want to know how likely we are to detect an effect we care about.
In other words, if there is a real effect, and that effect is large enough that it has practical value, then what is the probability that we detect that effect?
This probability is called the **power**\index{power}, and we can compute it for different sample sizes or different effect sizes.

::: {.important data-latex=""}
**Power.**

The power of the test is the probability of rejecting the null claim when the alternative claim is true.

How easy it is to detect the effect depends on both how big the effect is (e.g., how good the medical treatment is) as well as the sample size.
:::

We think of power as the probability that you will become rich and famous from your science.
In order for your science to make a splash, you need to have good ideas!
That is, you won't become famous if you happen to find a single Type I error which rejects the null hypothesis.
Instead, you'll become famous if your science is very good and important (that is, if the alternative hypothesis is true).
The better your science is (i.e., the better the medical treatment), the larger the *effect size* and the easier it will be for you to convince people of your work.

Not only does your science need to be solid, but you also need to have evidence (i.e., data) that shows the effect.
A few observations (e.g., $n = 2)$ is unlikely to be convincing because of well known ideas of natural variability.
Indeed, the larger the dataset which provides evidence for your scientific claim, the more likely you are to convince the community that your idea is correct.

Although a full discussion of relative power is beyond the scope of this text, you might be interested to know that, often, paired t-tests (discussed in @sec-mathpaired) are more powerful than independent t-tests (discussed in @sec-math2samp) because the pairing reduces the inherent variability across observations.
Additionally, because the median is almost always more variable than the mean, tests based on the mean are more powerful than tests based on the median.
That is to say, reducing variability (done in different ways depending on the experimental design and set-up of the analysis) makes a test more powerful in such that the data are more likely to reject the null hypothesis.

```{r}
#| include: false
terms_chp_14 <- c(terms_chp_14, "power")
```

\clearpage

## Chapter review {#sec-chp15-review}

### Summary

Although hypothesis testing provides a strong framework for making decisions based on data, as the analyst, you need to understand how and when the process can go wrong.
That is, always keep in mind that the conclusion to a hypothesis test may not be right!
Sometimes when the null hypothesis is true, we will accidentally reject it and commit a Type I error; sometimes when the alternative hypothesis is true, we will fail to reject the null hypothesis and commit a Type II error.
The power of the test quantifies how likely it is to obtain data which will reject the null hypothesis when indeed the alternative is true; the power of the test is increased when larger sample sizes are taken.

### Terms

The terms introduced in this chapter are presented in @tbl-terms-chp-14.
If you're not sure what some of these terms mean, we recommend you go back in the text and review their definitions.
You should be able to easily spot them as **bolded text**.

```{r}
#| label: tbl-terms-chp-14
#| tbl-cap: Terms introduced in this chapter.
#| tbl-pos: H
make_terms_table(terms_chp_14)
```

\clearpage

## Exercises {#sec-chp14-exercises}

Answers to odd-numbered exercises can be found in [Appendix -@sec-exercise-solutions-14].

::: {.exercises data-latex=""}
{{< include exercises/_14-ex-foundations-errors.qmd >}}
:::