seminar04.qmd

---
title: "Seminar 04"
subtitle: "MA22004"
date: "2024-10-09"
author: "Dr Eric Hall   •   ehall001@dundee.ac.uk"
format: 
  revealjs:
    chalkboard: true
    html-math-method: katex
    theme: [default, resources/custom.scss]
    css: resources/fonts.css
    logo: resources/logo.png
    footer: ""
    template-partials:
      - resources/title-slide.html
    transition: slide
    background-transition: fade
from: markdown+emoji
lang: en
---

```{r}
#| include: false
knitr::opts_chunk$set(echo = FALSE, comment = "", fig.asp = .5)
library(tidyverse)
library(latex2exp)
library(knitr)
library(kableExtra)
library(janitor)
library(fontawesome)
library(latex2exp)
```

# Announcements {.mySegue .center}
:::{.hidden}
\DeclareMathOperator{\Var}{Var}
\DeclareMathOperator{\E}{\mathbf{E}}
\DeclareMathOperator{\Cov}{Cov}
\DeclareMathOperator{\corr}{corr}
\newcommand{\se}{\mathsf{se}}
\DeclareMathOperator{\sd}{sd}
:::

## Attendance

::: {layout="[[-1], [1], [-1]]"}
![](images/seats.png){fig-align="center" fig-alt="Register your attendance using SEAtS"}
:::

## Reminders 

- Discuss labs, R, and RStudio at Thu workshop. 
- Discuss worksheet 3 at Fri workshop.
- Lab 2 due **Fri 2024-10-11** at **17:00**. 


## Outline of today &nbsp;&nbsp;`r fa("compass")`

1. Inference for means

2. Inference for proportions

3. How much water? &nbsp;&nbsp;`r fa("lightbulb")`


# Inference for means {.mySegue .center}

## Point estimate for mean

Consider data $x_1, \dots, x_m$. 

(Assuming it exists) 

A point estimate for the population mean, $\mu$, is:

$$\hat{\mu} = \frac{1}{m} \sum_{i=1}^{m} x_i \,.$$

## Consider the _population_ {.smaller}

We consider inferences for the population mean assuming that samples are drawn for one of three different populations. 

1. Normal population with known $\sigma$
2. Any population with unknown $\sigma$, when $m$ large
3. Normal population with unknown $\sigma$, $m$ may be small

:::{.callout-important}
## What ... 
... changes? 

... stays the same?
:::


## Confidence intervals for mean

The basic form is awlays:

$$\text{pt est} \pm \underbrace{(\text{critical value}) \cdot (\text{precision of pt est})}_\text{margin of error}$$


:::{.callout-warning}
## What changes? 
Margin of error!
:::

## Hypothesis tests for means

Tests on equality of mean $\mu$, i.e., $H_0 : \mu = \mu_0$ against one of the alternatives:

- $H_a : \mu \neq \mu_0$
- $H_a : \mu > \mu_0$
- $H_a : \mu < \mu_0$

:::{.callout-warning}
## What changes? 
Test statistic!
:::

:::{.notes}
- Does the available data provides sufficient evidence to reject a null hypothesis $H_0$ which is assumed to be true?
- Note the equality is part of the null hypothesis.
- If the observations disagree with $H_0$, then we reject the null hypothesis. 
- If the sample evidence does not strongly contradict $H_0$, then we continue to believe $H_0$. 
- $\mu_0$ is the **null value** that separate the null and alternative
:::


## Forms for $100(1-\alpha)\%$ CIs {.smaller}

1. Normal population with known $\sigma$:
$$\overline{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{m}}$$
2. Any population with unknown $\sigma$, when $m$ large
$$\overline{x} \pm z_{\alpha/2} \frac{s}{\sqrt{m}}$$
3. Normal population with unknown $\sigma$, $m$ may be small
$$\overline{x} \pm t_{\alpha/2, m-1} \frac{s}{\sqrt{m}}$$


## Hypothesis test statistics {.smaller}

1. Normal population with known $\sigma$:
$$Z = \frac{\overline{X} - \mu_0}{\sigma / \sqrt{m}} \sim \mathsf{N}(0,1)$$
2. Any population with unknown $\sigma$, when $m$ large
$$Z = \frac{\overline{X} - \mu_0}{S / \sqrt{m}} \sim \mathsf{N}(0,1)$$
3. Normal population with unknown $\sigma$, $m$ may be small
$$T = \frac{\overline{X} - \mu_0}{S / \sqrt{m}} \sim \mathsf{t}(m-1)$$

:::{.notes}
- Once we have sample data we make point estimate!
:::


# `r fa("dumbbell")`  Freshman 15  {.mySegue .center}

> A common belief among the lay public is that body weight increases after entry into [university], and the phrase "freshman 15" has been coined to describe the 15 pounds that students presumably gain over their freshman [i.e. first] year. (from Devore, exm #8.2)

## Example (2/n) {.smaller}

Let $\mu$ denote the true average weight gain of women over the course of their first year at university. The foregoing quote suggests that we should test the hypotheses $H_0: \mu = 15$ versus $H_a : \mu \neq 15$. 

Suppose that a random sample of $m$ such individuals is selected and the weight gain of each one is determined, resulting in a sample mean weight gain $\overline{x}$ and a sample standard deviation $s$. 

:::{.notes}
- Before data is obtained, the sample mean weight gain is a random variable $X$ and the sample standard deviation is also a random variable $S$!
:::

## Example (3/n)

:::{.notes}
- $$Z = \frac{\overline{X} - \mu}{\sigma/ \sqrt{m}}$$
- Assuming $H_0$ is true: $$Z = \frac{\overline{X} - 15}{\sigma/ \sqrt{m}}$$
- But don't know $\sigma$:
$$Z = \frac{\overline{X} - 15}{S/ \sqrt{m}} \sim \mathsf{N}(0,1)$$
:::

## Example (4/n)

Suppose $\overline{x} = 13.7$ and plugging in values for $s$ and $\sqrt{n}$ yields $z = -2.8$. 

Which values of the test statistic are at least as contradictory to $H_0$ as $-2.80$ itself?

:::{.notes}
- E.g., $m = 40$ and $s = 2.936401$
- Possible values: 13.5, 13.0, any value smaller than 13.7 is more contradictory to $H_0$ than 13.7
- 16.3 is just as contradictory to $H_0$ as is 13.7; both fall the same distance from the null value 15
- Just as values of $\overline{x}$ that are at most 13.7 correspond to $z\leq -2.8$, values of $\overline{x}$ that are at least 16.3 correspond to $z \geq 2$
:::

## Example (5/n)

$P( Z \leq -2.80 \quad \text{or}\quad  Z \geq 2.80 \quad \text{assuming } H_0 \text{ true})$

:::{.notes}
$\approx 2 \cdot$ (area under the $z$ curve ot the right of 2.80)
$= 2[1 - \Phi(2.80)] = 2[1 - 0.9974] = 0.0052

- If the null hypothesis is in fact true, only about one half of one percent of all samples would result in a test statistic value at least as contradictory to the null hypothesis as is our value. Clearly -2.80 is among the possible test statistic values that are most contradictory to $H_0$. 
- The evidence suggest to reject $H_0$.
:::

## Example (6/n) {.smaller}

The article *Freshman 15: Fact or Fiction* (Obesity, 2006: 1438– 1443), for $m = 137$ students, the $\overline{x} = 2.42$ lb with $s =  5.72$ lb. 

This gives $$z = (2.42 - 15)/(5.72 /\sqrt{137}) = -25.7\,.$$

The probability of observing a value at least this extreme in either direction is essentially zero! 

The data very strongly contradicts the null hypothesis, and there is substantial evidence that true average weight gain is much closer to $0$ than to $15$.

:::{.notes}
- Note this implies some students lost weight!
:::

# Inference for proportions {.mySegue .center}

## (Derivation) point estimate for $p$ {.smaller}

Let $X$ be the count of members with a given property based on a sample of size $m$ from a population where a proportion $p$ share the property.

Then $\widehat{p} = X / m$ is an estimator of $p$. 

Assume $m p_0 \geq 10$ and $m (1-p_0) \geq 10$. 

### For example

$p\in (0,1)$ and consider iid Bernoulli trials $X_1, \dots, X_m \sim \mathsf{Bernoulli}(p)$.

$\widehat{p} = \frac{1}{m} \sum_{i=1}^m X_i\,, \qquad  \mu_{\widehat{p}} = \E X_i = p\,, \qquad \sigma_{\widehat{p}} = \sqrt{\frac{\Var{X_i}}{m}} = \sqrt{\frac{p(1-p)}{m}}$


## $100(1-\alpha)\%$ CI for $p$

$$\widehat{p} \pm z_{\alpha/2} \sqrt{\frac{\widehat{p}(1-\widehat{p})}{m}}$$

:::{.callout-important}
## Compare to CI for population mean...

How is the standard error different from the standard error for estimate of mean?
:::

## Hypothesis test for $p$ at level $\alpha$ {.smaller} 

Consider $H_0 : p = p_0$. The test statistic is
$$
 Z = \frac{\widehat{p} - p_0}{\sqrt{p_0 (1-p_0) / m}} \,.
$$

For a hypothesis test at level $\alpha$, we use the following procedure: 
        
- If $H_a : p > p_0$, then $P$-value is the area under $\mathsf{N}(0,1)$ to the right of $z$.
- If $H_a : p < p_0$, then $P$-value is the area under $\mathsf{N}(0,1)$ to the left of $z$.
- If $H_a : p \neq p_0$, then $P$-value is twice the area under $\mathsf{N}(0,1)$ to the right of $|z|$.  

# How much water? &nbsp;&nbsp;`r fa("lightbulb")` {.mySegue .center}

## The question {.mySegue .center}

How might you estimate the proportion of the earth covered by water?

## Sampling

If we were to take a random sample of points on the globe, then the proportion that touched water would be a reasonable estimate of the overall proportion of water covering the earth.

:::{.callout-warning}
## Remember the candy!

Are random samples always representative? How can we take a random  representative sample from globe (sphere)?
:::

## `r fa("lightbulb")` The task

:::: {.columns}
::: {.column width="60%"}
1. The globe will be tossed around.
2. Hit the globe with the tip of your index finger.
3. Shout:
    - "water!" if your finger touches water, or 
    - "land!" if your finger touches land.
4. We will record your observation.
:::

::: {.column width="40%"}
![](images/globe.jpg){height=100% fig.alt="Photo of large inflatable globe (topographic)."}

**Do not flinch ;)**
:::
::::
## The result

Let $p$ be the proportion of earth's surface covered with water.

- Point estimate $\hat{p}$
- Confidence interval for $p$
- Hypothesis test: $p$ is 0.7?

## Challenge:

How could one use random numbers from the computer to pick a random spot on the globe?

Any point is characterized by a latitude (which must be between 90◦ North and 90◦ South) and a longitude (between 180◦ West and 180◦ East). 

:::{.importantblock}
Can you pick a random spot on the globe by picking a latitude between −90 to 90 at random, and a longitude between −180 and 180 at random? 
:::


# `r fa("dumbbell")` Churchill   {.mySegue .center}

## Example: Churchill 1/n

Churchill claims that he will receive half the votes for the House of Commons seat for the constituency of Dundee. 

We are skeptical that he is as popular as he says. Suppose 116 out of 263 Dundonians polled claimed that they intended to vote for Churchill. 

Can it be concluded at significance level $0.10$ that more than half of all eligible Dundonains will vote for Churchill?

## Example: Churhill 2/n

:::{.tipblock}
What is the parameter of interest? What are the null and alternative hypotheses? 
:::

## Example: Churhill 3/n

The parameter of interest is $p$, the proportion of votes for Churchill.

The null hypothesis is $H_0 : p = 0.5$. 

The alternative hypothesis is $H_a : p < 0.5$. 

:::{.warningblock}
Since $263(0.5) = 131.5 > 10$, we satisfy the assumptions to carry out the hypothesis test using the statistic described earlier.
:::

## Example: Churhill 4/n

Based on the sample, $\widehat{p} = 116 / 263 = 0.4411$. 

The test statistic value is 
$$
\begin{aligned}
 z &= \frac{\widehat{p} - p_0}{\sqrt{p_0 (1-p_0) / m}} \\
 &= \frac{0.4411 - 0.5}{\sqrt{0.5 (1-0.5) / 263}}\\
 &= −1.91 \,.
\end{aligned}
$$

The $P$-value for this lower-tailed $z$ test is  $P = \Phi(−1.91) = 0.028$. 

## Example: Churhill 5/n

Since $P < 0.10 = \alpha$, we reject the null hypothesis at the $0.05$ level. 

:::{.warningblock}
What does this mean in words? Always place the result of the hypothesis test in the context of the problem!
:::

## Example: Churhill 6/n (conclusion)

The evidence for concluding that the true proportion is different from $p_0 = 0.5$ at the $0.10$ level is compelling.

# Summary

Today we discussed: inferences for population means (and proportions, which are like population means...)

We engaged in an activity about inferences for proportions that highlighted importance of representative sampling. 

:::{.callout-tip}
## Today's materials 

Slides posted to <https://dundeemath.github.io/MA22004-seminar04>.
:::