Skip to content

Commit

Permalink
edits to classification intro
Browse files Browse the repository at this point in the history
  • Loading branch information
trevorcampbell committed Oct 14, 2024
1 parent 16b7d95 commit c54b9e4
Show file tree
Hide file tree
Showing 6 changed files with 1,366 additions and 1,358 deletions.

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion _freeze/site_libs/revealjs/dist/theme/quarto.css

Large diffs are not rendered by default.

181 changes: 91 additions & 90 deletions schedule/slides/14-classification-intro.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,12 @@ various mutations are associated with different phenotypes?
These problems are [not]{.secondary} regression
problems. They are [classification]{.secondary} problems.

. . .

Classification involves a **categorical response variable** (no notion of "order"/"distance").

## The Set-up

## Setup

It begins just like regression: suppose we have observations
$$\{(x_1,y_1),\ldots,(x_n,y_n)\}$$
Expand All @@ -50,14 +54,14 @@ variance and better predictions.

## How do we measure quality?

Before in regression, we have $y_i \in \mathbb{R}$ and use squared error loss to measure accuracy: $(y - \hat{y})^2$.
Before in regression, we have $y_i \in \mathbb{R}$ and use $(y - \hat{y})^2$ loss to measure accuracy.

Instead, let $y \in \mathcal{K} = \{1,\ldots, K\}$

(This is arbitrary, sometimes other numbers, such as $\{-1,1\}$ will be
used)

We can always take "factors": $\{\textrm{cat},\textrm{dog}\}$ and convert to integers, which is what we assume.
We will usually convert categories/"factors" (e.g. $\{\textrm{cat},\textrm{dog}\}$) to integers.


We again make predictions $\hat{y}=k$ based on the data
Expand All @@ -66,162 +70,176 @@ We again make predictions $\hat{y}=k$ based on the data
* We get zero loss if we predict the right class
* We lose $\ell(k,k')$ on $(k\neq k')$ for incorrect predictions

## How do we measure quality?

Example: You're trying to build a fun widget to classify images of cats and dogs.

| Loss | Predict Dog | Predict Cat |
|:---: | :---: | :---: |
| Actual Dog | 0 | ? |
| Actual Cat | ? | 0 |

. . .

Use the zero-one loss (1 if wrong, 0 if right). *Type of error doesn't matter.*

| Loss | Predict Dog | Predict Cat |
|:---: | :---: | :---: |
| Actual Dog | 0 | 1 |
| Actual Cat | 1 | 0 |

## How do we measure quality?

Suppose you have a fever of 39º C. You get a rapid test on campus.
Example: Suppose you have a fever of 39º C. You get a rapid test on campus.

| Loss | Test + | Test - |
|:---: | :---: | :---: |
| Are + | 0 | Infect others |
| Are - | Isolation | 0 |
| Are + | 0 | ? (Infect others) |
| Are - | ? (Isolation) | 0 |

## How do we measure quality?
. . .

Use a weighted loss; *type of error matters!*

Suppose you have a fever of 39º C. You get a rapid test on campus.

| Loss | Test + | Test - |
|:---: | :---: | :---: |
| Are + | 0 | 1 |
| Are + | 0 | (LARGE) |
| Are - | 1 | 0 |


## How do we measure quality?
Note that one class is "important": we sometimes call that one *positive*. Errors are *false positive* and *false negative*.

> We're going to use $g(x)$ to be our classifier. It takes values in $\mathcal{K}$.
In practice, you have to design your loss (just like before) to reflect what you care about.


## How do we measure quality?

Again, we appeal to risk
We're going to use $g(x)$ to be our classifier. It takes values in $\mathcal{K}$.

Consider the risk
$$R_n(g) = E [\ell(Y,g(X))]$$ If we use the law of
total probability, this can be written
$$R_n(g) = E_X \sum_{y=1}^K \ell(y,\; g(X)) Pr(Y = y \given X)$$
$$R_n(g) = E\left[\sum_{y=1}^K \ell(y,\; g(X)) Pr(Y = y \given X)\right]$$
We minimize this over a class of options $\mathcal{G}$, to produce
$$g_*(X) = \argmin_{g\in\mathcal{G}} E_X \sum_{y=1}^K \ell(y,g(X)) Pr(Y = y \given X)$$
$$g_*(X) = \argmin_{g\in\mathcal{G}} E\left[\sum_{y=1}^K \ell(y,g(X)) Pr(Y = y \given X)\right]$$

## How do we measure quality?

$g_*$ is named the [Bayes' classifier]{.secondary} for loss $\ell$ in class $\mathcal{G}$.

$R_n(g_*)$ is the called the [Bayes' limit]{.secondary} or [Bayes' Risk]{.secondary}.

[It's the best we could hope to do in terms of]{.hand} $\ell$ [if we knew the distribution of the data.]{.hand}

. . .
It's the best we could hope to do *even if we knew the distribution of the data* (recall irreducible error!)

But we don't, so we'll try to do our best to estimate $g_*$.


## Best classifier overall

(for now, we limit to 2 classes)

Once we make a specific choice for $\ell$, we can find $g_*$ exactly (pretending we know the distribution)
Suppose we actually *know* the distribution of everything, and we've picked $\ell$ to be the [zero-one loss]{.secondary}


Because $Y$ takes only a few values, [zero-one]{.secondary}
loss is natural (but not the only option)
$$\ell(y,\ g(x)) = \begin{cases}0 & y=g(x)\\1 & y\neq g(x) \end{cases} \Longrightarrow R_n(g) = \Expect{\ell(Y,\ g(X))} = Pr(g(X) \neq Y),$$

## Best classifier overall
$$\ell(y,\ g(x)) = \begin{cases}0 & y=g(x)\\1 & y\neq g(x) \end{cases}$$

| Loss | Test + | Test - |
|:---: | :---: | :---: |
| Are + | 0 | 1 |
| Are - | 1 | 0 |

Then

$$R_n(g) = \Expect{\ell(Y,\ g(X))} = Pr(g(X) \neq Y)$$

## Best classifier overall

This means we want to
classify a new observation $(x_0,y_0)$ such that
$g(x_0) = y_0$ as often as possible
Want to classify a new observation $(X,Y)$ such that
$g(X) = Y$ with as high probability as possible. Under zero-one loss, we have

$$g_* = \argmin_{g} Pr(g(X) \neq Y) = \argmin_g 1- \Pr(g(X) = Y) = \argmax_g \Pr(g(X) = Y)$$

. . .

Under this loss, we have
$$
\begin{aligned}
g_*(X) &= \argmin_{g} Pr(g(X) \neq Y) \\
&= \argmin_{g} \left[ 1 - Pr(Y = g(x) | X=x)\right] \\
&= \argmax_{g} Pr(Y = g(x) | X=x )
g_* &= \argmax_{g} E[\Pr(g(X) = Y | X)]\\
&= \argmax_{g} E\left[\sum_{k\in\mathcal{K}}1[g(X) = k]\Pr(Y=k | X)\right]
\end{aligned}
$$

. . .

## Estimating $g_*$

For each $x$, only one $k$ can satisfy $g(x) = k$. So for each $x$,

$$
g_*(x) = \argmax_{k\in\mathcal{K}} \Pr(Y = k | X = x).
$$

### Classifier approach 1 (empirical risk minimization):
## Estimating $g_*$ Approach 1: Empirical risk minimization

1. Choose some class of classifiers $\mathcal{G}$.

2. Find $\argmin_{g\in\mathcal{G}} \sum_{i = 1}^n I(g(x_i) \neq y_i)$


## Bayes' Classifier and class densities (2 classes)
## Estimating $g_*$ Approach 2: Class densities

Using **Bayes' theorem**, and recalling that $f_*(X) = E[Y \given X]$
Consider 2 classes $\{0,1\}$: using **Bayes' theorem** (and being loose with notation),

$$\begin{aligned}
f_*(X) & = E[Y \given X] = Pr(Y = 1 \given X) \\
&= \frac{Pr(X\given Y=1) Pr(Y=1)}{Pr(X)}\\
& =\frac{Pr(X\given Y = 1) Pr(Y = 1)}{\sum_{k \in \{0,1\}} Pr(X\given Y = k) Pr(Y = k)} \\ & = \frac{p_1(X) \pi}{ p_1(X)\pi + p_0(X)(1-\pi)}\end{aligned}$$
\Pr(Y=1 \given X=x) &= \frac{\Pr(X=x\given Y=1) \Pr(Y=1)}{\Pr(X=x)}\\
&=\frac{\Pr(X=x\given Y = 1) \Pr(Y = 1)}{\sum_{k \in \{0,1\}} \Pr(X=x\given Y = k) \Pr(Y = k)} \\
&= \frac{p_1(x) \pi}{ p_1(x)\pi + p_0(x)(1-\pi)}\end{aligned}$$

* We call $p_k(X)$ the [class (conditional) densities]{.secondary}
* We call $p_k(x)$ the [class (conditional) densities]{.secondary}

* $\pi$ is the [marginal probability]{.secondary} $P(Y=1)$

## Bayes' Classifier and class densities (2 classes)
* Similar formula for $\Pr(Y=0\given X=x) = p_0(x)(1-\pi)/(\dots)$

## Estimating $g_*$ Approach 2: Class densities

Recall $g_*(x) = \argmax_k \Pr(Y=k|x)$; so we classify 1 if

The Bayes' Classifier (best classifier for 0-1 loss) can be rewritten
$$\frac{p_1(x) \pi}{ p_1(x)\pi + p_0(x)(1-\pi)} > \frac{p_0(x) (1-\pi)}{ p_1(x)\pi + p_0(x)(1-\pi)}$$

i.e., the [Bayes' Classifier]{.secondary} (best classifier for 0-1 loss) can be rewritten

$$g_*(X) = \begin{cases}
1 & \textrm{ if } \frac{p_1(X)}{p_0(X)} > \frac{1-\pi}{\pi} \\
0 & \textrm{ otherwise}
\end{cases}$$


### Approach 2: estimate everything in the expression above.
### Estimate everything in the expression above.

* We need to estimate $p_1$, $p_2$, $\pi$, $1-\pi$
* We need to estimate $p_0$, $p_1$, $\pi$, $1-\pi$
* Easily extended to more than two classes


## An alternative easy classifier


Zero-One loss was natural, but try something else
## Estimating $g_*$ Approach 3: Regression discretization


Let's try using [squared error loss]{.secondary} instead:
$\ell(y,\ f(x)) = (y - f(x))^2$
0-1 loss natural, but discrete. Let's try using [squared error]{.secondary}: $\ell(y,\ f(x)) = (y - f(x))^2$

**What will be the optimal classifier here?** (hint: think about regression)

Then, the Bayes' Classifier (the function that minimizes the Bayes Risk) is
$$g_*(x) = f_*(x) = E[ Y \given X = x] = Pr(Y = 1 \given X)$$
(recall that $f_* \in [0,1]$ is _still_ the regression function)
. . .

In this case, our "class" will actually just be a probability. But this isn't a class, so it's a bit unsatisfying.
The "Bayes' Classifier" (sort of...minimizes risk) is just the regression function!
$$f_*(x) = \Pr(Y = 1 \given X=x) = E[ Y \given X = x] $$

How do we get a class prediction?
In this case, $0\leq f_*(x)\leq 1$ not discrete... How do we get a class prediction?

. . .

Discretize the probability:
**Discretize the output**:

$$g(x) = \begin{cases}0 & f_*(x) < 1/2\\1 & \textrm{else}\end{cases}$$

## Estimating $g_*$

### Approach 3:

1. Estimate $f_*$ using any method we've learned so far.
1. Estimate $\hat f(x) = E[Y|X=x] = \Pr(Y=1|X=x)$ using any method we've learned so far.
2. Predict 0 if $\hat{f}(x)$ is less than 1/2, else predict 1.




## Claim: Classification is easier than regression


Expand Down Expand Up @@ -279,38 +297,21 @@ ggplot(tib) +

## How to find a classifier

[Why did we go through that math?]{.hand}
**Why did we go through that math?**

Each of these approaches suggests a way to find a classifier
Each of these approaches has strengths/drawbacks:

* [Empirical risk minimization:]{.secondary} Choose a set
of classifiers $\mathcal{G}$ and find $g \in \mathcal{G}$ that minimizes
some estimate of $R_n(g)$
* [Empirical risk minimization:]{.secondary} Minimize $R_n(g)$ in some family $\mathcal{G}$

> (This can be quite challenging as, unlike in regression, the
training error is nonconvex)
> (This can be quite challenging as, unlike in regression, the training error is nonconvex)
* [Density estimation:]{.secondary} Estimate $\pi$ and $p_k$

* [Regression:]{.secondary} Find an
estimate $\hat{f}$ of $f^*$ and compare the predicted value to 1/2




##

Easiest classifier when $y\in \{0,\ 1\}$:

(stupidest version of the third case...)

```{r eval=FALSE}
ghat <- round(predict(lm(y ~ ., data = trainingdata)))
```

Think about why this may not be very good. (At least 2 reasons I can think of.)
> (We have to estimate class densities to classify. Too roundabout?)
* [Regression:]{.secondary} Find an estimate $\hat{f}\approx E[Y|X=x]$ and compare the predicted value to 1/2

# Next time:
> (Unnatural, estimates whole regression function when we'll just discretize anyway)
# Next time...
Estimating the densities

0 comments on commit c54b9e4

Please sign in to comment.