Stat 406
Geoff Pleiss, Trevor Campbell
-Last modified – 18 September 2024
+Last modified – 19 September 2024
\[
\DeclareMathOperator*{\argmin}{argmin}
\DeclareMathOperator*{\argmax}{argmax}
@@ -441,41 +441,33 @@ A model is a set of distributions that explain data \(\{ Z = (X, Y) \}\), i.e. \[\mathcal{P} = \{ P: \quad Y \mid X \sim \mathcal N( f(X), \sigma^2) \quad \text{for some ``smooth'' f} \}\] (Why do we have to specify that \(f\) is smooth? Why can’t it be any function?) Choose the \(P \in \mathcal P\) that makes the “best” predictions on new \(X, Y\) pairs. (Next slide: how do we formalize “best”?)Last time
What is a model?
Goal of learning
\[\mathcal{P} = \{ P: \quad Y \mid X \sim \mathcal N( f(X), \sigma^2) \quad \text{for some ``smooth'' f} \}\]
-Specify how a \(P \in \mathcal P\) makes predictions on new inputs \(\hat Y\).
+
Specify how a \(P \in \mathcal P\) makes predictions on new inputs \(\hat Y\).
(E.g.: \(\hat Y = f(X)\) for \(P = \mathcal N(f(X), \sigma^2)\).)
Introduce a loss function \(\ell(Y, \hat{Y})\) (a datapoint-level function).
+
Introduce a loss function \(\ell(Y, \hat{Y})\) (a datapoint-level function).
(E.g.: \(\ell(Y, \hat Y) = (Y - \hat Y)^2\))
Define the risk of \(P \in \mathcal P\) as the expected loss (a population-level function):
-(R_n(P) = E[(Y, Y)] = E[(Y - f(X))^2])
The best model is the one that minimizes the risk.
+
Define the test error of \(P \in \mathcal P\) as the expected loss (a population-level function):
+(P) = E[(Y, Y)] = E[(Y - f(X))^2]
The best model is the one that minimizes the test error
(\(P^* = \argmin_{P \in \mathcal P} R_n(P)\))
Last time: when \(\ell(Y, \hat Y) = (Y - \hat Y)^2\), we showed that the regression function is the best model:
\[ -\text{Regression function } \triangleq E[Y \mid X] = \argmin_{P \in \mathcal P} R_n(P) = \argmin_{P \in \mathcal P} E[\ell(Y, \hat Y)] +\text{Regression function } \triangleq E[Y \mid X] = \argmin_{P \in \mathcal P} \text{Err}(P) = \argmin_{P \in \mathcal P} E[\ell(Y, \hat Y)] \]
-Are we done? Have we solved learning?
(We’ll discuss various methods for producing \(\hat f(X)\) estimators throughout this course.)
Our estimator \(\hat f\) is a random variable (it depends on training sample).
+So let’s consider the risk (the expected test error):
\[ +R_n = E_{\hat f} \left[ \text{Err.}(\hat f) \right] = E_{\hat f, X, Y} \left[ \ell(Y, \hat f(X)) \right] +\]
+Note
+Test error is a metric for a fixed \(\hat f\). It averages over all possible test points, but assumes a fixed training set.
+Risk averages over everything that is random: (1) the test data point sampled from our population, and (2) the training data that produces \(\hat f\)
+When \(\ell(Y, \hat Y) = (Y - \hat Y)^2\), the prediction risk of \(\hat f(X)\) decomposes into two factors:
\[ -R_n(\hat f) \quad = \quad \underbrace{E\left[ \: \left( E[Y\mid X] -\hat f(X) \right)^2 \right]}_{(1)} \quad - \quad \underbrace{E\left[ \: \left( \hat f(X) - Y\right)^2 \right]}_{(2)} +R_n \quad = \quad \underbrace{E_{\hat f, X, Y} \left[ \: \left( E[Y\mid X] - \hat f(X) \right)^2 \right]}_{(1)} \quad + \quad \underbrace{E_{X, Y} \left[ \: \left( Y - E[Y\mid X] \right)^2 \right]}_{(2)} \]
-The estimation error term further reduces into two components:
-\[\begin{aligned} -\underbrace{E\left[ \: \left( E[Y\mid X] -\hat f(X) \right)^2 \right]}_{\text{Estimation error}} \quad &= \quad \underbrace{\left( E[Y\mid X] - E \left[\hat f(X)\right] \right)^2}_{(A)} \quad \\ -&+ \quad \underbrace{E\left[ \: \left( E \left[\hat f(X)\right] -\hat f(X) \right)^2 \right]}_{(B)} -\end{aligned}\] -\[ +\begin{aligned} +\underbrace{E_{\hat f, X, Y} \left[ \: \left( E[Y\mid X] -\hat f(X) \right)^2 \right]}_{\text{Estimation error}} \quad &= \quad \underbrace{ E_{X, Y} \left[ \left( E[Y\mid X] - E \left[\hat f(X) \mid X\right] \right)^2 \right]}_{(A)} \quad \\ +&+ \quad \underbrace{E_{\hat f, X} \left[ \: \left( E \left[\hat f(X) \mid X\right] -\hat f(X) \right)^2 \right]}_{(B)} +\end{aligned} +\]
+
-So far, \(R_n\) has been a theoretical construct.
+\(R_n\) is a theoretical construct.
We can never know the true \(R_n\) for a given \(\hat f\). We also have to estimate it from data.
\(\hat R_n(\hat f)\) is a bad estimator for \(R_n(\widehat{f})\).
+
\(\hat R_n(\hat f)\) is a bad estimator for \(R_n\).
So we should never use it.
These all have the same \(R^2\) and Training Error
Our training error \(\hat R_n(\hat f)\) is an estimator of \(R_n(\hat f)\).
-So we can ask “is \(\widehat{R}(\hat{f})\) a good estimator for \(R_n(\hat{f})\)?”
Our training error \(\hat R_n(\hat f)\) is an estimator of \(R_n\).
+So we can ask “is \(\widehat{R}_n(\hat{f})\) a good estimator for \(R_n\)?”
Let’s measure the risk of our empirical risk estimator:
-\[E[(R_n(\hat f) - \hat R_n(\hat f))^2]\] (What is the expectation with respect to?)
+Let’s measure the error of our empirical risk estimator:
+ +\[E[(R_n - \hat R_n(\hat f))^2]\] (What is the expectation with respect to?)
\[E[(R_n(\hat f) - \hat R_n(\hat f))^2]\]
+\[E[(R_n - \hat R_n(\hat f))^2]\]
As before, we can decompose our risk-risk into bias and variance
+As before, we can decompose the error of our risk estimator into bias and variance
\[ -E[(R_n - \hat R)^2] = \underbrace{E[( R_n - E[\hat R_n])^2]}_{\text{bias}} + \underbrace{E[( \hat R_n - E[\hat R_n])^2]}_{\text{variance}} +E[(R_n - \hat R_n(\hat f))^2] = \underbrace{( R_n - E[\hat R_n(\hat f)])^2}_{\text{bias}} + \underbrace{E[( \hat R_n(\hat f) - E[\hat R_n(\hat f)])^2]}_{\text{variance}} \]
+Recall that \(\hat R_n(\hat f)\) is estimated from the training data \(\{ (X_i, Y_i) \}_{i=1}^n\).
-Consider an alternative estimator built from \(\{ (X_j, Y_j) \}_{j=1}^m\) that was not part of the training set. \[\tilde R_m = {\textstyle \frac{1}{m} \sum_{j=1}^m} \ell(Y_j, \hat Y_j(X_j)), -\]
-Consider an alternative estimator built from \(\{ (X_j, Y_j) \}_{j=1}^m\) that was not part of the training set. \[\tilde R_m(\hat f) = {\textstyle \frac{1}{m} \sum_{j=1}^m} \ell(Y_j, \hat f(X_j)), +\] The error of this estimator can also be decompsed into bias and variance \[ +E[(R_n - \tilde R_m(\hat f))^2] = \underbrace{( R_n - E_{\hat f,X_j,Y_j}[\tilde R_m(\hat f)])^2}_{\text{bias}} + \underbrace{E_{\hat f,X_j,Y_j}[( \tilde R_m(\hat f) - E_{\hat f,X_j,Y_j}[\tilde R_m(\hat f)])^2]}_{\text{variance}} +\]
+\(\tilde R_m(\hat f)\) has zero bias!
+\[ +\begin{aligned} +E_{\hat f,X_j,Y_j} \left[ \tilde R_m(\hat f) \right] +&= E_{\hat f,X_j,Y_j} \left[ \frac{1}{m} \sum_{j=1}^m \ell(Y_j, \hat f(X_j)) \right] \\ +&= \frac{1}{m} \sum_{j=1}^m E_{\hat f,X_j,Y_j} \left[ \ell(Y_j, \hat f(X_j)) \right] += R_n +\end{aligned} +\]
This option follows the logic on the previous slide.
-If we randomly “hold out” \(\{ (X_j, Y_j) \}_{j=1}^m\) from the training set, then we can use this data to get an (nearly) unbiased estimator of \(R_n\). \[
-R_n(\hat f) \approx \tilde R_m \triangleq {\textstyle{\frac 1 m \sum_{j=1}^m \ell ( Y_j - \hat Y_j(X_j))}}
+If we randomly “hold out” \(\{ (X_j, Y_j) \}_{j=1}^m\) from the training set, we can use this data to get an (nearly) unbiased estimator of \(R_n\). \[
+R_n \approx \tilde R_m(\hat f) \triangleq {\textstyle{\frac 1 m \sum_{j=1}^m \ell ( Y_j - \hat Y_j(X_j))}}
\]