maybe watch this lecture if time
see notation in [[hypothesis complexity and generalisation]].
the assumption in machine learning is that the underlying distribution of training data will be the same for test data.
if the underlying distribution between datasets for two tasks are also the same, we should use the same machine learning algorithm.
recall that we call the underlying distribution of input data its domain.
domain adaptation: reduce the difference between source and target domains, i.e. match the two underlying distributions
transfer learning: exploit training samples ('extract knowledge') from some source domain, to improve performance in a target domain i.e. train and test on different domains.
suppose we have source domain with samples
$$ { (X_1^S, Y_1^S) \dots (X_n^S, Y_n^S) } $$
and target domain with samples
$$ { (X_1^T, Y_1^T) \dots (X_n^T, Y_n^T) } $$
let's say the target domain is such that most
To use a classifier trained on
consider expected risk in target domain $$R^T(h) = \mathbb{E}{(X,Y) \sim p_t(X,Y)} [\ell (X, Y, h)]$$ $$ = \int{(X,Y)} \ell(X, Y, h)p_t(X,Y) \thinspace dXdY $$ now add in term for probability distribution for source domain $$ = \int_{(X,Y)} \ell(X,Y,h) \frac{p_t(X,Y)} {p_s(X,Y)} p_s(X,Y) \thinspace dXdY$$ $$ = \mathbb{E}{(X,Y) \sim p_s(X,Y)} [\ell(X,Y,h) \frac{p_t(X,Y)} {p_s(X,Y)} ]$$ which we will denote as $$ = \mathbb{E}{(X,Y) \sim p_s(X,Y)} [ \beta(X,Y) \ell(X,Y,h) ]$$ where
-
$\beta(X,Y)$ =$p_t(X,Y) / p_s(X,Y)$
so we have expected risk in the target domain equal to the expected loss in the source domain, multiplied by this beta distribution, which measures the difference in two distributions.
so knowing
i.e. we approximate
$$ \frac{1}{n_S} \sum_{i=1}^{n_S} \beta(x_i^S, y_i^S) \ell(x_i^S, y_i^S)$$ by minimising above risk, we optimise for target domain, by training in source domain.
product rule of joint probability:
in this model, assume
we then have
in this model, assume
we then have
we have from the above
We minimise with respect to
$$\min_{\beta} || \space \mu(p_t(X))
- \mathbb{E}_{Y \sim p_s(Y)} \bigl[ \mu( p_s (X|Y)) \beta(Y) \bigr] \space ||^2$$ empirically written as
$$\min_{\beta} \Biggl| \Biggl| \space \frac{1}{n_T} \sum_{i=1}^{n_T} \phi(x_i^T)
- \frac{1}{n_S}\sum_{i=1}^{n_S}
\beta(y_i^S) \hat\mu (p_s(X|y_i^S))
\space \Biggr|\Biggr|^2 $$
subject to
$$\beta(y_i^S) \ge 0 , \space \frac{1}{n_S} \sum_{i=1}^{n_S} \beta(y_i^S) = 1 $$
note that a domain in machine learning can be regarded as
- if
$Y \in {1, 2, \dots , C}$ we have a classification task - if
$Y \in \mathbb{R}$ we have a regression task
We have source domain
What conditions on these two distributions can we directly use a classifier trained on
- we should reduce the difference between these distributions
consider a function
$$ \phi : X \to \mathcal{H} $$
where
let
$$ \mu (p(X)) = \mathbb{E}_{X \sim p(X)} [ \phi (X)] $$
where
Now if we consider $$ \mu (p(X)) = \mathbb{E}_{X \sim p_S(X)} [ \beta(X) \phi (X)] $$ where
$\beta(X) \ge 0$ $\mathbb{E}_{X \sim p_S(X)} [ \beta(X)] = 1$
then we have
-
$\beta(X) = p_t(X) / p_s(X)$ because we can write$$\mu(\beta(X)p_s(X)) = \mu(p_t(X)) $$ now we would like to learn$\beta$ .
$$\min_{\beta} || \space \mu(p_t(X))
- \mathbb{E}{X \sim p_s(X)}
\bigl[ \beta(X) \phi(X)
\bigr] \space ||^2 $$
subject to
$$\beta(X) \ge 0, \space \mathbb{E}{X \sim p_s(X) } [\beta(X)] = 1 $$
given samples then, from each domain, we use the empirical mean to approximate
$\mu$ and $\mathbb{E}{X \sim p_s(X)}$ above. $$\min{\beta} \Biggl| \Biggl| \space \frac{1}{n_T} \sum_{i=1}^{n_T} \phi(x_i^T) - \frac{1}{n_S}\sum_{i=1}^{n_S}
\beta(x_i^S) \phi(x_i^S)
\space \Biggr|\Biggr|^2 $$
subject to
$$\beta(x_i^S) \ge 0 , \space \frac{1}{n_S} \sum_{i=1}^{n_S} \beta(x_i^S) = 1$$