layout | title |
---|---|
post |
Learning in undirected models |
Let us now look at parameter learning in undirected graphical models. Unfortunately, as in the case of inference, the higher expressivity of undirected models also makes them significantly more difficult to deal with. Fortunately, maximum likelihood learning in these models can be reduced to repeatedly performing inference, which will allow us to apply all the approximate inference techniques that we have seen earlier.
Let us start with a Markov random field (MRF) of the form
{% math %} p(x_1,..,x_n) = \frac{1}{Z(\varphi)} \prod_{c \in C} \phi_c(x_c; \varphi), {% endmath %} where {% math %} Z(\varphi) = \sum_{x_1,..,x_n}\prod_{c \in C} \phi_c(x_c; \varphi) {% endmath %} is the normalizing constant.
We can reparametrize
Note that
More generally, distributions of this form are members of the exponential family of distributions, about which you can learn more in CS229. Many other distributions are in the exponential family, including the Bernoulli, Multinomial, Normal, Poisson, and many other distributions.
Exponential families are widely used in machine learning{% sidenote 1 'We refer the reader to the CS229 course notes for a more thorough presentation of this topic.'%}. Suppose that you have an exponential family distribution of the form{% sidenote 1 'Actually, exponential families have a slightly more general form, but this one will be enough for our purposes'%} {% math %} p(x; \theta) = \frac{\exp(\theta^T f(x))}{Z(\theta)}. {% endmath %} Here are few facts about these distributions that you should know about.
- Exponential families are log-concave in their natural parameters
$$\theta$$ . The partition function$$Z(\theta)$$ is also log-convex in$$\theta$$ . - The vector
$$f(x)$$ is called the vector of sufficient statistics; these fully describe the distribution$$p$$ ; e.g. if$$p$$ is Gaussian,$$f(x)$$ contains (simple reparametrizations of) the mean and the variance of$$p$$ . - Exponential families make the fewest unnecessary assumptions about the data distribution. More formally, the distribution maximizing the entropy
$$H(p)$$ under the constraint$$\mathbb{E}_p[\phi(x)] = \alpha$$ (i.e. the sufficient statistics equal some value$$\alpha$$ ) is in the exponential family.
Exponential families are also very convenient to work with computationally. Their sufficient statistics can summarize arbitrary amounts of i.i.d. variables from the same distribution, and they admit so-called conjugate priors which makes them easily applicable in variational inference. In short, it definitely pays off to learn more about exponential families!
Suppose now that we are given a dataset
The first term is linear in
Now consider the gradient of the log likelihood. Obtaining the gradient of the linear part is obviously very easy. However, the gradient of
Computing the expectation on the right hand side of the equation requires inference with respect to
We can similarly derive an expression for the Hessian of
In summary, even though the log-likelihood objective is convex, optimizing it is hard; usually non-convexity is what makes optimization intractable, but in this case, it is the computation of the gradient.
Interestingly, however, maximum-likelihood learning reduces to inference in the sense that we may repeatedly use inference to compute the gradient and determine the model weights using gradient descent.
This observation lets us apply to learning many of the approximate inference methods that we have seen in previous chapters, such as:
- Gibbs sampling from the distribution at each step of gradient descent; we then approximate the gradient using Monte-Carlo.
- Persistent contrastive divergence{%sidenote 1 'Persistent contrastive divergence (PCD) is one of the most popular methods for training Restricted Boltzmann Machines (RBMs), a very important deep learning model that is also an undirected graphical model. Have a look at this great practical tutorial on RBMs.'%}, a variant of Gibbs sampling which re-uses the same Markov Chain between iterations. After a step of gradient descent, our model has changed very little: hence we can essentially keep taking samples from the same Gibbs sampler, instead of starting a new one from scratch.
Another popular approach to learning
Note that each term
However, it is not equal to the likelihood. Note that the correct way to expand the likelihood would involve the chain rule, i.e. the terms would be
Intuitively, the pseudo-likelihood objective assumes that
More formally, we can show that the pseudo-likelihood objective is concave. Assuming the data is drawn from an MRF with parameters
Pseudo-likelihood often works very well in practice, although there are exceptions.
We will end with an interesting observation about the maximum likelihood estimate
Recall that the log-likelihood of an MRF is {% math %} \frac{1}{|D|} \log p(D; \theta) = \frac{1}{|D|} \sum_{x \in D} \theta^T f(x) - \log Z(\theta). {% endmath %}
Taking the gradient, and using our expression for the gradient of the partition function, we obtain the expression
{% math %} \frac{1}{|D|} \sum_{x \in D} f(x) - \mathbb{E}_{x \sim p} [ f(x) ] {% endmath %}
Note that this is precisely the difference between the expectations of the natural parameters under the empirical (i.e. data) and the model distribution. Let's now look at one component of
{% math %} \frac{1}{|D|} \sum_{x \in D} \mathbb{I}[x_c = \bar x_c] - \mathbb{E}_{x \sim p} [ \mathbb{I}[x_c = \bar x_c] ] {% endmath %}
This gives us an insight into how MRFs are trained. The log-likelihood objective forces the model marginals to match the empirical marginals.
We refer to the above property as moment matching. This property of maximum-likelihood learning is very general: it occurs whenever we train distributions
Finally, let us look how maximum-likelihood learning extends to conditional random fields (CRFs), the other important type of undirected graphical models that we have seen.
Recall that a CRF is a probability distribution of the form
{% math %}
p(y \mid x) = \frac{1}{Z(x, \varphi)} \prod_{c \in C} \phi_c(y_c, x; \varphi),
{% endmath %}
where
{% math %}
Z(x, \varphi) = \sum_{y_1,..,y_n}\prod_{c \in C} \phi_c(y_c, x; \varphi)
{% endmath %}
is the partition function. The feature functions now depend on
We can reparametrize
The log-likelihood for this model given a dataset
The gradient is now {% math %} \frac{1}{|D|} \sum_{x, y \in D} f(x, y) - \frac{1}{|D|} \sum_{x \in D} \mathbb{E}{y \sim p(y|x)} [ f(x,y) ] {% endmath %} Similarly, the Hessian is going to be the covariance matrix {% math %} \text{cov}{y \sim p(y \mid x)} [ f(x,y) ] {% endmath %}
The good news is that this means that the conditional log-likelihood is still a concave function. We can optimize it using gradient ascent as before.
The bad news is that computing the gradient now requires one inference per training data point
This makes learning CRFs more expensive that learning in MRFs. In practice, however, CRFs are much more widely used than MRFs because supervised learning is a more widely used type of learning, and discriminative models (like CRFs) often learn a better classifier than their generative counterparts (which model
To deal with the computational difficulties introduced by the partition function, we may use simpler models in which exact inference is tractable. This was the approach taken in the OCR example introduced in our first discussion of CRFs. More generally, one should try to limit the number of variables or make sure that the model's graph is not too densely connected.
Finally, we would like to add that there exists another popular objective for training CRFs called the max-margin loss, a generalization of the objective for training SVMs. Models trained using this loss are called structured support vector machines or max-margin networks. This loss is more widely used in practice because it often leads to better generalization, and also it requires only MAP inference to compute the gradient, rather than general (e.g. marginal) inference, which is often more expensive to perform.