Expectation-Maximization Algorithm
This is a lecture note of a course to introduce the general form of the expectation-maximization algorithm.
Let’s start with two mathematical concepts: Jensen’s inequality and Kullback-Leibler divergence.
Jensen’s Inequality
Jensen’s inequality is named after the Danish mathematician Johan Jensen that relates the value of a convex function of an integral to the integral of the convex function.
The definition of concave functions is shown above. A more intuitive version is that, for any two points on the curve, if the blue point is higher than the red point, the curve is concave.
We can generalize it to the multiple points version:
Kullback-Leibler Divergence
The Kullback-Leibler Divergence, also known for KL divergence, is a measurement between two probability distributions. Let’s see an example first.
Given two Gaussian distributions (normal distributions), normally we can have an empirical way of saying how different two distributions are. In the above figure, two pairs of curves both differ in the mean parameter (their variance are the same, the first being 1 and the second being 100). But as we can see, the KL divergences are not the same. Just bear in mind that, rather than comparing the parameters, we need a rigorous measurement for the distribution differences. And that is KL divergence.
Why do I say “difference” instead of “distance”? Because KL divergence is NOT symmetric, that is to say: KL divergence from distribution P to Q is different from distribution Q to P. Let’s see the formula for calculating KL divergence.
It’s obvious that this divergence can be infinite: if x exists where Q(x)=0 while P(x)>0, their differences are infinitely large. It means that to measure the difference from Q to P, we need to make sure that the support of P is contained in the support of Q. (Support refers to the range of x)
We will stop at this point for KL divergence. This blog is interesting for understanding why “all supervised learning and reinforcement learning problems are simply minimizing the KL divergence objective”.
Latent Variable Models
Latent variable models assume that we have a set of unobservable variables underlying. They help cast some constraints in the intermediate stage of the prediction processes, which helps reduce computation and complexity usually.
We assume that an arbitrary data point belongs to one of the latent components. Intuitively, we calculate the probability of a data point belonging to a particular component given parameters and then marginalize out the latent variables.
So given a latent variable model, we want to maximize the marginal likelihood of our data X given the parameters Theta. In the above figure, we assume that the dimension of parameters is one so that it’s easier to show the log-likelihood curve (in blue).
In the second likelihood curve, you find a red curve that represents a lower bound distribution (in red). We call it lower bound for that, for every Theta, we know for sure that the blue curve is higher than the red curve. So if we find a lower bound distribution that is easy to formulate (like a normal distribution), we can optimize on that lower bound to get a better Theta.
But we know that, the performance depends on the choice of lower bound distribution. By introducing a set of coefficients q, we can optimize on a family of lower bound functions L(Theta, q) instead of L(Theta). It’s called variational lower bound.
Expectation-Maximization Algorithm
Let’s reformulate the problem. Given a dataset, we want to find the optimal parameters by optimizing the GAP between log-likelihood curve and the variational lower bound.
The EM algorithm has two steps: E-step for finding the current best q and M-step for finding the current best Theta. We do this iteratively as shown below.
Intuitively, by doing E-step, we are moving the lower bound curve upward until they are touched (should not exceed the blue curve by the lower bound definition). By doing M-step, we are using the previous lower bound curve to generate the new Theta.
The EM algorithm has many extensions with respect to how we choose the q and the next Theta, but overall this is the general form of EM algorithm that allows you to train any latent variable models.
Here’s a small test for calculating the KL divergence of two Gaussian distributions with the same mean.
Did you get it?:)