Expectation-Maximization Algorithm

Jesse Jing
Towards NeSy
Published in
4 min readDec 21, 2021

This is a lecture note of a course to introduce the general form of the expectation-maximization algorithm.

Photo by Tobias Keller on Unsplash

Let’s start with two mathematical concepts: Jensen’s inequality and Kullback-Leibler divergence.

Jensen’s Inequality

Jensen’s inequality is named after the Danish mathematician Johan Jensen that relates the value of a convex function of an integral to the integral of the convex function.

Concave function definition

The definition of concave functions is shown above. A more intuitive version is that, for any two points on the curve, if the blue point is higher than the red point, the curve is concave.

We can generalize it to the multiple points version:

Kullback-Leibler Divergence

The Kullback-Leibler Divergence, also known for KL divergence, is a measurement between two probability distributions. Let’s see an example first.

https://www.coursera.org/learn/bayesian-methods-in-machine-learning/lecture/E8CFE/jensens-inequality-kullback-leibler-divergence

Given two Gaussian distributions (normal distributions), normally we can have an empirical way of saying how different two distributions are. In the above figure, two pairs of curves both differ in the mean parameter (their variance are the same, the first being 1 and the second being 100). But as we can see, the KL divergences are not the same. Just bear in mind that, rather than comparing the parameters, we need a rigorous measurement for the distribution differences. And that is KL divergence.

Why do I say “difference” instead of “distance”? Because KL divergence is NOT symmetric, that is to say: KL divergence from distribution P to Q is different from distribution Q to P. Let’s see the formula for calculating KL divergence.

It’s obvious that this divergence can be infinite: if x exists where Q(x)=0 while P(x)>0, their differences are infinitely large. It means that to measure the difference from Q to P, we need to make sure that the support of P is contained in the support of Q. (Support refers to the range of x)

We will stop at this point for KL divergence. This blog is interesting for understanding why “all supervised learning and reinforcement learning problems are simply minimizing the KL divergence objective”.

Latent Variable Models

Latent variable models assume that we have a set of unobservable variables underlying. They help cast some constraints in the intermediate stage of the prediction processes, which helps reduce computation and complexity usually.

We assume that an arbitrary data point belongs to one of the latent components. Intuitively, we calculate the probability of a data point belonging to a particular component given parameters and then marginalize out the latent variables.

How does Jensen’s Inequality help find a better lower bound?

So given a latent variable model, we want to maximize the marginal likelihood of our data X given the parameters Theta. In the above figure, we assume that the dimension of parameters is one so that it’s easier to show the log-likelihood curve (in blue).

In the second likelihood curve, you find a red curve that represents a lower bound distribution (in red). We call it lower bound for that, for every Theta, we know for sure that the blue curve is higher than the red curve. So if we find a lower bound distribution that is easy to formulate (like a normal distribution), we can optimize on that lower bound to get a better Theta.

But we know that, the performance depends on the choice of lower bound distribution. By introducing a set of coefficients q, we can optimize on a family of lower bound functions L(Theta, q) instead of L(Theta). It’s called variational lower bound.

Expectation-Maximization Algorithm

Let’s reformulate the problem. Given a dataset, we want to find the optimal parameters by optimizing the GAP between log-likelihood curve and the variational lower bound.

The EM algorithm has two steps: E-step for finding the current best q and M-step for finding the current best Theta. We do this iteratively as shown below.

E and M steps

Intuitively, by doing E-step, we are moving the lower bound curve upward until they are touched (should not exceed the blue curve by the lower bound definition). By doing M-step, we are using the previous lower bound curve to generate the new Theta.

The EM algorithm has many extensions with respect to how we choose the q and the next Theta, but overall this is the general form of EM algorithm that allows you to train any latent variable models.

Here’s a small test for calculating the KL divergence of two Gaussian distributions with the same mean.

https://www.coursera.org/learn/bayesian-methods-in-machine-learning/home/welcome

Did you get it?:)

References:

--

--

Jesse Jing
Towards NeSy

CS PhD student at ASU pushing the frontier of Neural Symbolic AI. We host the publication @Towards Nesy Try submitting your technical blogs as well!