# Machine Learning —Expectation-Maximization Algorithm (EM)

The chicken and egg problem is a major headache for many entrepreneurs. Many machine learning ML problems deal with a similar dilemma. If we know A, we find B. Or, if we know B, we find A. Either way solves the ML problem. Unfortunately, we don’t know A or B in reality. But this kind of chicken and egg problem is much welcoming in ML. It can be solved by one powerful ML algorithm called Expectation-Maximization Algorithm (EM).

Let’s illustrate it with a clustering example, called Gaussian Mixture Model. We want to find an optimal way to group 100 data points (*x₁*,* x₂*, …, *x₁₀₀*) into two Gaussian clusters. We will use a Gaussian distribution (with parameter *μ, σ²*) to model the likelihood that a data point *xᵢ* belonging to a specific cluster.

Why do I call it the chicken and egg problem? If we know the mean and the variance of both clusters, we know which cluster each point belongs to. If we know which cluster each point belongs to, we can calculate the Gaussian parameters for both clusters.

But let’s take a broader view of the problem. Let’s assume we want to discover the latent variables *z* for *x*. If we know how to model and solve *p*(*x|z*), we can apply MLE to solve *z*.

And you don’t need the EM algorithm.

If this is not the case, we can optimize the distribution of *z* by maximizing the log-likelihood below. However, this is not simple. The equation will be in the form of the log of the sum. As discussed before, its gradient is usually intractable.

There is another problem. In many ML problems, we can model *p*(*x|z*) as an exponential family of distribution which has a concave log-likelihoods. But *p*(*x*) is a weighted sum of *p*(*x|z*), it is no longer concave or convex. So it is not easy to optimize.

As mentioned before, many ML problems are much easier to model and to be solved by introducing another latent variable *θ₂*.

Assuming that if we have *x* and *θ₁ *(a.k.a. *z*) fixed, we know how to model the distribution for *θ₂ *easily, i.e. *p*(*θ₂*)*. *Then, we can introduce the expected log-likelihood as:

To find *θ₁*, we can simply find the optimal *θ₁ *in maximizing the expected log-likelihood.

This equation is in the form of the sum of the log. It is concave. It is not hard to optimize. This turns into our easy chicken and egg problem.

Intuitively, many ML problems can be solved in two alternating steps with one latent variable fixed at a time. First, we start with an initial guess. We continue our expectation formularization and maximization iteratively, the solution will converge. That is why we call the method Expectation-Maximization.

# Gaussian Mixture Model GMM

Let’s detail the solution more with the GMM. We are interested in finding the Gaussian parameters *θ₁ *for the two clusters.

To make an inference, we predict the cluster assignment of a new data point. In this prediction, we need *θ₁ *only*. *So how do we find the optimal *θ₁ *from the training data*?*

We start with a random guess for the clustering model *θ₁ *(*μa, σa², μb, σb²*). To solve the problem, we introduce the cluster assignments *P*(*θ₂*)* *for each sample data point*. *For each cluster, we compute the chance that *xᵢ *belongs to that cluster as:

For example, the data point *xᵢ=3* may have a 0.2 chance to be in cluster *A* and 0.8 chance to be in cluster *B*. Then based on *P*(*θ₂*), we optimize the expected log-likelihood of the observation *x* w.r.t *θ₁ *(cluster parameters). However, *θ₂ *depends* *on* θ₁ *and vice versa. We are working with multiple moving parts in a circular loop.

So, we fix *θ₁ *and derive *θ₂. *Then we optimize *θ₁ *with* θ₂ *fixed. We repeat the iterations until *θ₁ *converge*.* In each iteration, our objective will improve. Assuming a limited precision, we are dealing with a finite space for *θ₁ *and* θ₂. *Therefore, there is a finite step for *θ₁ *and *θ₂ *to improve and our iteration should reach a local optimal.

In the EM-algorithm, the E-step fix the Gaussian models *θ₁ *(*μa, σa², μb, σb²*) and compute the probabilities *P*(*θ₂*).

In the EM algorithm, we assume we know how to model *p*(*θ₂ |x, θ₁*) easily. In the E-step, we apply a Gaussian model where *θ₂ *∈ {*a, b*} and we compute *p*(*aᵢ*) and *p*(*bᵢ*) for each data point *i *(for simplicity, we write *p*(*aᵢ*) as *aᵢ* below).

In the M-step, we want to maximize *p*(*x* | *θ₁*) w.r.t. *θ₁. *This is done by marginalized over *θ₂. *It turns out this can be done pretty straight forward in GMM. (more details on GMM can be found here.)

**Mathematical model**

(Credit: the equations in this section are originated from here.)

Our GMM example is a special case for the EM algorithm. Let’s develop the math and generalize the concept a little bit more. Given *xᵢ* is i.i.d., data is drawn from the same distribution *Ɲ* and independent of each other.

Therefore, maximizing the likelihood of the observation is equivalent to maximizing the sum of log-likelihood of each observation.

Expectation-maximization (EM) algorithm is a general class of algorithm that composed of two sets of parameters *θ₁, *and* θ₂. θ₂ *are some un-observed variables, hidden latent factors or missing data. Often, we don’t really care about *θ₂ *during inference. But if we try to solve the problem, we may find it much easier to break it into two steps and introduce *θ₂ *as* *a latent variable. First, we derive *θ₂ *from the observation and *θ₁*. Then we optimize *θ₁ *with* θ₂ *fixed. We continue the iterations until the solution converges.

In the M-step, we maximize the log-likelihood of the observations w.r.t. *θ₁*. The log-likelihood ln *p*(*x*|*θ₁*) can be decomposed as the following with the second term being the KL-divergence (proof).

where *q*(*θ₂*) can be any distribution.

Now, let’s make the R.H.S. as simple as possible so we can optimize it easily*. *First, let’s pick a choice for *q*(*θ₂*). A natural one will be *p*(*θ₂|x, θ₁*) — the probability of *θ₂ *given the observation and our model *θ₁*. By setting *q*(*θ₂*) = *p*(*θ₂|x, θ₁*), we get a nice bonus. The KL-divergence term becomes zero. Now our log-likelihood can be simplified to

Since the second term does not change w.r.t. *θ₁, *we can optimize the R.H.S. term below.

This is the expected log-likelihood over *θ₂.*

Let’s recap. In the E-step (expectation estimation step), we compute *q*(*θ₂*) and re-establish the equation for the log-likelihood ln *p*(*x |θ₁*).

In the M-step (maximization step), we optimize this expected log-likelihood w.r.t. *θ₁.*

Here is the EM-algorithm:

For the solution to work, we assume we know how to model *p*(*θ₂ | x, θ₁*) and *p*(*x, θ₂ | θ₁*). The success of the EM algorithm subjects to how simple are they and how easy to optimize the later one. In GMM, this can be done easily.

**Narrative of the EM algorithm**

In some ML problems, it is intractable to solve the problem with our perceived model parameter *θ₁*. Instead, we solve the problems in alternating steps by adding latent parameters *θ₂*. By breaking down the problem, the equations involve *θ₁ *→ *θ₂ *→ *θ₁. *It can be solved much simpler. Our objective is solving either the L.H.S. or the R.H.S. below.

If we can model *p*(*x | θ₁, θ₂*) with a simple model that can be optimized easily, we don’t need the EM algorithm. Just use MLE. But as shown below, it is not the case for GMM. The log of the sum is intractable but the sum of the log is tractable.

For the GMM problem, the joint distribution *p*(*x, θ₂ | θ₁*) in the EM algorithm is modeled as a Gaussian. Its log is simple. *q*(*θ₂*) is set to *p*(*θ₂ | x, θ₁*) with *θ₁ *fixed. Optimizing this formularization is easy. On the contrary, the equations for *p*(*x | θ₁, θ₂*) is very complex. Therefore, we choose the EM algorithm to solve GMM.

So if the problem can be modeled easier with its subcomponents (the joint distribution) and we know how to model *p*(*θ₂ | x, θ₁*) easily, EM algorithm will have an edge over MLE (otherwise, no).

Another example in demonstrating this idea with the PCA can be found here.

In MLE or EM, *θ₁ *and* θ₂ *may depend on each other. Therefore, we optimize *θ₁ *or *θ₂ *one at a time with the other fixed. Iterating between alternating steps, the solution will converge.

**Minorize-Maximization MM algorithm**

The EM algorithm can be treated as a special case for the MM algorithm. With Jensen’s inequality, we can establish a lower bound for the log-likelihood as below for any distribution *q*.

The Jensen’s inequality above for the log function is demonstrated with *k* = 2 and *q*(*θ₂_k*) as *αᵢ* below*.*

Without too much elaboration, let’s understand the MM algorithm graphically. As mentioned before, the log-likelihood loss function *L *can be in some shape that is hard to optimize. We will use the MM method to optimize it instead.

𝓛 is the lower bound for the log-likelihood loss function *L *that we want to optimize. At iteration *n*, we compute *q*(*θ₂*) = *p*(*θ₂ | x, θ₁ⁿ*). 𝓛 will be the same as *L* at *θ₁ⁿ *(the lower bound is tight)*. *At the same time, 𝓛 with this *q*(*θ₂*) establishes a lower bound for *L*. 𝓛 is in the form of the sum of the log and it is easier to optimize. Therefore, we optimize 𝓛 to find a new optimal for *θ₁ *(the green dot)*. *Then, we can use the new optimal *θ₁ *to compute *q*(*θ₂*) and establish a new lower bound 𝓛.

As shown below, as we keep iterating, we can at least reach a local optimal for *L*. Intuitively, for a random guess on *θ, *we establish a lower bound function 𝓛. We find the optimal point for 𝓛 and use it as the next guess. We continue finding a lower bound function and optimize it as a way to optimize a function *f*. The MM algorithm takes advantage of the idea that *f* may be very hard to optimize but we can find lower bound functions that are much easier.

A detail mathematical proof that each iteration improves the likelihood can be found here.

The EM algorithm becomes

**Missing data**

Previously, we introduce latent variable *θ₂ *to complement *θ₁ *to solve our optimization problem. The EM algorithm can be used to estimate missing data in the training dataset also. For example, the training data may be collected from user surveys. But users may answer a subset of questions only and there will be different missing data in each survey. In this situation, *θ₂ *will be the missing data that we want to model and *θ₁ *is the answers provided in the surveys. Another example is removing noise in an image which we consider the noise as the missing data. Here is how we formulate the log-likelihood using our EM equations.

We will not elaborate on the solution further but the steps to find the solution can be found here.

**Terms**

Here is the recap of some terminology.

# Variational inference (optional)

At one point, we should suspect variational inference and EM-algorithm are closely related as they both work with the log-likelihood and minimize the KL-divergence.

In the variational inference, we approximate the posterior *p*(*x, z *;* θ* )with a tractable distribution *q*. We maximize the evidence lower bound ELBO below to find the optimal *q**:

ELBO fulfills:

i.e. We maximize the log-likelihood if *q* equals *p*(*z|x; θ*).

Therefore, the EM algorithm can be viewed as optimizing the ELBO over *θ₂* at the E-step and over *θ₁* at the M-step. By setting *q* to *p*(*z|x; θ*), we maximize the ELBO.

Previously, we calculate *p*(*θ₂|x; θ₁*) explicitly ourself. In variational inference, we approximate it with tractable distributions. In fact, EM can adapt the same strategy in the E-step to approximate *p*(*θ₂|x; θ₁*) with tractable distributions. Of course, the complexity will be significantly increased.