Understanding the EM Algorithm by Examples (with code and visualization)

Clarice Wang
8 min readJul 30, 2022

--

Before diving into the Expectation Maximization (EM) algorithm it is important to understand the concept of Maximum Likelihood Estimation (MLE). Given some observed data, MLE finds the parameters of an assumed probability distribution that most likely generates the data.

In other words, MLE finds θ, the set of parameters that gives us the maximum probability of seeing data D.

If the assumed distribution is Gaussian, we use MLE to find θ = [µ, σ], where µ is the mean and σ is the standard deviation, such that it fits the data the best.

As an example, if we measure the heights of 1000 adults, our data may look like this:

Visually, we may find that the data forms a Gaussian distribution with a mean of ~170 cm and a standard deviation of ~10 cm.

Mathematically, we can take the derivative of the log-likelihood function to solve for µ, which ends up being the expected value. The standard deviation can be solved similarly by differentiating the log-likelihood function with respect to σ.

Solving for µ using partial differentiation.
Solving for σ using partial differentiation.

So where does EM come in?

In our previous example, instead of modeling the heights by a single Gaussian distribution, we may model the heights of men by Gaussian distribution and the heights of women by another Gaussian distribution. But we do not know the gender of the 1000 adults. How do we find the two Gaussian distributions? For this problem, the presence of a latent variable, gender, is the reason why we need the EM algorithm.

The EM algorithm is an iterative method of statistical analysis that employs MLE in the presence of latent variables. It can be broken down into two major steps (Fig. 1): the expectation step and the maximization step.

  1. Initial Guesses: We begin the algorithm by randomly guessing the initial means and standard deviations for both distributions.
  2. E-Step: For each height measurement, we find the probabilities that it is generated by the male and the female distribution. Based on the probabilities we assign the measurements to the male and female groups.
  3. M-Step: We run MLE on the mail and female groups to re-estimate the means and standard deviations of the two distributions.
  4. Convergence Test: We do the E- and M- steps over and over again until the parameters stop changing — we’ve found the two normal distributions that best fit our height data.
Figure 1. Expectation-Maximization Algorithm Process

Convergence is largely dependent on the EM algorithm’s initial guesses. In consequence, it is common practice to test the algorithm with a variety of initial guesses in order to arrive at the best possible solution.

In the rest of this article, we cover 3 examples of the EM algorithm with code and visualization: K-Means, Two Coins, and Gaussian Mixtures.

K-Means

The K-Means algorithm is a widely used data clustering algorithm that implements the EM process. K-Means involves identifying the k clusters that n points belong in using an iterative method. Here are the steps.

  1. Generate data with k clusters
  2. Set a random initial centroid for each cluster
  3. Iterate through the E-step and M-step until the centroids converge

First, we have to create the dataset of points that we’ll run the algorithm on. The sklearn.datasets library has a convenient function, makeblobs, that does this for us. I created k = 3 clusters that each have 200 points, resulting in a final dataset of n = 600 points. We can visualize the clusters using a scatter plot (Fig. 2).

Figure 2. Three clusters generated by make_blobs

Given the data and the number of clusters desired, the algorithm first chooses 3 random initial points to represent the centers of each cluster, which are also known as centroids.

E-step

Then, the algorithm assigns each point to the cluster whose centroid is closest to the point.

M-step

Next, each cluster’s centroid is redetermined to be the mean of the cluster’s newly assigned points.

Equations to compute the centroid coordinate of a cluster

The E-step and M-step repeat until the centroids do not change (Fig. 3), and the program exits.

Figure 3. Iterations of the K-Means algorithm employing the EM process

Two Coins

Imagine you have two coins in your pocket, coin A and coin B, each with a different, but unknown probability to land on heads. Every time you take out a coin, you flip it 10 times, record the number of times it lands on heads, and put it back in your pocket. You repeat this process 5 times, each time randomly using either coin A or coin B. At the end, you will have flipped a coin a total of 50 times. Having done that, what are the probabilities of coin A and coin B landing on heads?

This is known as a mixture problem. We are working with data that is generated from two distributions, and our goal is to determine the parameters of those distributions, sound familiar? Let’s go through the steps.

  1. Generate data of coin flips
  2. Set random initial θ values (guesses)
  3. Iterate through the E-step and M-step until the θ values converge

I use NumPy’s random function to generate an array of coin flips given the coins’ true probabilities and the number of tosses I want per trial. In this example, I’ll use 0.6 and 0.2 as the true probabilities for coins A and B, respectively, and 10 tosses per trial. Heads are represented by 1s and tails are represented by 0s.

Although the problem statement above has you only repeat the process for 5 trials, in order for our algorithm to converge we need a lot more than 50 data points. Let’s increase the number of trials to 500, meaning we must call the roll function 500 times to completely generate our data.

data = [roll(tA, tB, num_tosses) for trial in range(num_trials)]

We again use NumPy’s random function for our initial guessed probabilities, which gives us 0.8 and 0.7 for coins A and B, respectively.

E-step

Just like in the K-Means problem, we must partition our data according to our initial guesses (this time by θ values instead of centroid positions).

For each trial, we count how many heads we rolled and calculate the probability of seeing those 10 tosses from each coin.

Now here’s what makes this problem different from K-Means. In K-Means we assign every point to the cluster of the centroid it is closest to, which is the equivalent of assigning this trial to coin A, since the probability is greater. However, we can actually take a step further by assigning a portion of the trial to coin A, and the other portion to coin B. To do so, we calculate the weight on each coin.

This tells us that 0.564(8) = 4.512 out of the 8 heads in this trial are attributed to coin A and the 3.488 heads left are attributed to coin B. We do the same calculations for the partitioning of the tails. Ultimately, the numbers of heads and tails attributed to each coin are accumulated over all trials.

M-step

We can now use MLE to recalculate the θ values for coins A and B. This is quite straightforward; for example, we are most likely to see 2 heads and 3 tails from a coin with probability θ = 2/5. So given the heads and tails attributed to each coin, we find the proportions of each coin’s number of heads. These proportions are our new guesses.

The algorithm goes through, in this case, 15 iterations of the E-step and M-step (Fig. 4) and does a good job converging at our true probabilities of 0.6 and 0.2.

Figure 4. Iterations of the Two Coins algorithm employing the EM process

Gaussian Mixtures

The goal of this example is to generate a Gaussian mixture model from n normally distributed data subsets. Remember the height problem from earlier? That was a Gaussian mixture problem with n = 2 distributions. While this is also a mixture problem like Two Coins, we are now working with a Gaussian distribution, not a Bernoulli distribution. Let’s begin.

  1. Generate dataset of 1-dimensional points
  2. Set random initial parameters (means and standard deviations)
  3. Iterate through the E-step and M-step until the parameters converge

I have generated three Gaussian distributions with the following parameters [µ, σ]: [-10, 3], [30, 7], [3, 5]. Each Gaussian has 1000 normally distributed points. Let’s take a look at the distributions by creating a histogram of the 3000-point dataset (Fig. 5).

Figure 5. Histogram of dataset

Now let’s set our initial guesses. For the means, we generate 3 random numbers within the data range. For the standard deviations, we generate 3 random numbers between 1 and 10.

mean = np.random.randint(min(points), max(points))
stdev = np.random.randint(1, 10)

E-step

Now that we have our data and initial parameters, the next step is to find the probabilities that each normal distribution generated each point. We can call these probabilities “membership probabilities” and create an array r with dimensions (3000, 3) that stores them.

First we use norm from SciPy’s stats library to set up the distribution function for each set of means and standard deviations. Then, we apply each distribution’s pdf (probability density function) on our data to find the membership probabilities. This function is standardized for normal/Gaussian distributions (Fig. 9).

Figure 9. Probability density function (pdf) for a Gaussian distribution

Finally, we normalize r so that each row sums to 1.

M-step

We then use the array of probabilities attained through the E-step to adjust our parameters of mean and standard deviation for each Gaussian involved in the mixture. Again, we use MLE (Fig. 10).

Figure 10. Equations to compute the new mean and standard deviation for every Gaussian distribution

The E-step and M-step repeat for k = 20 iterations (Fig. 6).

Figure 6. Initial random Gaussians (left); Gaussians after 5 iterations (center); Gaussians after 20 iterations (right)

Conclusion

The EM algorithm is an iterative process that employs MLE in the presence of a latent variable. We have seen this algorithm at work in three different examples: K-Means (clustering), Two Coins (binomial mixture), and Gaussian Mixtures.

--

--