The EM Algorithm And Mixture Distributions: A Match Made In Heaven

Omar Ghanem

Published in

Data And Beyond

13 min readSep 29, 2024

Melencolia I by Albrecht Dürer. Consider the window.

Introduction

Two normal distributions and their mixture.

The EM algorithm is quite counter intuitive at first sight. And worse, the very situation in which the EM algorithm becomes useful is mind-bending on its own. So, before describing the algorithm itself, let’s talk about when it’s beneficial to use it.

Imagine that you have a dataset of points x_1, x_2, x_3, … , x_n (they might be scalars or vectors; it doesn’t matter) and you somehow know, through domain knowledge or an oracle, that with every point x_i there is another random variable, call its realization z_i, associated with it which is hidden — and which no matter what you do you cannot measure. This is not as far-fetched as you might think.

A very simple example of when that might be the case is clustering algorithms. Whenever you use a clustering algorithm and assume that there is, say, 3 clusters, you implicitly create such a hidden random variable Z whose range is {1, 2, 3} (or equivalently, we could one hot encode it, such that Z becomes a random vector). For now every data point is associated with a cluster which you didn’t measure in the data collection — if you did measure it then you wouldn’t be doing clustering anymore but supervised learning.

Now, it might be the case that we have chosen a particular model to fit to the data. And we would like to estimate the parameters using maximum likelihood estimation. You do remember maximum likelihood estimation, right 🤔❓

Well, if you don’t, then here is a brief explanation. We have unknown parameters, and we estimate these unknown parameters by choosing values which maximize the probability of the dataset. To be more exact, we choose the parameters such that the following function,

is maximized, and p is either a probability mass function or a probability density function. Multiplying the individual probabilities of the data points gives us the probability of the whole dataset because we usually assume that the points are independent (unless it’s a time series or a similar dependent sequence, but we won’t go into that here).

Now where were we? We have a model and a parameter vector θ, and we would like to estimate θ using maximum likelihood estimation. Four things might happen:

The likelihood function turns out to be very simple and we can easily estimate θ analytically.
Something goes horribly wrong, and we are unable to estimate θ; perhaps the model is unidentifiable, as is the case, for example, in factor analysis.
The likelihood function turns out to be extremely complex, but, by introducing the hidden random variable Z which exists implicitly in our model, we can simplify it and so estimate our parameters by the iterative EM algorithm.
The likelihood function turns to be extremely complex, but introducing hidden variables avails us naught, and so we can do nothing better than numerical algorithms such as Newton’s algorithm or the like.

Enough talk for now. Let’s see the model whose parameters we shall estimate using the EM algorithm.

Mixture models

To illustrate mixture models, I will use a dataset of earthquake counts which has the number of earthquakes in the period 1916–2015.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats.distributions import poisson, norm
plt.style.use('ggplot')

quakes = pd.read_table('https://online.stat.psu.edu/onlinecourses/sites/stat501/files/data/earthquakes.txt')['Quakes']
x = np.unique(sorted(quakes))
y = poisson.pmf(x, 12.61)
sns.histplot(quakes, stat='probability', label='Quakes counts', bins=100);
sns.lineplot(y, color='b', label='Poisson variable')
plt.legend();

Histogram of earthquakes counts alongside pmf of a Poisson variable with mean 12.16

The natural way to model a count random variable is to model it as a poisson random variable. But, as we can see from the plot, our data isn’t so symmetric as a true Poisson variable should be, with our data being too right-tailed. And if we look at the summary statistics,

pd.DataFrame(quakes.describe()).T

we see that the mean is 12.61 while the variance is approximately 18. If our data was truly approximately poisson, the mean should have equaled the variance. So this isn’t going to work.

That’s when mixture models comes in. If one poisson isn’t going to work, then let’s use two, or three, or four, or even more. But what I am saying here exactly? I am saying that instead of assuming, as we usually do, that our data is one homogenous mass, let’s assume that it’s actually heterogenous. That it consists of multiple groups such that each group is distributed as a poisson with a certain mean; that it’s, so to speak, a mixture.

But let’s be rigorous for a second here. What I am saying is that our data follows the following generative story:

First, sample from the random vector Z (hint, hint) such that Z ~ multinomial(1, n, p), where n is the number of groups and p is the vector of probabilities, so that Z becomes a one-hot vector (If you do not remember what the multinomial distribution is, just know that we are basically choosing a group at random using using the components of p as weights. Meaning that if p = (0.8, 0.2), then the first group is chosen with probability 0.8 and the second with probability 0.2). As a simple example, if we have 2 groups, and somehow we have observed the unobservable hidden variable corresponding to x_2 which is z_2, then z_2 might be (0, 1).
Draw from the random variable X such that X ~ f_z. In other words, each group has its own distribution.

Now what will be the pdf of the random variable X. We know that p(X = x| Z = z) = f_z(x), and we also have p(Z = z). From this we can easily find that, given there are n groups, p(x) =

Pdf of a mixture distribution.

where all the lambdas are positive numbers that sum to one and each f_i is a probability density function (or probability mass function) that integrates to 1 (or sums to 1).

Notice that while there is no requirement for each f_i to be of the same family (we could have f_1 be Normal, f_2 Gamma, f_3 Cauchy, etc…), that’s usually the case. Here, I am going to have each f_i be a Poisson distribution but with different means. So how can we fit this model? How can we estimate its parameters? What are its parameters? Well, they are the following:

The first parameter is the λ vector which equals (λ_1, λ_2, λ_3, …, λ_n); these are what we shall call the mixing weights. They determine which group it’s more likely for a data point to come from.
The second parameter is the mean vector μ which equals (μ_1, μ_2, μ_3, …, μ_n), such that the each μ_i is the mean of the Poisson distribution f_i.

Now let’s try to estimate these parameters with maximum likelihood estimation. We are trying to maximize the likelihood function :

But it’s far easier to maximize the log of this function, in order to turn the product into a sum. The estimates aren’t going to change, because what maximizes a function maximizes the log of that function. So what we seek to maximize now is:

And let’s call this function l, and combine both λ and μ into a single vector θ. This may not seem a particularly complex function, but let’s write out p(x_i, θ) explicitly, so that the function l becomes, if we have k groups,

We have decided that each f_j is going to be a Poisson distribution, so let’s plug that in:

Do you now see the difficulty?

You can, if you are so inclined, try to take the partial derivatives of l with respect to λ and μ and setting them to 0. But it isn’t going to work (you don’t have to trust me on this; make the attempt). And you would actually have to use Lagrange multipliers, in order to add the following constraint:

But notice, that in the likelihood function, the hidden random variable Z nowhere appears. Might we introduce it ourselves? How would that help?

Well, to see how it helps, we will have to first discuss the EM algorithm. But before that a question.

Are mixture models really useful?

Perhaps you are of the kind that doesn’t care about density estimation. It’s all the same to you whether a variable follows the normal distribution or the cauchy distribution, etc. Well then, what use are mixture models to you? Be not afraid, mixture models do have one other use: Clustering.

Forget about k-means, k-medoids, and other similar algorithms. There is a whole field of clustering called model-based clustering which uses mixture models exclusively. By using mixture models, we add a probabilistic dimension to our clustering, and that has may benefits:

We do not have to guess the number of clusters randomly or by using magical methods such as the elbow method. We now have a probabilistic model for the data, and as such we can speak of the likelihood, which gives us a clear way of choosing the number of clusters.
Whether our data is categorical or numeric or both no longer matters; we can give every variable a suitable distribution in the mixture.

There are many other benefits, and a lot more can be said on the use of mixture models for clustering, but I am not going to say anymore here. If you wish to learn further, I highly recommend the book Model-Based Clustering and Classification for Data Science, which treats the topic exhaustively.

Now onto the EM algorithm.

The EM algorithm

Say we have a dataset X consisting of x_1, x_2, x_3, .., x_n data points, and, as I have said previously, we would like to fit a probabilistic model with maximum likelihood estimation. The only problem is that the function

is far too complex.

What the EM algorithm does, is that it introduces a set of unobserved variables Z, and instead of trying to maximize the likelihood of the observed data, it tries to maximize the likelihood of the join probability of both the observed and unobserved data. Which, in the case of mixture models with k poisson mixtures, becomes:

Joint probability of observed and unobserved data.

where z_ij is the jth component of the one-hot vector z_i, so it’s either 0 or 1. Now this function — let’s call it Q — is far simpler than the likelihood of the observed data only isn’t it? The only problem is that, per definition, each z_ij is unobserved, so how do we use it for estimation? Per the EM algorithm, there are four steps that we must follow:

First initialize λ and μ with random values.
Second, the expectation or E step. Alright, so we don’t know the value of each z_ij, what are we to do? We simply replace them by their expectation, E(z_ij), which we can compute in terms of λ and μ and the observed data.
Three, the maximization or M step. We have computed the expectation of the hidden variables; we now forget about the previously initialized λ and μ, and find new values for them which maximize Q (remember that the each z_ij is now replaced by its expectation so we have no problem).
Iterate until the likelihood of the observed data no longer increases or until a certain number of iterations is reached.

And so the million dollar question now is, does this actually work? does maximizing Q every other iteration maximize the likelihood of the observed data? The short answer is: Yes, it can be proved that in every iteration, the likelihood of the observed data increases by the same amount that Q has increased. I could write the theoretical justification, but I am afraid the margins are too narrow to contain it — and even more afraid that you would stop reading, that is in the fantastical case that anybody has even read till this point (most likely, everybody had already left after the first paragraph 😢).

Anyway, let’s now apply the EM algorithm to the poisson mixture model.

Fitting poisson mixture models with the EM algorithm

First of all we need to compute E(z_ij), which we can easily do using Bayes's rule. We have:

The expectation of the hidden random variable.

And remember that we have in this step we have already initialized λ and μ variables, so we can just plug them in to get the expectation.

The second step is the maximization step. We need to maximize Q:

Notice that all the z_ij have been replaced by their expectation. This is a very simple function consisting of just sums, and so we can easily take the derivatives. Let’s start with the harder task of taking the derivative with respect to λ. We do not have to consider the constraint that each component of λ be ≥ 0, because the function contains the log of each component, and the log is undefined for negative values, so the value of λ which maximizes the function can never be negative. What we have to consider is the following constraint function:

The lambdas must sum to 1 because they are probabilities.

Using Lagrange multipliers, we can maximize the function while adhering to the constraint by solving the following equations:

alongside the constraint equation. I am not going to write out the computations; suffice to say that the solution is, for an arbitrary λ_l:

If you you are having trouble solving the equations, remember that we have m linear equations in m unknowns, and so any solution you find is the only solution. Then try to solve them in the case k = 2, and interpolate that solution to higher k and check whether it works.

The situation with μ is far easier. A simple derivative without Lagrange multipliers will do the trick. We will then find that the solution, for an arbitrary μ_l is:

And we have finished. All that’s needed is for us to continue to iterate with these same steps until the likelihood no longer increases. But enough of theory, let’s now code this algorithm and fit the model to the earthquakes dataset.

Python implementation of the EM algorithm for poisson mixtures

Firs of all, we need a function to compute the log likelihood of the observed data, in order to find out whether adding more groups actually improves anything.

def log_likelihood(means, mixture_weights, counts):
    probs = []
    for count in counts:
        prob = np.dot(mixture_weights, poisson.pmf(count, means))
        probs.append(prob)
    return np.log(np.prod(probs))

Nothing strange here. This is the exact same formula of the log likelihood I have discussed above but in code. Now for the code of the EM algorithm itself:

def em_poisson(counts, iterations, k, means=None):
    # intialization step
    if means is None:
        means = np.random.normal(20, 10, k)
    
    mixing_weights = np.random.rand(k)
    mixing_weights = mixing_weights / np.sum(mixing_weights)
    expectations = np.zeros((len(counts), k))
    
    for iteration in range(iterations):

        # E step
        for i in range(len(counts)):
            denominator = np.dot(mixing_weights, poisson.pmf(counts[i], means)) + 0.0001
            expectations[i] = (mixing_weights * poisson.pmf(counts[i], means)) / denominator
        
        # M step
        for j in range(k):
            mixing_weights[j] = np.sum(expectations[:, j]) / (np.sum(expectations) + 0.0001)
            means[j] = np.sum(expectations[:, j] * counts) / (np.sum(expectations[:, j]) + 0.0001)
    
    return ({'means': means, 'mixing_weights': mixing_weights })

Here we have the EM algorithm, with its four steps. Possibly, we might have some idea about what the means of the various groups are, so I made it possible for the user to enter a starting guess. In the end, the estimates for λ and μ are returned in a dictionary. Also, notice that whenever I divide by a number, I add a very small fraction to it; that’s in order to prevent division by 0 if the denominator is too small.

Let’s try it out for a few different values of k.

means = []
mixing_weights = []
log_likelihoods = []

for k in range(1, 6):
    results = em_poisson(quakes, 100, k)
    means.append(np.round(results['means'], decimals=3))
    mixing_weights.append(np.round(results['mixing_weights'], decimals=3))
    log_likelihoods.append(np.round(log_likelihood(results['means'], results['mixing_weights'], quakes), decimals=3))

data_frame = pd.DataFrame({'means': means, 'mixing_weights': mixing_weights, 'log_likelihood': log_likelihoods})

data_frame

And the output will be:

The likelihood improves when using two groups, but further improvements by increasing k are infinitesimal.

We have seen some improvement, but certainly not as much as we would have liked. And that’s because we have ignored an important property of the data, which is that it’s a time series. It seems plausible to suppose, that a great number of earthquakes today might imply a great number of earthquakes tomorrow, or vice-versa.

Is there a way to use mixture models while adding this property of serial dependence?

Yes there is. We can do that using hidden markov models, which I shall hopefully discuss in a later article.

Good day and happy reading.