Implementing Expectation-Maximisation Algorithm from Scratch with Python
Demystifying the horrors of the EM Algorithm by building one from scratch
The Expectation-Maximisation (EM) Algorithm is a statistical machine learning method to find the maximum likelihood estimates of models with unknown latent variables. I am sure that that sentence will make no sense to some of you. Not to worry! This article aims to demystify the horror equations and puzzling vocabularies of the EM Algorithm. Unfortunately, without proper mathematics, one can not truly understand the essence of the EM algorithm. So this is what I am going to do. Every section of this article will include a Full Mathematics section and an All You Need to Know section. The Full Mathematics section includes complete derivations of the algorithm, while the All You Need to Know section summarises essential details without the derivations. To follow the Full Mathematics section, I will assume that you are familiar with probability theory, statistical theory, and fundamental calculus. There will still be some maths in the All You Need to Know section, but not as complex.
To understand the algorithm, we must first understand the data we are dealing with. The EM algorithm deals with data from multiple clusters, i.e. a mixture model. Each cluster has a probability of data being from that cluster, but we do not know these probabilities. Each data cluster has its distribution, which we know of, but we do not know the distribution parameters. See the example below for a Poisson mixture model.
In this example, there are 2 clusters. Zᵢ denotes the cluster of the data Xᵢ. The π denotes the probability of being from a certain cluster.
Hence the sum of all the π values should equal one. For this tutorial, I will assume these parameter values.
If a data point is from the first cluster, it comes from a Poisson distribution with parameter λ = 2
, and if the data point is from the second cluster, it comes from a Poisson distribution with λ = 6
. Remember that we do not know these parameter values and are trying to estimate them.
The purpose of the EM algorithm is to find the parameter estimates of this mixture model. In the case above, the EM algorithm would estimate π₁, π₂, λ₁, and λ₂. The π is the probability of the cluster, and the λ is the parameter of the Poisson distributions. Consequently, if the distribution of the clusters is Gaussian (Normal) instead of Poisson, the EM algorithm would estimate μ and σ² instead of λ. Let me leave this Poisson pdf function here. Please adjust it to fit your own desired distribution.
In practice, there may be as many clusters as the users want, and the distribution of each cluster can be anything we want it to be. Once again, I will remind you that we do not know any of the parameter values, nor do we know from which cluster each data point is from. This unknown cluster allocation is why the clusters are called latent variables. The EM algorithm is very useful for dealing with unobservable variables like this.
Since the EM algorithm is iterative, I will use θ to denote the new parameter estimates and θ⁰ to indicate the previous iteration estimates. Some other variables; N denotes the total number of samples, and K represents the total number of clusters.
You should skip to section 4. Algorithm to go right to the demystifying part or if you are looking for the non-maths summary of the algorithm.
1. Expectation of Complete Log-Likelihood
The complete log-likelihood is equivalent to the log-likelihood of the data. We used the term ‘complete’ because, in later sections, there will be another statistic called the ‘incomplete log-likelihood’. I will use ‘complete log-likelihood’ and ‘log-likelihood’ interchangeably for this section only.
Full Mathematics
Skip to the All You Need to Know section if you are not interested in the derivations.
To find the distribution parameters of some data, we usually take the parameters that maximise the likelihood of that data. The EM Algorithm applies the same concept here, trying to maximise the complete likelihood.
We can expand the formula above as below,
The function on the exponent is an indicator function, returning one if the condition inside is true and 0 otherwise. To simplify later calculations, we prefer this indicator function to not be in the exponent. Let’s take the log of the likelihood.
This is a more straightforward form to maximise. Unfortunately, we cannot calculate this log-likelihood because we do not know which clusters each data comes from. Therefore, we do not know the output of the indicator function and distribution parameters to use. A solution to this problem is to use the expectation of the log-likelihood instead of the actual log-likelihood. Let’s define the expectation of the log-likelihood as below,
The expectation of the log-likelihood replaces the indicator function with the posterior probabilities. We will calculate the values of the posterior probabilities in later sections.
For our Poisson example above, the expectation becomes the following,
All You Need to Know
All you need to know about the complete log-likelihood is that we want to find parameters that maximise it. Unfortunately, the complete log-likelihood is difficult to calculate because of the unknown clusters. To get around this, we calculate the expectation of the log-likelihood and maximise it instead of the actual log-likelihood.
The expectation of the complete log-likelihood is denoted as Q(θ, θ⁰). The formula for Q(θ, θ⁰) is given below.
For our Poisson example, this becomes,
For now, this is all you need to know about the expectation of the complete log-likelihood.
2. Posterior Probability
Full Mathematics
Skip to the All You Need to Know section if you are not interested in the derivations.
The posterior probability is the probability of obtaining a particular data point from a cluster given the value of that data point and the parameters in the current iteration. This is not to be confused with the probability denoted by π. Let’s define them,
It would help if you realised that I used the conditional probability definition to obtain this. We can expand this expression to the following,
I will not go into the probabilistic properties that I applied here, but it is the exact conditional probability definition above. If you notice, the numerator is the likelihood of data and cluster. We have seen these probabilities in the previous section. The denominator is just the sum of all clusters’ likelihood.
For our Poisson example, the posterior probability becomes,
Note that the parameters used to calculate the posterior probabilities are the previous iteration’s parameters. Therefore, the posterior probability will be considered a constant in the expectation of complete log-likelihood.
All You Need to Know
All you need to know about the posterior probability is that it is the probability of obtaining a cluster given the current data and parameters. We need this probability to calculate the expectation of complete log-likelihood. Thankfully, It is not too difficult to find, so I will just give out the formula.
This formula is simple because we already know everything. The numerator is the likelihood of a data and cluster, while the denominator is the sum of likelihoods of the same data and all possible clusters.
In our previous example, we only have two possible clusters. Hence the posterior probability for the first cluster of that example is just the likelihood of the first cluster divided by the sum of the first cluster’s likelihood and the second cluster’s likelihood.
I hope that you can extrapolate this example to other examples.
Python Code
Using the Poisson mixture example, below is a function to calculate the posterior probability.
The function above returns a list of lists, where each inner list denotes a cluster, and the content of the inner list is the posterior probabilities. Try to match this Python code with the Poisson Posterior Formula image above.
3. Maximisation
Full Mathematics
Skip to the All You Need to Know section if you are not interested in the derivations.
Maximisation refers to obtaining parameters by maximising the expectation of complete log-likelihood. We do this by finding the derivative of the expectation and setting it to zero. Let’s go back to the formula.
I will use this notation to denote the posterior probability to simplify the equations.
We will start with finding the optimal π₁. Because we only have 2 clusters, we can substitute π₂ using 1 - π₁. Now let’s remove all the terms in Q(θ, θ⁰) that do not contain π₁.
We are allowed to do this because we are about to derive this equation with respect to π₁. Meaning that all the other terms that do not contain π₁ will become 0 anyway, so let’s remove them so that the equation will become easier to the eye. The next step is to expand the inner sigma.
At this point, taking the derivatives is relatively simple.
Setting the derivative to 0 will yield the following results.
With some simple algebra, solving for π₁ will get us,
The interpretation of this result is that, with the current obtained posterior probability, we can update π₁ to whatever the result of this equation is.
I will not repeat the process for λ because it will be too long. If you are interested, see my calculations here. Alternatively, visit the All You Need to Know section for the final estimates only.
All You Need to Know
All you need to know about the maximisation process is that to obtain the parameter estimates, we need to maximise the complete log-likelihood expectation. Let’s go back to the formula.
The way to maximise this is to take the derivatives of Q(θ, θ⁰) with respect to each parameter and set it to 0. Remember that the posterior probability is constant when taking derivatives because it is calculated using the previous iteration’s estimates.
I will show you the result of the derivation for the Poisson example.
Hence to obtain the new parameter estimates of every iteration, we need to calculate these equations.
Python Code
Using the Poisson mixture example, below are functions to calculate the optimum parameter estimates.
The functions above return two lists for the π and λ estimates, respectively. Try to match this Python code with the Optimum Formulas image above.
4. Algorithm
So far, we have not gotten much to the demystifying part of the article. I understand that everything might still be a blur. So let’s wrap up everything we know and put the algorithm to live.
Steps of an EM Algorithm:
- Initialise random parameter values.
- Derive the expectation of complete log-likelihood, Q(θ, θ⁰).
- Calculate the posterior probabilities.
- Given the posterior probability, find optimal parameters by differentiating Q(θ, θ⁰) w.r.t each parameter, set the derivatives to 0, and solve for the parameter.
- Change current parameters to the one obtained in step 4.
Notice that after changing the parameter estimates in step 5, recalculating the posterior probabilities would yield different results (See Poisson Posterior Formula image). But if the posterior probabilities are now different, then surely redoing step 4 would generate different parameter estimates (See Optimum Formulas image). And that is the essence of the EM algorithm. Let’s repeat that in English. The EM algorithm is,
- Using current parameters, calculate posterior probability.
- Using current posterior probability, update the parameters.
And we iterate these steps over and over again. That’s not so mystical, isn’t it? Sure the maths might still be heavy, but the essence of the algorithm is just these two little steps here. Let’s try to translate this to code.
Now without needing to understand the contents of each function, this python code should be able to highlight the core of the EM algorithm. The heart of it is mathematical and complicated, but the algorithm is quite simple.
The only unsolved piece of the puzzle left is the stopping criteria. Now that we understand what is done in each iteration, we want to know when to stop the iteration. The following section will explain this stopping criterion.
5. Incomplete Log-Likelihood
In this section, I will introduce the concept of incomplete log-likelihood. The idea of the incomplete log-likelihood is similar to the complete log-likelihood. I will not go through the mathematical details, but essentially, the incomplete log-likelihood does not care about the data cluster. It calculates the likelihood from all clusters and sums them up. Whereas the complete log-likelihood only takes the likelihood of the cluster from which the data originates.
I am going to translate this formula into Python.
The function above returns the value of the incomplete log-likelihood.
Due to some properties I will not explore, the incomplete log-likelihood will always increase on every iteration. The EM algorithm iteration stops when the change in incomplete log-likelihood is no longer significant.
6. Summary
Let’s put everything in code.
The output of this function is the final parameter estimate.
Now you might be wondering why I said that the EM algorithm is a machine learning method, whereas it is clearly an estimation method. We have been calculating the posterior probabilities along the way, and we can use this to obtain the clusters of each data point. We do this by calculating the posterior probability of a data point belonging to each cluster and assigning the class to the cluster with the highest probability. Hence an unsupervised machine learning method.
The above parameters lambdas
and probs
refers to the final estimate from the EM algorithm. Running this function would return a NumPy array of the clusters of the data X
.
7. Congratulations
You have successfully finished this article. If you managed to understand and recreate the algorithm, I would applaud you. But if not, this is an advanced algorithm, and you’ll get it right with more practice. See my complete code on GitHub to see how to use our functions for practical usage. If you enjoy this article, consider giving a clap and a follow. Please do share any feedback and thoughts in the comments. Thank you for reading!