EM algorithm and Gaussian Mixture Model (GMM)

with sample implementation in Python

Published in

CodeX

6 min readAug 12, 2021

Preface: This article aims to provide consolidated information on the underlying topic and is not to be considered as the original work. The information and the code are repurposed through several online articles, research papers, books, and open-source code.

Introduction

The Expectation-Maximization Algorithm, or EM algorithm for short, is an approach for maximum likelihood estimation in the presence of latent variables. The EM algorithm was explained and given its name in a classic 1977 paper by A. Dempster, Nan Laird and D. Rubin in the Journal of the Royal Statistical Society.

Firstly, let’s recall types of clustering methods:

hard clustering: clusters do not overlap (element either belongs to cluster or it does not) — e.g. K-means, K-Medoid.
soft clustering: clusters may overlap (strength of association between clusters and instances) — e.g. mixture model using Expectation-Maximization algorithm.

GMM clustering is more flexible but need not to be the more accurate than K-means because you can view it as a fuzzy or soft clustering method. Soft clustering methods assign a score to a data point for each cluster. The value of the score indicates the association strength of the data point to the cluster. As opposed to hard clustering methods, soft clustering methods are flexible in that, they can assign a data point to more than one cluster. When clustering with GMMs, the score is the posterior probability.

Mixture models:

probabilistically-grounded way of doing soft clustering
each cluster: a generative model (Gaussian or multinomial)
parameters (e.g. mean/covariance are unknown)

Implementation of GMM in Python

The complete code is available as a Jupyter Notebook on GitHub.

Let’s create a sample dataset where points are generated from one of two Gaussian processes. The first distribution will have the mean of 100, the second distribution — 90; and the standard deviation of our distributions will be 5 and 2 respectively.

We will have 60,000 points in the first process; 50,000 points in the second process and mix them together.

So after mixing the processes together, we have the dataset that we see on the plot. We can notice 2 peaks: around 90 and 100, but for many of the points in the middle of the peaks it is ambiguous to which distribution they were drawn from. So how should we approach this problem?

So here, we have to estimate a total of 5 parameters:

where p is the probability that the data comes from the first Gaussian distribution and 1-p that it comes from the second Gaussian distribution.

We can use a Gaussian Mixture Model that will estimate the parameters of the distributions using the expectation-maximization algorithm (EM).

EM alternates between performing an expectation E-step, which computes an expectation of the likelihood by including the latent variables as if they were observed, and a maximization M-step, which computes the maximum likelihood estimates of the parameters by maximizing the expected likelihood found on the E-step. The parameters found on the M-step are then used to begin another E-step, and the process is repeated until convergence.

From sklearn, we use the GaussianMixture class which implements the EM algorithm for fitting a mixture of Gaussian models.

The class allows us to specify the suspected number of underlying processes used to generate the data via the n_components argument when defining the model. We will set this to 2 as we have two processes (or distributions).

If the number of processes was unknown, a range of different numbers of components could be tested and the model with the best fit could be chosen, where models could be evaluated using scores such as Akaike or Bayesian Information Criterion (AIC or BIC).

There are also many ways we can configure the model to incorporate other information we may know about the data, such as how to estimate initial values for the distributions. In this case, we will randomly guess the initial parameters, by setting the init_params argument to ‘random’.

Running the example fits the Gaussian mixture model on the prepared dataset using the EM algorithm. Once fit, the model is used to predict the latent variable values for the examples in the training dataset.

We can see that at least for the first few and last few examples in the dataset, that the model mostly predicts the correct value for the latent variable. Method predict_proba() predicts posterior probability of each component given the data. In our case, the probabilities that the point 105.0 belongs to each Gaussian processes are 0.501 and 0.499.

In general, k-means and EM may perform better or worse, depending on the nature of the data we want to cluster, and our criteria for what defines a good clustering result.

For multivariate Gaussians, the situation is a bit more complex and looks as follow:

Given data with multiple attributes from several sources, where each source is modeled as a Gaussian distribution.

Iteratively estimate parameters:

- Prior (Mixing Coefficient): Estimate what percentage of instances came from each source. This represents the probability that a randomly selected instance belongs to each source.

- Mean: Estimate the expected value of each attribute for each source. This is the average value of the attribute for instances that are most likely to come from each source.

- Covariance: Estimate how correlated different attributes are in each source. This measures the degree to which the attributes vary together for instances from each source.

These estimates are based on the current guess of the source for each instance, which is updated iteratively as the algorithm progresses.

Applications of EM Algorithm

The latent variable model has several real-life applications in Machine learning:

Used to calculate the Gaussian density of a function.
Helpful to fill in the missing data during a sample.
Used for finding the values of latent variables.
Used in image reconstruction in the field of Medicine and Structural Engineering.
It finds plenty of use in different domains such as Natural Language Processing (NLP), Computer Vision, etc.
Used for estimating the parameters of the Hidden Markov Model (HMM) and also for some other mixed models like Gaussian Mixture Models, etc.

Advantages and Disadvantages of EM algorithm

Advantages

The basic two steps of the EM algorithm i.e, E-step and M-step are often pretty easy for many of the machine learning problems in terms of implementation.
The solution to the M-steps often exists in the closed-form.
It is always guaranteed that the value of likelihood will increase after each iteration.

Disadvantages

It has slow convergence.
It is sensitive to starting point, converges to the local optimum only.
It cannot discover K (likelihood keeps growing with number of clusters)
It takes both forward and backward probabilities into account. This thing is in contrast to that of numerical optimization which considers only forward probabilities.

References:

Expectation Maximization: how it works — https://youtu.be/iQoXFmbXRJA

Computational Statistics in Python: Expectation Maximization (EM) Algorithm — https://people.duke.edu/~ccc14/sta-663/EMAlgorithm.html

Unsupervised learning: Clustering: Gaussian Mixture Models (GMM) — https://www.python-course.eu/expectation_maximization_and_gaussian_mixture_models.php

An Introduction to Hidden Markov Models and Bayesian Networks — http://mlg.eng.cam.ac.uk/zoubin/papers/ijprai.pdf

Thanks for reading. If you have any feedback, please feel free to reach out by commenting on this post, messaging me on LinkedIn.