Machine Learning

Detailed Mathematical Derivation of EM algorithm

Detailed mathematical derivation of EM algorithm and implement from scratch in Python

Yuki Shizuya
Intuition

--

Photo by Johannes Plenio on Unsplash

Introduction

In the machine learning field, we often encounter data without labels. In such a situation, we cannot use supervised learning techniques such as decision trees, deep learning, support vector machines, etc. How do we treat such a situation? Unsupervised learning is one answer. It is used to understand the data overview, and there are some common algorithms in unsupervised learning: (1)Clustering, (2)Anomaly detection, and (3)Learning latent variable models. Here, the EM algorithm is one of the common algorithms for learning latent variable models. Furthermore, it can also be applied to the clustering. It sounds so convenient! However, there’s a pitfall in understanding the EM algorithm. It is pretty complex. But fear not; in this blog, I will guide you through the EM algorithm with detailed mathematical derivations and a Python implementation from scratch.

Table of Contents

  1. The overview of the EM algorithm
  2. Detailed mathematical derivations of the EM algorithm
  3. Python implementation of EM algorithm

1. The overview of the EM algorithm

The expectation-maximization (EM) algorithm is a powerful tool. It’s designed to find the maximum likelihood (or maximum posteriori(MAP) estimates) of parameters in the statistical model that depend on unobserved latent variables. Now, that might sound a bit technical, so loosely speaking, the EM algorithm can find the most possible(≑maximum likelihood) parameters of the statistical models that we define, even when we don’t have labels for that data(≑unobserved latent variables)! So, it is suitable for unsupervised learning in the machine learning context and can be applied to the soft and hard clustering methods if we define probability models to fit the data in advance. EM algorithm consists of two parts as follows:

The image adapted from the [2]

As you can see, the formulas are so complicated. In the E-step, we calculate the probability using the pre-defined distribution. Then, we reparameterize parameters again using the evidence lower bound(ELBO) in the M-step. To understand these formulas, we will go through the mathematical derivations of the EM algorithm in detail in the next section.

2. Detailed mathematical derivations of the EM algorithm

Let’s derive the EM algorithm mathematically. Suppose we have data consisting of n independent examples, the parameter 𝜽 for the statistical model that the data fits, and the latent variable z. The density for x can be obtained by marginalizing over the latent variable z. We can define these settings as follows.

The symbols and notations we use for the derivation of EM algorithm

p(x ; 𝜽) means the probability of random variable x under the parameter 𝜽, and a latent variable can be thought of as a label for data. Since we want to fit the parameter 𝜽 by maximizing the log-likelihood of the data, we can define the objective function as follows.

The objective functions

Formula (2) uses the marginal probability formula to transform formula (1). Then, how do we optimize formula (2)? Generally, finding the maximum likelihood estimates of the parameter may be challenging in this setting because formula (2) tends to result in non-convex optimization problems (the optimization is empirically tricky). In other words, it is generally difficult to derive an analytical solution for the log(Σp(x, z; θ)) term. You can imagine that equation (1) is easy to differentiate, while equation (2) is difficult.

In such a setting, the EM algorithm provides an efficient method for maximum likelihood estimation to solve this problem. The EM algorithm solves maximum likelihood estimation not analytically but approximately using Jensen’s inequality (If you are not familiar with Jensen’s inequality, see the appendix). Let Q be a distribution over the z. We can apply Jensen’s inequality since f(x) = log x is a concave function as follows:

Transforming the equation (2) using Jensen’s inequality

Now, the formula (5) gives a lower bound on log(x;𝜽) for any distribution Q. So, if we can make tighter this lower bound to the value of log(x;𝜽), the parameter 𝜽 will better fit the data. But, when does formula (5) take the lower bound? In other words, when does formula (4) and (5) take the same value? The answer is when a random variable X takes a constant. To derive it, let c be a constant that does not depend on z, and we can define the formula below.

Denote the random variable as a constant

If we choose the Q(z) value to obtain the proportional value of p(x,z;𝜽), the above equation is easily accomplished. Moreover, since Q(z) is distribution (so ΣQ(z) = 1), we can rewrite Q(z) as follows.

The derivation of Q(z)

Since any constant c is okay, we set the marginal probability of the numerator over z as c for convenience. We can simply set the Q(z) as the posterior distribution of the z given x and the parameters 𝜽 from this derivation. We can check that this derivation is the same as the lower bound using Bayes’ theorem:

The derivation that the lower bound is the same as the posterior probability

Now, we call the above formula as the evidence lower bound (ELBO) for simplicity. For your information, ELBO can be defined in another form using Kullback-Leibler divergence. If you are interested in it, you can see the appendix.

The ELBO formula and the objective function

We finally get the ELBO equation we want to maximize. Again, if we can make ELBO tighter to the objective function (the maximum likelihood), the objective function also becomes bigger. The illustration is as follows:

The illustration of the relationship between the objective function and ELBO

So, the objective function and ELBO’s direction to maximize are the same; if we can maximize ELBO, we can also get the parameters to maximize the objective function. Since we don’t know the actual parameters, we start randomly initialized parameters and then update them in the direction to maximize the ELBO while fixing the distribution Q(z). It sounds good.

Next problem: How do we calculate ELBO? From the formula (15), if we choose a distribution Q appropriately, we can calculate the ELBO using the data and parameters. This is the E-step. In the M-step, we reparameterize parameters 𝜽 based on the result of the E-step. We repeat the following process until convergence.

EM algorithm

Since the EM algorithm always chooses the parameter that maximizes log-likelihood, we can derive the algorithm will converge as follows:

The derivation of the EM algorithm’s convergence

In summary, we first initialize the parameters randomly and then optimize them with respect to the parameters gradually.

Now that we understand the procedures of the EM algorithm, let’s derive parameters using a concrete distribution. If you have any insights into your data, you can choose the distribution you think will work best. If not, the Gaussian Mixture Model(GMM) is the most common choice. GMM is a probability model that assumes all the data points are generated from a finite number of Gaussian distributions with unknown parameters. We will use GMM as an example of distribution.

GMM illustration adapted from scikit-learn GMM page

GMM has the following parameters.

The parameters used in GMM

For the E-step, we can simply calculate as follows:

E-step using GMM

We can calculate the probability that each data point is assigned to the latent variable using GMM. Next, in the M-step, we need to maximize with respect to our parameters by taking the derivatives. The formula below is the ELBO derivation when using GMM.

ELBO derivation when using GMM

he equation below is convenient when we take derivatives of the above equation. We can decompose the log term as follows:

The log term decomposition

Let’s maximize this formula with respect to each parameter. Fortunately, we can derive derivatives of each parameter analytically.

For the parameter 𝜇:

The parameter mean derivation

For the parameter 𝜙:

The parameter phi derivation

The 𝜙’s sum is 1 since it represents the probabilities(weight), so it has a constraint. To deal with the constraint, we construct the Lagrangian, where β is the Lagrange multiplier.

The parameter phi derivation

Finally, for the parameter Σ:

The parameter Σ derivation

Now, you can update parameters using the above derivation. Now, let’s implement the EM algorithm from scratch!

3. Python implementation of EM algorithm

Let’s recap the steps of the EM algorithm. We will use GMM for the implementation.

  1. Initialize the value of parameters.
  2. Calculate the probability of being assigned to each class corresponding with parameters.
  3. Relearn the parameters based on the data calculated in step 2.
  4. Iterate steps 2 and 3until convergence.

Let’s implement these steps one by one. We will implement the EM algorithm as a class.

Firstly, let’s generate the data we use. We use data generated from the four Gaussian distributions.

def generate_sample_data(n_samples, means, covariance, weights):
"""
Generate n data points.
"""

n_clusters, n_dim = means.shape

data = np.zeros((n_samples, n_dim))
labels = np.zeros((n_samples,))

for i in range(n_samples):
# pick up a cluster_id and generate data from that cluster
k = np.random.choice(n_clusters, size=1, p=weights)[0]
x = np.random.multivariate_normal(means[k], covariance[k])
data[i] = x
labels[i] = k

return data, labels

# generate sample data
sample_means = np.array([
[5,0],
[1,1],
[0,5],
[5,5]
])

sample_covariance = np.array([
[[0.7, 0.0], [0.0, 0.7]],
[[0.5, 0.2], [0.2, 0.5]],
[[0.6, 0.3], [0.3, 0.6]],
[[0.5, 0.0], [0.0, 0.5]]
])

sample_weights = [1/5, 2/5, 1/5, 1/5]

# generate samples
n_clusters = 4
X, labels = generate_sample_data(300, sample_means, sample_covariance, sample_weights)

colors = ['red', 'blue', 'yellow', 'green']

for i in range(n_clusters):
tmp_X = X[labels==i]
plt.scatter(tmp_X[:, 0], tmp_X[:, 1], label=f'cluster_{str(i)}', color=colors[i])

plt.title('Sample data')
plt.show()
The generated data we use for an implementation

Firstly, we need to initialize three parameters in the GMM case for the first step. We assign the probability uniformly for phi, the randomly chosen points for the mean, and the data’s covariance. You can implement as follows.

class MyGaussianMixtureModel:
"""
Gaussian mixture model
estimate parameters using Expectation Maximization

Parameters
----------
n_clusters: int
The number of clusters (k)
The data will be assigned to the most possible cluster

n_steps: int
The number of iterations to run the EM algorithm

threshold: float
The threshold if the difference between new likelihood and previous likelihood is smaller than this value,
quit the calculatation

seed: int
random state to initialize the parameters
"""
def __init__(self,
n_clusters: int,
n_steps: int,
threshold: float = 1e-5,
seed: int = 42):
self.n_clusters = n_clusters
self.n_steps = n_steps
self.threshold = threshold
self.seed = seed

self.log_likelihood_list = []

self.phi_list = []
self.mean_list = []
self.cov_list = []

np.random.seed(self.seed)

def initialize(self, X):
# The data dimension
n_sample, n_dim = X.shape

# w
self.w = np.zeros((n_sample, self.n_clusters))

## initialize parameters
# phi
# assign each cluster probability uniformly
self.phi = np.full(self.n_clusters, 1/self.n_clusters)

# mu
# choose random initial points as mean value of each cluster
initial_index = np.random.choice(n_sample, self.n_clusters)
self.means = X[initial_index]

# covariance
# assign the whole data's covariance for the initial point
self.cov = np.full((self.n_clusters, n_dim, n_dim), np.cov(X, rowvar=False))

# for visualization
self.phi_list.append(self.phi)
self.mean_list.append(self.means)
self.cov_list.append(self.cov)

For the second step, we implement E-step. Since there is some overlap in the formula (15), we need to transform it for the calculation simplicity. We transform the formula (15) as follows:

The derivation for E-step

Now, we can calculate it in Python:

def e_step(self, X):
"""
Expectation step (E-step)

Parameters
----------
X: np.array
Given data

Returns
-------
log_likelihood: np.array
the likelihood calculated by the current parameters

"""

for j in range(self.n_clusters):
phi_j = self.phi[j]
likelihood = multivariate_normal(self.means[j], self.cov[j]).pdf(X)

self.w[:, j] = phi_j * likelihood

log_likelihood = np.sum(np.log(np.sum(self.w, axis=1)))
self.w = self.w / self.w.sum(axis=1, keepdims=1)

return log_likelihood

Then, for the third step, we reparametrize the parameters based on the w calculated in step 2. We can use these formulas to calculate parameters.

The parameter update formulas
def m_step(self, X):
"""
Maximization step (M-step)

Parameters
----------
X: np.array
Given data

Returns
-------

"""

# phi : (n_clusters,)
phi = self.w.sum(axis=0)

# update phi
self.phi = phi / X.shape[0]

# update mean
self.means = np.dot(self.w.T, X) / phi.reshape(-1, 1)

# update covariance
for j in range(self.n_clusters):
diff = (X - self.means[j]).T
self.cov[j] = np.dot(self.w[:, j]*diff, diff.T) / phi[j]

self.phi_list.append(self.phi)
self.mean_list.append(self.means)
self.cov_list.append(self.cov)

I will show you the full version of the code later. Here is the result of GMM with EM algorithm.

The animation of EM algorithm made by the author

As you can see, the EM algorithm can update parameters towards the direction of the actual distribution. You can reproduce the result using the code below.

Appendix 1: Jensen’s inequality

Jensen’s inequality is the theorem that any convex functions f meet the following formula:

Jensen’s inequality

When f is strictly convex, then E[f(X)] = f(E[X]) holds true if and only if X = E[X] with probability 1, which means a random variable X is a constant. You can remember this formula by considering the figure below.

The image from [2]

f is a convex function shown by the solid line. X is a random variable with a 50% chance of taking the value a and a 50% chance of taking the value b. Thus, the expected value of X is given by the midpoint between a and b. As you can see, the above formula meets Jensen’s inequality. This diagram makes it easy to recall Jensen’s inequality.

For your information, the formula becomes as follows if the function is concave.

The concave version of Jensen’s inequality

Loosely speaking, you can imagine this using the fact that concave and convex functions are different in the direction of convex(concave).

Appendix 2: ELBO derivation using Kullback-Leibler divergence

ELBO can be derived in another form using Kullback-Leibler(KL) divergence. We can derive the formula as follows:

The derivation of ELBO using KL divergence

This derivation is not typically used when we use Gaussian mixture models for the distribution Q(x). On the other hand, this derivation can extend the EM algorithm to the variational auto-encoder. You can check [2] for further understanding.

This is the end of this blog. In this blog, we have gone through the mathematical derivation and implementation of the EM algorithm. Thank you for reading!

References

[1] Sontag, D., Mixture Models & EM algorithm Lecture 21, New York University lecture

[2] Ma, T., Ng, A., Part IX The EM algorithm, Stanford University lecture

[3] Ng, A., Mixtures of Gaussians and the EM algorithm, Stanford University lecture

[4] Peterson, K., Pedersen, M., The Matrix Cookbook

--

--

Yuki Shizuya
Intuition

Data Scientist in Japanese IT company. I write blogs about machine learning/deep learning/statistics.