Expectation Maximisation(EM) explained part 1
This article gives the derivation of the EM algorithm and the two types of EM algorithm. In the next part, I would give a few solved examples.
This post assumes that you know about Maximum Likelihood Estimation (MLE) and Jensen’s inequality.
EM algorithm is an iterative procedure to compute the MLE of the parameters in the presence of hidden variables. EM stands for Expectation Maximisation and involves 2 steps-
- E step- calculate the expected value of the log-likelihood of the joint distribution of the observed data and the hidden variables. This is calculated by using the previous estimate of the parameters.
- M step- maximise the expected value found above to get a new estimate of the parameters.
We alternate between the above two steps until the parameters converge (i.e. they change by an amount smaller than some threshold). Finally, we get the best estimate of the parameters and the value of the hidden variables.
Don’t worry if the above explanation sounded too abstract, we would define everything formally.
Motivation
In the case of MLE, our goal is to maximise the (log) likelihood of the parameters given the observed data, i.e. maximise L(θ|X)=P(X|θ). θ are the parameters of the actual distribution of X which we are trying to estimate.
But the problem arises when there are hidden variables (Z) which influence our observed outcomes (X). X={x1, x2… xn} and Z={z1, z2… zn}, xi and zi can be a scalar or a vector and n is the total number of samples. Assume that all xi’s are independent and zi influences only xi. We want to find out the value of zi which was responsible for producing xi. This zi is unknown to us, that is why it is called a hidden variable. Examples involving hidden variables-
- In clustering, we want to find the cluster to which a sample belongs to. Here, xi is the data sample and zi is the cluster label.
- In speech recognition, our goal is to tell which phoneme was spoken given a small duration (usually 25 ms) of the speech signal. Here, xi is the speech signal and zi is the actual phoneme that was spoken. This problem is famously solved by HMM, which uses EM!
One way to solve the problem is by maximising P(X|θ) wrt θ (like MLE). As X is dependent on Z, we would want to write P(X|θ) as the summation of P(X,Z=z|θ) over all the values that z can take. This θ includes the parameters of X|Z (e.g.- mean and variance of the data points given a cluster, mean and variance of the speech signal given a phoneme) and also of the Z (namely the prior probabilities for each possible value of zi). This is because P(X,Z|θ)=P(X|Z,θ)P(Z|θ). If we differentiate it as we do in MLE, we can’t get a closed-form solution. Hence, the optimal value of a parameter would depend on the parameter itself. This type of problem is usually solved by first guessing the parameters and then improving our guesses iteratively. This is basically what the EM does!!!
Derivation of EM
This section has been taken from [1].
As stated above, we want to maximise L(θ|X)=ln P(X|θ) iteratively. Assume that at the nth iteration the estimate of θ is θn. As we want to maximise L(θ|X), we would want L(θ|X)>L(θn|X). Equivalently, we would want to maximise the difference L(θ|X) - L(θn|X) = ln P(X|θ) - ln P(X|θn). Introducing the hidden variable Z=z-
At the equation marked with *, Jensen’s inequality was used with P(z|X,θn) as the weights. This is a valid weight as- it is greater than or equal to 0 and for all values of z its sums to 1.
That triangle above = in the last equation means equal by definition. This means that we are defining the above definition of Δ by ourselves and we should not assume it means something else.
Equivalently-
From eq.(2), we can see that l is bounded above by L. Also, we can find that-
Try to prove this yourself using the definition of Δ. We can see that at θ=θn, l=L.
From the above two findings, we can see that l is always bounded above by L and l=L at θ=θn. We can conclude that to maximise L, we can instead maximise l. Increasing l is guaranteed to increase L. Hence, to update θn, we should maximise l with respect to θ.
Formally, we have-
Equation 3 completely describes the EM algorithm. Finding the expectation is the E step and then maximising it is the M step.
At this stage, we might take a step back and ask ourselves what have we gained by maximising l instead of L. The reason is that l takes z into account and the EM algorithm provides a method to find it.
Convergence of EM
We want to increase the likelihood function i.e. L(θ). Hence, if we show that L(θ_{n+1}) ≥ L(θ_n) at every iteration, then, we can show that the EM algorithm converges.
We know that L(θ) ≥ l(θ|θn). As we maximise l wrt θ, and l(θn|θn) = L(θn), this implies that L will surely increase.
Two types of EM
Two types of EM are Soft EM and Hard EM.
There are two types of EM algorithm based on whether we want to assign a sample some positive probability of coming from each value of z or only from one value of z. E.g.- during clustering, z represents the cluster and x the data point. Soft EM assigns some probability that x belongs to each of the clusters and Hard EM just assigns a single cluster.
The whole derivation done earlier was for Soft EM. For Hard EM, only equation 3 is changed. Instead of summing P(z|X,θ) over all the values of z, we take that value of z which maximises P(z|X,θ). Formally,
Soft EM-
- P(z|x)∈(0,1)
- Slower to converge as P(X,z|θ) in eq. 3 is calculated for all z.
- E.g.- GMM
Hard EM-
- P(z|x)∈{0,1}
- Faster to converge as P(X,z|θ) is not calculated for all z and also because after a few iterations the value of z for each x changes less frequently.
- E.g.- K-means
In the next part, we will see some solved examples of Soft and Hard EM.