Breakfast with AI

Srivatsan Ramesh
10 min readOct 9, 2018

Chapter-3

Collaborative Filtering using EM (Expectation-Maximization) Algorithm

Hey Guys, Welcome to the third chapter of the series. I hope you guys enjoyed my previous articles. If you haven’t read it, please refer to this link. Happy Reading :) !!!

In this article, we will be discussing in detail about a very important algorithm that is most commonly used in recommendation systems called Expectation-Maximization Algorithm. We will also see how it helps companies in recommending the right products to consumers to have a seamless service experience.

We will be going through the following topics in this article:

1. Why Expectation-Maximization?

2. What is EM Algorithm?(Mathematical Formulation)

3. Collaborative Filtering using “item based” EM Algorithm

Why Expectation-Maximization?

Let’s take an example to understand why we have to use expectation maximization algorithm.

Assume that you and your friend have a coin each. You both toss it 10 times and keep a count of the number of heads and tails. In order to estimate the bias of your coin, you count the total number of heads and tails in the given number of tosses and compute the probability of getting heads. In the above example, Coin A was tossed 30 times (24 Heads and 6 Tails) and Coin B was tossed 20 times (9 Heads and 11 Tails) and the probabilities of the coins are 0.80 and 0.45 respectively.

This is an example of maximum likelihood estimation. You are given the order in which the coins are tossed and how many times each coin was tossed and the resulting outcomes too. This is a very simple estimation.

Imagine a scenario where the outcomes of the tosses are known but you are not aware of the order in which the coins are tossed. The outcomes of the coins displayed in the picture above can be only from Coin B or Coin A or a mix of both. How do we estimate the bias of the coins in this scenario? It’s tough to incorporate maximum likelihood estimate as the outcomes of Coin A and Coin B are not explicit.

Welcome to the real world! You can never expect perfect data like the coin example explained above. Some parameters will be missing, and we need to make intelligent estimations.

All the Netflix users would have come across this recommendation panel where Netflix recommends shows to watch based on what you have watched before. In this scenario, how does Netflix predict what users like? There are users who would have watched many shows, and some might have hardly watched a couple of shows. How does Netflix ensure that they provide smart recommendations to all types of users?

The problems discussed above are some of the use cases of Expectation Maximization Algorithm. It is an algorithm for maximizing a likelihood function when some of the variables in your model are unobserved. The EM algorithm was explained and given its name in a classic 1977 paper by Arthur Dempster, Nan Laird, and Donald Rubin[link].

What is Expectation-Maximization Algorithm?

We will discuss the mathematical formulation of EM Algorithm followed by a solution to the problems we have analysed above.

Let’s assume that we have data D which consists of 2 parts

  1. The observed data X
  2. The latent data Z (data that is observed but inferred from other variables)

The likelihood of this data D can be defined as:

The marginal likelihood of the observed data X is given by

If Z is continuous, the summation becomes an integral in the above case. The maximum likelihood estimation of theta given X (MLE) is to determine the value of theta that maximizes the marginal likelihood

Theta is not a single value but a group of parameters(like mean, variance of a distribution) for a given distribution that can maximize the likelihood. We use log-likelihood because log function is a monotonically increasing function (i.e. the E step or Expectation step)and maximum of a function is the same as maximum log of that function (i.e. the M step or Maximisation step).

The EM Algorithm is as follows:

  1. Initialize theta0 to be some random value or employ some specific heuristics for initial guess
  2. Expectation Step — Compute the posterior distribution of Z given X and theta(t-1) as

This is also defined as the posterior distribution of Z at the t-th iteration.

3. Maximization Step — We need to solve for the value of theta that maximizes the log likelihood with respect to the posterior distribution mentioned above

4. These iterations stop when the convergence is achieved.

One of the most common applications of EM Algorithm is in estimating finite mixture models. A finite mixture model is composed of K component models whose parameters are

and a prior distribution over these components denoted by a. The density function of the finite mixture model is

where

If each component of the model is Gaussian, it’s called Gaussian mixture model. The method to generate a sample x from a finite mixture model is as follows:

1. Draw an indicator variable z from the prior distribution a.z can take any value between 1 to K(referring to the number of component models)

2. Given the value of z, draw x from the z-th component model.

Given the set of samples x, we can always treat the value of z to be an indicator variable as we associate each value to a value of the indicator z.

In order to estimate the model parameters a and theta, we will have to use EM Algorithm iteratively to get an optimum value for the parameters that maximizes the likelihood.We are assuming the independence between the samples drawn from the distribution.

The steps are as follows:

  1. Compute the log-likelihood:

Log likelihood is the joint log likelihood of the observed and latent variables

where

Therefore,

In order to make this formation simpler for the maximization, we need to account for the indicator variables z.The indicator function for z can be defined as

The log-likelihood can be simplified as

2. Expectation Step:

Given the model parameters M and x, the indicator variables for z is independent from other samples, therefore we have

From Bayes rule, we get

This is the posterior distribution of all the models given the data and parameters.Since the model parameters are updated in each step let’s define

We have

3.Maximization Step:

The expectation of log-likelihood with respect to the posterior distribution of latent parameters(q(Z)) is

Since the value of Ik(zi) is either 0 or 1,

Probability of an indicator value of a model equal to 1 is the same as the posterior distribution of that corresponding model.Therefore, the expected log-likelihood can be written as

This approach is very similar to the Maximum Likelihood Estimate except for the iterative modification of parameters until they converge.

Let’s get back to the coin example. Now how do we apply EM Algorithm here to get a good estimate of biases?

Here we are starting with an initial bias of 0.6 for coin A and 0.5 for coin B. Since it follows a Bernoulli distribution,

We need to normalize the probability values to get a ratio of occurrences of the two coins in a given distribution. Let’s take the first scenario. It has 5 Heads and 5 Tails,Therefore

Normalizing the probabilities, we get 0.45 and 0.55 in the first row for the ratio of occurrences of coin A and coin B respectively.

So,

0.45 * (5 Heads, 5 Tails) = (2.2 Heads, 2.2 Tails)

0.55 * (5 Heads, 5 Tails) = (2.8 Heads, 2.8 Tails)

This gives us a sense of the contribution of coins A and B for the first outcome. We can fill up the whole table in the same fashion for all the possible outcomes. The step discussed above is similar to the Expectation step where we have computed the expected value of number of heads and tails for a given coin.

For the outcomes, we calculate the expected value of heads and tails for a given coin. We sum them up and get (21.3 Heads,8.6 Tails) for coin A and (11.7 Heads,8.4 Tails) for coin B. We need to maximize the likelihood in this step. This requires calculation of the bias for both the coins based on the expected outcome. Therefore, we get the bias for coin A and B as 0.71,0.58 respectively. This MLE estimate is fed as an input to the next iteration and this process continuous until convergence. After 10 iterations, the values converge, and we get the outcome as 0.8 and 0.52 for coin A and coin B respectively.

This is how an unstructured problem without much leads can be solved using Expectation Maximization Algorithm.

Collaborative Filtering using Expectation Maximization Algorithm.

I hope you now got a sense of Expectation Maximization Algorithm and it’s use cases. In this section, we will discuss about how Expectation Maximization helps in getting good recommendations for users (Amazon, Netflix and so on).

This image is a typical example of data we have. We need to fill these missing values using EM Algorithm.

For this algorithm, we classify the movies into G groups and each movie m belongs to group g with a probability of Qm(g). We assume that the ratings of different users are gaussian and independent as the rating of User-1 for a particular movie does not depend on rating of User-2 for that movie. Therefore, the distribution is as follows,

Where R(u,m) is the rating of user u for movie m.

This conditional independence is also called as Naïve Bayes approach. The gaussian assumption is to simplify the process in the maximization step. Let’s define a latent variable Gm that denotes the group of movie m. Let U(m) be defined as the set of users who have rated movie m .Thus, the Expectation step is as follows:

This is the expectation step. We are trying to start with a random parameter and trying to compute the expected value of the probability distribution of being assigned to a group g for the (t+1)-th iteration. The maximization step is as follows

Here we maximize the likelihood and get a parameter update for all the parameters.

We continue the process until convergence and we make predictions using

We predict the ratings of users for movies they haven’t watched and rank them in a hierarchical order to show them the right recommendations.

I hope you enjoyed this article on Expectation-Maximization in building efficient recommendation systems. Please subscribe for more interesting articles in the future.

--

--