Decoding the Matrix: A Journey into the Algorithms of Recommender Systems — Part 2: Bayesian Ranking Algorithms

6 min readJun 22, 2023

Prior belief: Neo can’t stop bullets. Evidence: He can. Posterior: Maybe he can, maybe he can’t!

The Explore-Exploit Dilemma

Imagine you arrive in a casino and you obviously want to maximize the amount of money you win. Each slot machine appears to be exactly identical, but they each have a different probability of giving you a win (1) over a loss (0). How do we effectively judge each slot machine while also maximizing the ones we want to win on?

If we play too few times, the estimate is no good (fat confidence interval). But if we play a bad machine too many times, we lose money. So how do we min-max?

The Bayesian Approach

Much like our slot machines, imagine we have a series of Bernoulli random variables which are independent and identically distributed.

Bernoulli distribution for x with parameter pi

Under the frequentist view, our estimate for pi would be the maximum likelihood estimate, i.e the mean of the data.

In Bayesian stats, we treat pi as a random variable as well instead of considering it fixed from the data. Pi’s probability distribution is actually the Beta distribution i.e pi takes on values between 0 and 1.

Beta distribution for pi with alpha and beta parameters

The way we derive this expression is from Bayes’ Rule

What we want is the distribution for pi given our collection of samples X. This distribution is called the posterior. We know about one piece of this equation from MLE, p(X|pi) which is just the product of the individual Bernoulli PMFs. This piece is called the likelihood. The next piece is p(pi) which represents our prior beliefs. For our prior, we choose it to be Beta(1,1) such that the distribution is the uniform distribution, i.e completely uninformative. Our prior knowledge is that pi is somewhere between 0 and 1.

In Bayesian statistics, it’s common to ignore the normalizing integral at the bottom of Bayes’ rule and change the equality to a proportionality. It’s also common to ignore constant that do not directly effect the random variable in the posterior i.e the 1/Beta(a,b) function in our prior.

Doing that, and doing some rearranging yields the following.

Or that pi given X is Beta distributed with parameters alpha prime and beta prime.

This is our Bayesian update rule. In practice, the posterior of one step becomes the prior of the next step. This approach is called “online learning”.

We can break this down into steps as follows:

Step 1: Pi starts out as Beta(1,1)

Step 2: Collect our first sample, x_1

Step 3: Pi is now distributed as Beta(1 + x_1, 1 + 1 — x_1)

Step 4: Collect our second sample, x_2

Step 5: Pi is distributed as Beta(1 + x_1 + x_2, 1 + 2 — x_1 — x_2)

And so on…

As a minor remark, special pairs of likelihood and priors such that the posterior has the same type of distribution as the prior are called conjugate priors and they’re very useful. You can find a list of them on Wikipedia here!

How does this help us rank items?

The key here is that after a sufficient number of trials, our posterior is going to look a lot like a confidence interval for our estimate of pi hat. Except we don’t actually have to calculate anything, as it’s all encoded into the shape of the posterior distribution.

This solves a lot of the issues with the explore exploit dilemma as Bayesian updating allows us to sample from posteriors, collect information about the parameters, and then update distributions based on our data.

Each item in our set, called a Bandit, is going to have a posterior distribution attached to it. Sampling from each distribution allows you to rank the items randomly according to their posterior. Then once a user clicks, or doesn’t click an item we have collected more information to update our posterior, causing the distribution to become skinner and thus more “confident” in it’s prediction.

It’s Gaussian Time!

Let’s get normal with it. Hopefully we’ve seen this pdf a thousand times by now but just in case:

The gaussian has 2 parameters, mu and sigma, and we have the choice of putting a prior just on the mean, variance, or both.

In this example, we’ll look at the case where the variance is known, and introduce the precision, tau, as 1/sigma. Here our prior for mu is Guassian with mean m and variance 1/lambda.

Skipping over the derivation of the conjugate prior (just find it on Wikipedia) we see the following update rule for updating mu:

In this case, the conjugate prior is just another normal distribution, with new values for mean and variance.

As N increases, the updated lambda approaches infinity, which means the variance of the posterior goes down (skinner distribution).

If N = 0 (no samples) the posterior just asks as the prior.

And as N increases the posterior’s mean approaches the sample mean.

Bernoulli Bayesian Bandit Code

Necessary imports, defining globals, and making a class for each bandit with parameters pi, alpha and beta.

import numpy as np
from scipy.stats import beta
import matplotlib.pyplot as plt

TRUE_PROBS = [.4,.5,.6]
NUM_TRIALS = 2000

class Bandit:
 #init the state of each bandit
 def __init__(self, pi):
  self.pi = pi
  self.alpha = 1
  self.beta = 1
 #actually pulling the slot machine
 def pull(self):
  return np.random.random() < self.pi
 #getting a sample from the posterior
 def sample(self):
  return np.random.beta(self.alpha, self.beta)
 #online update
 def update(self, x):
  self.alpha += x
  self.beta += 1 - x

A simple plotting function

#simple plotting function
def plot(bandits, trial):
 x = np.linspace(0,1,200)
 for bandit in bandits:
  y = beta.pdf(x, bandit.alpha, bandit.beta)
  plt.plot(x,y,label=f"Real pi = {bandit.pi}")

 plt.title(f"Bandit distribution after {trial} trials")
 plt.legend()
 plt.show()

And the experiment itself

bandits = [Bandit(p) for p in TRUE_PROBS]

plotting_points = [5,10,100,200,1000,1500,2000]

for i in range(NUM_TRIALS):
 best_bandit = None
 max_sample = -1
 allsamples = []

 for bandit in bandits:
  sample = bandit.sample()
  if sample > max_sample:
   max_sample = sample
   best_bandit = bandit
 if i in plotting_points:
  plot(bandits, i)

 #pull the arm of the best bandit
 x = best_bandit.pull()

 best_bandit.update(x)

Let’s look at some plots and then finish this up!

As you can see, we get a better estimate about the true value of pi for each slot machine, by exploring and exploiting posterior distributions for each bandit.

Catch me in part 3!

Decoding the Matrix: A Journey into the Algorithms of Recommender Systems — Part 3: Collaborative…

Collaborative filtering is the process of assigning ratings based on the ratings of other users who are similar to you…

medium.com

Decoding the Matrix: A Journey into the Algorithms of Recommender Systems — Part 2: Bayesian Ranking Algorithms

Decoding the Matrix: A Journey into the Algorithms of Recommender Systems — Part 3: Collaborative…

Collaborative filtering is the process of assigning ratings based on the ratings of other users who are similar to you…

Written by Abelardo Riojas