Analytical Bayesian Inference with Conjugate Priors

Published in

Paper Club

5 min readMay 6, 2018

I have been working a lot with pyro lately, trying to wrap my head around variational methods and statistical machine learning. Variational bayes(VB) methods are an excellent way to solve otherwise intractable problems and have revolutionized the field of statistical machine learning. Before VB methods, using bayesian inference required careful choices of distributions such that inference was analytically tractable. In order to build some intuition around VB, I’ll walk throughs some of the analytically tractable approaches to bayesian inference. Full code for this post can be seen here

Outline For Creating a Bayesian Model

There are four main steps to creating a bayesian model.

Define a model that makes the most sense for how the data might have been created
Define a prior, e.g. define the parameters of the model in terms of a distribution
Use observations to construct a likelihood function.
Combine the likelihood and the prior to create a posterior distribution

Step 1: Define a model

John just opened a bakery and needs our help making predictions around how many pies he’ll sell in a day. A poisson distribution is a good starting point because it is a discrete probability distribution that expresses the probability of a fixed number of events occurring in a specific time interval. The poisson distribution takes a parameter λ. The probability of getting a specific data point can be represented by fixing λ in the equation:

Now what’s a good value for λ? Well, a really nice property of the poisson distribution is that λ represents the mean. John has a general idea that his mean will be around 4. Plugging this value in for λ gives the following pmf.

Step 2: Define a prior

The next step is to define the parameter of our model λ, as a distribution.

John’s best guess is a mean of 4(μ = 4), but he isn’t very confident in his estimate so we’ll reflect that confidence in our prior. We’ll choose a gamma distribution to represent the mean, why that’s a good choice will become apparent shortly! We want to reflect our prior belief that the actual mean is around 4, so we’ll choose parameters for the gamma distribution, shape(𝑘) and scale(𝜃), such that the mean of our gamma distribution is 4.

Definition of Gamma Distribution (𝚪 is the gamma function)

For now we choose 𝑘 = 4 and 𝜃 = 1. Because μ = 𝑘𝜃 for the gamma distribution.

Plotting this gamma distribution

plot of gamma distribution for 𝑘 = 4 and 𝜃 = 1

Now the choice for 𝑘 = 4 and 𝜃 = 1 isn’t entirely arbitrary. We want to choose values here that encode our uncertainty in our guess. If we are very uncertain, we’ll choose values that make this distribution wider, reflecting this uncertainty.

Step 3: Construct a likelihood function

The likelihood function for a specific observation is 𝑃(x|λ). For multiple observations, you multiply their likelihoods together, so the likelihood becomes:

Step 4: Calculate the posterior

Now let’s use bayes theorem to calculate our posterior.

P(X) is a constant because the values of X are fixed to the observations. The equation reads probability of λ given our observations X. We can ditch the P(X) denominator term temporarily just so long as we rescale the distribution at the end. We can now say that the left hand side is proportional to the right hand side

Plugging in our equations, we get a posterior distribution for λ:

For the sake of simplicity, let’s say we observe 1 day where 6 pies are purchased. This equation simplifies to:

ensure we re-scale the distribution by some constant C so that it integrates to 1:

and combine terms to get the posterior distribution:

Conjugate priors

We’ve computed our posterior for λ, but even more incredibly, we’ve landed back on a Gamma distribution! Let’s see exactly how this is the case. Remember the gamma distribution is defined as:

At the beginning of this post, we somewhat arbitrarily chose values for 𝑘 and 𝜃. Now, notice that a choice of 𝑘 = 10 and 𝜃 = 0.5 gives us our posterior!

which just means that in our calculation of the posterior we choose C:

Now this wasn’t an accident, in fact it’s precisely the reason that I chose a gamma distribution as the prior. A poissoin distribution’s conjugate prior is the Gamma distribution, and whenever you choose a conjugate prior, the posterior will always be of the same form as the prior

If we step back for a moment we can see that by using a conjugate prior, we generate our posterior by updating the parameters of our prior — reflecting a new mean and confidence level, depending on the amount of observed data. As we observe more datapoints, 𝑘 and 𝜃 are updated in such a way as to shrink the width of our posterior, indicating an increased level of confidence in our distribution.

Plotting our new posterior distribution: