MUST READ: Clueless Individual Attempts to Explain Basic PyMC3 Model. You Won’t Believe What Happens Next.

Tiger Shen
Paper Club
Published in
5 min readMar 11, 2018

For background on why all of a sudden there’s Bayesian content on this blog, see here.

N.B.: this isn’t meant as an introduction to Bayesian methods or PyMC3. It was originally composed as a memo for myself at ~ Ch 4 of the book Bayesian Methods for Hackers, and thus makes assumptions about concepts and knowledge up to that point. I would highly recommend the book if any of this seems interesting :)

There have been many many (many many many) confusing parts of PyMC3 and Bayesian modeling in my short experience with these tools and concepts. Among the most challenging has been conceptualizing how observed data is added to the model, and relating this to where the MCMC (Markov Chain Monte Carlo; here’s the best introduction I know of) algorithm hooks in to sample from the resulting posterior distribution.

Initially, I got hung up on comparisons to mini-batch SGD (Stochastic Gradient Descent; if you’re unfamiliar with this, maybe skim here for the basic idea) in deep neural networks. Since the majority of my exposure to machine learning has been through neural nets, this is the first example of a training process that my mind reached for. In mini-batch SGD, individual samples are fed one by one to the model, loss is calculated, and the error is backpropagated through the network as part of one sequential "training step".

However, this mental model does not fit in a Bayesian context, where the construction of the item of interest and the algorithm to “discover” that item are distinct. Specifically, after the prior distribution is warped by the likelihood distribution to form the posterior, without any further action the posterior is completely opaque to us; with multiple dimensions, it becomes computationally intractable to determine a function for the posterior distribution. Contrast this to a DNN where the training process can be stopped or continued at any point and the network remains as "whole" as at any other point. No further steps need to be taken in order to receive output from the model.

I brought this up as part of a larger discussion in our Slack group:

James Vanneman's second comment set off a lightbulb for me:

Just want to clarify here, the sampling in MCMC doesn’t create the posterior, your dataset does. The sampling only returns you points from your posterior.

So: the posterior distribution exists after the observed data has been incorporated into the model. However, due to the curse of dimensionality this posterior is inaccessible analytically. In order to approximate the posterior and do anything useful with your model, you must perform MCMC.

With this in mind, I wanted to understand the PyMC3 API for representing these actions. Let's take a look at a function from Ch 4 of Bayesian Methods for Hackers. The context for this code is the task of generating a distribution for the “true upvote ratio” of a Reddit submission, based on the observed sample of up- and down-votes for that submission. From the text:

One can think of the true upvote ratio as “what is the underlying probability someone gives this submission a upvote, versus a downvote?”. So the 999 upvote/1 downvote submission probably has a true upvote ratio close to 1…but on the other hand we are much less certain about the true upvote ratio of the submission with only a single upvote. Sounds like a Bayesian problem to me.

I chose this example because it is more or less the most compact practical distillation of a PyMC3 analysis you can get to.

While there are still lots of fuzzy areas, I think I have enough context to take a stab at an explanation here. There are four clear steps in this function, noted with comments in the code.

  1. Create your prior. In this case we represent our parameter of interest, the true upvote ratio of a post, and we use a Uniform distribution because we have no reliable information available by which to form a different prior. “Prior distribution” is the default designation for an initialized random variable in PyMC3.
  2. This is the step that took the longest to click, and it wasn’t until reading James’ comment that things started to fall into place. Here we are creating a binomial variable called obs in our model, named by the first argument. The next argument is the number of trials in the sequence — in this case, the total number of votes. With the third argument we pass in our prior, upvote_ratio, as the probability of success for each trial. The fourth argument is where the magic happens. The upvotes variable contains the observed number of upvotes. This fixes the value of the obs variable in our model to the upvotes data. While the Uniform upvote_ratio represents the prior distribution of the upvote ratio, the presence of observed data turns the obs variable into a likelihood distribution.
    Take for example a post with 99 upvotes and 1 downvote (N = 100). You could read this line of code as “create a binomial variable representing how well each possible number of upvotes (0–100) out of 100 total votes “explains” the observation of 99 upvotes in 100 trials.” In other other words, this distribution should model the most likely values of the bias term to produce 99/100 upvotes. This should end up being concentrated in the high 90s, with a peak at 99. This likelihood distribution combined with the prior will produce a posterior distribution over the “true” upvote ratio of the post. Note that with a Uniform prior, the likelihood equals the posterior. Think of it as combining the two by multiplying the likelihood by 1 (h/t James).
  3. With a prior and a likelihood in place, our model now has a posterior distribution over the upvote ratio which we can sample from. The Metropolis algorithm is a nuanced (read: I have no clue what distinguishes it) implementation of MCMC. Unless the default argument is overwritten, we are going to be taking 20000 samples of the posterior landscape of our model here.
  4. As explained in the previous chapter, it’s best practice to “burn” the first N% of your MCMC trace. This data can be very noisy as the algorithm moves towards the posterior distribution without necessarily being close or inside it yet.

And voila, you now have an approximation of your posterior! Easy¹.

¹Disclaimer: actually not easy at all.

--

--