Understanding Pyro’s Model and Guide: A Love Story

Published in

Paper Club

9 min readJun 20, 2018

For background on why all of a sudden there’s Bayesian content on this blog, see here.

Rambling Introduction 🤦

After wrapping up Bayesian Methods for Hackers (9/10 would recommend), I’ve been working with Uber’s Pyro library for the past few weeks. Pyro is a deep universal probabilistic programming language build on top of the PyTorch deep learning framework. Sounds fancy, but don’t let that scare you! My experience has been two parts challenging, one part fun, and I wouldn’t have it any other way.

One of the biggest obstacles has been the relative youth of the library. Pyro was open-sourced in December 2017 and is built on PyTorch which was itself released in October 2016. On top of that, probabilistic programming and Bayesian methods have always been relatively niche and even in the past few years have taken a backseat to more “mainstream” deep learning like CNNs and RNNs. Thus, the existing literature and tutorials are mainly written by and for academic, math-y types — efficient for the important folks, harder to parse for newbies like me.

As I sit on some of these concepts I’m going to try and re-explain them in my own words, to solidify my own understanding and hopefully help fill a gap in existing documentation.

This post only assumes basic knowledge of how probability distributions work (this might be a decent starting point if you want a refresher or reference point). I will walk you through all of the Pyro parts, with the disclaimer that I am not (yet) a complete expert.

Ridiculous Analogy 🙄

Let’s start with the most immediate and important new concept Pyro introduced to me: the guide.

To reason about the guide, you must start with a model. You can think of the model as an arbitrary python function.

But to make our initial neural connection, we’ll be using another definition: the smoking hot swimsuit model, say, Kate Upton.

To understand the guide, think up this scenario:

The guide is madly in love with the model. He’s made it his life’s mission to pursue her, and he’s willing to go to the ends of the latent space to breathe the same air as her. So he follows her everywhere she goes (kinda creepy yeah but bear with me).
The model gets into the most exclusive clubs and parties and constantly has a security detail. In this scenario, we are paparazzi who are trying to follow the model around. The problem is, she can hop on her private jet and could be anywhere on Earth at any given time. Luckily, following her isn’t work but more like eating, drinking, and breathing for the guide. His obsession means that he is never far away from the model, chasing her wherever she goes. Even at these exclusive parties, he can be found right outside the building. And so, by simply tracking the guide we are able to get a very accurate location for the model at any given time.

In practice, it might look something like this, where P(Z|X) is the model and Q(z) is the guide:

Q is a much “simpler” Gaussian distribution than P, but it is the closest possible Gaussian to P. Obviously, in real life with real distributions we will aim for a closer approximation.

Actual Content 😱

Let’s use this concept to build a probabilistic model.

Every time James, Tiger, and Jason go to The French Laundry, they play credit card roulette to determine who pays. This began as a form of good-natured entertainment since they’re all rational human beings who know that it should even out over time. After 20 meals, James has picked up 4 bills, Tiger has paid for 9, and Jason has been responsible for 7.

Now, Tiger can’t help but get a little bit suspicious — he’s been stuck with almost half of the tabs! And how has James only paid for 4?? He’s clearly cheating! Taking a deep breath, Tiger gathers himself and sets out to build a model to determine whether there’s foul play at hand.

Setup

If you’re following along at home, we’ve installed the latest versions of Pyro (0.2.1) and PyTorch (0.4.0) (see here for instructions on how to do this from scratch on a cloud GPU) and imported these packages:

Performed some good Pyro hygiene:

And set up a dataset representing the number of times each person has paid:

What output do we want from our model? The most straightforward answer would be three probability distributions representing the chances that each person pays for a meal, given the observed data. This will help us determine whether Tiger is really being cheated out of his hard-earned funds.

A good choice for modeling probabilities is the Beta distribution. In this case, we want to start off with distributions peaking around 0.33, or 1/3 chance, since this is a reasonable naive assumption for the “true” fairness of our game of credit card roulette. Parameter values of 6.0 for alpha and 10.0 for beta seem to do the job:

Beta distribution with hyperparameters 6 and 10

Model

We can now build our model:

Models in Pyro are Python functions which take data as input and make use of Pyro primitives to analyze the data. Things to note about our model:

pyro.sample is a Pyro primitive that creates a latent random variable. In our model, we are going to sample three “pay probabilities” from the Beta distribution representing each person’s probability of paying and give it a name of pay_probs. This will give us something like tensor([0.2091, 0.3522, 0.3726]) (without adding .expand(3).independent(1), we would only get one sample). As you’ll see later, pyro.sample statements without the obs keyword must appear in both the model and the guide.
It’s unlikely that our sampled probabilities add up to 1 like we want, so we normalize pay_probs to represent the fact that they are interdependent.
pyro.iarange is a Pyro construct that provides some improved performance that acts like Python’s range and allows us to vectorize the most important piece of our model…
pyro.sample statements with the obs keyword are how we incorporate observations into our model. Here, we pass our normalized pay probabilities to a Categorical distribution,which will return samples from the set {0, 1, 2} proportional to the respective probabilities. For example, if our pay probabilities were tensor([0.25, 0.25, 0.5]) the Categorical distribution would return 0 25% of the time, 1 25% of the time, and 2 50% of the time. We then compare these samples to the observed data to train our model.

“But wait!” you might be saying. “What exactly is being trained? You just hardcoded those hyperparameter values of 6 and 10!” Well, glad you asked. Enter:

Guide

Let’s declare our guide.

Keep in mind that analogy of the guide following the model around. The guide can be arbitrary python code just like the model, but with a few requirements:

Every pyro.sample statement without the obs keyword that appears in the model must have a corresponding pyro.sample statement with the same name in the guide. We sample with the name pay_probs here to meet this requirement. Note: the sample statements in the model and guide are not required to use the same distribution, but they often do.
There are no pyro.sample statements with the obs keyword in the guide. These are exclusive to the model.
There are pyro.param statements, which are exclusive to the guide. These provide differentiation for the inputs to the pay_probs sample in the guide vs. the model. These special variables are the pieces that are actually going to be trained. We declare them as vectors alphas and betas with three unique elements matching the three probability distributions we hope to see as the model’s output.

Training

IMPORTANT PARAGRAPH:

We will be using a process called SVI with ELBO loss to train our model. It’s not important to understand all of the nuts and bolts at this point. At a high level, each step in the training process will take one pass through the model and incorporate the observed data. Then, it will go through the model code again, replacing each pyro.sample statement with the corresponding statement from the guide. It compares the resulting distributions from the model and guide and adjusts the pyro.param values in order to get the guide closer to the model. This way, the guide will in theory follow the model around and get close enough to confidently approximate it.

Note: I couldn’t give you a 100% confident explanation as to why the model posterior distribution is intractable and we have to use the guide as an approximation, but I’m inclined to say it has something to do with the curse of dimensionality.

We’ll start with a quick function to print our training progress:

Nothing that special here. The most notable thing is the use of alphas = pyro.param("alphas") to query the state of a parameter in the currently-training Pyro model.

And our training loop:

We’re using the Adam optimizer here with a super scientific magical learning rate of 0.0005. If you’re unfamiliar with Adam, don’t worry. I’m no expert in optimization functions, but Just Use Adam ™ hasn’t failed me yet.

Pay attention to the interface for the SVI process: we pass it our model function, our guide function, and our optimizer, with a loss function of Trace_ELBO (no need to completely understand that for now). These are all of the components we need for successful training.

Then we just perform a given number of steps and print progress as we go. Whee!

Validation

This is an abbreviated version of the output we expect to see:

[ 0 | probability Tiger pays: 0.333 +/- 0.10 | probability Jason pays: 0.333 +/- 0.10 | probability James pays: 0.333 +/- 0.10 ] 
...[ 700 | probability Tiger pays: 0.368 +/- 0.11 | probability Jason pays: 0.343 +/- 0.10 | probability James pays: 0.288 +/- 0.10 ] 
...[ 2500 | probability Tiger pays: 0.377 +/- 0.10 | probability Jason pays: 0.340 +/- 0.10 | probability James pays: 0.283 +/- 0.10 ]

As you can see, our normalized initial probabilities start where we expect them, giving each person a 1/3 chance of paying for the meal. As the model trains, we see the probabilities moving in the correct direction. Eventually they stabilize ~ 0.380, ~ 0.340, and ~ 0.280, respectively.

From 10000 samples, in graph form, Tiger’s distribution:

Jason’s distribution:

James’s distribution:

And overlayed on each other:

Conclusion

So, is Tiger’s skepticism justified? Well, given the mean and standard deviation Pyro gives us for his final chances of paying (0.377 +/- 0.10), he should breathe a bit easier…for now. Over a third of this distribution lies at or below 0.333 (completely even chances of paying), which is well within the range of outcomes.

What if the group ate out 20 more times and got the same proportion of tabs paid? Do the model’s answers change? Do the model’s certainty in its answers change? What if they had half as many trials? Try and play around with different formulations of this problem and see what you find!

Special thanks to @jpchen on the Pyro forum for helping me work through some of the nuances of using Pyro for this example: https://forum.pyro.ai/t/svi-part1-readjusted-with-multinomial-instead-of-bernoulli-doesnt-work/192