Bayesian Inference with Probabilistic Programming Using PyMC3

Published in

Airy ♥ Science

8 min readJun 28, 2019

At Airy, we are always striving for excellence, including on how to answer business questions scientifically. Yet, the adage says

All models are wrong, but some are useful.

Therefore, what we are continuously looking for is a less wrong answer. An “educated guess”, you may say. In this post, I would like to share about how to use probabilistic programming to do Bayesian inference and answer questions that require statistical models which you might find often at your company. As the name goes, indeed we are going to tackle them programmatically. Thomas Wiecki said,

Probabilistic programming thus democratizes statistical modeling by considerably lowering the mathematical understanding and time required to successfully build novel models and gain unique insights into your data.

I am using PyMC3, an awesome library for probabilistic programming in Python that was developed by Salvatier, Wiecki, and Fonnesbeck, to answer the questions. Do check the documentation for some fascinating tutorials and examples. For those of you who use R mainly, you can check out RStan.

You might notice that I mentioned probability before. And like most intros to probability, we always start with a coin flip example.

Let’s get started, shall we?

Mandatory Coin Flip Example

Suppose that you are playing a game with your friend Bob, in which you bet on the outcome of a coin flip. The coin has been provided by Bob. You think that there is a 50% chance that Bob would have provided an unfair coin. If the coin is unfair, you have no knowledge of the probability that the coin will turn up heads, so if asked, you would model the distribution over this probability as uniform. You flip the coin once, and it comes up heads. You flip the coin a second time, and it comes up heads again. What is the probability that the coin is fair? How can you justify your answer?

Let’s think about it for awhile. Shouldn’t we suspect that the coin is biased given what we saw? However, can we be that certain since we only flip the coin two times?

If you have learned something about Bayesian probability, you might already know how to answer this… or you could read this blog post about Bayesian inference first. For those who haven’t, let me walk you through it.

Analytical Solution

First, what do we know from the story? We know that we have a fair and a biased coins. Since we are choosing one of them randomly, we can safely assume that we will have a random variable that follows a Bernoulli distribution to represent this. Basically, it is like saying that we have another fair coin to decide which coin should we flip. However, we don’t really know which is which — that’s why it is a random variable! Anyhow, let coin be a random variable that follows a Bernoulli distribution with 𝜃=0.5, i.e. it is equally probable to pick one of those two coins. We can denote this as

Now, since we are observing the outcome of the flips, we can also model the outcome as a Bernoulli random variable, but we don’t know the parameter since we don’t know which coin we took in the first place. Thus,

What we would like to know first is

Using Bayes’ rule, we can formulate it as

The first and second flips are independent, so we know that

Thus,

while

Since we don’t know the exact value of the probability of getting two heads from the biased coin, we should evaluate the expected value of getting two heads instead.

Finally,

Thus, we conclude that it is more likely that the coin we picked is biased.

If you have been following so far, congratulations!

Numerical Solution — PyMC3 in Action

Now, I am going to show you how we can do the same calculation to get an approximately similar answer. To do this Bayesian inference, we can use some of the aforementioned definitions.

Several things to clarify:

The sixth line functions like an “if-else” statement. If which_coin is 1, then we say that the parameter 𝜃=0.5, and uniformly distributed otherwise.
We use Binomial distribution in the seventh line to encode HH, since we flipped the coin twice (n=2) and saw two heads (observed=2).
The ninth line is to do the sampling step, also known as the Monte Carlo method. More on this is explained in this excellent book by Cameron Davidson-Pilon.

Quite easy to use, isn’t it?

And the answer we get from this is 0.42885 — so close to the analytical solution!

Do we always get an approximately correct answer?

To put it simply, given large enough sample size, the answer from numerical solution will converge to the analytical solution. This happens because of the law of large numbers. You can read some more details here.

Effective Engineering

Let’s move on to another example.

One day, I was curious to find how effective our data engineering and analytics teams are. Since we are using Scrum as our framework of project management, we can evaluate the effectiveness of the team using the velocity of our sprints. For each sprint, we have the total number of story points assigned for each team. The velocity of a team can be calculated by the total number of story points divided by the completed story points. The question is: which team is more effective?

To answer this kind of question, we usually resort to a statistical test. Most of the time we will use two-sample t-test. However, the results from this kind of test are very easy to misinterpret let alone explain to the business users — or team lead in this case. What we really want is to find out how different the two groups are and not only saying significant/not significant. The Bayesian way to answer this question is to use BEST.

The Setup

Figure 1. Total story points distribution

First, you can see from the histogram that the total story points are different each sprint. In Figure 1, DA denotes the data analytics points, while SW is the data engineering points. Velocity or completion rate seems to be a good metric to compare the two, but since SW points are fewer in general, wouldn’t it be easier to reach 100% completion?

In that case, Bayesian probability will incorporate your uncertainty appropriately. This is because we assume that the observed data are fixed, while the parameter of the distribution might vary. The model will encode your belief about the true value of the parameter.

If you recall your Intro to Probability class, you might notice that finding the number of successes from n trials with a probability of success p should follow a Binomial distribution. In stats lingo,

We will model both teams with that assumption. What we don’t know and would like to find out is p, the completion rate. Hence, we will put a prior probability for this completion rate as follows

Why beta distribution? Because the completion rate should be between 0 and 1, which is the support of a beta distribution. For now, we assume that we don’t know anything about the completion rate previously. Thus, the parameters of this beta distribution were chosen to make it a uniform distribution. After we see the observed data, the distribution will change accordingly.

At this stage, you can evaluate each team individually. You will see in Figure 2 that the mean completion rate for SW is higher than DA. However, is SW always better than DA?

The Differences

To answer that, you can sample from each distribution and then compare them directly, or you can subtract them to get the difference distribution and then calculate the number of times the difference is larger than 0. In Figure 3, this translates to counting the number of samples to the right side of the green line divided by the total number of samples.

Figure 3. Distribution of differences (SW-DA)

The probability value turns out to be 0.82, which means that 82% of the time SW is more effective than DA! On the other hand, you can also say that the difference of SW and DA completion rates is more likely between 0.024 and 0.069. To reproduce the analysis, you can use the following snippet.

What to Do Next

Unfortunately, there’s something slightly wrong in our current assumption. If we believe that the completion rate follows a beta distribution, it implies that our belief would converge to one value. In other words, we are saying that the teams are working with a constant rate week after week. Yet, we know that every team has their ups and downs. Try to plot the histogram of the observed rates to see this.

If so, what we should do next is to change our assumption a bit by using the beta-binomial distribution as suggested in this post. I’ll leave it up to you as an exercise. Nevertheless, hopefully you have learned something from this post about how to do Bayesian inference with PyMC3!

If you like this kind of work, want to delve into hospitality business, or even propose a better solution for this kind of problem, we’re hiring!