Paper club has been studying ML for the last year and a half, with a focus lately on bayesian machine learning. This post is a collaborative result from several pairing sessions with Tiger, learning pyro and variational methods. It is based off of the excellent bayesian methods for hackers example, using variational inference instead of markov chain monte carlo. The full notebook can be found on google’s colab environment or on github.
Let’s say we’re teaching a course and want to estimate how many students cheat on an exam. Obviously, if you simply ask people if they cheated, they will say no. A better algorithm, quoted from bayesian methods for hackers, is defined as:
In the interview process for each student, the student flips a coin, hidden from the interviewer. The student agrees to answer honestly if the coin comes up heads. Otherwise, if the coin comes up tails, the student (secretly) flips the coin again, and answers “Yes, I did cheat” if the coin flip lands heads, and “No, I did not cheat”, if the coin flip lands tails. This way, the interviewer does not know if a “Yes” was the result of a guilty plea, or a Heads on a second coin toss. Thus privacy is preserved and the researchers receive honest answers.
So in a pretty clever way, we’re encouraging honesty amongst our students, at the expense of noise in our data. We can expect roughly 50% of our data to be noise. What we need now is a way to model the percentage of students who cheated, while accounting for the noise.
For a more in depth post on variational inference(VI), I highly recommend Eric Jang’s post on the subject. The basic idea is that we have some unknown distribution P(Z|X) which is the true distribution of our model given our data. In most cases we can’t analytically calculate this distribution so we resort to approximation methods to do it for us. VI is one method that accomplishes this. We want to create a second, known distribution Q(z), and move the parameters of this distribution closer to P(Z|X). “closer” is defined by minimizing the KL divergence of the two distributions. There are some important caveats and tradeoffs to VI and you should really go read Eric’s post, but the basic idea is to move Q(z) so that it approximates P(Z|X).
By calculating the loss(KL divergence) between P(Z|X) and Q(z), we learn how to update Q(z)’s parameters so that the loss decreases at the next step, and Q becomes a better approximation.
A key point to understand here, is that while we can’t calculate the exact formula for P(Z|X), we can describe the density function in code, which VI will use for the loss.
The first thing we want to do to model our data is choose a distribution from which our observations are drawn. A Bernoulli distribution is good at representing binary outcomes so that’s a good place to start. Assigning 1 to cheated, and 0 to didn’t cheat, the distribution would look like this if 40% of students said they cheated:
In code that would look something like:
This is sampling from a Bernoulli distribution with p = probability_of_true_answer, and incorporating our data with the obs=data[i] keyword. This is how we’re going to calculate our loss. If our data has [1,1,1,1], for 4 yes answers, and our probability_of_true_answer is really low 1%, our model is going to realize that 1% is unlikely the true value and adjust accordingly.
So how do we model probability_of_true_answer? A student will say “yes” they cheated under two different conditions.
Condition 1: They flipped heads on the first flip, and they did actually cheat on the exam(true cheater)
Condition 2: They flipped tails on the first flip and heads on the second(noise)
If we have prior beliefs about how many students cheated, now would be the time reflect that in our model. In our case, let’s say we really have no idea. We’ll model our ignorance by using a uniform distribution from 0 to 1. We can combine conditions 1 and 2, and our prior ignorance of the true cheating frequency in this code:
This model succinctly describes our cheating algorithm.
Let’s turn our attention to Q(z). This is where the learning actually happens and there’s a few key points you have to understand about pyro in order to make the guide work properly.
- The guide trains learnable parameters. In the guide, we want to register the parameters with pyro that we want it to train. When the model is done training, we’ll grab the updated values and use them to plot our posterior.
- Pyro uses a name matching system, so any “pyro.sample” that appears in the model(e.g. “cheating_frequency”), has to match the name of a “pyro.sample” in the guide. The one exception to this is any sample that incorporates data, line 14 in our example, doesn’t have a matching “sample” in the guide.
- The distributions don’t have to be the same. In the case of a uniform distribution in the model, it doesn’t really make sense to have a uniform distribution in the guide. The point of the guide is to learn parameters to its distributions, such that it fits the data better. A uniform distribution doesn’t have learnable parameters.
- A model’s distribution must have non-zero densities for all possible x values in a guide’s matching distribution. This one threw me off for a while and is pretty critical to creating a working model. Pyro uses the model to calculate the log probability of a specific value occurring and uses this result to calculate the loss of our guide. Let’s say we make a bad choice for our guide’s “cheating_frequency” distribution and choose Normal(0, 1). During training, we’ll sample from this distribution an x value of say, 1.5. The probability of this value occurring in a Uniform(0, 1) distribution is 0, and log(0) is infinity. Our model will start spewing infinity as the loss and won’t be able to learn. A good choice of a matching distribution is a Beta distribution, since it’s between 0 and 1, and has non-zero values in that range.
A beta distribution is described by two parameters, alpha and beta. Here are some plots for different values of those parameters.
We want our guide to learn good values of alpha and beta so Q(z) comes close to representing our intractable distribution. With these pieces in mind, we get our guide function:
It’s important to remember that there are mathematical constraints on these distributions. Alpha and beta must both be positive for this distribution to make sense. Applying these constraints as an argument is an important step to ensuring that our model adheres to mathematical reality.
Putting it all together
Our sample dataset will include 35 “yes” answers, and 65 “no” answers. We’ll run our model for 5000 steps to learn alpha and beta. As we learn alpha and beta, we can compute the cheating frequency mean as well as the standard deviation with the following formula:
So we get a mean frequency of 21.4% with a standard deviation of 0.084. We’ve learned a beta distribution, Q(z), to describe cheating frequency. Plotting this distribution gives us:
Let’s think about this intuitively and see if it makes sense. We expect around 50 responses to be intentionally false, either yes or no. Of those 50 false responses, about 25 will say yes. This means that of the 50 responses that were supposed to be honest, we got about 10 “yes, I cheated” answers. 10 / 50 = 20% which is on par with our mean.
At this point you might be asking yourself what is the point of using VI to solve this? It was pretty straightforward to calculate 20% analytically. Using bayesian techniques here allows us to see the distribution of probabilities.
How confident are we that the percentage of cheaters is 20%? How likely is it to be more or less? How likely is it that we had 0 cheaters? The distribution of values allows us to answers these types of questions. The wider a distribution, the less confident we are in the mean. In our case our distribution is relatively wide, so saying 20% of the students cheated is a misinformed judgement. We can pretty sure that the proportion of cheaters is above zero, but there’s a large range of possible values it could take on so any further conclusions would be ill-advised.
We can also choose to use a binomial distribution. Instead of saying that there is a 50% chance of an honest answer, let’s also simulate the coin flipping as described by the algorithm. Since a binomial distribution is just bernoulli distribution with N = len(data), we can easily translate the above model to use a bernoulli. The few changes we make are using “expand” to sample multiple values, 1 for each student, and summing the proportion of true answers to pass as our observation to the binomial model:
Simulating the coin flips, and saying there is a 50 % chance of an honest answer are semantically the same, but we’re adding a small amount of variance in our model. Notice that the standard deviation as the end is 0.091, which is somewhat larger than 0.084 of the previous model. Plotting this graph, we can see that our uncertainty is a little larger(slightly wider distribution)
VI is a powerful technique. It allows you to approximate otherwise intractable distributions and gives you insight into the confidence of our estimates. There can be multiple correct ways to model the same distribution and one isn’t necessarily any better or more correct than the other. Variational methods turned out to be some of the most challenging work we’ve encountered. Pyro is a great library but it assumes a lot of knowledge of the intricacies of VI that can make it difficult to approach. If you’re new to VI and want to work with pyro, I recommend reading papers and spending time understanding the background before digging into the code.