Getting Started to Probabilistic Programming using PyMC

Published in

tiket.com

8 min readJan 13, 2023

As a leading company in the hospitality industry, tiket.com manages a vast number of lodging options such as hotels and villas. Some of these properties have a strong monthly booking rate and minimal customer complaints, while others have a lower performance with a higher ratio of complaints to bookings.

The issue is that although it may appear simple, there is an implicit assumption when it comes to the complaint ratio, that the ratio derived from data is a constant value representing the actual ratio of complaints. This assumption is not always accurate due to bias from small sample sizes. For example, when flipping a fair coin twice, it is unlikely to result in one head and one tail.

In this article, we will go through the basics of probabilistic programming and how we implement it using PyMC, one of the famous libraries to do probabilistic programming in Python, to solve the problem above.

What is Probabilistic Programming

So, what is probabilistic programming? One of the informal definitions is that probabilistic programming is a way to do Bayesian statistics from an engineering perspective [1]. The reason is because it lets us define a (complex) Bayesian model and perform the parameter inference automatically via random sampling. We don’t need to deal with complex distribution functions and their derivation to find the optimum parameter values. In addition, a model with many variables involved and various probability distributions often does not have analytical solutions. Probabilistic programming solves this problem by performing Markov Chain Monte Carlo (MCMC) sampling algorithm.

Some probabilistic programming frameworks:

PyMC (Python, JAX)
Pyro (PyTorch)
NumPyro (Numpy, JAX)
Tensorflow Probability
Stan (R, Python, C++)
Turing.jl (Julia)

Applications of Probabilistic Programming

Probabilistic programming can be utilized to perform Bayesian modeling in various use cases, for instance:

Bayesian A/B testing
Bayesian data analysis, e.g. change point detection, clustering, survival analysis
Causal inference
Prediction, e.g. Bayesian multilevel regression model
Decision making under uncertainty, e.g. Bayesian decision theory

Bayesian Modeling

When we discuss Bayesian modeling, we always think about how the data is generated, i.e. a generative model approach. This can be broken down into two general phases: first is model definition, and second is performing parameter inference.

Following the generative modeling approach, we start from observed data in hand and think about what is the underlying distribution that possibly generates the data. This distribution is surely driven by some unknown parameters.
The objective of parameter inference is then to infer these unknown parameters, so that we find a model that best fits to explain observed data.

The two steps above can be defined very straightforward using a probabilistic programming language.

Bayesian Inference

The core of doing Bayesian inference is basically utilizing Bayes theorem to update probability of an event as a new data is observed. Recall the Bayes formula, a posterior probability of A given an observed B depends on its prior probability and a likelihood function, i.e. the most likely A that explains B.

Solving the Bayes equation is often not practical. Especially when complex distributions are involved. Hence, automatic sampling procedure is utilized to infer unknown parameters. In the above example, we are trying to infer parameter p.

Markov Chain Monte Carlo (MCMC) Sampling

Monte Carlo method is a method that heavily relies on a pseudo random number sampler [2]. While the term Markov Chain suggests that the sampling procedure only depends on the previous sample in the sequence without knowing the full history or often called as memorylessness property.

Metropolis-Hastings algorithm is one of the earliest, yet still performs generally good, that works based on random walk procedure. The way the algorithm performs sampling to estimate unknown parameters is illustrated in the following animation [3].

In general, to estimate an unknown distribution of an unknown parameter, it does the following steps

Sample a proposed value θnew.
Substitute the proposed θnew to the Bayes formula to obtain posterior value.
Then calculate the ratio r of proposed posterior(θnew) and previous posterior(θt-1)
Calculate acceptance probability a = min(r,1)
If the acceptance probability a = 1, then accept the proposal
If not, continue the next steps
Draw a random number u
If u < a, then accept the proposal, otherwise reject the proposal

We don’t need to implement those sampling procedure as it is already implemented in any probabilistic programming frameworks.

Case Study — Property Complaints

Let’s revisit the case study that was mentioned in the beginning of this post and follow the probabilistic programming framework to solve the case. Suppose that we are going to identify which properties that need our immediate attention as they have lower performance, e.g. having many complaints.

The Uncertainty Factor

One straightforward approach to solve that problem is that we calculate a complaint ratio (number of booking with complaints / number of bookings) then sort properties descendingly by these ratios. However, as previously mentioned there might be a problem with this approach due to small sample bias.

How certain are we that the ratio numbers are close to the true value?

Consider these following examples:

Property A: 1 complaint from 10 bookings (complaint ratio = 0.1)
Property B: 10 complaints from 100 bookings (complaint ratio = 0.1)
Which one is better, A or B?
Property A: 0 complaint from 1 booking (complaint ratio = 0)
Property B: 1 complaint from 100 bookings (complaint ratio = 0.01)
Which one is better, A or B?

It is generally known that there is greater variation when the sample size is small [4]. From the first example above, both property A and B have the same ratio. But property B has more bookings and as the law of large numbers suggests that more samples tend to converge to the true value, we are relatively more certain in property B. Similar takeaway from the second example. Using only point estimation of complaint ratio will suggest that property A is the best (complaint ratio = 0). But actually this can be misleading as we only have a single data point for the estimation.

Model Definition

When we apply the above framework to our case study of property complaints, we think about what distribution that might generate such data and what unknown parameters that drive the distribution. In this case study, we can assume that the observed data we have follows a Binomial distribution with one parameter (p) that defines the chance of getting a complaint from a random booking.

Then the way we define our model using PyMC is explained as follows,

Parameter Inference via MCMC

Once we defined the model. We can run the script so that MCMC will start to perform parameter inference. In the screenshot below, PyMC performs 4 chains of sampling with sample size = 5000 in each chain. This process took a few seconds to complete.

Model Checking

There are many ways to do model checking (for reference [5]). But in this case study we only touch the simplest one to check whether our model converges or not and this can be done easily with observing trace plot, which is the plot of parameter values as a function of sample indices.

The trace plot (the plot on the right hand side) shows that the generated samples converge to a certain range (0–0.4). There is no indication that the sampled values deviate significantly from that range. While the left hand side plot shows the sampled posterior distributions that also converge towards the same shape.

Posterior Probability Distribution

The result from the inference process is probability distribution for each unknown parameter. In this case, we get the probability distribution of parameter p that defines the probability of getting a complaint for each particular property.

The wider the distribution suggests that the more uncertain our estimation is. As in the example above, we can see at the 94% HDI (high density interval) range from 0.008 to 0.35. Which subjectively in this particular domain is quite wide.

Ranking with Uncertainty

Back to our problem. Once we have our posterior probability distribution for each property. How do we know which properties need to be prioritized first, second, third, and so on?

To answer this question, we can perform property ranking. Then the next question is, what values should the ranking be based on? We cannot rank probability distributions. We need scalar values.

We cannot use mean because it will discard the risk of getting complaints at the rate at the right tail of the distribution. Another possible way is by using 95% of probables values or simply 95th percentile of the distribution. It suggests that 95% of the time the chance of getting a complaint will be lower or equal than that value (or the other way around, the risk of getting a complaint higher than that complaint ratio is only 5%).

After calculating the 95th percentile of each property and performing ranking. We end up with this following result.

As the result suggests, we are not only able to estimate unknown causes (i.e. chance of getting a complaint for each property), but also we have incorporated uncertainty into our decision making process.

Similar approach can also be beneficial to be applied in other business problems when dealing with some uncertain risks and we want to choose an optimal decision by taking those risks into account.

Conclusions

Probabilistic programming combines Bayesian statistics with Engineering approach
Bayesian statistics help to estimate uncertainty of our model and this uncertainty estimation can be incorporated into our decision making process