What are Probabilistic Programming Languages and why they might be useful for you

David Hoyle
dunnhumby Science blog
6 min readApr 11, 2022

David Hoyle, Price & Promotion Science, dunnhumby

Why use a Probabilistic Programming Language?

Let’s imagine you want to create a predictive model, but your problem is somewhat unusual. You can write the problem down on paper in a succinct mathematical form, but you find there is no existing package or module that has algorithms for estimating this kind of model. You can’t just simply write:

from sklearn import MyComplexModelmy_complex_model_instance = MyComplexModel()my_complex_model_instance.fit(training_data_df)

You’re stuck! What do you do? The answer is obvious — you build your own algorithm. You could do this in a general high-purpose language such as Python, but it may become tedious or downright painful in places, because not all the in-built data-structures and language constructs are optimized for probabilistic models.

Wouldn’t it be great if there was a high-level language that just understood what any probabilistic model was and the things you want to do with them? Enter Probabilistic Programming Languages (PPLs). In a PPL, probabilistic models are first class citizens.

What is a Probabilistic Programming Language?

At the simplest level a PPL is a language that makes it easy to express any probabilistic model you want to. PPLs are great for rapid prototyping of predictive probabilistic models.

A few of the more well-known PPLs (but not an exhaustive list) are:

  • Stan: A PPL from Andrew Gelman, Bob Carpenter, and others, primarily at Columbia University. It has its own model specification language, which is then translated to C++ code that is then compiled to machine code. Good R and python interfaces are available. Details from https://mc-stan.org/.
  • pyMC3: A widely used python open-source PPL with a Theano back-end, from John Salvatier, Thomas Wiecki, Chris Fonnesbeck and other contributors. Details from https://docs.pymc.io/en/v3/.
  • Edward and Edward2: TensorFlow based PPLs from Dustin Tran at Google, David Blei at Columbia University, and other collaborators. The papers here and here introduce Edward and Edward2, respectively. More details and code from https://github.com/google/edward2.
  • Pyro: A PPL created by Uber AI Labs with a PyTorch back-end. Details from https://pyro.ai/

How to use PPLs

In practical terms the expression of a model in PPL code is very similar to how you would express the model in maths. To give you a simple example, on the left-hand-side in Figure 1 below is how I would write on paper or on a whiteboard that a variable is Normally distributed, when discussing a modelling problem with a colleague. On the right-hand-side is how I express that same fact in Stan code:

Because of this narrower domain focus of PPLs compared to general purpose high-level programming languages, PPLs have an inbuilt understanding of the inference tasks that you would want to do with any probabilistic model. In Stan, tasks such as fitting a model by optimizing the posterior of the model parameters, or sampling from the posterior, are as simple as writing,

model.optimizing(data)

Okay, I hear you say, but fitting a model using sklearn is just as succinct. I just write the following,

 model.fit(training_data)

The key difference is that with a PPL we can succinctly express inference tasks for any model. We are not restricted to just the model types that are available in sklearn or whatever package we happen to be using. This is the power that PPLs give us:

  1. Succinct and expressive declaration of probabilistic models.
  2. Succinct expression of inference tasks.

Imagine a more complex, but realistic scenario, where we want to model the products chosen by shoppers and how those choices change as we change things such as the price, the promotional and marketing details. Mathematically, we have a set of observations :

A set of response variable values, indexed by i which takes values 1 to N

which are drawn from a discrete distribution with K possible options. If observation i has the value k it means that product k was chosen in observation i. The probability of seeing product k chosen for observation i is conditional on a D-dimensional vector of features:

The features, x subscript i k, for observation i and class k

The precise relation between the features and probability of choosing the product and is through the linear predictor:

The linear predictor for the probability of choosing product k is the inner product between the vector of covariates x for that product, at observation i, and the vector beta of model parameters.

The parameters in the linear predictor are drawn from a Normal prior with a broad variance. Discussing the business problem with a colleague at a whiteboard, I write this mathematically as:

Our model consists of a categorical logit distribution for the discrete valued observations. The linear predictors for the probabilities of each of the categories depend on parameters beta. We assign a normal prior, of zero mean and standard deviation 5 to each element of the parameter vector beta.

In Stan code we write this almost identically to the mathematical expression. The relevant lines of Stan code would be:

matrix[K,D] X[N];
int Y[N];
vector[D] beta;
beta ~ normal(0, 5);
for (n in 1:N)
Y[n] ~ categorical_logit(X[n]*beta);

With a couple more lines of Stan code to declare some variables and organize into appropriate blocks of code, and we are good to go. We can fit the model and start doing inference with it. I can go this quickly from whiteboard to inference for almost any mathematical model form I want.

The Pros and Cons of PPLs

How and why do we get this apparent ‘free lunch’ ? By narrowing the domain to that of probabilistic modelling. PPLs are not general-purpose programming languages.

What we do not gain is complete ease-of-use. PPLs are not a No-Code, Low-Code or even an AutoML solution. The target user-base is likely to be Data Scientists with a thorough understanding of the underlying statistical principles and methodologies.

However, a PPL can be used as the basis of an AutoML solution in a specific domain or for a subset of specific tasks — this is what has happened with the production of Facebook’s open-sourced forecasting tool Prophet, which under the hood uses Stan.

It is also how we have used PPLs within the Price & Promotion Science Team at dunnhumby. We used PPLs to build a tool for experimenting with demand forecasting models. A demand forecasting model predicts how much a retailer is going to sell of a product given inputs such as price, how the product is being marketed, time of year, etc. To construct the most accurate predictions we are continually improving our models. This involves an R&D process where we experiment with different types of models. We use PPLs to experiment with different mathematical forms for our demand models. The expressiveness and flexibility of PPLs means we can do that experimentation rapidly.

Clearly, no language has it all. There is always a trade-off to be made between orchestration overhead and expressiveness. The more specific the task, the more we will know upfront (at design-time) what the orchestration tasks will be and so these can be internalized within the tool and the tool becomes an interface for specifying a configuration of a task — this is essentially a No-Code or Low-Code solution. A No-Code or Low-Code solution requires minimal or no extra orchestration code to be written by the user. At the other end, we have general purpose languages and PPLs, where extra orchestration code is required. However, for PPLs the extra orchestration code required around iterating on the model form is minimal. This supports rapidly experimenting with the model form.

Summary

  • Probabilistic Programming Languages (PPLs) are languages where probabilistic models are first-class citizens. They make expressing probabilistic models easy.
  • PPLs are great for rapidly prototyping and experimenting with different forms of models.
  • PPLs are not general-purpose programming languages and are not Low-Code or No-Code solutions. You will still have to write additional orchestration code and data-wrangling code in a host language.

--

--

David Hoyle
dunnhumby Science blog

A Data Scientist with 10yrs commercial experience and 20yrs in academia. My background (PhD and BSc.) originally in theoretical physics.