# Probabilistic Programming

I’ve just started a 2 month placement at the Informatics Lab. Usually I spend my time working on my PhD in Gaussian process emulation for global climate models. But for the next few weeks, I’ll be in the Informatics Lab exploring probabilistic programming for inference in physics-based models.

We are interested in modelling physical systems, such as the atmosphere and the oceans. To build these models we first need to know parameters and initial conditions of the model that control the outcome. Sometimes these parameters or initial conditions are learnt directly from data, which can be surprisingly difficult and is known as **inference**.

Probabilistic programming languages (PPLs) are a tool designed specifically for doing inference. In this post, we will look at what they are and how they can used in the most simple case.

## Bayesian Inference

In Bayesian statistics, probability is viewed as a degree of belief of an event occurring (in contrast to frequentist statistics where probability is the frequency of an event occurring). The Bayesian view of probability is valuable in climate and weather forecasting. It allows us to quantify how likely we expect events to be, such as the probability of rain on a certain day.

In Bayesian statistics, we can incorporate a prior probability: our degree of belief of event occurring before observing any evidence. On an average day, we might expect rain to occur with some probability. We write the *prior probability *of rain as *P*(rain). Then we observe some evidence and update our degree of belief about the event occurring. Let’s say we look out the window and observe some data: we see there are many dark clouds in the sky. So we can update our degree of belief of rain occurring this afternoon. This is called **conditioning on observations**.

We write this updated probability as *P*(rain|cloud). This is a conditional probability, or a probability of one event (rain) assuming that another event has occurred (cloud). In particular, we call this the *posterior probability* of rain because we have observed some data and updated our prior probability. We can use **Bayes’ theorem** to calculate it:

There are two additional terms here. *P*(cloud) is the *evidence* or the overall probability of seeing the cloud, regardless of whether it rains or not. *P*(cloud|rain) is the *likelihood*, the probability of observing the data if you know the outcome i.e. the probability of seeing the cloud if it ends up raining later.

We are usually concerned with learning the posterior probability *P*(rain|cloud), which is why Bayes’ theorem is useful. But it can also be used more generally when we want to learn anything given some data, e.g. we might want to learn parameters of a model. This is called **Bayesian inference**.

This happens to be really useful and has many applications in climate and weather science. It becomes even more powerful when we think about probability distributions, which describe the probabilities associated with all the possible outcomes. One important use of Bayesian statistics is in data assimilation: incorporating observed data from satellites about the current state of the atmosphere into the weather model for prediction of the future state of the atmosphere.

# What is a probabilistic programming language (PPL)?

The traditional approach to constructing statistical models can be time-consuming and requires expertise. Typically it involves statisticians writing down the problem on paper and carrying out a mathematical procedure to design a bespoke statistical model. This would then be translated into code to construct a model ready for use. Since the model is specifically designed for the task at hand, this method must be repeated for each new problem.

Probabilistic programming languages are designed to make this process simpler, quicker and less technical. The idea is to remove the need for the expert statistician to hand-design the statistical model, by embedding this step within the programming language itself. Statistical tasks such as encoding observations (conditioning) and learning latent variables (inference) are automated and some of the complicated mathematical and statistical steps are hidden from the user. This makes statistical modelling more accessible to users and tasks that traditionally require a lot of thought and hand-engineering can be automated.

# What does Probabilistic Programming involve?

PPLs are just ordinary programming languages, equipped with tools that make statistical modelling easier for the user, reducing the need for hand-designing programs. The main components of a PPL are:

- Sampling: Our model is probabilistic- it requires drawing values at random from probability distributions.
- Conditioning: We have some observed data that can be used to update probabilities
- Inference: We can learn something from the known data and model. This could be the parameters of the system or a “latent variable”, some underlying factor that you can’t directly observe but might influence the data. This is usually the hard part.

Probabilistic programming is a suitable choice when we have a probabilistic model, that relies on sampling from distributions in order to make predictions. PPL makes it easier to do conditioning on observed data and to learn something about the model. Note that probabilistic programming languages are not new. They have been around since Simula in 1966. However, with the rise of deep learning, they are growing in popularity, with many new PPLs designed with inference for machine learning in mind (e.g. Pyro, Edward, Infer.Net, webppl).

Let’s look at a simple example to show how probabilistic programming works. We’ll be using Pyro, a probabilistic programming language built on top of PyTorch in Python. Most of this example comes from the Introduction to Inference in Pyro tutorial.

# A simple model for a moving cloud

Let’s say we have a cloud moving through the atmosphere with some speed in 1 dimension. We observe it after 10 seconds but with some measurement error. We want to use a probabilistic approach to find the distribution of the wind speed.

## 1) Statistical Model

The wind speed is something we can’t directly observe — it is a hidden variable. But we have some prior knowledge for what kind of wind speed to expect. We use this to sample from a prior distribution for the cloud speed with an initial guess for wind speed mean (5 m/s) and standard deviation (2 m/s). The cloud moves a distance (speed **×** time) and its final position is observed with some measurement error.

The `pyro.sample`

statements are the first component of PPL, allowing us to sample from distributions. We provide Pyro with unique names for each variable, so they can be tracked. We can run this model many times to find the distribution of cloud speeds and observed positions.

## 2) Conditioning

What if we observe the cloud at a certain point, *x=30 *m? We want to use this information in our model to learn more about the latent variable, speed. This is when we condition on our observations.

In Pyro, this is done with the `pyro.condition`

statement. This statement returns a new function almost identical to the original, but overwrites the `pyro.sample("position",...)`

by fixing it at the observed value.

The rest of the function remains the same. We can check this by sampling speed and position many times:

Our new function, `conditioned_propagate`

, always returns the same value for position, fixed at the observed value, but the distribution of the windspeed is unchanged. Speed is still sampled from the normal distribution we provided in our original function, `propagate_cloud`

.

## 3) Inference

The next step is inference — the difficult part. The aim is to find the posterior distribution of the speed, given our model and observation above.

Being able to calculate this analytically is rare, so we will estimate the distribution using the stochastic variational inference algorithm. This requires a distribution called a **guide** (or variational distribution) that is an approximation to the posterior distribution we want to know. The guide relies on additional parameters that will be tuned to “guide” it towards the posterior distribution.

For this example, we will sample the wind speed from a normal distribution with mean and variance given by the new parameters, *a* and *b*. These are first defined with the statement `pyro.param(...)`

and are stored in the “Pyro Parameter Store”. The values we provide here for *a* and *b* aren’t too important, as we will optimise these in the next step.

We can check the distribution of the guide before optimising *a* and *b.*

I mentioned that we can’t usually find an analytical posterior distribution, but in our simple example we can! Using properties of normal distributions we can calculate that posterior distribution of the cloud speed is another normal distribution with a mean of 3.4 m/s and a standard deviation of 0.89 m/s, shown in red above. Our initial guess is far from the ground truth.

Next we want to optimise the guide function to get a the best approximation to the posterior distribution, according to some loss function- we’ll use the Evidence Lower Bound Operator or “ELBO”. We use stochastic variational inference , `pyro.infer.SVI`

:

It looks like our guide has converged, so now we can compare our new optimised guide to the exact distribution:

The result is that after optimisation, the distribution of the guide looks almost identical to the exact posterior. So we have a good approximation to the posterior distribution which we can easily draw samples from.

This example used stochastic variational inference, which gives us a way to approximate a posterior distribution, but we could have used Markov Chain Monte Carlo to learn a posterior distribution exactly at a higher computational cost.

Remember that this is all something that we could do without a probabilistic programming language — but using one makes it easier and requires less code, demonstrated here with a simple inference example. This is powerful as we could use it to do inference on more complex physical models, which may have useful applications in climate and weather science.

# Further Reading

See the notebook version of this here: https://github.com/informatics-lab/probabilistic-programming.git

## Read more about Bayesian statistics

- https://www.probabilisticworld.com/what-is-bayes-theorem/
- https://towardsdatascience.com/probability-concepts-explained-bayesian-inference-for-parameter-estimation-90e8930e5348

## Read more about probabilistic programming languages

- https://pyro.ai/examples/
- http://adriansampson.net/doc/ppl.html
- https://tenfifty.io/probabilistic-programming-proves-poland-was-robbed-in-chess/
- https://www.cs.ubc.ca/~poole/papers/IndependentChoices.pdf
- https://methods.co.uk/blog/the-future-of-artificial-intelligence-part-1-probabilistic-programming-languages/