How we have used Probabilistic Programming Languages at dunnhumby

David Hoyle
dunnhumby Science blog
4 min readApr 13, 2022

David Hoyle, Price & Promotion Science, dunnhumby

(This is the second article in our series about Probabilistic Programming Languages. The first article introducing PPLs is here)

In the Price & Promotion science team at dunnhumby we experiment with a wide variety of demand forecasting models — models that predict how much a retailer is going to sell of a product given inputs such as price, how the product is being marketed, time of year etc.

To support that experimentation process we wanted to create a tool that allowed us to rapidly prototype different mathematical forms of demand model. Probabilistic Programming Languages (PPLs) are ideal for this sort of task, as we have described in a previous post.

In fact, our forecasting process is typically a pipeline of several models each within its own stage, and so we didn’t want to build a narrow tool. Instead, we wanted a tool that supported:

1. Rapid experimentation with the composition of the pipeline, in terms of what stages we include.

2. Rapid experimentation with the model within each stage of the pipeline.

In supporting requirement #2 we didn’t want to limit ourselves in any way as to the mathematical model form used within a stage. PPLs were obviously ideal for this requirement. We knew we would have to write orchestration code around each PPL model. Schematically, we can think of a modelling stage as PPL model code top-and-tailed by orchestration or data-wrangling code, as show in Figure 1 below:

Schematic representation of a modelling stage containing a single model and its associated supporting code.
Figure 1: Schematic representation of a modelling stage containing a single model and its associated supporting code.

A modelling pipeline then looks like Figure 2:

Figure 2: Schematic representation of a pipeline of PPL modelling stages.
Figure 2: Schematic representation of a pipeline of PPL modelling stages.

To support requirement #1 we needed to create a tool that allowed users to easily compose and run pipelines. That pipeline creation tool had its own requirements:

  1. We didn’t want to impose any a priori specification of what the interface to a model within a stage should be, other than that model uses a PPL. A Data Scientist writing a new modelling stage simply writes PPL model code, writes the additional required orchestration code, and chooses what config to surface. The new modelling stage is now available to other Data Scientists to use in their pipelines.
  2. We wanted pipeline specification to be highly flexible, requiring configuration only — no additional orchestration code.
  3. We wanted a means of easily running pipelines just from the config specified in (2). Again, there should no requirement for new code to be written by the user. Running of a pipeline should be reproducible and auditable.

I’m not going to go into the details here on the engineering and software principles we followed when implementing our pipeline designer tool — that is a whole new blog post on its own — other than to highlight two points:

Firstly, the simple linear nature of our pipelines means it was easy for us to build our own pipeline tool. We could have used Airflow or Luigi. However, writing our own lightweight tool gave us more flexibility in the configuration that can be passed to that tool, allowing us to focus on supporting the user in specifying rich and expressive configuration. This meant we were now essentially adopting a paradigm of ‘pipeline config as pipeline code’, and the pipeline config becomes the expression of the whole modelling solution even when we are running at semi-productionized scale. The final outcome is that we have taken advantage of the expressiveness of PPLs to build a ‘low-code’ pipeline experiment tool. For example, we have used our pipeline tool to perform experiments in modifying our usual demand model form, but in a setting where we run on real production-scale data and build 200,000 product-level models in a single run.

Secondly, we can use different PPLs in different stages of the pipeline. For example — if for one stage we needed to use a Bayesian neural network and we felt that Edward2 was the best PPL to use for constructing a neural network model, then we just simply code a stage using Edward2 as the PPL engine for that stage. In fact, with most PPL tasks typically only taking a dataframe and dictionary of config options from the host language as input, it required minimal changes to allow for the user to choose between several PPL engines for a given stage. This reflects the over-arching principle that we want the tool to support the user in focusing on the ‘what’ not the ‘how’ of the task. So, if we want a stage to draw 10000 parameter samples from the posterior of a neural network, we don’t care which PPL engine is used to do that sampling.

Summary:

  • PPLs are great for rapidly prototyping and experimenting with different forms of models.
  • PPLs allow your Data Scientists to focus more on the ‘what’ and not the ‘how’ when building modelling pipelines.
  • At dunnhumby we have incorporated PPLs into modelling pipelines. We have used our pipeline experimentation tool to run a new model form on real production-scale data, building and assessing 200,000 product-level models.

--

--

David Hoyle
dunnhumby Science blog

A Data Scientist with 10yrs commercial experience and 20yrs in academia. My background (PhD and BSc.) originally in theoretical physics.