Why and how to integrate Causality in predictive modelling

8 min readAug 24, 2023

Maybe you’re a scientist, data-scientist, Machine-Learning or really any professional with some statistical experience. You’ve heard about Causality and the difference between associative models (correlations, etc) and causal models — but you’re not sure what that means in practice.

We hope that at the end of this article you will make a small change to the way you approach predictive and statistical projects, especially when you’re working with historical data or an observational study design. And most especially when you’re thinking about managing or implementing a change to the system.

There’s 2 parts: First, why you should incorporate Causality. Second, how to do it.

Why you should incorporate Causality

What’s wrong with predictive models anyway?

Of course, the aim with all predictive models (including e.g. deep learning) is to train them to generalize — within the statistics of the training data. They should do well under these conditions, regardless of whether they are causal models, as long as the statistics of the training data cover all the causal scenarios sufficiently.

Problems arise when we ask these models to extrapolate beyond the training data statistics — also known as out-of-domain data. This doesn’t simply mean changes to individual variable or feature values; it can also be a change in the joint distribution of some variables. Or perhaps there’s just a change in the frequency of combinations of some variables, which was under-represented in the original data but is now common. In these conditions, ordinary predictive models can be very wrong — but causal models may still be able to predict under these conditions accurately.

So if you’re planning to use non-causal (associative) models to:

Predict the effects of a change or intervention to a system, or
Predict under changed conditions / statistics due to a cause you can’t control, or
Understand counterfactual scenarios which aren’t well represented in your historical data

… you’re risking a significant (and un-measurable) drop in model performance unless you try to understand the causal effects of these changes. We will now explore just how and why models and researchers get it wrong.

False perceptions of Causal conclusions

There’s a temptation to ignore causality and try to answer inherently causal questions with associative methods, and then cover your ass with statements like “since our study is associative, no causal conclusion can be made”. This is like selling acne cures with small print stating that product does not cure acne. Either you’re trying to prove a causal relationship, or you’re not.

Schrodinger’s Causal Inference — Peter Tennant on Twitter

What’s worse, not only are you deceiving yourself but you’re also misleading readers and stakeholders who will read causal conclusions from your crypto-associative studies:

People draw causal conclusions from statements about association. Paper — causal implicatures from correlational statements (PDF).

False correlations

You know that reading causation from correlation is a no-no, but how bad can it be, if you’ve got a really strong correlation? Well, bad. The strength of the correlation is actually irrelevant; these correlations may reflect other, latent causal relationships which will behave completely differently under the changed conditions in which you plan to use your model.

If you haven’t attempted to model the cause of a correlation, it still doesn’t say anything at all about the relationship between these variables under different conditions than in your existing data.

“Spurious” correlations are everywhere. If you’ve got many features (variables), you’ll find tons of correlations which don’t mean what you think they mean.

Spurious correlations from Tyler Vigen’s site. Visit for more hilarity.

Simpson’s Paradox

Simpson’s paradox occurs in a dataset where two variables X and Y are positively correlated, but simultaneously, every sub-group in the dataset displays a negative correlation between X and Y. The relationship between all subgroups is the opposite of the relationship in the overall population!

The figure below explains how this can occur. You can see that the correlation between X and Y would be negative for all dots, while being positive for each coloured subgroup:

Simpson’s paradox: **Over the whole population, a negative correlation between X and Y is observed**. But after controlling for another variable which divides the population into 4 subgroups (black, red, blue and green), **all subgroups display a positive correlation** between X and Y! Controlling or stratifying by the additional variable completely reverses the association.

Why is Simpson’s paradox important? It shows that if you fail to consider the effects of other variables, correlation can easily lead you to find a strong result which is the inverse of the correct conclusion.

The perils of over-controlling

You can’t simply control for all or as many variables as possible either. This can actually make things worse. For example, if you control for all the mediators of a cause, you can eliminate the real effect you’re trying to measure!

Over-controlling for the diseases caused by high BMI eliminates the effect on mortality, leading to confused authors and readers. Everything could have been OK if they’d sat down and sketched out the causal mechanisms they were investigating. Original tweet.

The only way to identify the correct set of variables to [control / condition on / use as covariates in your predictive model], is to perform identification on a Causal Diagram.

For example, in the simple example below, we see different arrangements of 3 variables; the direction of the red arrows determines the role of the third variable in each case, and what to do with it. In the case of colliders and mediators, you want to avoid controlling. This article explains more.

Analysis of a Causal Diagram allows identification of the roles of confounding, collider and mediator variables. These roles then define appropriate handling for each variable, given the effect of interest.

Avoiding nonsensical conclusions from predictive models

I was playing with a Bayesian Network editor called the Causal Attribution Tool (CAT) and specifically with an example which explores the relationship between hair colour and the chances of getting tenure, the holy grail of all academics. The system is as follows:

Relationship between hair colour and tenure. Overall chance of tenure is 43%.

This diagram says that Age/Time in field causes Academic Record strength, chances of Tenure, and Hair colour. Academic Record strength also “causes” (i.e. has a causal effect on) the chances of getting Tenure.

The nice thing about CAT is we can ask associative and causal questions. If we ignore the causal arrows and just use this as an ordinary associative/prediction model, then what are our chances of getting Tenure if we set hair colour to White?

Set hair colour to white, and an associative model predicts 58% chance of tenure.

58% chance of tenure, up from 43%! But before you go buy a bottle of hair dye, what does the Causal model say? Since there is no directed path from Hair colour to Tenure, the causal model (correctly) says our chances are unchanged at 43%.

Just for fun, I drew the same Causal Diagram in CausalWizard app, and it said:

The same Causal Diagram in the Causal Wizard app. You can draw any Causal Diagram in the app, even without data.

Here’s what Causal Wizard says about the effect of Hair Colour on Tenure, given the graph above. It has no effect!

So by using our domain knowledge to draw a Causal Diagram, we are easily able to avoid reaching a false conclusion.

How to integrate Causality and causal methods into predictive ML

So assuming you’re convinced this causality thing is worth looking into, what should you do? Here are 3 steps you can take immediately, in many projects where you’re already using associative models.

Step 1: Model the problem with your SMEs

Work with people who understand the system being studied and try to capture their knowledge in a Causal Diagram. A Causal Diagram is simply a network of nodes (representing variables, aka features, independent and dependent variables) and directed edges between the nodes. If an edge is present, it means the source node has a direct causal effect on the target node. You can draw a Causal Diagram in your browser using Causal Wizard. It’s free. Or you can draw it on a piece of paper. It’s very easy. The important thing is the structure you capture.

Many people say — what if my diagram is wrong? Won’t that make my results worse?

The reality is, things can’t be any worse by burying your head in the sand and not considering the causal relationships in the system. Any attempt to capture your assumptions explicitly is better than leaving it to the reader to guess what you thought at the time.

The process of sucking domain knowledge out of the heads of your experts is called Elicitation. Discuss and review the diagram. You may discover new insights just from this discussion! You don’t all have to agree on everything. There may be more than one candidate diagram — model them all and see how they affect results! There will often be some compromises to limit the system to observable variables (ones you have data for) and to simplify things to make analysis practical. It doesn’t have to be perfect. At least people can see and understand your assumptions.

The one exception to this process is if your system is extremely complex and mostly unknown (e.g. genetic networks), in which case you might want to explore Causal Discovery or continue to use only associative methods. However, even Causal Discovery greatly benefits from any prior knowledge you can provide.

Step 2: Analyse the causal diagram

There are a number of software tools which, given a Causal Diagram, will automatically figure out how to calculate the effect of one variable on another, including which variables to control, condition on, or provide as covariates (features for the ML people). This process is known as Identification.

Pearl’s Do-calculus can Identify any causal quantity which is identifiable — which means it will tell you the correct answer, if there’s any answer at all.

Check out this video by Brady Neal on the topic:

You can also draw your Causal Diagram in Causal Wizard and it will perform Identification for you:

Identification of variable roles in causal diagram using Causal Wizard

Under the hood, Causal Wizard uses the DoWhy library for Identification. If you’re a Python programmer, you can easily use DoWhy yourself.

Identification might provide your answers — either telling you the effect you’re interested in is zero, or telling you how to estimate it from data using any appropriate statistical model. You can go back to your favourite statistical or ML techniques here — perhaps you’ll simply use a regression model as you originally planned, but with the added insight of appropriately-chosen input features / covariates and a principled and documented basis on which these features were selected.

Which brings us to the final step:

Step 3: Include your Causal Diagram in your paper or report

In steps 1 and 2 you explored and captured important domain knowledge, which was used to appropriately define a set of covariates for your model. It’s important now that you document this knowledge. The Causal Diagram is now part of your assumptions.

This allows readers (other researchers, or simply other stakeholders / future project teams) to understand, build on or challenge your assumptions. They can always modify the Causal Diagram and re-analyze the system, perhaps with new data.

Be like this guy: