How to measure a causal relationship (part 2/2)

11 min readJan 7, 2024

For part 1 (study design, setup and modelling), click here.

This two part series aims to provide a introductory guide to causal relationship analysis, also known as causal-effect estimation. The problem focuses on the effect of one variable on another. This article is suitable for Data-Scientists, Machine-Learning folks, and other scientists or professionals with some basic statistics knowledge.

If you’re not sure whether you need a causal analysis, you may want to read this article first, or watch this excellent video by Richard McElreath on the problem of “Causal Salad”.

Causal Salad — from Richard McElreath’s video. Researchers create a Causal Salad by throwing in a bunch of standard statistics tricks without ever doing any principled thinking about causes.

To complete an analysis of your own data, you have two options. Option 1 is to write your own analysis code using one of the popular open-source packages and follow the steps in this article — we would recommend DoWhy (PyWhy). Option 2 is to use free Causal inference software such as Causal Wizard. This is really the only option if you are not comfortable with writing code, but is convenient and quick even for coders.

Key questions

These articles explain how to use causal inference and causal ML to answer questions like:

Is there a causal relationship between these variables, or just correlation?
What is the direction of the causal relationship between these variables?
How strong and consistent is the causal effect?
Is the effect statistically significant?
Does the effect vary between sub-groups?

Part 2: Analysis and validation of a Causal Effect model

We continue where the previous article ended — assuming you have created at least a regression and a propensity-score based model of the desired causal effect. If this all sounds like confusing jargon to you, have a read of part 1!

Careful analysis and validation of causal effect modelling is essential because the conclusions may have enormous real-world impacts. Causal effect analysis is often the best way to determine policy or make decisions when there is no possibility of a controlled, prospective study. For obvious reasons this is particularly common in econometrics and epidemiology, which explains the relative popularity of Causal methods in these disciplines. However, Causal techniques are valid in any discipline.

Explore and verify model behaviour

Visualize outcome distribution by treatment group

Why: Check that results appear plausible given the data and expert knowledge of the system.

Blindly throwing data at statistical models is a bad idea. Even if you’ve followed the data exploration steps from part 1 you’ll want to get an overview of model behaviour to check that it’s not crazy. Without examining the results graphically, it’s too easy to uncritically accept good performance metrics, which actually represent nonsense results. These can occur accidentally due to e.g. mis-specification of the problem, data or model, or a misunderstanding of the results.

There are many ways you can visualise the model and outputs but our favourite for this study design is to present the distribution of outcomes for each group in the study (i.e. control and treated).

We can then check whether there is a difference in outcome between these groups, and how this difference compares to the estimated causal effect. Note that due to the influence of other variables, we don’t expect the treatment variable to explain all the differences between these groups. There could be no discernible difference in outcome between groups with a valid, nonzero effect. But it helps to put the magnitude of the effect in the context of the differences between these populations and their distributions. This helps us to reject implausible, likely erroneous results.

Plot of outcome distribution by treatment group, for a continuous numerical outcome variable. This is a good way to visualise the difference in outcome between groups. The red lines attempt to depict the estimated effect of treatment by adding the estimated causal effect (in this case, ATE) to the mean of the control samples. As we can see, this does approximately equal the mean of the treated samples.

Feature importance

Why? Understand and validate the contribution of individual variables.

Your subject-matter-experts (SMEs) will often request an analysis of the modelled contribution of each variable provided as an input feature to the model. This can easily be retrieved from regression models — these values are the coefficients of the models. This is a major motivation for including regression in your selected models.

The effect of variables can also be captured from more complex ML models using feature-explainer tools, but additional assumptions and analysis are required for these, which raises questions about the validity of the insights. So there’s a tradeoff between model performance, complexity and interpretability.

In most cases, Causal studies use tabular data with relatively few variables; simpler models are usually indicative of the performance of more complex models with the added benefit of increased interpretability.

Using e.g. a linear regression model, we can inspect both the sign and magnitude of the coefficients. The sign tells us the direction of effect and the magnitude reflects the degree to which the feature influences the outcome. Note that input features must be standardised to enable comparison of relative magnitudes.

Check with your SMEs that the magnitudes and directions of effects are plausible for all variables. If not, dig deeper to understand why not. In some cases, the answer is that effect of a variable is nonlinear or conditional on other variables and cannot be interpreted easily. However, variables can often be interpreted successfully as linear or monotonic effects.

Coefficients of variables used as input features for a linear regression model. Identification on the Causal Diagram is used to select features for the model. After fitting/training, we can see to what extent each variable influences the outcome and the direction of effect. One of the variables will be the treatment. A coefficient is generated for each distinct value of categorical input features.

Counterfactual analysis

Why? Answers “what would have happened, if” questions. Allows validation of model behaviour given changed treatment status.

Causal models differ from purely predictive ones because they should be able to generate alternative outcomes given changed inputs — for example, changing the value of the treatment variable. This allows you to generate various counterfactual scenarios such as:

Outcomes if all controls were treated
Outcomes if all treated samples were controls
Outcomes if all samples were treated
Etc.

Note that counterfactual analysis can be applied to different subsets of the samples. These results often directly inform policy and evidence-based decision-making because they can indicate the population-level benefit or cost of alternative strategies.

In addition to that utility, counterfactual scenarios also enable validation of model behaviour given varying treatment values — we should see outcomes and outcome changes reflect the estimated effect, and overall consistency of this effect. If we don’t observe this — ask yourself why not? There must be something different about the values of covariates (other features) between samples from the treated and control groups (a concern which is explored further below, in the section on positivity and covariate balance).

Note that not all Causal models are able to predict individual sample outputs (known as individual treatment effects). Regression models do; propensity methods generally do not.

Measure statistical significance and robustness

Why: Understand the probability that the estimated effect is due to chance characteristics of the sampled data, rather than a real effect. Estimate the stability of the estimated effect, given different samples of data.

Refutation is a principled, statistical method for accepting and rejecting experimental hypotheses — i.e. reaching conclusions about the significance of your results.

For those coming from a non-causal ML background, the qualities we are seeking to validate in a causal effect analysis are a little bit different to what you’re used to. For example, you are probably expecting to perform generalization validation techniques rather than refutation tests. In fact, this is covered in the last section of this article.

We still care about model generalization to the target population and other key performance qualities, but whereas the focus of most ML models is predictive, Causal ML studies often aim to be descriptive: They seek to quantify an effect or the behaviour of a system at a population level, rather than to produce accurate predictions for individual samples. This means we can use a wider range of techniques for both modelling and validation.

At the start of part 1 we recommended 2 software options for Causal Inference — DoWhy (python library) and CausalWizard (free web app). Both these software options provide several refutation methods which help to validate a causal effect estimate, including:

Bootstrap outcome permutation test: Randomly shuffles sample outcomes and retrains models many times to observe how often an effect as large as the original estimate occurs given a random relationship between treatment and outcome. We can create a hypothesis and generate a p-value for acceptance of a significant difference between the original, estimated effect and the distribution of supposedly null effects given permuted outcomes.
Placebo treatment permutation test: Randomly shuffles the treatment values, destroying any measurable treatment effect; we then verify that the estimated effect given this data is zero and cannot leak through via any other variables in the data.

Note these tests are all performed on the entire original data. Follow this link for additional refutation tests.

Validate key assumptions

Confirm positivity (overlap)

Why: Checks that data and models do not violate key assumptions and hence are unlikely to yield false insights.

The Positivity assumption is one of the most important assumptions which must hold for your causal analysis to be valid. The essence of the Positivity assumption is that both treated and control groups must have at least some representative samples for all possible combinations of input features, including the treatment variable. If this is not true, the model will be required to extrapolate unpredictably to these unseen combinations of these variables.

One way to test for violation of the Positivity assumption is to analyse the distribution of propensity scores obtained from your propensity based models. The intuition behind this is relatively easy to grasp but is too long to repeat here, so follow the link above if you want to know more.

We hope to see that:

a) Few or no samples have extreme propensity;

b) Both treated and control groups are represented within all the range of scores observed.

In the figure below (reproduced from Causal Wizard results) we can see a reasonably good propensity score distribution. Note that there are some samples from both treated and control groups for almost all propensity scores which are observed.

There are a few samples which have very low propensity scores; one optional remedy would be to remove these from the dataset. The chosen propensity score thresholds for removal are arbitrary, but should be close to zero or 1.

A propensity score distribution plot, used to check for Positivity violations.

Check covariate balance

Why: Checks that data and models do not violate key assumptions and hence are unlikely to yield false insights.

Propensity methods for estimating causal effects are appealing because they make few assumptions, but they rely on similar distributions and combinations (joint distributions) of all input features, including treatment variable and other covariates, between the treated and control groups. If this “balance” between treated and control groups is not achieved, the resulting insights may also be untrustworthy. We can check this property with a Covariate Balance plot, also known as the Love plot.

Although the figure below can only be generated using a propensity based model, the results have implications for your data and hence extend to any modelling technique.

The Covariate Balance plot shown below has a pair of dots for each input feature. Dots represent the standardized mean difference (SMD) between treated and control groups before and after “balancing” using propensity score weighting. The hope is that weighting produces SMD values close to zero. We can define a somewhat arbitrary threshold where we consider the feature “imbalanced” between treated and control groups (indicated by the dashed red line in the plot):

Covariate Balance plot. We calculate the standardized mean difference between treated and control groups before and after weighting by propensity scores; we aim to have differences less than a (somewhat arbitrary) threshold, indicated by the red dashed line. If so, we consider the covariates “balanced” between treated and control groups.

Measure model generalization performance

Why: We want to estimate causal model performance on the true population, for which data is not available, rather than our limited sample on which models are fitted or trained.

For readers who are familiar with conventional machine learning this is the section you’ve been waiting for. Causal models which can generate individual treatment effects can be used to predict outcomes on unseen data, beyond the original training set used to create the model. This includes Regression models, but usually not Propensity score methods.

The principle is always the same: We want to estimate model performance on the larger real-world population from which our data was sampled. The details depend on specifics of the study design including whether the outcome variable is categorical or numerical. Model performance is important because it gives us an objective way to measure how good the model is — how well it represents the behaviour of the system it models!

The following results are all generated by using a causal model to predict outcomes on a set of data NOT provided during model training or fitting.

Categorical Outcome. Given a categorical outcome, it is important to provide a summary table of all predicted vs actual outcomes, known as a confusion matrix. The figure below depicts a confusion matrix for a binary (true/false) categorical outcome:

Binary categorical outcome — confusion matrix.

The table shown above also has marginal sums. These are useful to check that data hasn’t been disproportionately lost or modified during processing, which happens more often than you’d expect. You need to be sure your results reflect the data you intended to analyse, not just a subset of it!

Moving on from the confusion matrix, we also need to measure some summary performance metrics, which objectively reduce model performance to a number, such as accuracy. Note that it’s especially important to use additional metrics such as F-Score, Precision and Recall when outcomes are imbalanced:

Typical categorical summary performance metrics, including F-Score, Accuracy, Precision and Recall.

Continuous Outcome. When the outcome is continuous, different performance metrics are used. Typically, they include R-squared, Root Mean Square Error (RMSE) and Mean Absolute Error (MAE). Note the latter two are in the original units, making interpretation easier.

It is also useful to generate a scatter or density plot of the actual vs predicted values, as shown below. A trend line can be fitted to these points; a good model should have a strong positive correlation. Scatter or density plots will show if model predictive abilities break down in certain output ranges, typically due to insufficient sampling in these ranges or models with high bias.

Scatter plot of predicted vs actual continuous outcomes on a held-out test set.

Summary

Studies are only as good as the data and the assumptions made while conducting and analysing them. We believe it’s essential to recognise and verify your assumptions — one of the things Causal Inference advocates like to shout about is that all studies are make assumptions about causality — it’s just that if you’re not making them explicit you have no way to know that you’re making them! But that’s a topic for another time…

Hopefully this pair of articles has given you an idea how and most importantly why you should use these and other techniques to critically examine any causal relationship or causal effect estimation study.

We aimed to set expectations for what you would like to see in a legitimate result, and help you to reject erroneous or misleading results. But remember that these techniques are simply tools to help you make your own conclusions in conjunction with expert knowledge of the domain or system you are studying.