How to measure a causal relationship (part 1/2)

8 min readJan 7, 2024

This two part series aims to provide an introductory guide to causal relationship analysis, also known as causal-effect estimation. The problem focuses on the effect of one variable on another. This article is suitable for Data-Scientists, Machine-Learning folks, and other scientists or professionals with some basic statistics knowledge.

If you’re not sure whether you need a causal analysis, you may want to read this article first, or watch this excellent video by Richard McElreath on the problem of “Causal Salad”.

Causal Salad — from Richard McElreath’s video. Researchers create a Causal Salad by throwing in a bunch of standard statistics tricks without ever doing any principled thinking about causes.

To complete an analysis of your own data, you have two options. Option 1 is to write your own analysis code using one of the popular open-source packages and follow the steps in this article — we would recommend DoWhy (PyWhy). Option 2 is to use free Causal inference software such as Causal Wizard. This is really the only option if you are not comfortable with writing code, but is convenient and quick for even for coders.

Key questions

These articles explain how to use causal inference and causal ML to answer questions like:

Is there a causal relationship between these variables, or just correlation?
What is the direction of the causal relationship between these variables?
How strong and consistent is the causal effect?
Is the effect statistically significant?
Does the effect vary between sub-groups?

Part 1: Study design and preparation

This article will lead you through the preparation and setup of a Causal analysis based on existing, observational data. Part 2 will explain how to analyse, interpret and validate your results. Feel free to skip ahead.

Background knowledge

This article assumes some basic knowledge of Causality terminology. In particular:

The concept of a statistical variable. In simple terms this might be a column or row in your data. It’s a property of one of your samples. Your samples are the separate units or things you are studying.
The difference between a causal relationship and association or correlation.

Data

Since causal effect analysis is statistical, you’ll need some data from your system. This can be historical, observational data — you don’t need to do a controlled, prospective experiment.

Exploratory data analysis

Objectives: Understand data qualities, and spot obvious problems. Visually identify possible relationships between variables.

The first step in any machine learning or causal ML analysis should be looking at the data. In fact, we have a guide to this process in our ML Project Builder tool here (click the “Data” tab). Looking at data allows you to spot data quality issues (such as null values) which could have a big impact on any analysis.

To begin to understand each variable, you should also look at the univariate distribution of each variable (i.e. make a histogram of each one) and the bivariate distribution of variables you suspect may be associated, correlated or have a causal relationship. At this stage, looking at data is a bit like being a policeman on watch — looking for anything unusual or unexpected, without a specific aim in mind. That’s why they call it exploratory. You don’t know what you’ll find.

You’ll look at the data again later, often using the same techniques, but with specific questions in mind.

Use a histogram plot to examine the distribution of each variable. In this case, two variables are displayed at the same time. Example plot from Causal Wizard app.

A histogram plot is useful to see whether the data contains the expected range of values; it will be obvious if many values are bad, missing or don’t have the expected distribution. Histograms are also useful for categorical, text data.

Use a scatter plot to check for any association or correlation between variables. Correlation can also be measured numerically, but it’s always good to eyeball the joint distribution. Example plot from Causal Wizard app.

A scatter plot is usually suitable for examining a bivariate distribution, although a density plot may be needed to see high densities. Use the guide below to interpret scatter plots of two variables and make judgements about whether variables are associated, correlated or independent (not associated). If you suspect a causal relationship between two variables, you would also expect to see some association or correlation between them, although it may be visually subtle and yet statistically significant:

How to interpret a scatter plot of two variables to decide whether they are associated, correlated or independent.

Modelling the system

Objectives: This section will describe how to define your causal experiment and capture existing knowledge about the system. It will also produce a methodology for Causal ML effect modelling using your data.

Identify Cause and Effect variables

To pose a Causal effect analysis, you’ll need to define which variable is the proposed Cause (aka Treatment) and which is the Effect (aka Outcome). You will be exploring the effect of the Treatment on the Outcome. We prefer the latter terminology and will use it for the rest of this article.

Define groups for comparison

A common (and recommended) experimental design is a comparative, binary treatment study in which participants or samples are divided into two groups, depending on the value of the Treatment variable:

Treated: Was exposed to something / something was changed
Control: Was not exposed / not changed / default condition

In a binary treatment comparative study design, divide your study population or samples into two groups, depending on the value of the Treatment variable.

The benefit of this study design is that results are easy to interpret and analyse. The experiment will involve comparing these groups to measure the effect of the change between the groups. If your treatment is continuous, numerical values, the samples can still be divided into two groups by thresholding at a specific value; samples with value < threshold become controls, and the others are considered treated. You will need to use your knowledge (or ahistogram of the treatment variable) to identify a suitable threshold. Note that a continuous treatment can be analysed, but is not fully described in this article.

If you have more than one treatment option to analyse, you can still study them one at a time with this study design. You might want to filter different subsets of samples.

Creating a Causal Diagram

Why? A Causal Diagram encodes prior domain knowledge of the system and the structure of the diagram is a necessary assumption for all further analysis.

A Causal Diagram should capture all direct relationships between variables, including the Treatment and Outcome variables. It allows the role of other variables to be captured and accounted for in the analysis. These other variables may affect the treatment, or the outcome, or both; they might also mediate or modify the effect of treatment on outcome.

Causal Diagrams are one way to capture causal relationships between variables.

If you already have a causal model of the relationships between variables, you can proceed straight to causal inference by drawing and using the model in a suitable software tool. In general, you should try to simplify the model as much as possible without excluding key variables.

However, if you do not have a causal model, you can create one using expert domain knowledge (a process known as elicitation) or by using machine learning and statistics to generate an appropriate model (a process known as Causal Discovery).

Elicitation of expert knowledge is usually the best approach when the system is relatively well understood, even if the relationships between variables do not have a precise mathematical or statistical definition. It is preferred because the resulting models can be understood in the context of existing knowledge.

Depending on the data and number of variables, discovery may be unreliable or uncertain, and often generates overly complex models. However, it may be the only approach possible. If you can restrict the causal model using expert knowledge, this will help to generate better models.

How to choose whether you need Causal **Discovery**, Causal **Inference** or **Elicitation**. If you already have a causal model of the system, proceed directly to Causal Inference. Otherwise, decide whether to create a model by Causal Discovery (learning the model from the data) and / or Elicitation (asking experts to define the model for you). Our experience is that Discovery is very difficult, and benefits greatly from Elicitation where possible. Some software packages allow you to constrain the model via elicitation but otherwise learn it from data.

Analyse the system (Identification)

Identification allows the role of confounding variables to be accounted for and produces an Estimand describing how to estimate the effect being investigated.

Identification is a process which uses your study design and causal diagram to produce an Estimand. An Estimand describes how the desired effect can be calculated, including which variables to use as inputs to the model, and how to use them. The Estimand will include an appropriate set of confounding variables.

The desired effect is usually the “Average Treatment Effect” (ATE) but several other effects are widely used. The right effect to use depends on the precise question you want to answer. The ATE is the difference in mean (average) outcomes between the treated and control groups — in other words, it’s a common-sense definition of the effect of the treatment on the outcome.

Identification is not always successful, in which case the desired effect cannot be estimated in the system as described. In some cases, the Estimand may indicate that the desired causal effect is zero. This is usually obvious from the Causal Diagram — there will be no directed path from Treatment to Outcome variable.

Whereas all the other steps can easily be done with a pen and paper, Identification is easier with software. The do-Calculus is the most popular technique, though some variants exist.

The Identification process uses your description of the system and experiment design to produce an Estimand, which describes how to estimate the desired effect using your data.

Choose and create a model

At this stage we have:

Examined our data and identified any relevant association or correlation between variables
Obtained expert domain knowledge to create a causal diagram of the system
Created a specific study design, including treatment and outcome variables, and separated the data into two groups — treated and controls. We have selected the causal effect we wish to measure.
Used the above to Identify the desired effect, obtaining an Estimand.

The next step is to use the Estimand to create one or more statistical models, fitting or training them with your data. As mentioned at the start, you can do this with your own code, using Causal ML libraries, or using a tool such as Causal Wizard.

We recommend making multiple models, using different approaches to gain more comprehensive insights into the system and data. In particular, we recommend combining regression and propensity score methods because these approach the problem with quite different, but minimal assumptions. If multiple, different modelling approaches all agree, we can have more confidence in the result. In the next part of this article, we will see how different models enable different types of analysis to characterise and validate a causal relationship between variables.

Part 2: Analysis and validation of a Causal Effect model

For part 2, click here.