Causal Inference with CausalPy

Laurin Brechter
5 min readJul 20, 2023

--

This post provides a short introduction to causal inference with a practical example taken from the book Causal Inference for The Brave and True. Note that I have simply taken the example from the book and implemented them in CausalPy to give them a bit of a Bayesian touch. The original paper that investigated the issue can be found here.

Causal Inference is the process estimating a causal effect from observational data. In this context we usually have a treatment (e.g. a medical intervention) that is applied to some group/individual. We can then observe a metric/measure from that individual pre and post treatment. The value post treatment is also referred to as the outcome.

The fundamental problem is that for any given individual we can only ever observe one outcome. The other one remains hidden to us. It is the so-called counterfactual (i.e. counter-to-the-fact). As an example, we can either treat a patient or not, but we only get the outcome under one of the settings. The other outcome is the one we don’t observe therefore known as the potential outcome. We can, however, estimate a potential effect if we have a control group that is not treated but very similar to the treated group. We must ensure that there is no difference between the groups pre-treatment.

Synthetic Control

In many situtations, there is no control group to which we can compare the treated subjects to. Imagine for example that we show ads to a certain percentage of our users. Before and after the ad exposure (i.e. treatment) we record the traffic to our website. Following the definition of a causal effect, we need to know what would have happened, had the users not been exposed to that ad. This is often possible in the case of ads where we can expose a percentage of the users and leave the rest as our control group. In the following example, this is not possible.

In this example, we want to know what the effect of a policy that restricts smoking was on the cigarette sales in California. Notice that in this case, there is no natural control group (we don’t have a 2nd California in the world). This poses a problem as it can be hard to verify if the policy actually had an effect on sales or if they would have decreased anyways.

This is exactly where Synthetic Control enters the stage. The idea is the following: Since we do not have a natural control group, we will try to construct one that is as similar as possible to our treatment group. In our case, we can use other states of the US that are similar to the pre-treatment California. We can thereby construct a ‘synthetic California’ with a mixture of the other American states.

import causalpy as cp
import pandas as pd

cigar = (pd.read_csv("data/smoking.csv")
.drop(columns=["lnincome","beer", "age15to24", "california", "after_treatment"]))
Loaded Data

Above, we import the CausalPy Python package, load the data and drop some columns that we don’t need. We get 31 years worth of data from 39 different states. The intervention (start of policy) took place in 1989. California is state no. 3. Before we can pass the data to CausalPy, we have to do some reshaping/preprocessing. Most importantly, the data needs to be in a wide instead of long format.

piv = cigar.pivot(index="year", columns="state", values="cigsale")
treatment_time = 1989
unit = "s3"

piv.columns = ["s" + str(i) for i in list(piv.columns)]

piv = piv.rename(columns={unit: "actual"})
Data after preprocessing

Firstly, we will pivot the data such that we have one column for each state and one row per year. We will also rename the columns, the reasons for which will be clear later on.

formula = "actual ~ 0 + " + " ".join(list(piv.columns.drop("actual")))

We construct a formula that says that we want to explain the ‘actual’ variable (i.e. the cigarette sales in California) with the cigarette sales in the other countries. Note that we had to rename the columns as we cannot use integers here. The first 0 simply means that we do not want to include an intercept in the model.

result = cp.pymc_experiments.SyntheticControl(
piv,
treatment_time,
formula=formula,
model=cp.pymc_models.WeightedSumFitter(
sample_kwargs={"target_accept": 0.95}
),
)

The code above creates the model und fits it. This is pretty simple if we have clean data. We simply have to pass the data to CausalPy along with the time of the treatment and our formula. The formula describes how we want to construct the synthetic control group (i.e. what variables). Besides using SyntheticControl as our experiment type, we are telling CausalPy that we want to use a WeightedSumFitter as our model. Once we run this code, CausalPy will initiate a Markov Chain Monte Carlo (MCMC) algorithm that performs inference by drawing samples from the posterior distribution. Note that we are not going into the details of Bayesian inference here but there are nice introductory articles that explain the concept intuitively.

This is the primary figure that we get after fitting the model. First we should make sure that we have a good model that can construct a good synthetic group. That is the case here, as we achieve an R2 of ~82%. CausalPy then shows us in the first subplot in orange where California would have been without the intervention. The black dots show the actual observations. The two other subplots show the (cummulative) difference between the synthetic control and the treatment group. Note that we also get credible intervals associated with the causal effect.

Model Coefficients

We can also look at the coefficients of the WeightedSumFitter. This again shows that the synthetic California is a combination of the other states. In this case, s8 and s4 make up a large portion of the synthetic california.

Conclusion

Causal Inference is an often overlooked area of statistics. It is however, becoming more popular as it allows us to go beyond mere association and correlation and answer questions of the type ‘what-if’. Answering these types of questions is essential for actually making data-informed decisions.

--

--