Hands-on Causal Effect Estimation with Python

A Gentle Guide to Causal Inference with Machine Learning Pt. 10

Kenneth Styppa
Causality in Data Science
15 min readJun 17, 2024

--

Simply knowing that things are causally related is usually not enough. Think of examples such as increasingly extreme weather phenomena: Forecasts that only indicate some rain on a Sunday would be only slightly useful. Especially if you live near a river and are in fear of flooding. There are numerous more examples of the same type. Each showcases that we often want to know the effect of a cause and not only the causal relationship itself.

In this tutorial, you will be introduced to a computational method that will enable you to utilize the theoretical knowledge you have gained throughout the last blog posts to estimate causal effects in your projects.

Before you keep on reading, please make sure you are familiar with the following concepts:

  1. Causal markov, faithfulness and sufficiency assumption
  2. The fundamental idea behind causal inference
  3. Causal graphs & d-separation
  4. The PC algorithm
  5. The reason why machine learning can profit from causality

Part 1: The Theory

First, we introduce a bit of theory on causal effect estimation. If you are familiar with that already you can jump directly to part two where we demonstrate causal effect estimation with the tigramite package.

The Big Picture: From Causal Discovery to Causal Effect Estimation

We have already established in the introduction the fundamental need to answer causal questions not only qualitatively but also quantitatively. In other words, we do not simply want to use binaries on causal relationships that indicate existing/non-existing, but we want to obtain the actual sizes of a certain possible effect.
It should be obvious that this is different from causal discovery where the task is to estimate a causal graph from data. Methods for causal discovery won’t give you causal effect estimations directly. And yet, they are a fundamental part of the path towards them.
Sounds a bit confusing? No worries. It is not that hard. Imagine causal effect estimation as a two-phase process. You first identify a causal graph using… yes you are right… causal discovery methods (or available expert knowledge), and then use the resulting graph to determine the magnitudes of the detected effects. That’s it! It is, of course, even better if you already qualitatively know the graph without having to learn it.

Great, now that we understand the bigger picture, let’s proceed to some small mathematical formality that will help us to reliably achieve our eventual goal of estimating causal effects.

Pearl’s causal effect framework:

Consider we are given a system and a causal graph describing it. The standard problem in Pearl's framework would be to estimate the causal effect of variable X on Y given the known causal graphical model. In Pearl’s framework, the causal effect on Y of setting X = x is a function of the interventional distribution p(Y|do(X=x)).

If you are unfamiliar with this notation, please consider this introduction. At this point, a simple recap of the do-notation should suffice. If you spot “do()” within a distribution definition, this means that we actively intervene in a system by setting one of its variables to a certain value. Please note that such an intervention is fundamentally different from conditioning on the same value. To make this clear, think about both concepts in a practical manner. Conditioning means you have sampled some data from a system and now are only selecting those for which X = x. Consequently, you ignore all those values in your sample for which X is not x.

Performing an intervention is fundamentally different. Here you intervene/manipulate the system and set the value of all instances of X to x. As a result, the distribution of your dependent variable in your system will be largely different when you intervene vs. when you condition.

SCMs

Given this background, the basis of Pearl’s framework is the assumption of an underlying (but unknown) structural causal model (SCM) among random variables, for example,

with f() being assignment functions, through which the value of a random variable (left-hand side of the equation) gets determined by its parents and a noise variable in the graph. The graph corresponding to the SCM above can be depicted as:

So how can we interpret p(Y|do(X=x)) in such a case? Intuitively, it is the (interventional) probability distribution of Y in the intervened SCM where the assignment equation of X is replaced such that the SCM becomes:

The above is a “perfect” intervention as explained in “The Elements of Causal Inference”. Note that there are more advanced and highly interesting types of interventions. For now, the simple case above should suffice.

Average Treatment Effects

So what would a standard notion of the causal effect look like if we follow this notion of experiments? The commonly used average treatment effect is the difference between the distribution over Y given an intervention on X to one value vs. the distribution over Y given an intervention on X to another value. Or simply:

To be precise, this is only one way to formalize causal effects based on interventional distributions. There are more types of effects to discover such as the Conditional Average Treatment Effect (typically encountered as CATE), the Heterogenous Treatment Effect and the Individual Treatment Effect. The ATE is the classic and also most simple type though, so we will stick to this one for the remainder of this article. It is a good way to get started!

Adjustment sets

Experiments are not always feasible. We already stressed this fact in one of our first posts, as it is one of the fundamental problems causal inference addresses. Pearl’s theory comes in quite handy here because it allows utilizing purely graphical knowledge to employ criteria to characterize whether a causal effect of X on Y among observed variables V is identifiable, meaning whether the interventional target query can be written as:

with q being some function of the observational distribution, the estimand. Thus, provided, that the causal effect is identifiable, we can use Pearl’s theory and his well-developed methods such as the do-calculus, to obtain the interventional distribution from observational data without any intervention.

One way to achieve this is via valid adjustment sets that fulfill, e.g., the backdoor criterion or more generally the generalized adjustment criterion. We will keep it simple for the start and cover backdoor adjustment where a suitable (possibly empty) set of adjustment variables Z allows us to express the interventional distribution for setting do(X=x) in terms of the observational distribution. For non-empty Z, the formula for backdoor adjustment reads:

In other words, given we adjust for the appropriate variables, which we then call a valid adjustment set, we can compute what the interventional distribution would be like, although we do not perform a “real intervention” ourselves. While this sounds a little like a magic trick it is very logical and thoroughly proven. Remember, however, that this is based on the assumption of knowing the causal graph, from which one can then determine the adjustment sets.

According to the generalized backdoor criterion (Perkovic et al., 2018) an adjustment set Z is valid if and only if both of the following conditions hold:

where “forb” refers to forbidden nodes that are descendants of either Y or certain mediator node(s) M, which may lie between X and Y. The second point assures no “leakage” of spurious association and is based on d-separation.

Using adjustment sets to estimate causal effects

Now let’s get to the heart of it. Equipped with a valid adjustment set, and the resulting factorization of our interventional distribution, the expected value of Y under the intervention X=x becomes:

The expectations on the right are all expected values without any interventional statements. So we can estimate them using our observational data (X, Y, Z). To estimate the inner expected value, we use a statistical learning model:

i.e. a model which yields an unbiased estimator of Y given values for X and Z. In the linear case, this would be your basic linear regression model from sklearn, although more sophisticated extensions are of course possible and oftentimes necessary.

We are almost there. The last step is to get an estimator for the outer expected value by evaluating the trained model f_hat, over all observed values z_k of Z (note z_k is a vector since it contains the observed value for each of the variables in the Adjustment set Z) and then take the average:

Having operationalised the interventional expected value, you can now compute the average treatment effect by choosing different values x and x' for intervention and subtracting the resulting expected values.

In the linear case, you can do something even more interesting. Here the average treatment effect for an intervention on X equals the regression coefficient for X of a multivariate linear regression of Y on X and Z since:

This gets especially handy when we are in the realm of multivariate interventions in X (meaning: X is not one random variable but represents many). Then:

from the regression model:

Part 2: Estimating Causal Effects with Python

Getting started:

It should get now a bit clearer what happens under the hood when you perform causal effect estimation. Let’s now do a coding example hands-on!

To perform causal effect estimation we are going to use the Tigramite package. Being the go-to guy for full-stack causal inference with time-series data, Tigramite provides continuously maintained and further developed state-of-the-art solutions with low package dependencies (Details: https://github.com/jakobrunge/tigramite).

You can install it via the following command within the directory you want to use via the terminal:

pip install tigramite

Let’s continue and download the necessary packages for the tutorial:

# Imports
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
%matplotlib inline
from scipy.stats import gaussian_kde
import tigramite
from tigramite import data_processing as pp
from tigramite.toymodels import structural_causal_processes as toys
from tigramite import plotting as tp
from tigramite.models import Models
from tigramite.causal_effects import CausalEffects
import sklearn
from sklearn.linear_model import LinearRegression

For a basic causal effect estimation in Tigramite, you only need a couple of other packages in addition to tigramite itself. Those are numpy, matplotlib and sklearn.

So far so good. To start causal effect estimation, let’s first generate some ground truth data against which we can compare the outcome of our causal inference efforts.

To do so, we generate some data from a slightly more challenging toy SCM with mixed links using toys.structural_causal_process with Gaussian unit variance noise (noises = None).

def lin_f(x): return x
coeff = .5
links_coeffs = {
0: [],
1: [((0, 0), coeff, lin_f), ((5, 0), coeff, lin_f)],
2: [((1, 0), coeff, lin_f), ((5, 0), coeff, lin_f)],
3: [((1, 0), coeff, lin_f), ((2, 0), coeff, lin_f), ((6, 0), coeff, lin_f), ((7, 0), coeff, lin_f)],
4: [((5, 0), coeff, lin_f), ((7, 0), coeff, lin_f)],
5: [],
6: [],
7: [],
}
T = 10000
data, nonstat = toys.structural_causal_process(links_coeffs, T=T, noises=None, seed=7)
# Time series no 7 is unobserved confounder
data = data[:, [0,1,2,3,4,5,6]]
dataframe = pp.DataFrame(data)

As we need to assess interventions to estimate causal effects, we also generate ground truth interventional data for two interventions x1= x2=0,

intervention2 = 0.*np.ones(T)
intervention_data2, nonstat = toys.structural_causal_process(links_coeffs, T=T, noises=None, seed=7,
intervention={X[0][0]:intervention2, X[1][0]:intervention2},
intervention_type='hard',)
# Time series no 7 is unobserved confounder
intervention_data2 = intervention_data2[:, [0,1,2,3,4,5,6]]
tp.plot_timeseries(pp.DataFrame(intervention_data2)); plt.show()

and x’1=x’2=1

T = 10000
intervention1 = np.ones(T)
intervention_data1, nonstat = toys.structural_causal_process(
links_coeffs, T=T,
noises=None,
seed=7,
intervention={X[0][0]:intervention1, X[1][0]:intervention1},
intervention_type='hard',)

# Time series no 7 is unobserved confounder
intervention_data1 = intervention_data1[:, [0,1,2,3,4,5,6]]
tp.plot_timeseries(pp.DataFrame(intervention_data1)); plt.show()

Alright, now we have the data we need. Let’s proceed to estimate the effects.

The true average treatment effect is then simply the average difference between the values of Y of the two different data frames. We can calculate using three lines of code which resulted in a causal effect of 0.75.

true_effect = (intervention_data1[:,Y[0][0]] - intervention_data2[:,Y[0][0]]).mean()
print("True effect = %.2f" %true_effect)

Arguments for “CausalEffects”

To perform causal effect estimation using the CausalEffects class in Tigramite, we need to specify the following arguments:

graph, graph_type, hidden_variables.

  • graph is a string array of dtype=’<U3' in different shapes depending on graph_type:
  • graph_type=’dag’ denotes a non-time series DAG where the graph is of shape (N, N) with N denoting the number of nodes. Edges can be ← for incoming or → for outcoming edges
  • graph_type=’admg’ denotes a non-time series ADMG (acyclic-directed mixed graph) which we do not discuss in this post all too much. For more details: see GitHub.
  • graph_type=’stationary_dag’ denotes a stationary time series DAG where the graph is of shape (N, N, tau_max + 1) where tau_max is the maximal time lag. The +1 stands for contemporaneous relations. (if the graph vocabulary is unfamiliar to you have a look at our past blog posts on (time-series) DAGs)
  • hidden_variables is a list of tuples [(k, -tau), …] containing hidden variables. In the case of time series graphs, it thereby contains selected hidden time points of a variable k such that (k, -tau) denotes Xt-k.

To construct a graph we provide all its edges in a numpy array. Consequently, to use CausalEffects we need to have already conducted CausalDiscovery using, i.e., PCMCI as described in our last blog post, or simply know the graph by expert knowledge. Given we have identified the causal links, we can then proceed and specify an array that represents the graph underlying the SCM specified above.

#graph in the toy model
graph = np.array([['', '-->', '', '', '', '', ''],
['<--', '', '-->', '-->', '', '<--', ''],
['', '<--', '', '-->', '', '<--', ''],
['', '<--', '<--', '', '<->', '', '<--'],
['', '', '', '<->', '', '<--', ''],
['', '-->', '-->', '', '-->', '', ''],
['', '', '', '-->', '', '', '']], dtype='<U3')

This results in the following graph:

tp.plot_graph(graph = graph,
var_names=var_names,
save_name='Example.pdf',
figsize = (8, 6),
); plt.show()

Once graph, graph_type, and hidden_variables are defined, we can run the CausalEffects function native to Tigramite (note: we do not need the generated data yet).

#denotes position in array above: X_1 = first sub-array; X_2 = second sub-array
X = [(0,0), (1,0)]
Y = [(3,0)]

#run CausalEffects
causal_effects = CausalEffects(graph, graph_type='admg', X=X, Y=Y, S=None, hidden_variables=None, verbosity=1)

# Just for plotting purposes
var_names = ['$X_1$', '$X_2$', '$M$', '$Y$', '$Z_1$', '$Z_2$', '$Z_3$']

We can then use the resulting CausalEffects object to obtain the adjustment set using the get_optimal_set() function. (note: we use not just any adjustment set, but the optimal adjustment set, i.e. the set of covariates to block all backdoor paths from the exposure to the outcome that results in the lowest asymptotic estimator variance).

opt = causal_effects.get_optimal_set()
print("Oset = ", [(var_names[v[0]], v[1]) for v in opt])

Resulting in the output: Oset = [(‘$Z_1$’, 0), (‘$Z_2$’, 0), (‘$Z_3$’, 0)]

So in this case, the optimal adjustment set consists of Z3, Z2 and Z1.

We now estimate the average treatment effect utilizing the graphical generalized backdoor adjustment criterion with the optimal adjustment set. To do so, we use the fit_total_effect and predict_total_effect functions. Essentially we take our causal_effect object with the specified optimal adjustment_set, pick an estimator (e.g. sklearn estimators) and fit it to our dataframe following the logic of valid adjustment as defined in the theoretical explanations above. You can thereby choose any model you like. Of course, depending on the kind of functional dependency you expect, some estimators might prove more useful than others. You can increase the complexity of your mapping function arbitrarily by using, i.e., highly nested ANNs. At this point, of course, balancing bias and variance is essential to choosing the right estimators. But luckily the world of statistical/machine learning has provided an extensive body of literature and methods to help us make this decision. It might not be the worst idea to start pretty simple with a Linear Regression:

causal_effects.fit_total_effect(
dataframe=dataframe,
estimator=LinearRegression(),
adjustment_set='optimal',
conditional_estimator=None,
data_transform=None,
mask_type=None,
)

After fitting our estimator, we have learned a causal effect model from our data. We can now use it to predict the effect of the two interventions denoted earlier. The function predict_total_effect takes as arguments intervention_data, conditions_data, and pred_params.

  • pred_params are optional parameters passed on to the sklearn prediction function.
  • intervention_data are objects of shape (1, len(X)) to predict the intervention of single value(s) X = x or (T_x, len(X)) intervention to predict the effect of a range of values of length T_x.
  • conditions_data is for estimating conditional causal effects and not further discussed here.

In other words, we now create intervention_data by setting X = 1 and X = 0, resulting in arrays of 0s and 1s each. We then provide them as inputs into our causal effect model and obtain the expected value of y given this intervention.

intervention_data = 1.*np.ones((1, 2))
y1 = causal_effects.predict_total_effect(
intervention_data=intervention_data
)
print("y1 = ",y1)

intervention_data = 0.*np.ones((1, 2))
y2 = causal_effects.predict_total_effect(
intervention_data=intervention_data
)
print("y2 = ",y2)

To then finally compute the expected causal effect of increasing X from 0 to 1 we simply calculate the difference between these two.

beta = (y1 - y2)
print("Causal effect = %.2f" %(beta))

And here is your estimated causal effect:

As you can see, we are quite close to the true effect value of 0.75. We can now advance this analysis even more by computing the effect over a range of intervention values from -2 to 2 using a numpy tile:

intervention_data = np.tile(np.linspace(-2, 2, 20).reshape(20, 1), (1, 2))
estimated_causal_effects = causal_effects.predict_total_effect(
intervention_data=intervention_data,
# conditions_data=conditions_data,
)
plt.scatter(intervention_data[:,0], estimated_causal_effects[:,0], label="Estimated")
# plt.plot(intervention_values, true_causal_effects, 'k-', label="True")
# plt.title(r"NRMSE = %.2f" % (np.abs(estimated_causal_effects - true_causal_effects).mean()/true_causal_effects.std()))
plt.xlabel('Intervention / X-value range')
plt.ylabel('Causal effect')
plt.legend()
plt.show()

As you can see, this outputs a straight line of slope 0.74 (almost the true causal effect) as we used a linear estimator. Depending on what kind of estimator we chose, this plot might look vastly different, revealing the actual complexity of some real phenomena, see here for some nonlinear effects.

Wrapping it up

Oh, happy day! You have just successfully estimated a causal effect and at the same time created your causal-inference-informed machine learning model. Especially if you followed all other articles up to this point, this is a true achievement as you have gained the necessary fundamental knowledge to go on and explore this field of research developing on the verge between statistical learning and causal inference. What you just did adds a valuable asset to your skill set. Instead of simply learning patterns from data using plain-vanilla algorithms, you have chosen a much more thoughtful way that eventually leads to more robust and insightful models.

We appreciate you accompanying us on this journey and hope that you will stick around and have a look at some of the state-of-the-art methods that we plan to publish on this platform in the future.

Thanks for reading!

About the authors:

Kenneth Styppa is part of the Causal Inference group at the German Aerospace Center’s Institute of Data Science. He has a background in Information Systems and Entrepreneurship from UC Berkeley and Zeppelin University, where he has engaged in both startup and research projects related to Machine Learning. Besides working together with Jakob, Kenneth worked as a data scientist in Companies such as BMW and QuantumBlack and currently pursues his graduate degree in Applied Mathematics and Computer Science at Heidelberg University. More on: https://www.linkedin.com/in/kenneth-styppa-546779159/

Jakob Runge is Professor of Data Science at TU Dresden. His Causal Inference group develops causal inference theory, methods, and accessible tools for applications in Earth system sciences and many other domains. Jakob holds a physics PhD from Humboldt University Berlin and started his journey in causal inference at the Potsdam Institute for Climate Impact Research. The group’s methods are distributed open-source on https://github.com/jakobrunge/tigramite.git. More about the group on www.climateinformaticslab.com

--

--

Kenneth Styppa
Causality in Data Science

Hi! I'm a grad student in applied mathematics and computer science. I love everything around ML, tech and powerful algorithms applied to data created by humans.