Structural Causal Models — A Quick Introduction
A Gentle Guide to Causal Inference with Machine Learning Pt. 7
From reading through the past blog posts, you are familiar with the basic idea of causal inference and how we use certain assumptions and methods such as conditional independence tests to identify a causal graph that describes the causal relations among our observed variables. While this is undoubtedly already quite an achievement, the world of causal inference has even more to offer. Let’s say there would be a way to formalize the whole system into a set of neat equations that enables you to assess all kinds of intriguing things such as quantifying the effect of interventions or even reason about counterfactuals… Wouldn’t that be amazing? Well, it is exactly what Structural Causal Models (SCM) can offer.
You should be familiar with the following concepts from the previous articles before you keep on reading:
The idea behind SCMs
Before getting to the formalities, let’s understand the very basic idea behind SCMs.
So far, the only type of causal models that we have encountered are Causal Graphs, directed acyclic graphs that depict causal relations in a binary way. Either there is a cause-effect relation between two variables, in which case there is a directed edge from cause to effect in the graph, or there is none, and therefore there is also no edge.
Ideally, we would like to expand this binary description about causal relations to a more fine-granular one that includes more details. For example, being able to quantify the effect of certain policy measures on climate change would be more helpful than “just” knowing that they are causally related. Especially, when there would be several policies to choose from. But how could we get this done?
In a way, we just need to add some additional information to a Causal Graph, describing the characteristics of the discovered causal relationships.
In other words, we formulate a set of equations that describe all causal relations in our system. Let’s say we have discovered the following graph using the methods described in our earlier blog posts:
In an SCM, all four variables will have their own equation, describing the functional mechanism that explains how a variable’s parents influence the variable itself. But SCMs include not only the influence of the parents, because we want the causal relations to be described in a probabilistic manner. Meaning, instead of saying that, every time when X changes by 1 unit, Y changes by 5 units, we acknowledge that there will be some random noise (N) that could alter the actual effect one way or another. This means that our SCM for the example graph above looks as follows:
A formal definition is provided in the book Elements of Causal Inference from Peters et al. (2017):
Definition of Structural Causal Models
We give the definition as provided in the Elements of Causal Inference from Peters et al. (2017):
Facets of Structural Causal Models
The world is uncertain — Random Noise in SCMs
As mentioned before, SCMs are probabilistic due to the random noise terms that appear in the equations. Additionally, for the rest of this post, we make an important assumption: the noise terms must be independent. In a way, we assume that our model is able to explain all systematic dependencies in our system and there is no unobserved confounding left. Be careful though: if you do think that some of your variables are confounded by something that you haven’t measured, you will have to work with SCMs of a more general kind — or you include another, unobserved node in your graph.
Keep it simple — Structural Minimality
The art of explaining is doing so in the simplest possible way. This principle has found its way into science under the name of Occam’s Razor, and states that among equally good explanations, the simplest one is to be preferred.
An incarnation of Occam’s Razor in Causal Inference goes by the name of Structural Minimality. It requires that all functions f_j describe an actually non-zero dependency between X_j and the function’s input arguments. This implies that the following SCM
has to be rewritten into:
in order to satisfy Structural Minimality. If we derive a Causal Graph from an SCM by drawing arrows from a function’s input to its output variables, Structural Minimality ensures that each edge we draw actually corresponds to non-zero causal effects.
The independence of cause and mechanism in SCMs
In an earlier blogpost we introduced the mechanism as the facilitator that brings about the effect from its cause. In this sense, the function f_j is the mechanism that connects X_j to all its parents. In other words, f_j is the mechanism by which X_j is caused.
Also, we have already explained that we assume these mechanisms to be invariant of shifts in the cause or in other mechanisms. That is, no matter what we do to the value of the parent variable or to the other mechanisms, the mechanism f_j, by which X_j is brought about, stays the same. This also has important implications for the next question we ask ourselves.
How to learn Structural Causal Models
Say that we have already found the Causal Graph describing our system, for instance through a previous causal discovery process, and we would like to learn the more fine-grained SCM. We will illustrate how to do this using the following graph, corresponding to our example SCM above.
Having already identified the existence of causal relations in step one, we now try to quantify them through functions that resemble the observable variability of the data as closely as possible. So we know that X_1 depends on X_3, but now we want to identify the structural assignment X_1:=f_1(X_3, N_1).
The wonderful part about causal inference is that if we have done our homework in the previous steps, this last identification / quantification step can be done with all the tools we know from classical statistical / machine learning. For example, we could learn f_1 using OLS, decision trees, ANNs and so forth with the correct variables.
Of course, we cannot arbitrarily choose any kind of model and any variables we fancy to be interesting. Again, we need to check what kind of assumptions we can make (e.g., such as linearity for OLS), back it up with some theory and run the adequate models to learn and quantify the previously detected dependencies. For instance, let’s say you have detected a dependency of X_1 on X_3 in a previous causal discovery process. In addition, you have reasons to assume the link to be highly non-linear. You can then train a Neural Net to learn the non-linear function as you have done many times before:
SCMs and Interventions
Interventional questions are all around us. Questions such as “what would happen if a climate policy X would come into effect?”. These “what would happen with Y if X is set to a certain value” type of questions are typical for causal inference. The kind of actions they describe are so-called interventions. In the end, they describe the process of intervening on a system by setting one or more variables of a system to take on a specific value, as one does in a scientific experiment.
This is clearly different from conditioning, where we select those observations out of a sample that takes on a certain value while ignoring the others. Intervening is not selecting, it is forcing all instances to take on this value.
Consequently, the term P(Y|do(X = 2)) , which expresses the probability distribution of Y if X is set to 2, does not generally equal the conditional probability P(Y|X = 2). For a deeper explanation of this statement, have a look at our past blog posts.
Precisely because conditioning and intervening is not the same, questions such as the one about climate policy can’t be answered simply by comparing two means.
SCMs come in very handy here. Assume that we have conducted proper causal discovery (or we just know the qualitative graph by expert knowledge) and SCM estimation, we can now model intervention distributions quite easily. Let’s again understand this using the 4 variable example from above. The SCM still looks like this:
The effect of a hard intervention do(X_1=2) can be modeled by the following alteration:
We simply set the variable on which we perform the intervention to our value of choice. The effect of changing X_1 will then “ripple” through the whole system and will change the outcome of all its descendants, in this case X_2 and X_4.
This new SCM entails a new distribution, the interventional distribution.
If you recall one of our first articles, this is one of the great benefits of causal “infused” machine learning. We can make these adaptations and still have a solid model even, or especially, when facing severe distribution shifts.
More details on learning SCMs and the challenges that might arise such as not knowing the causal variables in the first place can be found in: https://arxiv.org/pdf/2210.13583.pdf. And also have a look at our recently published Review/Howto-guide article published in Nature Reviews Earth and Environment.
With this blogpost we now covered the basic fundamentals and will keep on exploring more advanced topics. Keep on reading to earn the fruits of all the fundamentals we built up! Up next — multivariate causal discovery in time series settings.
Source and Reading Recommendation:
Peters, J., Janzing, D., & Schölkopf, B. (2017). Elements of causal inference: foundations and learning algorithms (p. 288). The MIT Press.
Runge, J., Gerhardus, A., Varando, G., Eyring, V., Camps-Valls, G. (2023). Causal Inference for Time Series. Nature Reviews Earth and Environment 4, 487–505. View-only version here.
About the authors:
Kenneth Styppa is part of the Causal Inference group at the German Aerospace Center’s Institute of Data Science. He has a background in Information Systems and Entrepreneurship from UC Berkeley and Zeppelin University, where he has engaged in both startup and research projects related to Machine Learning. Besides working together with Jakob, Kenneth worked as a data scientist at BMW and currently pursues his graduate degree in Applied Mathematics and Computer Science at Heidelberg University. More on: https://www.linkedin.com/in/kenneth-styppa-546779159/
Jonas Wahl is a postdoctoral researcher within the research group Climate Informatics at TU Berlin. He obtained his PhD in mathematics at KU Leuven (Belgium) and has worked at the Hausdorff Centre for Mathematics in Bonn before joining Jakob’s group in TU Berlin. His research focuses on causal inference for high-dimensional spatiotemporal data. You can read more about Jonas on his personal website https://jonaswahl.com.
Jakob Runge heads the Causal Inference group at German Aerospace Center’s Institute of Data Science in Jena and is chair of computer science at TU Berlin. The Causal Inference group develops causal inference theory, methods, and accessible tools for applications in Earth system sciences and many other domains. Jakob holds a physics PhD from Humboldt University Berlin and started his journey in causal inference at the Potsdam Institute for Climate Impact Research. The group’s methods are distributed open-source on https://github.com/jakobrunge/tigramite.git. More about the group on www.climateinformaticslab.com