Assumptions for Causal Discovery

A Gentle Guide to Causal Inference with Machine Learning Pt. 5

Kenneth Styppa
Causality in Data Science
10 min readApr 27, 2023

--

All statements made using the toolkit of causal inference and causal machine learning rest on underlying assumptions about the process that generated the data. Hence, the extent to which you can draw causal conclusions rests on the justification of these assumptions.

Critically assessing all assumptions will therefore be inevitable for anybody who wants to do causal machine learning. With the variety of more sophisticated algorithms, models and application areas (e.g. time series) there will always be additional assumptions to discuss. Yet, there are also good news. Being equipped with an understanding about a rather small arsenal of assumptions, you will have what it takes to cover most of the essential parts of causal inference. In the following, we focus on the classical assumptions of causal discovery.

Before you start reading this article you should know the following concepts (see previous posts):

  1. D-separation
  2. Causal graphs
  3. The fundamental idea behind causal inference

Why assumptions are needed

“We are not about trying to magically pull causal rabbits out of a statistical hat.”

Richard Scheines (Carnegie Mellon University)

Let’s stay close to Judea Pearl and ask the question of Why. Why exactly do we need these assumptions which are so typical for causal inference and why should we care?

The reason is pretty simple. Our aim is to identify causal relationships and causal effects.

Now if you are familiar with the fundamental idea behind causal inference, you know that these two things cannot simply be derived from data per se. Meaning, in most cases we simply observe a system without the possibility to perform interventions on it.

In other words, most of the times causal quantities are not observational quantities, so we cannot simply compute statistical estimators. In that setting, assumptions play an important role, as they provide justification for certain mathematical modifications of causal quantities turning them into statistical quantities, which can be estimated from the observational data we have at hand. Hence, the importance of critically discussing the applicability of said assumptions.

Using them in the wrong way or in a setting where they are not justified will lead to highly biased descriptions of causal quantities, possibly resulting in fatal misunderstandings and wrong decisions (see our article on Simpson’s Paradox for an example). With that in mind, let’s start to understand them one by one.

Independence of Cause and Mechanism

Humans and most animals are quite good in generating intuitive knowledge about the cause and the effect of certain scenarios. If your little nephew has put his hand on the hot stove top, he most likely understands that hot things will hurt him if he gets too close to them. He will also know that it does not matter whether the hot stove top is in your house, the house of your mother or the house of his kindergarden friend. Being too close to something hot always results in pain. In other words, he knows no matter where he puts his hand on a hot stove plate, the body reaction will always be the same. Put differently, the source of the heat (cause) and the body reaction (mechanism) which leads to the pain (effect) are independent of each other. With that he understands a fundamental assumption of causal inference, the independence of cause and mechanism:

“A system’s cause C and the mechanism M by which the cause brings about the effect E are independent of each other”.

This assumption makes it possible to perform localized interventions, meaning that we can change C without affecting M.

Let me repeat that because it is super important: as the mechanism and the cause are independent, we can perform any kind of intervention on the cause, and assume that the mechanism that connects the cause and the effect stays the same.

Given a dataset consisting of information about the cause C and the effect E, factorizing the joint distribution p(c,e) will give us two autonomous, modular components: p( c ) being the distribution of the cause and p(e|c) being the mechanism.

p(c,e) = p(e|c) * p( c ).

The principle of independent mechanisms

Given more variables, the assumption generalizes to the very handy principle of independent mechanisms, stating that all mechanisms of a system are unaffected by each other. In other words, although the physical mechanism that connects altitude and temperature will lead to a change in general temperature when your nephew is on top of a mountain, this won’t have an impact on the fact that his hand will hurt the same when he touches a hot stove plate.

This results in a Bayesian Network Factorization, stating that the joint distribution of all observed variables factorizes into the product of all causal mechanisms.

The Causal Markov Assumption

With d-separation we already have a concept that formalizes the independence of two sets of nodes in the graph given a (possibly) empty third set Z in a causal graph.

Using the Causal Markov Condition, we will translate the principle of independent mechanisms into the semantics of causal graphs and therefore establish an elegant connection between probability distributions and causal structure. The assumption states:

“A node X is independent of all its non-descendants given the set of all its parents”

While this sounds very straightforward, it is important to recognize the resulting implications. Whenever we draw a causal graph to describe the underlying system that produced our observed dataset, the conditional independencies in the dataset must respect the d-separation structure of the graph at hand. If this is the case, we say the distribution P is Markovian with respect to graph G (Scheines, 1997).

The following example will make this more clear. Let’s say we observe a system of four variables x1, x2, x3, x4. These four variables have a joint distribution P(x1, x2, x3, x4). On our search for the true causal graph, we now ask ourselves, under what conditions could we call our observed system Markovian with respect to the graph G depicted below. Before reading on, try to figure it out yourself (hint: we use the Causal Markov Condition).

Applying the Causal Markov Condition gives us the answer. P is Markovian with respect to G if:

This reads as: “x2 has to be independent of x3 conditioned on x1” and “x1 is independent of x4 conditioned on x2 and x3”. This means our joint distribution factorizes into:

Markov Equivalence Classes

It is important to notice that the same distribution can be Markovian to different causal graphs, while on the other hand several data distributions can satisfy the Causal Markov condition with respect to G. For example the following two graphs imply the same observational independencies via d-separation (namely Y is independent of C given S):

(Scheines, 1997)

and

(Scheines, 1997)

Speaking more precisely, both graphs contain the following independencies:

For Y: Y is independent of C conditional on S

For S: None of the other variables is independent of S

For C: C is independent of Y cond. on S

Whenever it is the case that several graphs contain the same (conditional) independencies, they are said to be Markov equivalent or in the same Markov equivalence class.

This also means that, although it can be helpful in the process, simply using the Causal Markov Assumption won’t give us the full picture yet, as the resulting output will be several Markov-equivalent graphs that all fit to the data we observe. Consequently, we need to move one step further and pair the Principle of Independent Mechanisms and the Causal Markov Condition with more assumptions that enable more powerful statements.

At this point it is important to recognize that while the Causal Markov Assumption does not suffice to derive a precise causal structure, it is the foundation on top of which all causal inference is build upon.

Faithfulness

When assuming a causal graph to be causally Markov, we assume that all the independencies which are implied by d-separation are reflected in the data distribution. This, however, does not mean that the data does include no other additional independencies.

The following example illustrates this point:

(Scheines, R. (1997)

We want to model the effect of smoking on health. It could be the case that smoking might make people do more sports, which could then fully cancel out the negative effects of smoking on health if the effects are equally large by coincidence. Thus, although Health and Smoking can’t be d-separated, they are independent in the data distribution. In such a case we say that the data is unfaithful to the causal graph that generated it. More precisely, a dataset is unfaithful, if the causal graph which generated a certain distribution, does not cover all independencies of the data.

Conversely, when assuming data to be faithful, we assume that a causal graph reflects all probabilistic independencies in its d-separations. So, we assume that when there is any independence in the data, they are caused by the underlying structure of the graph that generated it, rather than by some random coincidence. This would be the case if the positive effect of exercise happens to exactly equal the negative effect of smoking.

While the magnitude of this assumption first seems to be small, its impact is rather large as it dramatically reduces the set of graphs that could explain the underlying structure of the system.

Causal Sufficiency

Causal Sufficiency states that all confounders of the observed variables have been measured and are included in the data. You should probably read this sentence again: “all confounders are assumed to be observed”. We have implicitly made this assumption in the example above by not raising the question if there might be another unobserved variable x4, which is a confounder between x1 and x3, leading to the statistical dependency between both variables.

Sadly, assuming causal sufficiency is not realistic in most of the cases as it is very likely that plenty of unobserved confounders exist. Thus, whether this assumption can be made, has to be discussed for every application task. If the result is that the assumption cannot be made, some inferential power will be lost, but fortunately typically not all of it. Although we must acknowledge that e.g. the detected link between x1 and x3 could result from an unobserved confounder, we can still make valid statements about the non-existence of causal links.

Summary

So far we have discussed 3 main assumptions that are needed for learning the underlying causal structure of an observed system.

  1. Causal Markov Assumption: A node X is independent of all its non-descendants given the set of all its parents. Consequence: The independencies implied by d-separation of the respective graph hold in the probability distribution of the data.
  2. Faithfulness Assumption: The causal graph represents exactly the distributional independence relations implied by d-separation. Consequence: Any independence relations in the data are caused by the underlying structure of the graph that generated it, rather than from some random coincidence which narrows down the scope of possible causal graphs.
  3. Causal Sufficiency: All confounders of the relevant variables are observed in the given dataset. Consequence: edges in the DAG imply causal relationships.

As this article shows, the assumptions you make largely dictate what kind of inferential statements you can make.

The Causal Markov Condition, the Faithfulness and Sufficiency assumption enable us to learn causal structures from data using conditional independence tests. To see these assumptions in action have a look at our article about causal discovery with the PC algorithm (in a later blog post).

Thanks!

Also: For more literature on the topic have a look in “Scheines, R. (1997). An introduction to causal inference” from where we took some of the graphics.

About the authors:

Kenneth Styppa is part of the Causal Inference group at the German Aerospace Center’s Institute of Data Science. He has a background in Information Systems and Entrepreneurship from UC Berkeley and Zeppelin University, where he has engaged in both startup and research projects related to Machine Learning. Besides working together with Jakob, Kenneth worked as a data scientist at BMW and currently pursues his graduate degree in Applied Mathematics and Computer Science at Heidelberg University. More on: https://www.linkedin.com/in/kenneth-styppa-546779159/

Jonas Wahl is a postdoctoral researcher within the research group Climate Informatics at TU Berlin. He obtained his PhD in mathematics at KU Leuven (Belgium) and has worked at the Hausdorff Centre for Mathematics in Bonn before joining Jakob’s group in TU Berlin. His research focuses on causal inference for high-dimensional spatiotemporal data. You can read more about Jonas on his personal website https://jonaswahl.com.

Jakob Runge heads the Causal Inference group at German Aerospace Center’s Institute of Data Science in Jena and is chair of computer science at TU Berlin. The Causal Inference group develops causal inference theory, methods, and accessible tools for applications in Earth system sciences and many other domains. Jakob holds a physics PhD from Humboldt University Berlin and started his journey in causal inference at the Potsdam Institute for Climate Impact Research. The group’s methods are distributed open-source on https://github.com/jakobrunge/tigramite.git. More about the group on www.climateinformaticslab.com

Another great source to get a first and easy introduction to Causal Inference:

Neal, B. (2020). Introduction to causal inference from a machine learning perspective. Course Lecture Notes (draft).

--

--

Kenneth Styppa
Causality in Data Science

Hi! I'm a grad student in applied mathematics and computer science. I love everything around ML, tech and powerful algorithms applied to data created by humans.