Introducing Conditional Independence and Causal Discovery

A Gentle Guide to Causal Inference with Machine Learning Pt. 6

Published in

Causality in Data Science

8 min readJun 13, 2023

“Why?” has been a central question of humanity and the birthplace of science hundreds of years ago. The human will to make sense of their surroundings was its fuel, experiments and theories formulated in mathematics were its motor since Galileo Galilei developed the scientific method at the end of the 16th century. But what happens if experiments are simply impossible? What happens if all we have are observational data? Are we doomed to ground our models and worldview in correlations instead of underlying mechanisms at play?

Luckily, we are not! The goal of Causal Inference is to do exactly that: learning causal relationships just from observational data.

You should be familiar with the following concepts from the previous articles before you keep on reading:

To answer the causal questions that humans ask all the time, scientists have come up with two ways.

Performing experiments or experimental studies: Conducting so-called interventions to assess causal relationships. E.g. giving Group A a medication, Group B a placebo, and Group C no medication to assess the effect of medication on some health outcome. Or designing an experiment in physics.
Performing observational causal inference studies: Using assumptions in a theoretical framework to derive causal statements from a data sample of an observational distribution

Although experimental studies such as Randomized Control Trials are the gold standard for identifying causal relationships in a number of fields, such as in medicine, running these is often not feasible. Consider the example of climate sciences. We only have one planet. Performing climate change experiments on this planet is either impossible or highly unethical. Confronted with these limitations, causal inference gives us a method to identify causal relationships nevertheless. Sounds like magic? It’s not, it’s just math, logic, and reasoning about assumptions of the underlying system. Let’s understand how this works.

The two tasks of causal inference

In essence, there are two tasks in causal inference and their difference resides in their distinct starting point.

Task 1 uses already existing qualitative knowledge about causal relationships to build a directed acyclic graph which is then used to quantitatively estimate causal effects. Here we refer to this task as causal effect estimation.

Task 2, on the other hand, starts from a blank slate. You have no clue about how your variables could be related and therefore go on an adventurous discovery journey to find out. This task is also called causal discovery.

Causal Discovery and Conditional Independence

By Reichenbach’s common cause principle, we can assume that if X and Y are statistically related (=correlated) and the data is unbiased, then

X is either the cause of Y, or
Y is the cause of X, or
there is a common cause Z that causes both X and Y.

Sounds reasonable, right? More generally, this is called the Causal Markov Condition which states that statistical dependencies emerge from causal relations. The converse of the Markov assumption is the so-called Faithfulness assumption which states that statistical independencies in the distribution imply graphical d-separation in the graph.

We will now see that if both of these conditions are fulfilled, we get a highly useful direct relation between causal connections in the graph and statistical dependencies in the data that we can use in order to discover a representation of the underlying causal mechanism.

Now let’s assume we have observational data from three variables and no idea about their true causal relationships. Fortunately, we can do something about that. We use statistical tests for conditional independence to test whether there is any kind of relationship between X and Z. If no dependence is found, that is, the two are statistically independent, then the Faithfulness condition tells us that there is no causal relationship and we can remove the edge from our graph. If they are not independent, our journey of causal discovery has just begun.

Independence-based Causal Discovery with the PC algorithm

As a result of the rigorous development of a mathematical framework around causal inference, several causal discovery algorithms have been developed. Being one of the first algorithms published for causal discovery, the PC algorithm as introduced by Peter Spirtes and Clark Glymour (named after their first names) is not only one of the most popular and still widely used algorithms, but also provides an excellently intuitive entry point into causal discovery. Thus, we choose the PC algorithm as an example to illustrate the process of causal graph identification. The PC algorithm makes one further assumption next to the Causal Markov Condition and Faithfulness: It assumes that there are no unobserved confounders, called the Causal Sufficiency assumption. However, there exist other algorithms that allow to deal with hidden confounding, but let’s keep it simple for now.

This is how the PC algorithm works:

Make a node for each observed variable
Start with all of them being connected to each other.
Eliminate as many edges as possible using conditional independence tests. More specifically, remove edges X-Y if X is independent of Y given a conditioning set S. Step 3 is a repetitive procedure, starting with S as the empty set S={} and increasing its size (cardinality) by 1 for every iteration.
Establish (causal) directions for each remaining edge using colliders, the assumption that there are no cycles, and any other assumptions you can make use of, such as time order.

Let’s take a look at each step in an example.

Skeleton Identification

We are given a data set with five variables ABZED (see figure below). As described before, we start with the fully connected graph. As a first step towards identifying the potential causal relations, we test for independence conditioning on the empty set. As can be seen in the true causal graph (marked as green) using the rules of d-separation, A and B are independent. Consequently, the edge between A and B is removed.
Next, we test for conditional independence using a conditioning set containing one node, for example, S={Z}. This leads to removing the edges A-E, A-D, E-D, B-D, and B-E. Performing more conditional independence tests conditioning on A or B will not lead to removing any more edges. Thus, we have converged to what we assume to be the actual skeleton of the true causal graph. The PC algorithm chooses the conditioning sets S in a smart way by only selecting subsets of adjacent nodes of the two nodes to be tested.
Converged towards this graphical representation, we uncover the causal directions. One way to achieve this is quite obvious. It is the exploitation of time dependencies. If A happens before Z, the arrow goes from A to Z. If this is not possible, collider-structures can help.

Collider Orientation and further orientation rules

The algorithm now proceeds by considering all unshielded triples in the skeleton. A triple such as A-Z-B is unshielded if

(i) the edge A-B has been removed in the previous PC skeleton phase, and

(ii) there are links between A-Z and Z-B.

Now the collider rule says that if Z was not in the conditioning set (also called separating set) of the statistical test by which the link between A and B was removed in the skeleton discovery phase, then the arrows can only point to Z.

Why? Because otherwise there would be a dependence between A and B (by d-separation), but we didn’t measure any. Indeed this is the case as we can see in the true green causal graph.

Now this oriented collider structure helps us in determining the other orientations. For example, no such collider can be found for A-Z-E or any other remaining triple. What does this tell us? If there would be a directed edge from E to Z, then A and E should be independent given the empty set S={}. But in the skeleton phase we found them to be independent given Z. So, we can assume that the arrow goes from Z to E because this well explains the observed (in)dependenvies. The same accounts for the edge from Z to D. This way, all other remaining edges can be directed and we discovered the true causal graph on the right. Mission fulfilled!

The example is from: https://www.bradyneal.com/Introduction_to_Causal_Inference-Dec17_2020-Neal.pdf

When does this work?

It is important to note that only certain mathematical assumptions render these causal dependencies identifiable. Here we used the Causal Markov, Faithfulness and Causal Sufficiency assumption to identify the causal structure. If they are not met, some of the inferential power will be lost. But with new problems, new solutions and methods are developed such as a version of the PC-algorithm that does not assume Sufficiency called the FCI algorithm.

Consequently, causal inference is an ongoing process, testing the eligibility of certain assumptions and selecting/developing the appropriate method for the given task.

For more details on the most fundamental assumptions of causal discovery have a look at the previous blogposts as linked at the beginning.

So far so good for an introduction. More recent algorithms follow in the next articles. Thanks for reading!

About the authors:

Kenneth Styppa is part of the Causal Inference group at the German Aerospace Center’s Institute of Data Science. He has a background in Information Systems and Entrepreneurship from UC Berkeley and Zeppelin University, where he has engaged in both startup and research projects related to Machine Learning. Besides working together with Jakob, Kenneth worked as a data scientist at BMW and currently pursues his graduate degree in Applied Mathematics and Computer Science at Heidelberg University. More on: https://www.linkedin.com/in/kenneth-styppa-546779159/

Jakob Runge heads the Causal Inference group at German Aerospace Center’s Institute of Data Science in Jena and is chair of computer science at TU Berlin. The Causal Inference group develops causal inference theory, methods, and accessible tools for applications in Earth system sciences and many other domains. Jakob holds a physics PhD from Humboldt University Berlin and started his journey in causal inference at the Potsdam Institute for Climate Impact Research. The group’s methods are distributed open-source on https://github.com/jakobrunge/tigramite.git. More about the group on www.climateinformaticslab.com