Confounding, Colliding, D-Separation, and sleeping with shoes on

A Gentle Guide to Causal Inference with Machine Learning Pt. 4

Kenneth Styppa
Causality in Data Science
9 min readApr 6, 2023

--

Colliding, confounding, d-separation, controlling for latent variables and Simpson’s Paradox, blocking paths, etc. etc. You probably encountered all of these terms already. This article will bring some order into the terminology chaos by explaining the basic concepts needed to take the next step towards understanding the methods of causal inference.

Simpson’s Paradox and Confounders

Given that you’ve read our last blog post and understand the basic idea behind causal graphs and causal inference, we are ready for a well-known example.

Remember the last time you slept with your shoes on? If you do, you probably had a headache afterwards. But do the shoes cause the headache? Probably not. It’s you drinking something with your friends the night before, which also caused you to sleep with your shoes on in the first place. How could we depict this system of causal relations in a graph? From a previous blog post we know that, with three random variables in our dataset, we can draw a graph with three nodes, each representing one variable. This means the nodes in our graph look like this:

Which nodes are connected? The drinking variable caused both other variables so we can draw connections from the respective node to the other nodes. The result is the following skeleton:

In a final step we orient the edges of the graph along the causal direction, leading to a directed acyclic graph or in other words: a causal graph.

You might now wonder why we need these causal graphs. This is a legitimate question to which there are many valid answers. Besides being an easy and intuitive visualization of the underlying system, the most important reason is that these graphs are able to reflect all kinds of interventions that we might want to analyze while also forcing us to think about all the variables and directions in the system, making our modeling procedures more careful. Let’s understand this reason with an example — Simpson’s Paradox!

You and all your friends from college figured out that sleeping without shoes on won’t solve your headache problem, so you started a new experiment. You want to compare the effect of two different headache treatments A and B. Without thinking about the whole system and sketching out causal graphs, you would most likely ask your friends which treatment they took in the past and note down how many of them still have the headache. The table below shows the result you get. As you can see, Treatment B seems to be better than Treatment A.

The result looks largely different, though, if you use your causal inference knowledge and make yourself aware of other variables that might be part of the system, such as the severeness of the headache that people are suffering in the morning. It might be possible that the severeness influences the kind of treatment that your friends take, inducing a bias. Thus, you ask your friends how strong their headache was when they first got up, what kind of treatment they took, and whether their headache was cured afterwards. The resulting table looks as follows (note that the values of cured people is still 273/350 for A and 289/350 for B).

What you see is that although overall Treatment B has a higher rate of curing people, Treatment A yields better results for both people with strong AND mild symptoms! How is that possible?!

The answer can be found in the causal graph. The treatment you take has a direct causal effect on the headache. But that’s not all. The severeness condition causes both, the degree to which the headache can be cured and the selection of the treatment your friends will take.

As you can see in the table, people with strong headaches are much more likely to pick Treatment A (e.g. could be a result of the fancy name of the treatment indicating it would help in extreme cases). At the same time it is harder to cure them. As a consequence, it wrongly seems as the overall worse alternative, although it is the superior one for both severeness cases. More generally speaking, although there is a causal connection between T(reatment) and O(utcome), the total association you measure if you do not account for this third latent variable will be an entanglement of true causal and purely associational relation.

Causal graphs will be your savior to overcome this pitfall.

The role of the severeness variable actually has a name. We call it a confounder as it causes both the treatment and the outcome. Confounding is one of the main reasons why we need to draw causal graphs and be so precise for correctly identifying causal relations and causal effects. A node with two outgoing arrows on a path, like C in the figure above, is also called a fork node.

Colliders, Mediators & blocking paths

Besides confounding, there are two more very important roles that nodes can play in a causal graph. Both can best be described when thinking of any kind of association as a “flow”. Purely associational relations flow in two directions at the same time, while causal association only flows into one direction. The way nodes in a causal graph influence the flow of association along the edges of a causal graph determines the role that they play in the system as well as the names we give them. We can make this more clear with the following causal graph for which we outline all association flows step by step.

Thinking of causal association as a directed flow, we can see that node M passes on (or mediates) the flow of the association between T and O. This makes it a mediator. Thus, when we want to block the association between T and O, all we have to do is condition on M. In other words, we simply keep M constant and as a consequence separate the variation in T from the variation in O, similar to building a dam in a river. This gives us the following association visualizations:

At node C we see that two directed edges clash together as they are pointed towards each other. Therefore, nodes like C are called colliders. Contrary to mediators, colliders block the flow of association. What happens when we condition on a collider? Is the path still blocked?

Let’s use a famous example from Judea Pearl to answer that. Imagine that when you’re out there dating men, it seems that most nice men aren’t really handsome, while the good-looking ones are mostly jerks. In other words, you would measure a very high negative correlation between being handsome and being nice. But why?

The reason is that you just date single men. Without advising you to start trying to date men in relationships, this means that you still ignored a third variable in this correlation analysis, namely availability. Scientifically speaking, you fell prey to the so called “selection bias”. In our toy example, the availability of men (whether they are in a relationship or not) is determined by them being handsome and nice. The available men, meaning the “remaining” ones, are either not very handsome or not very kind, or not even both. In other words, instead of considering the whole distribution of men (which would include the ones in a relationship), you only consider a distribution where you conditioned the variable “availability” on “single”.

Taking a step back from this example, we see that conditioning on a collider node will free up the flow of association between the two variables that are connected through the collider (like the strong correlation between being handsome and nice). But be cautious! This association is purely associational. It is NOT causal, as we have just seen in the handsomeness-kindness example.

Taking these insights together, the behavior of a collider can be visualized with the following figures:

Left: collider without being conditioned on; Right: conditioned collider

D-Separation

Given all these different types of nodes and how they channel or block association, you might wonder why we care about that at all. The answer to that is rather simple. We need to draw our causal graphs in order to understand which association is causal and which is only associational. In other words, we need causal graphs to disentangle causality from correlation. In doing so, we will make use of and need to consider the characteristics of the observed nodes to perform this entanglement via conditioning (e.g. on confounders).

Strongly related to the topic of blocking paths is the concept of d-separation. Per definition, two nodes are d-separated if all paths along which association could be channeled are blocked by a (possibly empty) conditioning set Z. Or more formally:

Two (sets of) nodes X and Y are d-separated by a set of nodes Z if all of the paths between (any node in) X and (any node in) Y are blocked by Z.

This is the case if given Z, one of the following is true for every path:

  1. There is a fork node or a mediator along the path, that is conditioned on.
  2. There is a collider C along the path, on which and on whose descendants is not conditioned on.

We need this formal understanding of d-separation, as it allows us to formulate one of the most important assumptions in Causal Inference, the Causal Markov Assumption. Said assumption makes it possible for us to conclude from separation in the graph on independence in the distribution and consequently delivers an anchor point to connect the world of causality and graphs to the world of statistics and distributions. More details on this and other assumptions will follow in the next post.

Thanks for reading!

About the authors:

Kenneth Styppa is part of the Causal Inference group at the German Aerospace Center’s Institute of Data Science. He has a background in Information Systems and Entrepreneurship from UC Berkeley and Zeppelin University, where he has engaged in both startup and research projects related to Machine Learning. Besides working together with Jakob, Kenneth worked as a data scientist at BMW and currently pursues his graduate degree in Applied Mathematics and Computer Science at Heidelberg University. More on: https://www.linkedin.com/in/kenneth-styppa-546779159/

Jakob Runge heads the Causal Inference group at German Aerospace Center’s Institute of Data Science in Jena and is chair of computer science at TU Berlin. The Causal Inference group develops causal inference theory, methods, and accessible tools for applications in Earth system sciences and many other domains. Jakob holds a physics PhD from Humboldt University Berlin and started his journey in causal inference at the Potsdam Institute for Climate Impact Research. The group’s methods are distributed open-source on https://github.com/jakobrunge/tigramite.git. More about the group on www.climateinformaticslab.com

A further great source to get a first overview on Causal Inference:

Neal, B. (2020). Introduction to causal inference from a machine learning perspective. Course Lecture Notes (draft).

--

--

Kenneth Styppa
Causality in Data Science

Hi! I'm a grad student in applied mathematics and computer science. I love everything around ML, tech and powerful algorithms applied to data created by humans.