The Ladder of Whys

On Causal Inference

Published in

From the Diaries of John Henry

11 min readJun 25, 2021

Ok so as a quick confession I’ve been meaning to write an essay on causal inference for quite some time now. Have read a few books (like Peters, Janzing, and Schölkop’s Elements of Causal Inference), read a few papers (like Judea Pearl’s The Seven Tools of Causal Inference), read some blog posts (like Bruno Gonçalve’s DFS blog), and heck even did a stint in an entertaining altdeep.ai workshop on related material taught by Robert Ness. I think what has held me back was a real struggle on finding a proper framing. This subject is somewhat complex, so a comprehensive overview is probably a bit out of reach for this kind of venue. That leaves the question of how high we can climb up the ladder of abstractions while still keeping grounded in the foundations of mathematics — the proverbial feet on the ground and head in the clouds so to speak. Finally settled on a form I’ve grown comfortable with in these pages, sort of a multi-modal collage probably bearing some inspiration from the likes of Hofstadter’s Gödel Escher Bach, which is admittedly a tad unorthodox for the subject matter, so I hope the reader may grant a small amount of creative license, after all what would life be without poetry? Especially when offered in earnest. Yeah so without further ado.

Tier 1 — Seeing

Politik — Coldplay

Causal inference is a branch of statistical analysis that extends the capacity of probabilistic reasoning by incorporating elements of hypothesis testing to infer directional variable relationships.

In the language of algebra equations are symmetrical, e.g. y=mx directly implies that x=y/m. The language of causality has a fundamental distinction in that the symmetry requirement is lifted, where y->mx may not directly imply anything about mx->y, or maybe only partially. Another way of saying this in the formal vocabulary is that causal inter variable relationships may be non-abelian.

To accommodate the complexity of lifted symmetry, visual representations of variable interactions are supported by diagrams depicting directional assumptions which are known as directed acyclic graphs, which in some places we’ll abbreviate to causal graphs for this essay. As a simple example, let’s say we have two variables A and B which are both individually directed at variable C, which would give us a causal graph of something like this.

In some cases, interacting variables may share a common upstream input, resulting in a circumstance known as “confounding”. Here’s an example where variable A is a confounder to B and C since it influences both C as well as B which influences C. In practice the presence of confounders makes the unentangling of relationships a bit more challenging, but we’ll see further below that there are algorithmic methods available to help isolate.

In general I believe some of the algorithms that may come into play are better suited, and thus handle more easily, variables of the categoric type as opposed to continuous numeric sets (I found the literature somewhat lacking on this point). Here categoric refers to variables that have a fixed range of distinct potential values, such as for a variable A may constitute the set {cat, dog, child}. For a given data set we could then derive from frequency counts probabilities associated with each entry as P(A=Ai). For example if our data set had 100 samples recorded and of those samples we had 20 instances of A=cat, we could thus estimate that P(A=cat) = 20/100 = 20%.

These types of frequency inferred probability estimates can even be extended to multi-variable considerations. For example, if this same data set with 100 recorded samples had 60 cases where B = Chinese takeout, and of those 60 cases there were 40 instances where the corresponding value of variable C was Netflix, we could thus estimate that P(C = Netflix | B = Chinese takeout) = 40/60 = 66%, which is shorthand for the probability of us watching some Netflix given that we ordered Chinese takeout.

Oh and these two types of probability estimates derived from frequencies are sufficient to apply Bayes theorem, in which we could use known information about P(C), P(A), and P(C|A) to infer P(A|C) for instance.

The theory of causal inference owes a lot to the work of the researcher Judea Pearl, a Turing Award winner and pioneer of the field. His published papers and books are pretty much considered dogma, where as a recommendation if you are interested in a deep dive the book Causal Inference in Statistics is a great resource, or for a treatment intended for a more general audience The Book of Why is the way to go. In fact it was one of the frameworks presented in the Book of Why that really served as the inspiration for this essay, what Pearl refers to as the Ladder of Causality.

The Ladder of Causality is a way to think about the tiers of reasoning that can be achieved using the tools of causal inference. The ladder has three rungs, each tier building on the one preceding to reach increasing heights of abstraction, which can be abbreviated as

Association
Intervention
Counterfactuals

Association, the first tier, can be thought of as the form of modern paradigms of machine learning. In fact the first tier doesn’t really need directed acyclic graphs to represent variable interactions, in this convention we are merely observing statistical relationships between variables, such as could be probabilistically be evaluated by frequency counts or when tied to a specific label feature by trained models (like neural networks). Because variable relationships are assumed abelian, Bayes Theorem holds and conditional probabilities can be directly translated to an inversion. It is only after climbing to the second rung of the ladder that we may start to explore the implications of a one-sided relationship.

Tier 2 — Doing

Clocks — Coldplay

Once we take that climb to the second rung we’ll find that in order to carry out our intervention experiments we’ll need some scaffolding, which is where the directed acyclic graphs come into play. To be clear, there is nothing automatic about our causal graph connections, they are intended to capture assumptions about the data. For example, it may be obvious that for a variable depicting whether a driver got a speeding ticket, a likely causal parent node could be a variable of how I was going a little too fast, so we may know to build that connection into our graph. In other cases the presence of inter-variable connections, their directions, and order may be a bit more subtle, necessitating an iterative approach to modeling connection assumptions and evaluating the resulting fit.

Determining these graphical connections in some cases may be guided by running experiments on the data. For example let’s say we had a variable for what song was playing on the radio and another variable for what kind of mood you were in. We may find that intervening on what song is playing may improve your mood, while intervening on your mood may not impact what song is playing. We can thus infer that song — >mood is the direction of the causal structure.

In fact one way to think about what is taking place in causal inference is that we are using the recorded samples in a data set to run observational experiments, and I’ll try to elaborate on what that means.

In traditional scientific loops, experiments can be conducted formally under randomized control trials, or in an online setting perhaps a little less rigorously as A/B tests, where you have a hypothesis and test it by making some intervention on a system while abstaining from that intervention on a control group for comparison. This is kind of like when you’re interviewing a candidate for a position and asking him lots of questions to acquire new information.

In a causal inference analysis on the other hand, our observational experiments are conducted very differently, where instead of trying to acquire new data, we instead reduce our focus to subsets of the existing data conditional on some desired property. For example if we want to evaluate impact of the variable A describing how a candidate might get along with your parents, you could set aside all of the data points where he’s trying to flirt or something and just base analysis on cases where he’s acting seriously, which under the do-calculus formulation could be described as P(A|do(serious)). Thus interventions in causal inference are conducted by way of subtractions as opposed to acquiring new data. Once we have subtracted any observation sets based on our conditional framing we can then recalculate the associated probabilities based on the resulting adjusted frequency counts.

The benefit of these do-calculus experiments are that we can then perform a kind of reconfiguration of our directed acyclic graphs, eliminating those edges that may interfere with the two variables of focus. For example, when you condition a variable to a certain configuration, you’ve eliminated potential for influence of variables further upstream, which is how we can accommodate confounding variables as noted above. When faced with increasingly sophisticated graphs, such as with multiple confounding variables, there are guidelines available to select which of the paths to sever with our do-calculus conditioning, which in the interest of brevity won’t go into a lot of detail other than to offer keywords for reference, which is the backdoor criterion, or for cases where the backdoor criterion is insufficient an alternative guideline is available as the front-door criterion.

The whole point of these manipulations on graph configurations by conditional framings is that it gives us the ability to run intervention experiments on existing data without the need to acquire new samples. In other words, we can run experiments without running experiments. Do you want to know what will happen if you make the first move? The answer is already there in the data.

Tier 3 — Imagining

Green Eyes — Coldplay

The third rung of our ladder is where we reach the highest forms of reasoning known as counterfactuals, which in some ways go back to the teachings of philosophers like David Hume and John Stuart Mill. To be honest the details of how we can implement I am still a little fuzzy on, so may get a little abstract in this section. A counterfactual is like using your imagination to consider an alternate course of events. To give an example, if we have two variables connected to each other in a causal graph, and we know the current state of the two based on where they are now, can you imagine what our lives would be like now if we had met five years ago?

This gets to the heart of what is being attempted in causal inference. As we develop these world models upon which we can run counterfactual inquires and ask questions like “What if?”, it opens the door to all kinds of approaches to addressing the fundamental limitations of modern paradigms of machine learning. Our models gain the ability to adapt to a changing world. We begin to understand the rationales for decisions, the models are no longer just black boxes generating predictions. They begin to be able to articulate cause and effect relationships, which may yet open the door to higher forms of cognition that may be needed for artificial general intelligence. All of this and more is possible on the top rung of our ladder.

Of course some tempering is needed for our expectations of what can be achieved on the third rung. There are no guarantees that a causal query may succeed in achieving a desired answer. It could be that our causal graph assumptions are wishful thinking, or it could be that the data just does not hold the answer that we are looking for. In such a case of failure of result it is our duty as data scientists to pick ourselves up and revisit our assumptions.

I’ll offer in closing a helpful illustration that may help to demonstrate the need for causal inference, as counterintuitive subtleties in the data may not be visible without intentional inspection. Simpson’s Paradox is a well-known thought experiment in which a value comparison between two alternatives may give unexpected results after appropriately weighting the data. Consider the playful scenario where a beautiful, smart, amazing woman has found herself with two competing invitations to go on a first data from two suiters. Being the good-natured soul that she is she has decided that she will only accept one of the invitations, but to decide she will need to evaluate which of them she considers more appealing based on what she knowns about their respective communication styles and talent at dancing.

The paradox can be found in how even though Suiter A is both a better communicator and better dancer, she is surprised to discover that she finds Suiter B much more appealing. After all, she only has a handful of sound bites to go on with Suiter A, while Suiter B has been working on a collection of essays for several years, and yes of course she hasn’t read them all (that could take months), but the few that she tried she kind of liked (at least the non-boring ones anyways). And yes while Suiter A is a slightly better dancer owing to all of his weekends spent getting drunk and club hopping, Suiter B can at least carry a resemblance of rhythm when not distracted by her beauty, after all most of his dancing experience is from attending weddings. The whole point of the illustration is that after properly weighting the appeal metrics she is pleased to find that Suiter B is clearly the right match for her.

Books that were referenced here or otherwise inspired this essay: