Causal Inference — Part I
How do causes lead to effects? Can you associate the cause leading to the observed effect? Big Data opens the doors for us to be able to answer questions such as this, but before we are able to do so, we must dive into the field of Causal Inference, a field championed by Judea Pearl.
In this series of blog posts we will learn about the main ideas of Causality by working our way through “Causal Inference In Statistics” a nice Primer co-authored by Pearl himself.

The book is divided into Four chapter. The first chapter covers background material in probability and statistics. The other three chapters are (roughly) organized to match the “Three steps” in the ladder of causality as defined by Pearl:
1 — Association
2 — Intervention
3 — Counterfactuals
In this series of blog posts we will cover most of the content of the book, with a special emphasis on the parts that I believe are more interesting or relevant to practical applications. In addition to summarizing and explaining the content, we will also explore some of the ideas using simple (or as simple as possible) Python code in the companion GitHub Repository:
While I will do my best to introduce the content in a clear and accessible way, I highly recommend that you get the book yourself and follow along. So, without further ado, let’s get started!
1.2 — Simpson’s Paradox
We start our study of causal analysis by looking at Simpson’s Paradox. The paradox was named after Edward Simpson, a British World War II code-breaker at Bletchley Park along side Alan Turing and other greats.
Simpson’s Paradox can be summarized as “aggregated data can appear to reverse important trends in the numbers being combined” (WSJ, 2009). In other words, a relationship that seems clear and unequivocal across the entire dataset can be reversed in a similarly just as clear and unequivocal way when you perform the same analysis in the subsets of the data.
As you can easily guess, this type of paradox will become only more and more relevant in the current age of Big Data and Big Compute where one can easily perform large scale analyses of massive data sets that are just now becoming available. Indeed, Simpson’s Paradox has been increasingly in the focus of academic research on how to detect it
Pearls illustrates this Paradox with data on the performance of a blood pressure medication on a set of 700 patients, 350 who received the treatment and 350 who did not.

A simple analysis of this dataset results in:
“In male patients, drug takers had a better recovery rate than those who went without the drug (93% vs 87%). In female patients, again, those who took the drug had a better recovery rate than non-takers (73% vs 69%). However, in the combined population, those who did not take the drug had a better recovery rate than those who did (89% vs 78%)”
which might lead one to decide that knowing the patients gender makes the drug ineffective! This clearly non-sensical result is obvious due to the simplicity of the example and our familiarity with concepts such as gender and how medication works. In other examples it can be much harder to see.
The underlying reasons for this Paradox are two fold:
— Significantly more women than men are given the medication (263 vs 87)
— Estrogen has a negative effect on the efficacy of the medication so women who do take the drug are less likely to recover.
As a result, when you take the average across all patients, you’ll weigh men more heavily on the “No Drug” and women more heavily on the “Drug” column.
In general, Simpson’s Paradox is likely to appear whenever you have confounding factors.
Iris dataset
To illustrate how common this Paradox is, I would like to illustrate it with a data set you’re more likely to have seen before, the Iris dataset:

The dataset contains 150 observations, evenly split across 3 species (Iris Setosa, Iris Versicolor and Iris Virginica) of 4 features: Petal and Sepal Width and Length as illustrated in the figure above.
Let’s say that we are interested in finding a relationship between Petal Width and Sepal Width. If we perform a simple linear regression on the entire dataset, we would see the negative slope fit illustrated on the left hand side of figure below.

On the other hand, performing the same fit on a species by species basis we find three different positive slope fits as seen on the right hand side.
Here it is also easy to see what the confounding factor is: the Species. In general Petal and Sepal width are positively correlated, but Iris Setosa flowers have a generally larger Sepal Width and a smaller Petal Width and that is enough to skew the population level fit.
On the other hand, it’s also easy to see that if the population distribution was different (many more or many fewer Iris Setosa flowers) the effect might be stronger or even completely absent:

Now that you are more familiar with Simpson’s Paradox, you’ll likely start noticing it everywhere and know what to look for in order to properly deal with it.
Pearl concludes the medicine example by saying:
“In order to decide whether the drug will harm or help a patient, we first have to understand the story behind the data. The causal mechanisms that led to or generated the data”
Finally, I hope you enjoyed the first instance of this series and the quick dive into Simpson’s Paradox.
Just a quick reminder that you can find the code for the Iris example above in our GitHub repository:
And if you would like to be notified when the next post comes out, you can subscribe to the The Sunday Briefing newsletter: https://data4sci.com/newsletter

