“Simpson’s Paradox” — When You Derive A Wrong Insight From Your Analysis

From selection bias to causal diagram, with tips on how to avoid biases.

Published in

Analytics Vidhya

9 min readJan 10, 2020

An analysis can be biased for various reasons. In this post, we will start by looking at an example of selection bias to understand the impact of bias.

We will also compare two examples which have the same data but lead to the different consequences, and learn what Simpson’s paradox is.

Finally, we will discuss the backdoor criterion to learn how to avoid bias.

1. Introduction

1.1. Selection Bias

1.2. Question #1

1.3. Question #2

1.4. What was different between Question #1 and Question #2?

1.5. Simpson’s Paradox

2. Causal Structure and Causal Diagram

3. Use Causal Structure Properly in Data Analysis

4. Backdoor Criterion in Causal Diagram

5. Conclusion

Reference

1. Introduction

1.1 — Selection Bias

Take a look at the following scatter plot. It is comparing math scores and English scores for some population.

Math Score vs. English Score for a population

We see a stronger negative (linear) correlation between math score and English score. Then, can we say “the students having higher math skills tend to have lower English literacy and vice versa”?

Scatter plot of math score vs. English score from true universe

Actually, the data was a made-up data through the following steps:

Generate two random numbers independently following normal distribution with mean=50 and SD=20, with clipped at 0 and 100. One in math score and the other is English score.
Chose the sub-population if the math score + English score > 120, assuming a case the sub-population represents the students admitted by the total score of math and English.

Therefore, the true insight was “the students’ math ability and English ability are independently distributed”. And the bias we mentioned only from the trend in sub-population who was over the gray dashed line happened because the selection boundary was diagonal.

This kind of bias is called selection bias.

This time the bias was easily detectable since we knew there was a data set of the entire population and there was a formula to select a subset, but what if we only have the selected data without being noticed it was selected and do not know what was the selection criteria?

For further discussion, let’s take a look at the next two biased analysis examples.

1.2 — Question #1

Background:

We have medical observational data with male or female patients who had treatment or not, with final prognosis.

A bit analysis:

Cured ratio within female patients:

if not treated 40% (=2/5) < if treated 44% (12/27)

Cured ratio within male patients:

if not treated 57% (=4/7) < if treated 62% (8/13)

Cured ration within female or male patients:

if not treated 50% (=6/12) = if treated 50% (=20/40)

Question:

Can we say the treatment works or not?

(Let’s forget about the ‘statistical significance’ for now and only talk over the proportions of data!)

Answer:

The treatment work differently to each genders but the treatment is worth trying. We should ignore the ratios on ‘Total’ row.

1.3 — Question #2

Background:

After a baby played with a card deck, we noted how many of the cards the baby made dirty per type and color of cards.

A bit analysis:

Red ratio within clean cards:

if court cards 40% (=2/5) < if plain cards 44% (12/27)

Red ratio within dirty cards:

if court cards 57% (=4/7) < if plain cards 62% (8/13)

Red ratio within clean or dirty cards:

if court cards 50% (=6/12) = if plain cards 50% (=20/40)

Question:

Is proportion of court cards associated with color?

(Again, let’s forget about the ‘statistical significance’!)

Answer:

Baby may have preferred red cards to black cards but the proportion of court is the same to both colors. We should ignore the ratios on separate ‘Clean’ and ‘Dirty’ rows.

1.4 — What was different between Question #1 and Question #2?

Did you notice the numbers in the data table of Question #1 and Question #2 are all the same?

So, #2 caused the selection bias with stratification by row, while #1 did not, and from our knowledge on the subjects, we correctly chose the way to analyze and derived the insight. Our knowledge of the subjects is something like:

“it is natural to assume a medical treatment works differently to each gender”, and
“the number of cards per color and type are completely independent (while a baby may love to play with specific color and type).”

But, what if the case is something we have no idea about the subject?

1.5 — Simpson’s Paradox

All of these fads are about aka “Simpson’s Paradox”. Simpson’s Paradox was first discussed in “The interpretation of interaction in contingency tables.” by Simpson, Edward H. in 1951.

Based on “The Simpson’s paradox unraveled” by Hernán, Miguel A., David Clayton, and Niels Keiding in 2011, what Simpson’s Paradox teaches us are:

- Statistics is insufficient for analysis because identical data arising from different causal structures should be analyzed differently.
- For causal inference, a data analysis that ignores subject-matter (causal) knowledge is hopeless.

(From Miguel Hernán’s tweet.)

Let me paraphrase:

When doing data analysis, you have to know the causal structure of the subject and use it properly. Otherwise, you may end up deriving a wrong insight.

Then, what does this mean? What is the causal structure and how can we ‘use it properly’? Let see them in the next section.

“The Simpson’s Paradox”, referenced from twitter account of RJ Andrews

2. Causal Structure and Causal Diagram

When people hear the word “Causal relation” you may have some concept in your mind but actually it is not easy to formally define.

The “Causal Inference Book” by Miguel Hernán defines it that ‘A has a causal effect on Y’ when the outcome Y differs between when A is taken and when A is withdrawn. Formally, A and Y are both random variables here. Let’s avoid further complicated discussion about the definition and just believe that it represents so-called causal relation here. Refer to the “Causal Inference Book” for more details.

A causal diagram is a graph visually representing the causal relation. When A causes Y, its causal diagram is as follows.

3. Use Causal Structure Properly in Data Analysis

Then, what is the association of causal relation/diagram to Simpson’s paradox, and what is the idea to “use causal structure properly”?

Above, I said:

When doing data analysis, you have to know the causal structure of the subject and use it properly. Otherwise, you may end up deriving a wrong insight.

In the two examples above, we wanted to know the causal relation between ‘treatment’ (A: cause) and ‘cure’ (Y: effect), and ‘court/plain card’ (A) and ‘red/black card’ (Y).

Putting conclusion first, when we want to detect if there’s any causal relationship, we have to keep the following two rules in mind:

Rule #1: we MUST use the variable causing both of (A) and (Y)(ascendant of (A) and (Y)) to stratify. Otherwise, the results are to be biased.

Rule #2: we MUST NOT use the variable caused by (A) (descendant of (A)) to stratify because it causes selection bias. Otherwise, the results are to be biased.

(These two rules will be generalized by ‘backdoor criterion’ later. Do not be confused since they indicate the same thing!)

Look at the causal diagram of two cases + the very first math score and English score example which we are sure produces selection bias.

Causal diagram of Question #1, Question #2, and math score & English score example.

The graph of treatment example (Question #1) hits the rule #1, therefore we must stratify the data by gender.
The graph of trump card example (Question #2) and math/English score hit the rule #2, therefore we must not stratify the data by clean/dirty or total score.
But what if you did not stratify Question #1 case by gender just because the data did not have gender information? Still your results are biased just like the ‘total’ row in the exhibit. This depicts how the background knowledge about the subject is important to get the right insights.

Now, remember that Question #1 and Question #2 had exactly the same values in the exhibits. This means we cannot identify which one of the two rules we have to follow only from the data, and the only way is to draw the causal structure from our prior knowledge about the subject.

As I have written twice already if we plainly summarize this,

When doing data analysis, you have to know the causal structure of the subject and use it properly. Otherwise, you may end up with deriving a wrong insight (=the results will be biased.)

4. Backdoor Criterion in Causal Diagram

Rule #1 and Rule #2 are generalized by the concept of ‘backdoor criterion’ proposed by Pearl as follows:

Excerpt from “Causality- Models, Reasoning, and Inference” by Judea Pearl

X in this definition is the one we called A in the previous section. The point of backdoor criterion is once we identify Z through this criterion, they immediately become the variables we need to stratify the data by.

In the case of Question #1, if we did not include gender in Z, Z={}. Since Y={cured}, it will violate (ii) in the backdoor criterion. Now, consider another choice that Z={gender} and we can confirm they satisfy both conditions of the criteria. The data must be stratified by gender.

In the case of Question #2, if we included cleanness in Z, Z={clean}. Cleanness is a descendant of X={court card}, therefore this violates (i) in the backdoor criterion. Now, consider another choice that Z={} and we can confirm they satisfy both conditions of the criteria. The data must not be stratified by cleanness.

5. Conclusion

Selection bias and Simpson’s paradox are the biases that possibly give inversed insight from data analysis and every data analysis has to worry about.

The first step should be always to “try to understand the subject and its causal structure in the background” — It is doom for data analysis automation :(

Once we know the causal relations, the following two rules can give an easy scan to know which variables should be used to stratification when the causal structure is simple enough.

Rule #1: we MUST use the variable causing both of (A) and (Y) (ascendant of (A) and (Y)) to stratify. Otherwise, the results are to be biased.

Rule #2: we MUST NOT use the variable caused by (A) (descendant of (A)) to stratify because it causes selection bias. Otherwise, the results are to be biased.

In more complex cases, we should consider backdoor criterion to determine the stratifying variables.