Understanding Simpson’s Paradox

4 min readDec 20, 2021

simplified visualization of Simpson’s Paradox thanks to Wikipedia

As someone working with data — whether you’re a Data Scientist, a Data Analyst, or a Statistician — it’s essential to make sure that the conclusions you draw are well-founded. This is where Simpson’s Paradox comes in.

Before diving into the definition of Simpson’s Paradox, let’s consider the following famous example.

In 1973, a report revealed that UC Berkeley’s graduate school had accepted 44% of their male applicants, and only 35% of their female applicants. Gender discrimination seems like it would be a straightforward, clear-cut conclusion to come to after looking at this data, right?

Wrong! The school, fearing legal action, commissioned a statistician to investigate. The statistician decided to subdivide applicants based on whether they applied to the natural sciences departments or the social sciences departments. They found that in the natural sciences, 80% of women and 46% of men were admitted, while in social science, 20% of women and 4% of men were admitted.

Data Table thanks to The Graduate Division, University of California, Berkeley

The statistician determined, based on this dataset, that men were more likely to apply to the natural sciences and women were more likely to apply to the social sciences. Since the social sciences had a lower acceptance rate than the natural sciences, when the variable department was ignored, data analysis suggested a misguided conclusion. When the variable was accounted for, the statistician realized that the opposite of what the university feared was true: women were more likely to be accepted to graduate school.

The statistician concluded that both gender and admissions related to a third variable, previously unaccounted for: department. This is Simpson’s Paradox.

Defining Simpson’s Paradox

Before diving into Simpson’s Paradox, we’ll define the term paradox. According to Websters,

a paradox is a seemingly absurd or self-contradictory statement or proposition that when investigated or explained may prove to be well founded or true .

Simpson’s Paradox, sometimes called the Yule-Simpson effect, is an association paradox. Association paradoxes can occur between continuous and categorical variables. For example: gender (categorical) and acceptance rate (numerical).

Furthermore, Simpson’s Paradox is a special sub-type of association paradox called a reversal paradox. Reversal paradoxes are when marginal and partial associations between two variables have different signs. For example: according to the initial false conclusion, if the variable gender is equal to female, the acceptance rate was expected to be lower than for the value male. We can consider this a negative correlation.

On the other hand, according to the correct conclusion, if the variable gender is equal to female, the acceptance rate was actually higher than for the value male. We can consider this to be a positive correlation. Thus, the marginal and partial associations between two variables have different signs.

Now that we understand what an association paradox and a reversal paradox are, we can define Simpson’s Paradox:

Simpson’s Paradox is an effect that occurs when the marginal association between two categorical variables is qualitatively different from the partial associations between the same two variables after controlling for one or more other variables.

More simply put: Simpson’s Paradox describes a situation in which a trend or relationship that is observed within multiple groups disappears or reverses when the groups are combined.

This disappearance or reversal is due to the existence of a confounding variable. A confounding variable is a variable that is correlated with both the dependent variable and the independent variable. Our confounding variable in the UC Berkeley example was department.

Let’s consider another example to better understand what a confounding variable is. When investigating the causal relationship between smoking and death rate, age is a confounding variable. This is because as age increases, the death rate also increases, and the smoking rate decreases. If the statisticians investigating this causal relationship fail to control the variable age, they’ll run into Simpson’s Paradox.

And that’s a wrap! Hopefully this article helped explain why hidden variables could be lurking in your data. If you enjoyed this piece on Simpson’s Paradox and are interested in reading more, check out the following articles.

The Three Bs: Bootstrapping, Bagging, & Boosting!

Many important ensemble learning algorithms are based off of the Three Bs. Understanding these fundamental Machine…

medium.com

Simple & Quick Web Apps in R

R Shiny is an R package that allows you to build interactive web apps through R. It doesn’t require any previous…

medium.com

Simplified Logistic Regression: Classification With Categorical Variables in Python

Logistic Regression is an algorithm that performs binary classification by modeling a dependent variable (Y) in terms…

medium.com

Understanding Simpson’s Paradox

Defining Simpson’s Paradox

The Three Bs: Bootstrapping, Bagging, & Boosting!

Many important ensemble learning algorithms are based off of the Three Bs. Understanding these fundamental Machine…

Simple & Quick Web Apps in R

R Shiny is an R package that allows you to build interactive web apps through R. It doesn’t require any previous…

Simplified Logistic Regression: Classification With Categorical Variables in Python

Logistic Regression is an algorithm that performs binary classification by modeling a dependent variable (Y) in terms…

Written by Rowan Curry