Unveiling Hidden Truths: Simpson’s Paradox (part I)

7 min readJul 7, 2024

Last week, we delved into a famous concept in data science and probability theory — Berkson’s Paradox. We discovered that this paradox is a form of selection bias, emerging when we fail to consider the entire dataset while conditioning on certain events. This oversight can lead us to observe spurious correlation even when events are assumed to be uncorrelated.

Paradoxes are an excellent way to explore the wonders of statistics because they challenge our intuition and alter our sense of what “feels right”. Over the next few posts, I will continue discussing fascinating paradoxes that arise when dealing with conditional probabilities and statistics. Today, let’s meet another rascal: Simpson’s Paradox.

The birth of a paradox

Simpson’s Paradox occurs when a trend apparent in several different groups of data disappears or even reverses when these groups are collated. Much like Berkson’s paradox, it illustrates how our statistical interpretations can change dramatically based on how we aggregate data.

In 1951, British statistician Edward H. Simpson published a paper titled “The Interpretation of Interaction in Contingency Tables”, where he analyzed categorical data (i.e. items classified into distinct groups or categories) and the interactions within contingency tables —which display the frequency distribution of variables to analyze their associations.

Simpson’s key insight was identifying that an observed association between two variables could be reversed when a third variable, often called a confounder, is considered. In other words, the direction of the association could change depending on how the data is grouped.

A visual example

If this description sounds too technical, don’t worry. There is another, immediate way to understand the paradox visually. Consider the graph below, representing the occurrence of two random variables. To make things more concrete, we will refer to a scenario similar to one in Simpson’s original paper about a treatment for a medical condition (and is inspired by real-life medical studies like this one about kidney stones). The two variables represent the severity of the condition (as measured on a 0 to 1 scale) on the x-axis and the success rate of its treatment (as measured with some standard reference like duration of hospital stay, re-occurrence of the disease, number of rounds of treatment needed, etc.) on the y-axis.

At first glance, the full dataset (in black) appears positively correlated (red solid line). This would imply that a more severe disease is correlated with a more successful treatment! This finding is counterintuitive, so let’s examine the data more closely. When we consider an additional variable — namely the type of treatment the patients received — a more consistent picture appears. The data clusters into two subgroups (blue and orange dots) representing data for two different treatments A and B. When considered separately, the clusters exhibit the expected behavior: a more severe disease leads to a poorer outcome, regardless of the treatment, indicating a negative correlation between disease severity and treatment success rate (solid yellow and blue lines).

How is this possible? Well, the type of treatment could be a confounding variable that implies both disease severity and treatment success rate. For example, treatment A could a cheaper procedure that is given predominantly for less severe diseases. In fact, we see that it is chosen only for diseases with severity up to 0.7. This alternative might be cheaper, but inherently has a lower success rate. Conversely, when the disease is severe, treatment B is often preferred. This could be a more expensive but superior treatment which has a higher success rate than treatment A. When pooled together, the discerning effects of the two treatments disappear and the data acquires a confusing trend because we are mixing cases with two different natures. Simpson’s Paradox thus teaches us a crucial lesson: always dig deeper into the data before jumping to conclusions! Aggregating data without considering underlying group differences can lead to misleading interpretations.

Mathematical formulation

After having set the stage for what Simpson’s Paradox is, let’s see how we can formulate it rigorously with conditional probabilities. For events A, B, and C, we say that we have a Simpson’s Paradox when the following holds:

P(A|B) > P(A|B^c): the event A is more likely conditioned on B (i.e. if B happens) than conditioned on the complement B^c (i.e. if B does not happen). This is the case of aggregated data in the example above.
P(A|B,C) < P(A|B^c,C) and P(A|B,C^c) < P(A|B^c,C^c): if we condition on an additional confounder, the probabilities flip sign and A conditioned on B is less likely than A conditioned on B^c. This holds no matter if we condition on C or its complement C^c! This case represents the disaggregated case in the example above.

This paradox is solved if we examine the conditional probabilities with the law of total probability (LOTP) using extra conditioning on C (check this post on the LOTP if you want to refresh this notion). We can write:

P(A|B) = P(A|B,C)P(C|B) + P(A|B,C^c)P(C^c|B).
P(A|B^c) = P(A|B^c,C)P(C|B^c) + P(A|B^c,C^c)P(C^c|B^c).

For this reason, some people argue that this situation shouldn't even be called a paradox. After all, it's fully in line with the laws of probability, and the apparent paradox arises only from how we interpret the data (whether it's aggregated or disaggregated). In other words, what seems like a paradox is actually a misunderstanding: it's a failure to properly account for confounders or to consider the causal relationships between variables.

The kidney stone example — revisited

Armed with the exact mathematical definition of Simpson’s Paradox, let’s now revisit the medical study above. To make things easier, instead of working with a continuum of data points, we will use categorical data like in Simpson’s original work. We will consider a contingency table where diseases are categorized into two tiers based on the size of the stones (small or large). We maintain two treatment types (renamed 1 and 2) with different preferential applicability and success rates: treatment 1 is mainly used for small stones and has a lower success rate than treatment 2, which is chosen predominantly for large stones.

A contingency table representing the distribution of different cases in the kidney stone medical study. The first number is the number of successful treatments and the second is the number of total cases, e.g. for 81/87 there were 81 successful treatments out of 87 cases. The numbers in red represent the largest quota of successful treatments in each row.

Upon examining the table, we can immediately see Simpson’s Paradox at play. Treatment 2 is more successful than treatment 1 for both types of kidney stones. However, once aggregated, the overall success rate favors treatment 1, creating the illusion that it is the better treatment overall.

Let’s now link this example to the conditional probabilities defined before. The events are defined as follows:

A: treatment is successful.
B: the chosen treatment is treatment 1.
C: the kidney stones are small.

With these definitions is easy to verify the following:

P(A|B) > P(A|B^c): treatment 1 is more successful over the aggregated data.
P(A|B,C) < P(A|B^c,C): treatment 1 is less successful for small stones.
P(A|B,C^c) < P(A|B^c, C^c): treatment 1 is also less successful for large stones.

By examining the data we can also understand why the paradox is possible. The number of procedures are not allocated uniformly across the different types of kidney stones, with treatment A disproportionally used for small stones and B for large ones. In fact, we can state this mathematically as P(C|B)=270/350~77.14%>P(C|B^c)=87/350~24.86%. This huge discrepancy, combined with the fact that the other probabilities are not too far off from each other, makes the aggregated probabilities favor treatment 1.

In this revisited example, the confounder is the size of the stones, determining both the severity of the disease and the type of treatment chosen.

A famous example: UC Berkeley gender bias study

Let’s conclude with another, famous example that can be explained with Simpson’s Paradox. This example comes from a study on gender bias in the graduate admissions at the University of California, Berkeley from 1973. The admission data showed that men were more likely to be accepted than women. Moreover, this difference was too significant to be attributable to chance. However, when looking at disaggregated data, the culprit became clear: there was a confounder at play — namely the department choice! While men usually applied to less competitive departments with higher acceptance rates, women tended to apply to more competitive departments with lower acceptance rates. Remarkably, when the data was adjusted and combined, it indeed reveal a slight gender bias, but in the other direction (favoring women instead of men)!

To see how we can describe this situation with Simpson’s Paradox, let’s first formulate the events needed:

A: a given person is admitted.
M: the given person is a man (M^c: person is a woman).
C: the department the person is applying to is competitive (i.e. smaller ratio between available places and applications).

Then we can obtain the statement of Simpson’s paradox with the following (simplified) conditional probabilities:

P(A|M) > P(A|M^c): men are admitted more often than women overall.
P(A|C,M) < P(A|C,M^c) and P(A|C^c,M) < P(A|C^c,M^c): women are admitted more often than men at every department.
P(C^c|M) > P(C^c|M^c): women apply to more competitive departments.

For a deeper dive into the data science behind this, take a look at the full study here.

Did you enjoy this explanation of the tricks that statistics can play on our intuition? Share your thoughts and any cool examples of Simpson’s Paradox you’ve encountered in the comments below!