Resolving Simpson’s Paradox
The world does not contain contradictions — our notation should not help to create them.
In his classic work on human perception, Principles of Gestalt Psychology, the Gestalt psychologist Kurt Koffka stated that:
…the whole is other than the sum of its parts…
— Kurt Koffka
Note, this is very different from the common expression “The whole is greater than the sum of its parts.”
What Koffka meant was that the parts-of-a-thing — when perceived individually — could have properties different from when they were perceived all together as a whole-thing.
You can see this in the following image: looking at the three circles individually, we see three circles with a missing wedge; looking at all three at once, you will you see a triangle, one that doesn’t exist at the individual level of the parts.

The whole can be “other than the parts” in statistics as well
While Koffka refers to the subjective realm of human perception, it turns out that — even in the (comparatively) more objective world of statistics — the whole can be other than the parts.
Stated more formally, it is possible that:
… a trend appears in several different groups of data but disappears or reverses when these groups are combined.
— Wikipedia
The technical term for this phenomena is “Simpson’s Paradox”, and by “paradox” it is meant that, starting from seemingly valid assumptions, we are lead to a conclusion that contradicts itself.
And what are these seemingly valid assumptions?
Namely:
- if A > C
- and
- if B > D
_____________________ - then (A + B) > (C + D)
Imagine the following example:
- if Alan runs 10 miles, and Charles runs 8 miles.
- and
- if Barry runs 6 miles, and Dave runs 5 miles.
________________________________________________________________ - then Alan and Barry together (16 miles), ran further than Charles and Dave together (13 miles) did.
Since Alan beat Charles by 2 miles, and Barry beat Dave by 1 mile, then we should expect Alan and Barry together beat Charles and Dave (together) by 3 miles.
That is, we see a trend in the subgroups (the individual people), which continues when we combine the subgroups into larger groups (the pairs of people).
All this seems so obvious from everyday experience that it is hard to imagine circumstances where it would not be true — but that is exactly what happens with Simpson’s Paradox.
How can the parts (the sub-groups) be other than the whole (the combined group), when it is the parts that make up the whole?
It certainly seems counter-intuitive, and in order to make sense of it, we’ll need to first describe the paradox in more detail.
And what is “Simpson’s Paradox”, exactly?
As stated, it is a phenomena in statistics where a given trend appears in individual sub-groups, but then disappears or reverses when these groups are combined.
Unfortunately, showing you an example of Simpson’s Paradox is more difficult (and much less cool) than the illustration of the Gestalt phenomena above; the easiest way is to lead with an example, taken from Wang et al.
School B and School G take a test
There are two schools in a school district — School B and School G — and they each take the same national mathematics examination. Each school has 100 students: School B has 80 boys and 20 girls, while School G has 20 boys and 80 girls.
And here are the results for the 80 boys of School B, showing that they scored on average 84% (assume that every boy scored the same, as unrealistic as that is):

And here are the results for the 20 boys of School G, showing that they scored on average 85%:

So, it seems that the boys of School B were beaten by the boys of School G.
Now, let’s look at the 20 girls of School B and see that they scored on average 80%:

Now, looking at the scores for the 80 girls of School G, we can see that they scored on average 81%:

After seeing the results for the following sub-groups:
- the boys of School B vs. the boys of School G (84% vs. 85%)
- the girls of School B vs. the girls of School G (80% vs. 81%)
We can see that both the boys and girls of School B lost to their respective counterparts in School G.
Now, if we were to ask the (apparently) simple question of “Which school did better on the national mathematics test”, then the answer should be obvious, no?
It’s School G, isn’t it?
I mean:
- School G’s boys beat the boys of School B, and
- School G’s girls beat the girls of School B
The conclusion is obvious, surely?
Deciding which school did “better” isn’t straight-forward
It actually isn’t straight-forward to decide which school did “better” on the test, since deciding which school is better depends on how you how you look at the data.
Yes, if we look at the “average scores” of the boys and the girls of School G individually, we can see that they were individually higher than the boys and girls of School B, respectively.

But if we look at the average score — boys and girls combined—for each school, something interesting happens:

After combining the results, it looks like School B is actually better overall than School G — in complete contradiction to our previous conclusion!
(And yes, my calculations are correct — before you ask!)
So, School B is now better overall than School G, despite the boys and girls of School G scoring higher on average than the boys and girls of School B?
Yes!
Welcome to the heart of the paradox: a trend appearing when looking at sub-groups (the boys and girls of a school), but then disappearing or reversing when you combine the groups (a school in total).
The Root of the Paradox in Poor Notation
Why Simpson’s Paradox arises isn’t easy to explain — you can check some more technical explanations (here for example), to confirm that this paradox is challenging to both explain and resolve.
Personally — despite being familiar with the paradox for several years — it was only after writing a previous post (on explicitly distinguishing counts and fractions) that I found an “intuitive” explanation — and by “intuitive”, I mean “obvious to me”, of course.
The difference between counts and fractions
While writing my previous post, I realized that not explicitly distinguishing between counts and fractions is a key factor in Simpson’s Paradox. Specifically, the use of the fraction slash ( ⁄ ) to represent both of them — despite them being fundamentally different things — helps to create Simpson’s Paradox.
Briefly, a count is something concrete that you can directly perceive e.g. you see two actual shoes in front of you.

On the other hand, a fraction is something abstract that you derive in your head, by comparison to a unit , e.g. you see the two matching shoes in front of you, and you know that you see 100% of one pair of shoes, since you have the concept of “a pair of shoes”.

Treating counts and fractions as equivalent concepts leads to confusion, since they are not directly equivalent; this becomes very obvious when we start to compare them across different groups of things.
The difference between counts and fractions: a simple example
Imagine that the national mathematics test had ten questions on it.

If one boy — we’ll call him “Bobby” — answers 8 out of 10 questions correctly, then 80% of the questions Bobby answered were correct, and he has answered eight questions — out of a possible ten — correctly. That is, he has answered 80% of the total questions on the test.

If another boy — we’ll call him “Gilbert” — answers only 7 out of the 10 questions on the test, and all of his answers are correct, then 100% of the questions Gilbert answered were correct; but he has only answered seven questions — out of a possible ten — correctly. That is, he has answered 70% of the total questions on the test.

Yes, from a “fraction-of-questions-answered” perspective, 100% of Gilbert’s answers were correct — but he only answered seven questions.
Could Gilbert claim he did better than Bobby on the test, since 100% (7/7) of his answers were correct, and only 80% (8/10) of Bobby’s answers were?
No, of course not.
Gilbert only answered seven questions correctly, while Bobby answered eight questions correctly — and on examinations, it is the count of questions answered correctly that matters, not the fraction of your answers that were correct (unless negative marking is in effect, of course, etc.).
Now, while Bobby did better on this test than Gilbert — providing more correct answers (8 versus 7) — Bobby did get 2 questions wrong, while Gilbert answered all his questions correctly.
Perhaps Gilbert was just slower, and more careful, and merely ran out of time on the exam?
Is a greater count better than a greater fraction? It depends!
Imagine instead Bobby and Gilbert were heart surgeons, and you were going under one of their knives for a transplant.
Would you prefer Bobby, who has performed ten surgeries, but only 80% of them were successful?
Or would you prefer Gilbert, who has performed only seven surgeries, but 100% of them were successful?

This contrast between test scores and heart surgeries helps to illustrate the importance of differentiating between counts and fractions, since they are very different concepts — sometimes the count is more important (i.e. test scores), while in other situations, the fraction matters more (i.e. heart surgeries).
Using the new notation to help distinguish counts and fractions

To help distinguish between these two concepts, I will use the notation for counts and fractions that I introduced previously.
That is, “eight-out-of-these-ten” actual answers, is depicted using the vertical broken bar, as 8¦10. On the other hand, “eight-out-of-every-ten” answers will be represented using the standard fraction slash, as “8 ⁄ 10” (80%).

Similarly, “seven-out-of-these-seven” actual answers is depicted as 7¦7 , while “seven-out-of-every-seven” answers is represented as “7 ⁄ 7” (100%).

Armed with this new notation, let’s return to the example of Schools B and G and see if we can cut through the confusion.
Resolving the Paradox Using the New Notation
How can it be that School G is “better” than School B, when comparing the boys and girls as individual groups, but School G is “worse” than School B when you look at the average score for the entire School?
First difference: the sub-groups differ
First, notice that the sex ratios between the two schools is very different — there are 80 boys and 20 girls in School B, but 20 boys and 80 girls in School G.

This difference between the two schools has important implications for the results of the mathematics test — and directly contributes to the paradox — since the sex ratio of each school affects the overall count of questions answered correctly:
- School B had 80 boys who scored 84% (the second highest average score): that’s a lot of boys with a very high score.
- School B only had 20 girls who scored 80% (the lowest average score): that’s only a few girls with the lowest average score.
- School G only had 20 boys who scored 85% (the highest average score): that’s only a few boys with the highest average score.
- School G had 80 girls who scored 81% (the second lowest average score): that’s a lot of girls with a (relatively) lower score.
Given the percentage notation used in the example above, the key difference between the count of questions answered correctly (where School B did better), and the fraction of test questions answered correctly (where School G did better) is not made clear.
Second difference: counts ≠ fractions
Second, imagine that there were 10 questions on the test — which means that 1,000 questions were asked of the students of each school*.
*10 questions on the test × 100 students per school = 1,000 questions asked per school

Therefore, School B answered 832 questions correctly, while School G answered only 818 questions correctly — meaning that School B’s students answered more actual questions correctly than School G’s students did.
This is despite School G’s boys and girls answering a greater fraction of the questions given to them, compared to the boys and girls of School B.
Critical to the resolution of Simpson’s Paradox, is that while:
- School G’s boys scored higher than School B’s boys from a fractional perspective (85% vs. 84%), they lost badly from a count perspective (672 vs. 170).
- School G’s girls beat School B’s girls from both a fractional (81% vs. 80%) and count (648 vs. 160) perspective, but they didn’t make up the count deficit created by School G’s boys against the boys of School B ( −14 correct answers)
In total, while School G’s students answered a higher fraction of questions presented to them, overall they answered a lower count of questions correctly.
It is critical to remember that, since fractions are derived from an actual count of things , you cannot freely change the count you are calculating from, without affecting the fraction — it is essential to remember that it is the concrete count of things that comes before the abstract fraction of things.
Conclusion — Which School Is Actually Better?
So, which school would you rather go to?
- Where as an individual boy or girl, you would score the highest score on average (the greatest fraction of correct answers) —then you should attend School G.
- Where you are part of the highest-scoring school overall (the greatest count of correct answers) — then you should attend School B.
Which criterion matters more to you is an individual choice: there is no algorithm for deciding whether the count or the fraction of things is more important, and so the decision depends on your judgement.
And judgement, depends on your perspective.
