Resolving Simpson’s Paradox

The world does not contain contradictions — our notation should not help to create them.

John Jordan
Nov 5 · 11 min read

In his classic work on human perception, Principles of Gestalt Psychology, the Gestalt psychologist Kurt Koffka stated that:

…the whole is other than the sum of its parts…

— Kurt Koffka

Note, this is very different from the common expression “The whole is greater than the sum of its parts.”

What Koffka meant was that the parts-of-a-thing — when perceived individually — could have properties different from when they were perceived all together as a whole-thing.

You can see this in the following image: looking at the three circles individually, we see three circles with a missing wedge; looking at all three at once, you will you see a triangle, one that doesn’t exist at the individual level of the parts.

Individually, there are three circles with a wedge missing; taken together, a triangle is created. “The whole is other than the parts”.

The whole can be “other than the parts” in statistics as well

While Koffka refers to the subjective realm of human perception, it turns out that — even in the (comparatively) more objective world of statistics — the whole can be other than the parts.

Stated more formally, it is possible that:

… a trend appears in several different groups of data but disappears or reverses when these groups are combined.

— Wikipedia

The technical term for this phenomena is “Simpson’s Paradox”, and by “paradox” it is meant that, starting from seemingly valid assumptions, we are lead to a conclusion that contradicts itself.

And what are these seemingly valid assumptions?

Namely:

  • if A > C
  • and
  • if B > D
    _____________________
  • then (A + B) > (C + D)

Imagine the following example:

  • if Alan runs 10 miles, and Charles runs 8 miles.
  • and
  • if Barry runs 6 miles, and Dave runs 5 miles.
    ________________________________________________________________
  • then Alan and Barry together (16 miles), ran further than Charles and Dave together (13 miles) did.

Since Alan beat Charles by 2 miles, and Barry beat Dave by 1 mile, then we should expect Alan and Barry together beat Charles and Dave (together) by 3 miles.

That is, we see a trend in the subgroups (the individual people), which continues when we combine the subgroups into larger groups (the pairs of people).

All this seems so obvious from everyday experience that it is hard to imagine circumstances where it would not be true — but that is exactly what happens with Simpson’s Paradox.

How can the parts (the sub-groups) be other than the whole (the combined group), when it is the parts that make up the whole?

It certainly seems counter-intuitive, and in order to make sense of it, we’ll need to first describe the paradox in more detail.

And what is “Simpson’s Paradox”, exactly?

As stated, it is a phenomena in statistics where a given trend appears in individual sub-groups, but then disappears or reverses when these groups are combined.

Unfortunately, showing you an example of Simpson’s Paradox is more difficult (and much less cool) than the illustration of the Gestalt phenomena above; the easiest way is to lead with an example, taken from Wang et al.

School B and School G take a test

There are two schools in a school district — School B and School G — and they each take the same national mathematics examination. Each school has 100 students: School B has 80 boys and 20 girls, while School G has 20 boys and 80 girls.

And here are the results for the 80 boys of School B, showing that they scored on average 84% (assume that every boy scored the same, as unrealistic as that is):

The 80 boys of School B scored an average of 84% on the test.

And here are the results for the 20 boys of School G, showing that they scored on average 85%:

The 20 boys of School G scored an average of 85% on the test, beating the boys of School B by 1%.

So, it seems that the boys of School B were beaten by the boys of School G.

Now, let’s look at the 20 girls of School B and see that they scored on average 80%:

The 80 girls of School B scored an average of 80% on the test.

Now, looking at the scores for the 80 girls of School G, we can see that they scored on average 81%:

The 80 girls of School G scored an average of 81%, beating the girls of School B by 1%.

After seeing the results for the following sub-groups:

  • the boys of School B vs. the boys of School G (84% vs. 85%)
  • the girls of School B vs. the girls of School G (80% vs. 81%)

We can see that both the boys and girls of School B lost to their respective counterparts in School G.

Now, if we were to ask the (apparently) simple question of “Which school did better on the national mathematics test”, then the answer should be obvious, no?

It’s School G, isn’t it?

I mean:

  • School G’s boys beat the boys of School B, and
  • School G’s girls beat the girls of School B

The conclusion is obvious, surely?

Deciding which school did “better” isn’t straight-forward

It actually isn’t straight-forward to decide which school did “better” on the test, since deciding which school is better depends on how you how you look at the data.

Yes, if we look at the “average scores” of the boys and the girls of School G individually, we can see that they were individually higher than the boys and girls of School B, respectively.

The boys and girls of School G scored higher than their School B counterparts — but does this mean that School G is “better” overall than School B?

But if we look at the average score — boys and girls combined—for each school, something interesting happens:

When we combine the scores of both boys and girls for each School, it turns out that School B is better than School G, despite the boys and girls of School G having higher average scores. The calculation for overall average involves computing the weighted mean for each school, since the sex ratios are imbalanced, e.g. for School B: (80 × 84%) + (20 × 80%) = 83.2%.

After combining the results, it looks like School B is actually better overall than School G — in complete contradiction to our previous conclusion!

(And yes, my calculations are correct — before you ask!)

So, School B is now better overall than School G, despite the boys and girls of School G scoring higher on average than the boys and girls of School B?

Yes!

Welcome to the heart of the paradox: a trend appearing when looking at sub-groups (the boys and girls of a school), but then disappearing or reversing when you combine the groups (a school in total).

The Root of the Paradox in Poor Notation

Why Simpson’s Paradox arises isn’t easy to explain — you can check some more technical explanations (here for example), to confirm that this paradox is challenging to both explain and resolve.

Personally — despite being familiar with the paradox for several years — it was only after writing a previous post (on explicitly distinguishing counts and fractions) that I found an “intuitive” explanation — and by “intuitive”, I mean “obvious to me”, of course.

The difference between counts and fractions

While writing my previous post, I realized that not explicitly distinguishing between counts and fractions is a key factor in Simpson’s Paradox. Specifically, the use of the fraction slash ( ⁄ ) to represent both of them — despite them being fundamentally different things — helps to create Simpson’s Paradox.

Briefly, a count is something concrete that you can directly perceive e.g. you see two actual shoes in front of you.

Two shoes. And yes, they are hideous shoes — when you are working with creative commons material, you get what you pay for: https://www.flickr.com/photos/11286073@N00/2587042936

On the other hand, a fraction is something abstract that you derive in your head, by comparison to a unit , e.g. you see the two matching shoes in front of you, and you know that you see 100% of one pair of shoes, since you have the concept of “a pair of shoes”.

Moving from things we can count directly — the two shoes — to things we measure by reference to a unit — a pair of shoes — is when we start to move from counts of things, to fractions of units.

Treating counts and fractions as equivalent concepts leads to confusion, since they are not directly equivalent; this becomes very obvious when we start to compare them across different groups of things.

The difference between counts and fractions: a simple example

Imagine that the national mathematics test had ten questions on it.

Imagine the test had 10 questions — imagine further that I used tabular figures, so the questions aligned up nicely.

If one boy — we’ll call him “Bobby” — answers 8 out of 10 questions correctly, then 80% of the questions Bobby answered were correct, and he has answered eight questions — out of a possible ten — correctly. That is, he has answered 80% of the total questions on the test.

Bobby scored 8-out-of-10 on the test; he gave ten answers, of which 8 were correct, so he scored 80%.

If another boy — we’ll call him “Gilbert” — answers only 7 out of the 10 questions on the test, and all of his answers are correct, then 100% of the questions Gilbert answered were correct; but he has only answered seven questions — out of a possible ten — correctly. That is, he has answered 70% of the total questions on the test.

Every question Gilbert gave an answer to was correct, which means 100% of his answers were correct; but he only gave seven answers, when there were ten questions on the test — so he scored 70%.

Yes, from a “fraction-of-questions-answered” perspective, 100% of Gilbert’s answers were correct — but he only answered seven questions.

Could Gilbert claim he did better than Bobby on the test, since 100% (7/7) of his answers were correct, and only 80% (8/10) of Bobby’s answers were?

No, of course not.

Gilbert only answered seven questions correctly, while Bobby answered eight questions correctly — and on examinations, it is the count of questions answered correctly that matters, not the fraction of your answers that were correct (unless negative marking is in effect, of course, etc.).

Now, while Bobby did better on this test than Gilbert — providing more correct answers (8 versus 7) — Bobby did get 2 questions wrong, while Gilbert answered all his questions correctly.

Perhaps Gilbert was just slower, and more careful, and merely ran out of time on the exam?

Is a greater count better than a greater fraction? It depends!

Imagine instead Bobby and Gilbert were heart surgeons, and you were going under one of their knives for a transplant.

Would you prefer Bobby, who has performed ten surgeries, but only 80% of them were successful?

Or would you prefer Gilbert, who has performed only seven surgeries, but 100% of them were successful?

Sometimes, doing ten things, and only doing eight of them correctly is worse than only doing seven things, but doing every one of them correctly.

This contrast between test scores and heart surgeries helps to illustrate the importance of differentiating between counts and fractions, since they are very different concepts — sometimes the count is more important (i.e. test scores), while in other situations, the fraction matters more (i.e. heart surgeries).

Using the new notation to help distinguish counts and fractions

The vertical bar, to be used to record counts of things, as opposed to representing fractions of things.

To help distinguish between these two concepts, I will use the notation for counts and fractions that I introduced previously.

That is, “eight-out-of-these-ten” actual answers, is depicted using the vertical broken bar, as 8¦10. On the other hand, “eight-out-of-every-ten” answers will be represented using the standard fraction slash, as “8 ⁄ 10” (80%).

Eight-out-of-ten questions is written as 8¦10. That is, Bobby gave answers to all ten questions on the test, but only eight were correct.

Similarly, “seven-out-of-these-seven” actual answers is depicted as 7¦7 , while “seven-out-of-every-seven” answers is represented as “7 ⁄ 7” (100%).

Seven-out-of-seven is written as 7¦7. That is, Gilbert only answered seven questions, but all of his answers were correct. (Of course, he only answered 7¦10 of the total questions on the test — that is, 70% of the total questions on the test)

Armed with this new notation, let’s return to the example of Schools B and G and see if we can cut through the confusion.

Resolving the Paradox Using the New Notation

How can it be that School G is “better” than School B, when comparing the boys and girls as individual groups, but School G is “worse” than School B when you look at the average score for the entire School?

First difference: the sub-groups differ

First, notice that the sex ratios between the two schools is very different — there are 80 boys and 20 girls in School B, but 20 boys and 80 girls in School G.

The sex-ratios are reversed in the two Schools: School B has more boys, School G has more girls.

This difference between the two schools has important implications for the results of the mathematics test — and directly contributes to the paradox — since the sex ratio of each school affects the overall count of questions answered correctly:

  • School B had 80 boys who scored 84% (the second highest average score): that’s a lot of boys with a very high score.
  • School B only had 20 girls who scored 80% (the lowest average score): that’s only a few girls with the lowest average score.
  • School G only had 20 boys who scored 85% (the highest average score): that’s only a few boys with the highest average score.
  • School G had 80 girls who scored 81% (the second lowest average score): that’s a lot of girls with a (relatively) lower score.

Given the percentage notation used in the example above, the key difference between the count of questions answered correctly (where School B did better), and the fraction of test questions answered correctly (where School G did better) is not made clear.

Second difference: counts ≠ fractions

Second, imagine that there were 10 questions on the test — which means that 1,000 questions were asked of the students of each school*.

*10 questions on the test × 100 students per school = 1,000 questions asked per school

Given 10 questions on the test, expressing the correct answers from schools B and G in both counts and fractions, helps show that a school can lose the sub-group fractional battle, but win the total fractional war. Basically, the boys of School B answered so many questions correctly (as a count), it didn’t matter that they (and their female colleagues) answered a lower fraction of questions than School G’s boys and girls.

Therefore, School B answered 832 questions correctly, while School G answered only 818 questions correctly — meaning that School B’s students answered more actual questions correctly than School G’s students did.

This is despite School G’s boys and girls answering a greater fraction of the questions given to them, compared to the boys and girls of School B.

Critical to the resolution of Simpson’s Paradox, is that while:

  • School G’s boys scored higher than School B’s boys from a fractional perspective (85% vs. 84%), they lost badly from a count perspective (672 vs. 170).
  • School G’s girls beat School B’s girls from both a fractional (81% vs. 80%) and count (648 vs. 160) perspective, but they didn’t make up the count deficit created by School G’s boys against the boys of School B ( −14 correct answers)

In total, while School G’s students answered a higher fraction of questions presented to them, overall they answered a lower count of questions correctly.

It is critical to remember that, since fractions are derived from an actual count of things , you cannot freely change the count you are calculating from, without affecting the fractionit is essential to remember that it is the concrete count of things that comes before the abstract fraction of things.

Conclusion — Which School Is Actually Better?

So, which school would you rather go to?

  • Where as an individual boy or girl, you would score the highest score on average (the greatest fraction of correct answers) —then you should attend School G.
  • Where you are part of the highest-scoring school overall (the greatest count of correct answers) — then you should attend School B.

Which criterion matters more to you is an individual choice: there is no algorithm for deciding whether the count or the fraction of things is more important, and so the decision depends on your judgement.

And judgement, depends on your perspective.

John Jordan

Written by

I’m a digital designer & developer with a background in pharmacy and the life sciences. Passionate about making sense of things and sharing what I’ve learnt.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade