Homer Simpson is not the author of this paradox but he suffers from it, thinking he sees four beers when in reality he only he is being misled by his simple data analsysis.
If you’re starting to dive into Data Science probably sooner than later you will have to dip your toes in some Statistics. According to Joel Grus, on Data Science from Scratch, “Statistics refers to the mathematics and techniques with which we understand data”. We use Statistics to persuade others to believe in what we’re telling. Most people, organisations or goverments base their most important decisions on this organised data.Nevertheless, very often when looking at your data if you make the mistake of not looking close enough you might be misled by it.
Simpson’s Paradox occurs when groups of data show on a particular trend, however this trend is reversed when the groups are combined together.
Let’s learn by looking at an example!
# Choosing between two hospitals
Imagine you have an elderly relative which needs to go to the hospital for an emergency surgery and you have to choose between two facilities(Hospital A or B). According to the latest data available, for the last 1000 patients going through surgery at Hospital A 900 survived whereas only 800 survived at Hospital B.
So the first assumption would be to choose Hospital A over Hospital B, right? Wrong…
For example, we have to assume that not all patients arrived at the hospital with the same level of health. Therefore, we can start by dividing our results into patients that arrived in good and poor health, and we start to see a difference in the results.
In the case of Hospital A, for 100 patients that arrived in poor health 30 survived. On the other hand, Hospital B had 400 patients arriving in poor health and from that total, 210 survived. Thus, we can conclude that Hospital B, with a survival rate of 52.5%, is the best choice for patients that arrive in poor health.
Ok, that’s if the patient is in poor health… what about if the patient is in good health?
Curiously, Hospital B continues to be the best choice with a survival rate of 98%! So, now comes the one million dollar question: how can Hospital A have an overall better survival rate if Hospital B has better survival rate for patients in each of the two groups!? Congratulations, you’ve now understood Simpson’s Paradox! The same set of data appeared to show opposite trends depending on how is grouped. This occurs when data hides a conditional variable, which is a factor that significantly influences the results. Here the condition variable is patient’s health when arriving at the hospital.
So, in the case of Homer Simpson possibly his conditional variable was the fact that he was already under the influence of alcohol and therefore what he was seeing was not actually true.
This text was based on Mark Liddel Youtube video, visit to see more awesome examples.
Don’t forget, if you like it, please give it an applause!