How UC Berkeley Almost Got Sued For SEX Discrimination….LYING Data?
This weekend, I was helping a friend with his startup. He was really frustrated — after making a few adjustments based on the data he collected — his profits seemed to have gone down instead of up.
Since he “followed the data”, he was convinced that he made the right decision. He asked me to help him figure out what went wrong.
“His data was LYING to him”
Turns out… his data was literally “lying” to him. This wasn’t a case of “garbage in = garbage out”, but one of those rare/tricky incidents where your data can actually trick you into making the opposite decision.
Data, data, data… We’re increasingly becoming a society obsessed with data! Important decision-making meetings will often parrot the phrase “Well…What does the data say”.
We’re becoming a society obsessed with data!
Being “data-informed” is all well and good. But using it at face value to drive decision making can be rather dangerous. In this post, we will discuss one of the ways data can trick you into making the wrong decision — The Simpson’s Paradox.
(Psst…This article is illustrated with an Animated Video for easier understanding)
The Simpson’s Paradox
In 1973 UC Berkeley was sued for sex-discrimination. It turned out of all the female students who applied — only 35% of them were admitted. While out all the male students who applied 44% of them were admitted.
(If you like the post so far, please ❤ and follow!)
The Witch Hunt Was On!
The data raised a lot of eyebrows. And the witch-hunt was on! UC Berkley set out to find the main culprits of this gender discrimination. To do this they broke open the data to see which departments were mainly responsible for this gender bias. And here is what they found:
Now this is where the data gets funny. After breaking open the data, we see a different story. Out of the 6 departments 4 of the departments accepted women more than men. There definitely was a gender bias — but it was in favour for the women. Not against!
Why did the aggregated data tell a different story?
But that begs the question? Why did the aggregated data tell a different story?This is a classic case of the Simpson’s Paradox — when grouped-up data tells the opposite story of the ungrouped data. This happens because of a confounding-factor that is hidden from sight WITHIN the data.
So what’s this “hidden factor” that’s causing all the mischief? Take a look at first and the last rows of the table:
(This article is illustrated with an Animated Video for better understanding as well)
You’ll notice that Department A has a pretty high acceptance rate — especially for women at 82%! However, out the 4000+ women only a 108 of them applied to Department B. That’s only 2% of all women who applied across departments.
On the other hand, 825 of the men applied to Department A! That’s 10% of all the male applicants. You may have already spotted the mischief. But let’s go on.
Now, take a look at the last row. Again, the women have a higher acceptance rate than the men. But over here — Department F, in contrast to Department A, has a very LOW acceptance rate. And this is where it goes wrong.
This is where it ALL goes wrong.
Compared to the men, a much larger portion of the women applied to this low-acceptance department. Around 4% of all the men applied here. While 8% of all the women applied to Department F.
So in truth, women weren’t being discriminated against. It just so happened that a large proportion of them were applying to a low-acceptance rate department while a large proportion of men were applying to high-acceptance rate department. That skewed the overall results.
This sort of data mischief — The Simpson’s Paradox — can happen everywhere. Even in businesses who use data to make decisions.
Here’s a business-case example. A CEO and his team were deliberating whether to use a One-Click advertisement campaign or Two-Click campaign. That’s when the marketing manager — who happened to support the Two-Click campaign showed him some data:
Single Click had more users allocated to it, and thus more revenue — but the RPM (revenue per thousand users) is higher for double click. When you look at the data — the decision is obvious. Double Click is generating more money per user — so they should go with double click, correct?
Turns out, picking the Double-Click campaign would have been a costly mistake. Let’s break open the data again — into its subgroups of International users and Local users:
Suddenly, the data tells a different story. Single click is outperforming Double-Click in both subgroups — Local and International? How is this possible?
Simpson’s Paradox at play again. The grouped up data has a hidden factor that tells the opposite story of the ungrouped data.
In this case, the hidden factor was that only 33% of international users were shown the double-click page, while only 58% of the local users were shown the double-click page. And in general the local users had a much higher RPM than international users. So the local users who had a much higher proportion of double click users and a higher RPM skewed the overall data.
Phew — this was a tough example. Take a minute to analyse the data. Simpson’s Paradox can be tricky — the key is to look out for any hidden variables that may be influencing your data!
Don’t rely too much on your data. If something smells fishy — look into it. Do not trust your data blindly.
(Psst…This article is illustrated with an Animated Video for better understanding as well)
If you liked this post, please ❤ and follow! Also, check out the Skip-MBA Reading List for a curation of 30 Books & 20 Free Courses — Categorized for your convenience!
About the Author: Shawn Dexter started his journey as a Janitor & Highschool Dropout but self-educated himself to a Six Figure Salary & MSc Degree!
Shawn is an Entrepreneur, Product Manager, and former Software Developer. Shawn was ready to pursue his MBA in a top business school. But after extensive research Shawn decided to self-study his MBA. You can join Shawn on his journey to a self-directed MBA at http://SkipMBA.com