Data Fallacies to Avoid

Jonathan Kurniawan
Data Analytics @ Hult
4 min readApr 16, 2018

As powerful as data is in providing insights and inform decisions, every data leader be aware of some ways where data can be misunderstood due to either poor statistics training, or cognitive biases that we already have.

Here are a couple of examples:

Cognitive biases

McNamara Fallacy: Relying solely on metrics in complex situations and losing sight of the bigger picture. A.k.a Quantitative fallacy.

Danger of summary metrics: Only looking at summary metrics and missing big differences in raw data.

Example of both McNamara and summary metrics: if a business, through its promotions, have increased their number of email subscribers. On the surface, it looks like a great sign, but not necessarily. What if the number of people of who never open their email increased? Without having the email read, the increase subscriber count would be useless, and measuring the metric blindly could be misleading.

Publication Bias: Interesting research findings are more likely to be published, distorting our impression of reality.

Example: News reports only high-profile crimes, and so perception of safety is skewed to feel more insecure, even though there are many more instances of peaceful events happening in the city.

Hawthorne Effect: The act of monitoring someone can affect their behavior, leading to spurious findings. A.k.a Observer effect.

Example: Supervising manufacturing workers during a QA trip influences their quality positively.

Cobra Effect: Setting incentive that accidentally produces the opposite result than intended. A.k.a Perverse incentive.

Example: The even-odd carplate scheme in Jakarta, Indonesia was intending to reduce traffic at main streets during work days, but the unintended consequence was Jakartans buying two cars with both even and odd plates, aggravating the traffic issue in the city.

Survivorship Bias: Drawing conclusions from incomplete set of data, because that data has ‘survived’ some selection criteria.

Example: Lets say you just started joining a gym. Every time you are there, you see the same fit and motivated faces. After a few days, you feel depressed because you couldn’t stick to the schedule and motivation, whilst the others in the gym could. You begin to feel like you’re less than the average gym goer, however, you fail to realize that many more of those who fail to come to the gym, and are biased to those who ‘survive’ and come to the gym.

Statistics fallacies

Cherry Picking: Selecting results that fit your claim, and excluding those that don’t.

Data Dredging: Process of ‘fishing patterns’ in data that can be presented as statistically significant, without first devising a specific hypothesis as to the underlying causality. A.k.a p-hacking.

A really good example of data dredging: https://fivethirtyeight.com/features/science-isnt-broken/#part1

False Causality: Falsely assuming when two events appear related that one must have caused the other.

Example: “Every day, I eat cereal for breakfast. One time, I had a muffin instead, and there was a major earthquake in my city. I’ve eaten cereal ever since.”

Gerrymandering: Manipulating geographical boundaries used to group data in order to change the result. Here’s youtuber CGP Grey explaining it further: https://www.youtube.com/watch?v=Mky11UJb9AY

Sampling Bias: Drawing conclusions from a set of data that isn’t representative of the population you’re trying to understand.

Example: Interviewing students in an international school on what their preference of messaging platform, and taking that insight to say ‘students in the U.S. use WeChat more’. This mistakes the fact that the initial sample was not representing the population.

Gambler’s Fallacy: Mistakenly believing that because something has happened more frequently than usual, it’s now less likely to happen in future (and vice versa).

Example: The past eight coin toss has been tails, so its ‘due’ to be heads now. In actuality, probability of the next coin toss to be heads is still 50–50, regardless of previous coin toss.

Regression Fallacy: When something happens that’s usually good or bad, over time, it will revert back towards the average.

Example: Your favorite sports player, with no new trainings or strategies, suddenly win more games than expected. This could be due to ‘luck’, and after a while, will go back to his/her average capability.

Simpson’s Paradox: When a trend appears in different subsets of data but disappears or reverses when groups are combined. Here’s a great example of Simpson’s Paradox in the U.S. median salary: http://blog.revolutionanalytics.com/2013/07/a-great-example-of-simpsons-paradox.html

Overfitting: Creating a model that is overly tailored to the data you already have, and not representative of the general trend.

Example: This is a common problem in data models and machine learning, especially when a model gives a 100% accurate response. This means the machine learning model has truly optimized for the training data set, and not the real world.

More importantly, it’s important to know that the model’s “concepts” (relationship of input and output) can change over time, and needs to be accounted for (also known as concept drift) https://machinelearningmastery.com/gentle-introduction-concept-drift-machine-learning/

So the next time you’re making a data-driven decision, have some of these in mind and mitigate the risk of misleading fallacies.

--

--