How Statistics Can Lie: A Guide to Avoiding Data Misinterpretation

Brianomukhulu
7 min readSep 8, 2023

--

Photo by Claudio Schwarz on Unsplash

Numbers don't lie, or so they say. It is easy to fail to acknowledge the possibility of misleading data in a world where data is king. After all, who can argue with the cold, hard facts presented through meticulously crafted graphs and charts? But there's a catch, a twist as old as numbers themselves: while numbers may remain steadfast, the narratives woven around them can be as malleable as clay. It's a paradox we can't ignore, an enigma encapsulated in the age-old adage, 'Statistics can lie.'

Causation and Correlation

Causation and correlation are fundamental concepts in data analysis, each playing a distinct role in understanding the relationships between variables.

Correlation is like a statistical flirtation between two variables. It implies that variations in one variable influence variations in another. In other words, when one rises, the other usually tends to do the same, whether in the same direction (positive correlation) or the opposite (negative correlation).

It's like saying, "These two things seem to go together, but let's not jump to conclusions just yet."

Scatter plots indicate the strength and direction of the correlation between the co-variables.

Contrarily, causation asserts a cause-and-effect relationship in which a specific action or situation results in a known or expected outcome.

False Casualty occurs when it is assumed that the trend in one variable is causing the movement in another variable, even though there may not be a causal connection between them.

Therefore, it is essential to remember that correlation does not imply causality. This expression serves as a sobering reminder that just because two variables are connected, it does not follow that one is the cause of the other.

It means that even though it can seem as though A and B are connected, there could be several other factors at play:

1. A causes B (Causation): According to this scenario, changes in A inevitably result in changes in B. As an illustration, increasing activity (A) results in weight loss (B).

2. B causes A (Reverse Causation): In this case, the relationship is the opposite; modifications to B create changes to A. An illustration would be the relationship between weight growth (B) and decreased physical activity (A).

3. Both influence each other in a feedback loop (bi-directional causality): An example could be that stress (A) leads to sleep disturbances (B), which, in turn, exacerbates stress.

4. C is the common causal variable causing both A and B: In some circumstances, the changes seen in both A and B may be caused by an unobserved third variable, C. For instance, hot weather may impact an increase in ice cream sales (A) and drownings (B).

5. Coincidental Relation between A and B (Spurious Correlation): In this case, A and B are only associated by coincidence and have no underlying cause. It's similar to discovering a significant trend among unconnected data. Check out this website that collects some genuinely bizarre, spurious cases of correlation.

An example of a Spurious Correlation

Deceptive Graphs

The power of a visual representation cannot be underestimated. However, it's crucial to acknowledge that a graphic can sometimes unintentionally mislead its viewers. This distinction between deception and misinformation is essential.

Professional codes of ethics make it abundantly clear that concealing the truth or presenting it distortedly is unacceptable.

Your audience — the ones who will interpret your graph — is essential when creating a data visualization. Creating graphs isn't merely about showcasing data; it's about ensuring your audience can understand it accurately.

There are classic ways in which graphs can mislead, such as;

  1. Manipulating the vertical scale by making it too large or too small,
  2. Skipping numbers or failing to start at zero,
  3. Additionally, improper labelling and omitting crucial data can also lead to misinterpretation.

Yet, beyond these classic pitfalls, real-life misleading graphs can take various forms. Some are designed to deceive, while others aim to shock. In certain instances, even well-intentioned individuals can inadvertently create misleading visuals. These real-world examples underscore the importance of responsible data visualization and the need to foster a critical eye when interpreting visual representations of data.

By Joxemai, CC BY-SA 3.0

Example of a truncated (left) vs. full-scale graph (right) using the same data

Prosecutor's Fallacy- Conditional Probability in the Courtroom

The Prosecutor's Fallacy is a legal misstep involving the misuse of Bayes' Theorem within the courtroom setting. Instead of properly assessing the probability that the defendant is innocent considering all the evidence, the prosecution, judge, and jury often make the error of asking the probability that the evidence would arise if the defendant were innocent. In essence, they invert the conditional probabilities:

1. P(defendant is guilty|all the evidence)

2. P(all the evidence|defendant is innocent)

This approach overlooks alternative explanations and the inherent probability of guilt or innocence. In the legal context, the prior probability, which is the initial probability of guilt or innocence based on all available evidence, plays a crucial role. Ignoring this base rate is a common mistake.

Bayes' theorem demonstrates how these conditional probabilities are interconnected when incorporating new information into the prior probability.

Illustrating the impact of the Prosecutor's Fallacy, a significant 2010 case saw a convicted murderer, known as "T," appealing his conviction based on a shoeprint found at the crime scene, which appeared to match a pair of Nike trainers discovered at his residence. To correctly employ Bayes' Theorem, a meticulous statistical analysis should have assessed the likelihood of the crime scene shoeprint originating from the same Nike trainers found at the suspect's home, considering factors like shoe prevalence, size, wear, and damage. However, the expert witness couldn't pinpoint the exact prevalence with 786,000 pairs of Nike trainers distributed in the UK between 1996 and 2006, plus numerous sole patterns and millions of sports shoes sold yearly. The conviction was overturned, and notably, the judge banned similar statistical analyses in future court cases.

Simpson's Paradox

Simpson's Paradox is like a magician's trick played by data, revealing the vital lesson of scepticism in data interpretation and the peril of simplifying complex realities through a narrow data lens.

A legendary case of Simpson's Paradox revolves around UC Berkeley's perceived gender bias in 1973. At first glance, the university seemed to admit 44% of male applicants and only 35% of female applicants, sparking allegations of gender discrimination. Although not sued, UC Berkeley, fearing legal trouble, called in statistician Peter Bickel to scrutinize the numbers. What Bickel unearthed was astonishing: Within the six academic departments, four exhibited significant gender bias in favour of women, while the remaining two showed no such preference.

The twist in the tale emerged when Bickel's team dug deeper. They discovered that women tended to apply to departments with lower overall acceptance rates. This seemingly innocuous hidden variable influenced the marginal values for accepted applicants, reversing the gender bias trend in the general data. The entire narrative flipped when accounting for the university's departmental divisions.

By Pace~svwiki, CC BY-SA 4.0

An illustration showing how Simpson's Paradox in real-world-like data can obscure authentic causal relationships, highlighting the risk of misjudgment.

The lesson here?

Wrong data analysis can seriously hinder a business's progress. Bad decisions, fueled by misunderstood data, are never a good thing for a company's growth. Embracing Simpson's Paradox helps businesses understand the limitations of their data, what's driving it, and the various variables at play, ultimately keeping bias at bay. It's like having a superpower in the world of data analysis!

Conclusion

The journey is as important as the destination in data analysis and interpretation. These concepts urge us to approach data with humility, scepticism, and an unwavering commitment to truth. They caution us against the allure of easy answers and remind us that pursuing knowledge is not a linear path but a labyrinth of complexities.

Data, in its raw form, is neutral.

The human interpretation, presentation, and manipulation of data can introduce bias, errors, or even intentional deception. Embracing these concepts isn't just a matter of academic interest; it has real-world implications. Incorrect data analysis can hinder progress and lead to misguided decisions, stunting a company's growth. Understanding data limitations, identifying driving factors, and acknowledging the intricate web of variables are essential for responsible data analysis.

As we venture forward in this data-rich world, let us not forget the valuable lessons learned here — that numbers don't lie, but the stories we tell with them can be as elusive as shadows in the dark.

Thank you for reading!

If you found this article insightful, don't forget to clap and leave a comment to let me know your thoughts.

--

--

Brianomukhulu

Data Scientist | I leverage data to derive actionable insights, build models, and communicate complex concepts effectively.