When Data Lies: Exploring the Enigmatic World of Data Science Paradoxes

Ahmad Suhail
Tensor Labs
Published in
8 min readNov 6, 2023

Over the time period of our careers, we have been stuck in understanding data that we could not wrap our heads around. There was always a struggle to get the answer to our question, “Why is it behaving like that!”. The answer is sometimes in front of our eyes, yet we are so focused on other variables, that we miss these little details. Such instances are called paradoxes.

According to the definition of paradox,

“a statement that is seemingly contradictory or opposed to common sense and yet is perhaps true”

Today I will introduce some of the paradoxes in data science, so you won’t be left scratching your head next time you face mind-boggling data.

  1. Simpson’s Paradox
  2. Berkson’s Paradox
  3. Accuracy Paradox

Simpson’s Paradox

Prepare to be amazed by the amazing phenomenon known as Simpson’s paradox in the realm of Data Science! This paradox challenges our understanding of statistics and reveals the intricate relationship between population subsets and overall trends.

Imagine this: when examining data at a macro level, we may observe a certain trend or pattern. However, when we dive deeper into the specific subgroups within that population, an entirely different trend emerges. It’s as if reality itself is playing tricks on us!

Again imagine you are faced with a crucial decision: choosing between hospital A and hospital B for your elderly relative’s surgery. Naturally, you would want to make an informed choice based on the survival rates of each hospital.

Now, let’s dive into the numbers. Upon reviewing the data of the last 1000 patients, you discover that 900 patients survived at Hospital A, while at Hospital B, 800 out of 1000 patients survived. At first glance, it may seem like Hospital A has a higher success rate and would be the obvious choice.

But hold on! Here comes Simpson’s paradox to turn your understanding upside down. When we dig deeper into the data and consider additional factors, a surprising revelation emerges. It turns out that when we break down the patient groups by age or severity of illness, a different picture emerges. There’s this additional hidden factor that we were overlooking when viewing the data at the macro level.

The hidden factor in this can be the health of patients who arrive in good and poor health. The hospital A only had 100 patients who arrived in poor health out of which 30 survived. However Hospital B had 500 poor health patients, and they were able to save 320 patients. So Hospital B is the clear choice for us to admit our elderly patients in poor health for surgery.

This paradox occurs when an apparent trend in subgroups reverses or disappears when combined together.

So what does this mean? It means that relying solely on overall statistics can sometimes lead us astray. We need to consider all relevant factors and analyze data from various angles before making any conclusions.

Simpson’s paradox serves as a powerful reminder that things are not always as they seem at first glance. It highlights the importance of thorough analysis and critical thinking in decision-making processes.

So next time you encounter seemingly contradictory data like this example with hospitals A and B, remember Simpson’s paradox and approach it with awe-inspired curiosity!

Let us take some more examples to understand the paradox better. In university admissions, applicants to the university’s graduate programs are classified based on gender and admissions outcome. These data would seem to be consistent with the existence of a gender bias because men were more likely to be admitted to graduate school than women. However, when the data is separated by department, it is revealed that women were more likely to apply to social science departments, which had a lower admission rate overall.

In political policies, during Gerald Ford’s presidency, he lowered taxes for every income group and raised taxes on a nationwide level from 1974 to 1978. When the data is separated by income group, it appears that Ford lowered taxes for every group. However, when the data is aggregated, it appears that Ford raised taxes overall.

Simpson’s paradox showcases how seemingly contradictory conclusions can arise from statistical analysis. It reminds us that blindly relying on overall statistics without considering underlying factors can lead to misleading interpretations.

Berkson’s Paradox

Berkson’s paradox is a statistical phenomenon that challenges our general understanding of cause-and-effect relationships. It is named after the American statistician Joseph Berkson, who first described it in 1946.

At its core, Berkson’s paradox relates to collider bias, which occurs when conditioning on a common effect of two variables creates an artificial association between them. In other words, when we focus on a specific outcome or condition, it can lead to a distorted perception of the relationship between two independent factors.

events that “seem” to be related, are NOT!

To better understand this paradox, let’s consider an example: Suppose we are studying the relationship between smoking and obesity. If we only look at hospital patients who have both conditions (smoking and obesity), it may appear that there is a strong connection between the two. However, this observation is misleading because people without either condition would not be represented in the study.

Similarly, if we focus only on individuals who are admitted to hospitals due to accidents or injuries (a common effect), we might find an unexpected association between being accident-prone and having certain health conditions. This correlation arises due to collider bias rather than any true causal relationship.

Let’s look at this example, we all have heard lazy people are “smarter”. But is it truly the case?

Take a look at this picture, where we have divided the population size into their respective areas. Now if you have gone to a class where your classmates were all smart, then it is very unlikely that you will ever see the category of “lazy + poor result”. This will give you a false impression that all lazy people are smart.

Berkson’s paradox highlights the importance of considering all relevant factors and avoiding narrow perspectives when analyzing data. By understanding this statistical phenomenon, researchers can avoid drawing incorrect conclusions based on collider bias and ensure more accurate interpretations of cause-and-effect relationships.

Accuracy Paradox

The accuracy paradox is a phenomenon that occurs when the accuracy of a model is high, but the model’s ability to predict is low. This paradox arises when the data is imbalanced, and the model is biased toward the majority class. The accuracy paradox can be misleading and can lead to incorrect conclusions. Let’s discuss the accuracy paradox with scenarios and pictures.

  1. Medical diagnosis: Suppose a medical diagnosis model is trained to detect a rare disease that occurs in 1% of the population. The model is trained on a dataset that has 99% of healthy patients and 1% of patients with the disease. The model achieves an accuracy of 99%, which is high. However, the model is not useful because it predicts every patient as healthy, including the 1% of patients with the disease. In this scenario, the model’s accuracy is high, but its ability to predict is low.
  2. Credit card fraud detection: Suppose a credit card fraud detection model is trained on a dataset that has 99% of legitimate transactions and 1% of fraudulent transactions. The model achieves an accuracy of 99%, which is high. However, the model is not useful because it predicts every transaction as legitimate, including the 1% of fraudulent transactions. In this scenario, the model’s accuracy is high, but its ability to predict is low.

The accuracy paradox arises from the interplay between different aspects of model performance and the inherent trade-offs within classification algorithms. While achieving a high overall accuracy rate may seem desirable at first glance, it can sometimes mask deeper issues within the model’s predictions.

One aspect that contributes to this paradox is the presence of false positives and false negatives. False positives occur when a model incorrectly predicts a positive outcome when it should have been negative. Conversely, false negatives occur when a model incorrectly predicts a negative outcome when it should have been positive. These errors can lead to significant consequences in certain applications such as healthcare or fraud detection.

The accuracy paradox becomes more apparent when we encounter imbalanced datasets or situations where certain classes are rare compared to others. In such cases, even if a model achieves high overall accuracy by correctly predicting the majority class accurately, its performance on minority classes may be considerably poorer.

To address this paradox and improve overall model performance, data scientists need to delve deeper into evaluating metrics beyond simple accuracy measures. Techniques such as precision and recall become crucial in understanding how well a model performs on different classes and balancing trade-offs between false positives and false negatives.

By acknowledging the existence of the accuracy paradox and adopting more comprehensive evaluation methods in data science projects, we can gain better insights into our models’ capabilities and limitations. This knowledge empowers us to make informed decisions about deploying these models responsibly in real-world applications while striving for improved performance across all classes of interest.

In conclusion, the accuracy paradox is a phenomenon that can occur when the data is imbalanced, and the model is biased towards the majority class. The paradox can be misleading and can lead to incorrect conclusions. To avoid the accuracy paradox, it is important to use appropriate metrics to evaluate the model’s performance, such as precision, recall, and F1 score. It is also important to balance the data and use techniques such as oversampling and undersampling to balance the classes.

Conclusion

Hopefully, this article has guided you in solving your data problem. The hidden variables in data are always hard to find. And you can only catch them if you read about them a lot, or have been stuck by them like me.

Do follow our page TensorLabs to gain more insights into the world of data science, machine learning, and web technologies.

--

--