Correlation vs. Causation: Your Data Might Be Lying To You

Brian Muchai
6 min read3 days ago

--

We have always been natural pattern-seekers. It’s built into our brains. Long ago, before math was even discovered, our ancestors were already using pattern recognition in their daily lives. Picture this; a group of hunter-gatherers foraging in the wild. Among the many plants they encounter, mushrooms catch their eye. They develop a taste for them and they become one of their favorite foods. As days go by, they notice that there is an increased number of individuals who were falling sick and dying thereafter.

They begin to put one and one together— those who were dying were reported to have eaten mushrooms with some distinct black spots. From this, they begin to associate these black-spotted mushrooms with potential death. Through this grim discovery, they learnt to avoid these mushrooms, a crucial insight that saved countless of their lives.

Fast forward to the modern era, and pattern recognition is no longer just about survival — it’s about thriving, being among the top one percent.

Before we go on, let us define some terms:
Correlation — it is the measure of strength and direction of a relationship between two variables. When two variables are correlated, it means they tend to vary together in a consistent pattern. In the example above, the black-spotted mushrooms and illness were positively correlated. This is because when there was an increase in the consumption of those mushrooms, there was a corresponding increase in cases of ailment and eventual death.

Causation — implies that a change in one variable directly produces a change in another. From the same example above, eating the black-spotted mushrooms was the cause of illness and eventual death.

With these terms fully defined, let as explore other examples where correlation and causation are at play.

A certain company notices a troubling trend in its quarterly sales reports. Sales have been steadily declining over the past three months. Digging upon this, they discover something troubling: there had been a significant drop in customer numbers. It turned out that many customers had been unsubscribing from their service, in turn, leading to a decrease in sales revenue.

Recognizing this pattern was the first step for them to address this problem. From it, they hypothesize that a declining customer base is driving the sales drop. To combat this, a strategy is devised: launching a loyalty program to reward repeat customers and incentivize referrals. By identifying and acting on this correlation, they aim to turn the tide and boost their revenue once again.

From foraging for mushrooms to analyzing sales data, the ability to spot patterns has always been a powerful tool. It is particularly useful in a scenario where we are trying to solve a problem. We already have the data at hand to do so, we just have to discover what the data is actually telling us. However, as we do so, we must be careful. The line between correlation and causation is thin, and mistaking one for the other can lead to misleading conclusions and costly decisions.

For example, the decline in sales in the case above wasn’t necessarily because people were leaving the company— not fully at least. People were leaving because of a decrease in income. It was the middle of the Covid-19 pandemic. Thousands of people were getting laid off while others had their salaries significantly reduced. I mean, when your earnings are cut by 50%, will you continue with your online movie subscription or focus more on how you will get a steady food supply?

Because of this oversight, they ended up implementing a strategy that would not help them gain customers back. Instead, they were literally just throwing money in the drain, leading to even more losses.

Here is another scenario: you are tasked to investigate why students at a particular school are getting low marks. After doing your research, you discover that most of these students smoke. It is known that smoking can lower somebody’s cognitive ability, therefore, you come up with the conclusion that those particular students are getting low marks because of smoking.

However, somebody else could argue that these particular students smoke because of getting low grades. They may be getting a lot of pressure from their teachers and parents because of scoring poor marks in their exams, and therefore resort to smoking for some relief.

Which is which then? Students are getting low marks because they smoke, or they smoke because of getting the low marks. In effort to remaining in scope of the task at hand, you conclude that smoking is the reason that they get low marks. A conclusion that very few can object because you have the data to back it up.

However, just because you have the data to defend your case does not always mean that you are right. You might have missed out on something, therefore, instead of getting credible insights from the data, it is lying to you instead.

Let as look at this case in a different perspective. We have students who smoke and they happen to be getting low marks. Rather than these two characteristics causing each other, what if we have some external parameter causing them? This seems possible, right? Let’s further explore it.

It is known that negative life experiences such as loss of a loved one, stress and peer pressure can cause somebody to smoke and also score low marks in examinations. Upon interviewing a significant number of these students, they confessed the same.

What could have happened if we did not dig deeper into the root cause of why the students were getting low marks? We could have given a recommendation to the school like sensitizing about the dangers of smoking to the students. This, however, would not have fully addressed the problem at hand. The students would have potentially quit smoking but their marks would not have improved.

The best solution in addressing was looking into why the students were actually failing, not addressing the smoking problem which by coincidence they all happen to have.

From the two examples, we have learnt that it is of utmost importance to dig deeper into our data. We should understand fully what our data is trying to tell us, not just spotting a correlation and assuming that it is leading to a causation. If we do this, we will fully understand the reasons behind the patterns in our data and give solutions that fully address the problem that we wish to solve.

--

--