Pinarismiguzel
Academy Team
Published in
4 min readNov 1, 2023

--

I FOUND A CORRELATION BETWEEN TEMPERATURE AND COFFEE SOLD, SO WHERE IS THE CAUSALITY IN THIS RELATIONSHIP?

Artificial Intelligence is a topic of widespread discussion today, with an increasing number of processes integrating various Machine Learning or Deep Learning models to enhance their business processes. These models are trained on data to obtain the relationships and get predictions.

However, addressing a problem is essential: correlation does not imply causation.

Imagine you’re the owner of a coffee shop, and you’re intrigued by the connection between the maximum daily temperature and the number of cups of coffee sold per day.

As shown in the graph, we observe a negative correlation between maximum daily temperature and daily coffee cup sales. A conventional model in this scenario might successfully detect this correlation. However, it’s essential to recognize whether this is a case of spurious correlation, since the model lacks the ability to understand the question of why this relationship exists.

CORRELATION VS CAUSATION — DON’T LET THEM FOOL YOU

To understand the concept, let’s return to our example about the relationship between daily temperature and daily coffee cup sales.

As we can see the negative correlation between the cups of coffee sold and the max daily temperature in the scatter plot, we could think that coffee sales are positively affected by a decrease in temperature.

CORRELATION CAUSED BY THE THIRD FACTOR

If we think of a reason or cause for this concept, most coffee drinkers consider the beverage during cold weather to boost body temperature and stay warm; drinking hot coffee may seem an ideal way to alleviate freezing temperatures.

We can conclude that a third factor causes a correlation between two events. Without that third factor, there would be no correlation between the other two.

Although the existence of that connection alone does not mean one is caused by the other, that is not to say it never is. Sometimes (!) one indeed causes the other. One of the common mistakes in that case is confusing the vector of the causality.

THE WACKY WORLD OF THE COINCIDENTAL CORRELATIONS

In the world of data analysis, data explorers know how sometimes you look at two totally unrelated things and find a correlation that seems like it fell straight out of a comedy skit. Welcome to the quirky universe of coincidental correlations.

The famous “Age of Miss America correlated with murder by steam, hot vapors, and hot objects” is a prime example of such a coincidental correlation.

This correlation suggests that as the age of Miss America changes over the years, so do the cases of people getting scalded by hot steam and objects. But here’s the kicker — there’s no real connection here. No beauty queen is causing fatal steam accidents; it’s just a wild coincidence.

The age of Miss America and steam-related murders have no inherent connection. So, what’s happening here? These types of correlations are often driven by what statisticians call “confounding variables” or simply pure coincidence. In this case, there is no causal relationship between the two variables; they are merely changing over time independently.

The Age of Miss America example, along with many others like it, serves as a humbling reminder of the dangers of drawing causal conclusions from correlated data. It’s easy to jump to conclusions and believe that one variable is causing changes in another when, in reality, there is no meaningful cause-and-effect relationship.

AVOIDING THE PITFALLS

As data scientists, It’s essential to keep our wits about us in the world of data analysis. Correlations are like tantalizing hints that say, “Hey, there might be something here!” But it’s essential to dig deeper, ask questions, and consider whether confounding variables might be at play. Instead, put on your detective hat, look for confounding factors, and remember that correlation doesn’t mean causality.

In a nutshell, the data analysis realm is full of surprises, and those tricky spurious correlations are a constant reminder of the complexity of the data we work with. By understanding the difference between correlation and causation and by being diligent in our analysis, we can separate the coincidental from the meaningful and make smarter decisions based on our data.

Happy data hunting, fellows!

--

--