Do storks bring babies? Here is why every data person should be aware of spurious correlation.

Robert Barcik
4 min readJan 20, 2022

A strange phenomenon occurred some 50 years ago in smaller cities in Germany. At the same time, many storks were moving into the city, and many babies were being born. People started to think — do storks bring babies? Even you maybe heard the story as this phenomenon occurred in many regions.

Storks and babies being born correlated so much that it bothered researchers. Is it possible that nature gave us a sort of function, or a process, where storks somehow contribute to babies being born? Or maybe, is this correlation somehow beneficial for us?

Probably since you were like eight years old, you know that storks do not bring babies. So what was going on?

In reality, spurious correlation occurred, and the researchers revealed the underlying mechanism. At the very heart was a socio-demographic trend of young couples moving into smaller cities. These young couples were settling and building houses. These houses made for an ideal nesting place for storks that happily moved into the cities. At the same time, the young couples decided on a baby. Thus, we could summarise the phenomena as in the picture below.

Spurious correlations occur everywhere in the world. If you work with data, there is a good chance that you will occasionally stumble upon some. It is then necessary to stay rational and question whether the correlation you see is spurious or has some logical justification. Even though this example was only illustratory, your mind can fall for the trap. As Daniel Kahneman wrote in his masterpiece “Thinking, fast and slow”, our minds quickly draw conclusions.

Why shouldn’t we use a spurious correlation?

To answer this question, I need to bring another story. It regards the Washington football team.

Image by Erik Drost on Flickr

The “Redskins Rule” is probably the longest-lasting case of spurious correlation that humanity observed. If Washington Redskins win their last game on the home game in the election year, the incumbent party wins presidential elections. This spurious correlation held for over 70 years! It started in 1936 with Frank D. Roosevelt and concluded in 2008 with Barrack Obama.

Since the 2008 election, it has reversed. If Washington Redskins win their last home game, the incumbent party loses the presidential elections. This is a showcase of why we should not rely on spurious correlations. As spurious correlations have no rational justification for existing, they can collapse at any moment.

You would not want your data science product to collapse at a random moment, would you?

You sit down with your company data and spot your new correlation. For example, coffee consumption in the office correlates closely with revenues. Should you now hire a barista and drown your office in coffee? Certainly not. Keep a rational mind and don’t conclude too quickly.

Let me summarize with a few valuable thoughts that every data person should keep in mind.

The world works in causations. One phenomena happening causes another phenomenon to occur. For example, the sun rises, and it heats the air. The world is full of causations, but it might require tedious and meticulous work to prove one. For example, you might need a randomized experiment that has been taking place for decades to establish causation between tobacco products and health issues.

As causations can be hard and expensive to prove, data science instead focuses on correlations. These could be just as useful and are much easier to discover. A correlation in the broadest sense means that two phenomena have some form of a relationship that we can describe. For example, we might see a correlation between the temperature outside and ice cream sales. This relationship is helpful for us as we can adjust our sales strategy based on that.

However, we need to be careful as we might stumble upon spurious correlations. Our task is to have a rational mind and discard it from the analysis. The spurious correlation could be caused by some unseen factors, such as with the case of storks and babies. Or, it could be caused by a pure coincidence; this was the case with the Washington Redskins and US presidential elections. In any case, whether it is some unseen factor or coincidence, we should not rely on spurious correlation as it will likely not be stable at a specific moment, and our model could break.

This learning story comes from our latest online course “Be Aware of Data Science” which can be found here (the author of this article is also the author of the linked course).

--

--