How to not be impressed by spurious correlations

Input Coffee
Input Coffee
Published in
3 min readSep 8, 2016

Actually, they’re spurious relationships

tl;dr

Spurious relationships are those where we find some kind of high correlation between to seemingly unrelated variables and we think there might be some kind of relationship, or else we are impressed that there isn’t one.

For example, here is one from Tyler Vigen’s site.

There are systemic, well known reasons for these. They are: lurking variables, absolute numbers, small numbers and large data sets.

Spurious Correlations

Tyler Vigen’s site, spurious correlations, has remained a perennial favorite among statisticians, and the mathematically minded. It shows a series of correlations that seem to have absolutely no causal relationship of any sort. They are invalid correlations. If you have not seen the site before, take a second to enjoy it.

What do we mean by spurious

Spurious could mean various things, but we mean to imply that the correlation is invalid in some sense. Of course we are not saying that the correlation doesn’t exist. We mean that the implication of the correlation is not correct. There are two things that a correlation might imply:

  1. A causal relationship between the two variables
  2. A causal relationship between the two variables and a third variable

I think that the vast majority of the spurious relationships are spurious in the first sense, but not in the second.

The lurking variable

A lurking variable is the third variable that explains both. In most of the spurious correlation cases, the lurking variable is population and wealth growth. As population grows and wealth grows there are more people and they can afford more things. So we expect ice cream sales to have gone up over time, and the number of cars, and the number of houses and their windows, and so on.

This sort of thing shows why you might find a correlation between, for instance,

US spending on science, space, and technology correlates with Suicides by hanging, strangulation and suffocation

The US budget basically goes up over time (inflation alone would account for this). And the US population goes up over time. If you look at all the subsets of the US budget and all the items that correlate with the population, you would see similar correlations. Tylver Vigen may have picked the best example, but it was a target rich environment.

A similar thing can be said about the revenue of arcades computer science doctorates that are awarded. Again we notice the dollars are not adjusted for inflation and the doctorates are not adjusted for the population growth.

Truly spurious correlations: small numbers

There are some correlations that cannot be explained by lurking variables. The most popular example seems to be “Number of people who drowned by falling into a pool correlates with Films Nicholas Cage appeared in.”

This is truly spurious in the sense that there isn’t another variable to explain the two. If you run enough variables against each other, you’ll find many such correlations. However, you’ll notice on thing in common with many of these: they involve very small numbers.

So Nicholas Cage movies, for instance, vary between 1 and 4. This is most likely the case for many, many actors. Also note that the average seems to be about 2. The period of time they look at is 10 years. For most of those years, Cage appears in 2 movies, which is normalized to the average of about 100 drownings. Then it is a question of finding peaks and valleys near those of the pool. Any actor that acted in 2 extra movies in 2007 and one less movie in 2003 would find a close correlation.

Massive Data Sets

It should also be added here that you could find genuine examples of spurious correlations in the strong sense if you used a large enough data set and ran through enough examples.

If you see some data sets that are highly correlated, with no third explanatory variable, that use % and fractions without using scaled numbers, you should probable be genuinely impressed by the combination of luck and processing power that went into discovering the correlation.

Please subscribe.

--

--