Photo by Jason Coudriet on Unsplash

Correlation Myths

Harish Daryani
Analytics Vidhya
Published in
3 min readAug 2, 2021

--

Understanding relationships between data attributes has been at the heart of data analysis. One of the most common measures of such interdependence is the Pearson correlation coefficient, defined below.

Formula for Pearson correlation coefficient

-1 ≤ r ≤1, with sign signifying the direction and value the strength of a relationship.

Pearson correlation (hereafter referred as correlation) is often misunderstood to mean things it isn’t.

Let’s understand these myths one by one.

  1. Measure of a relationship

While this is true, Pearson’s coefficient specifically measures a linear relationship not any kind of a relationship.

As an example, below is a plot of an inverse relationship i.e. y = 1/x. While we have identified the relationship, the Pearson coefficient does not capture full extent of this relationship which is evident in its value of -0.6

y = 1/x

2. A coefficient of 0 implies no relationship

A coefficient of 0 only implies an absence of a linear relationship, there can still be a non-linear relationship. For e.g. consider y = x², again we know there is a relationship that the Pearson’s coefficient does not capture

y = x²

3. Higher the linear slope higher the correlation

As long as there is a linear relationship, correlation coefficient will capture it. The slope of the relationship doesn’t matter. Consider below examples of a perfect correlation i.e. 1.0 for relationships with different slopes.

Correlation is same for linear relationships with different slopes

To correlation, all that matters is the strength i.e. data points being closer to each other and direction (positive/negative), not the slope.

4. Correlation implies a cause

This is probably the most talked about myths. Correlation simply tells us that data attributes share a linear relationship without attributing a cause to that relationship.

The below data has a near perfect correlation of 0.997. Does that mean that US spending on science, space and technology is causing suicides by hanging, strangulation and suffocation (or vice-versa)? Probably not. These correlations might just exist by chance.

Source : https://www.tylervigen.com/spurious-correlations

Here are a bunch of them.

A causal effect is usually determined via experiments, not just by observing correlation. Having said that, there might be situations where we are not interested in causality and only interested in the relationship to make accurate predictions (a topic for a different time).

Conclusion

Pearson correlation coefficient condenses a linear relationship into a value between -1 and 1. It is important for us as analytics professionals to understand what that value implies and what it doesn’t.

What other types of correlation measures have you encountered ?

--

--