Correlation Myths
Understanding relationships between data attributes has been at the heart of data analysis. One of the most common measures of such interdependence is the Pearson correlation coefficient, defined below.
-1 ≤ r ≤1, with sign signifying the direction and value the strength of a relationship.
Pearson correlation (hereafter referred as correlation) is often misunderstood to mean things it isn’t.
Let’s understand these myths one by one.
- Measure of a relationship
While this is true, Pearson’s coefficient specifically measures a linear relationship not any kind of a relationship.
As an example, below is a plot of an inverse relationship i.e. y = 1/x. While we have identified the relationship, the Pearson coefficient does not capture full extent of this relationship which is evident in its value of -0.6
2. A coefficient of 0 implies no relationship
A coefficient of 0 only implies an absence of a linear relationship, there can still be a non-linear relationship. For e.g. consider y = x², again we know there is a relationship that the Pearson’s coefficient does not capture
3. Higher the linear slope higher the correlation
As long as there is a linear relationship, correlation coefficient will capture it. The slope of the relationship doesn’t matter. Consider below examples of a perfect correlation i.e. 1.0 for relationships with different slopes.
To correlation, all that matters is the strength i.e. data points being closer to each other and direction (positive/negative), not the slope.
4. Correlation implies a cause
This is probably the most talked about myths. Correlation simply tells us that data attributes share a linear relationship without attributing a cause to that relationship.
The below data has a near perfect correlation of 0.997. Does that mean that US spending on science, space and technology is causing suicides by hanging, strangulation and suffocation (or vice-versa)? Probably not. These correlations might just exist by chance.
Here are a bunch of them.
A causal effect is usually determined via experiments, not just by observing correlation. Having said that, there might be situations where we are not interested in causality and only interested in the relationship to make accurate predictions (a topic for a different time).
Conclusion
Pearson correlation coefficient condenses a linear relationship into a value between -1 and 1. It is important for us as analytics professionals to understand what that value implies and what it doesn’t.
What other types of correlation measures have you encountered ?