TL;DR: Correlation does not necessarily mean causation! See yourself through this Infographic.
“Correlation does not prove causation”: This was the statement I came across during my Udacity-Bertelsmann Technology Scholarship on Data Track Course- 2019. I was awestruck by this line. I was doing EDA, and based on correlation; I summed up my result(causation accepted). [Yes, I was wrong!]
That very line from Bertelsmann Data Track course made me realize that I was steering towards wrong analysis; thus, I started to dig deeper and try to understand the thin line difference between Correlation & Causation.
Understanding the phrase “Correlation does not prove causation” and underpinning the concept on your next data science project will make you double confident.
- Understanding the correlation.
- Calculating correlation.
- Understanding the causation.
- Establishing causation.
- The key differences between correlation and causation
Before jumping into the process of being double confident , let’s understand the underlying meaning of each concept and move forward.
What is the Correlation?
Correlation is a statistical measure (expressed as a number) that describes the size and direction of a relationship between two or more variables Or correlation is simply a relationship between anything. The general and most prefer objective of the analysis is to identify the extent to which one variable relates to another variable, i.e., to see how to target variable is dependent on an independent variable.
A correlation between variables, however, does not automatically mean that the change in one variable is the cause of the difference in the values of the other variable.
How is the correlation measured?
Pearson r correlation: Pearson r correlation is the most widely used correlation statistic to measure the degree of the relationship between linearly related variables. There are three possible results of a correlational study:
- Positive correlation: One variable increases; the other variable increases.
- Negative correlation: One variable increases; the other variable decreases.
- No correlation: There is no apparent relationship between the two variables.
If you are familiar with pandas then Pandas dataframe.corr() is used to find the pairwise correlation of all columns in a dataframe and to make the result obtained from dataframe.corr() look beautiful and more comfortable to interpret, you can import Seaborn library, and plot Heatmap also called Pearson coefficient of correlation. To know more about it, read my previous post.
The correlation coefficient should not be used to say anything about the cause and effect relationship. By examining the value of ‘r’, we may conclude that two variables are related, but that ‘r’ value does not tell us if one variable was the cause of the change in the other.
So, here comes the need of understanding Causation.
What is Causation?
Also known as causality or cause and effect, indicates that one event is the result of the occurrence of the other event, i.e., there is a causal relationship between the two games.
It tries to answer the question: does one variable impact the other?
How can causation be established?
When data shows a correlation, then we can say that there is necessarily an underlying causal relationship. Still, we cannot confidently say that there are a cause and effect relation. For establishing causation, we can approach two further processes after correlating.
- Controlled study
The use of a controlled study is the most effective way of establishing causality between variables. In a controlled study, the data is split into two, i.e., treatment(which would be the independent variable) and interest (the dependent variable) with both groups being comparable in almost every way. After that, these two groups receive different treatments, and the outcomes of each group are assessed.
How to perform controlled study? Find more on below article.
Designing a research project: randomised controlled trials and their principles
The sixth paper in this series discusses the design and principles of randomised controlled trials. View Full Text The…
The spurious or false relationship exists when what appears to be an association between the two variables is caused by a third extraneous variable, i.e., A and B are correlated, but they’re created by C.
So, in non-spuriousness, it requires that alternative explanations for the observed relationship between two variables should be ruled out, i.e., the analysts should take greater challenges in ruling out spurious relationships and establish the non-spuriousness among the variables.
Find more about Spuriousness for causation in the below article.
Spurious: Why You Need to Know What It Means
Spurious is a term used to describe a statistical relationship between two variables that would, at first glance…
After understanding the underpinning points about correlation and causation, we can move to see what’s the difference.
So, What’s the difference between correlation and causation?
Correlation and causation are often confused because the human mind likes to find patterns even when they do not exist. Also, if there is a stable association between the two variables, we cannot assume that one causes the other. Even if there is a strong correlation, we cannot jump directly to causation without doing at least a randomized controlled experience.
E.g., smoking is correlated with alcoholism, but it does not cause alcoholism.
This example shows that there is a correlation, but it is not causation.
In practice, however, it remains difficult to establish causation, compared with establishing correlation.
Understanding causation is a difficult problem. Looking at the correlation and jumping into making bold claims without checking causation is a totally wrong approach, and unless and until causation can be clearly identified, it should be assumed that we are only seeing the correlation and still causation is lacking. The more confident you become at identifying true correlations and causation within your dataset, the smarter you be in data science domain.