Power of correlation & causation in the age of data science

Prashant Singh
Analytics Vidhya
Published in
5 min readApr 22, 2020

--

“I couldn’t claim that I was smarter than sixty-five other guys — but the average of sixty-five other guys, certainly!” — Richard P. Feynman

What is Correlation?

Correlation is defined as a relationship in a bivariate system where both variables have a direction and a degree of association between them.

In statistics, it is defined as a degree to which two variables are linearly related.

Correlation describes the relationship between two datasets and it can be visualized using a scatter plot where both variables can be plotted down.

In order to understand correlation on a deeper mathematical level, we need to understand covariance first.

Sign of covariance gives a sense of direction to data i.e whether data points on a scatter plot have any orientation or are they just randomly spread over the space.

Here, 1) x, y are datasets

2)x_bar , y_bar are mean of their respective datasets

3) N is the sample size

Covariance can have both (+)ve or (-)ve sign where positive sign shows a positive slope for dataset while negative sign represents a negative slope for dataset on graph.

Covariance shows a (+)ve sign when [(x-x_bar)>0 &(y-y_bar)>0] or[(x-x_bar)<0 &(y-y_bar)<0].

Covariance shows a (-)ve sign when [(x-x_bar)>0 &(y-y_bar)<0] or[(x-x_bar)<0 &(y-y_bar)>0].

The problem with covariance is that its magnitude which shows the strength of association lies in between { − ∞ to + ∞ } which makes it really tough to understand the degree of association and here comes correlation into play.

Correlation is a version of covariance where the covariance value is normalized i.e the domain range for correlation is {-1 to +1} and it makes correlation a really powerful tool for statistical analysis of data.

How to use it?

Correlation is a tool that’s used to quantify different datasets. It is a mathematical formula that can be implemented using different programming languages like python & C++.

Pearson Correlation Coefficient Formula:

Where, n = Quantity of Information

Σx = Sum of the First Variable Value

Σy = Sum of the Second Variable Value

Σxy = Sum of the Product of first & Second Value

Σx2 = Sum of the Squares of the First Value

Σy2 = Sum of the Squares of the Second Value

The value of coefficient stands between -1 to +1. When the coefficient value is 0 then data is said to be non-related or completely independent of each other otherwise related.

Linear Correlation Coefficient Formula:

Correlation Vs Causation

In a lot of cases, correlation and causation seem similar but in reality they are different.

Correlation doesn’t imply causation although correlation between a pair of variables is easy to find in comparison to causation because it is easy to discover the fluctuation in one dataset when the other dataset variates but it is arduous to say whether there exists a causal relationship between both datasets.

Causation explicitly applies to cases where variable H causes outcome variable T but in case of Correlation variable H is related to variable T.

How to interpret it? Common mistakes

Generally people misunderstand correlation with causation but both are different terms having different meanings.

For ex: let’s consider we have 3 variables h,t,x

Here, h- Rate of Smoking

t- Increase in developing lung cancer

x- Alcoholism

Here, variable h is a cause of outcome variable t hence there is a causal relation between h & t but on the other hand, there exists a correlation between variables h & x as it is observed that people who used to smoke are also very fond of alcohol but smoking is not a cause of alcoholism.

A small example to find out the correlation between different parameters of a lung cancer dataset where 0 represents ‘No lung cancer’ & 1 represents ‘Lung cancer present’.

correlation estimation between parameters using python
correlation estimation between parameters using python
correlation estimation between parameters using python

Types of Correlation

Correlation is expressed using correlation coefficient which tells the direction and strength of association of 2 datasets.

Correlation is broadly classified into 3 main types:

1) Positive Correlation

2) Negative Correlation

3) Zero(NO) Correlation

When one dataset increases w.r.t the other dataset in the same direction then the correlation b/w both datasets is known as positive correlation. It is expressed by a positive correlation coefficient value.

When the correlation coeff value is exactly ‘+1’ then the correlation is said to be a perfect positive correlation.

When one dataset value decreases while the other dataset’s value increases then correlation b/w both datasets is known as negative correlation. It is expressed by a negative correlation coefficient value.

When the correlation coeff value is ‘-1’ then the correlation is said to be a perfect negative correlation.

When both datasets are completely independent of each other then there is no correlation between datasets hence correlation coefficient value comes out to be as Zero.

What to use?

Correlation coefficient is a factor that gives great insights on the orientation of dataset and the degree of association between 2 datasets. Finding a functional dependency for a bivariate system is almost impossible in mathematics and that’s why correlation is highly important for finding association between data.

Correlation varies with some other mathematical parameters like standard deviation, entropy & slope of line which tries to satisfy all data points on graph. It is inversely proportional to standard deviation i.e greater the standard deviation (difference between a data point and mean of the dataset), lesser would be the magnitude of correlation. Correlation also decreases up to zero w.r.t the increase in entropy of the dataset. As dataset becomes more and more random, the correlation coefficient starts decreasing to approach zero.

It is directly proportional to the slope of line which tries to satisfy all data points of dataset. The sign of both slope and of correlation coefficient is also the same.

There are more instances of causal relationships than that of functional dependency between datasets.

About the writer:

Prashant Singh: Data science enthusiast great at analyzing data to create discernible graphs and to log insights for business and research purposes.

References

1) https://en.wikipedia.org/wiki/Correlation_and_dependence

2) https://www.statisticssolutions.com/correlation-pearson-kendall-spearman/

3) https://byjus.com/maths/correlation/

4) https://amplitude.com/blog/2017/01/19/causation-correlation

TAGS: #Correlation #causation #covariance #data_science #data #jupyter_notebook #entropy #Pearson_correlation_coefficient #python #predictive_analytics

--

--