Correlation in Python; Find Statistical Relationship Between Variables

5 min readOct 16, 2021

Photo Credit: Corporate Finance Institute

Correlation

The variance and standard deviation measure the dispersion, or volatility, of only one variable. In many finance situations, however, we are interested in how two random variables move in relation to each other. For investment applications, one of the most frequently analyzed pairs of random variables is the returns of two assets. Investors and managers frequently ask questions such as, “what is the relationship between the return of Stock A and Stock B?” or “what is the relationship between the performance of the S&P 500 and that of the wireless communication industry?” As you will soon see, the correlation is a measure that provides useful information about how two random variables, such as asset returns, are related.

The correlation coefficient or correlation is the resulting value of the covariance of two random variables divided by the product of the random variables’ standard deviations. The relationship between covariances, standard deviations, and correlations can be seen in the following expression for the correlation of the returns for asset i and j:

The correlation between two random return variables may also be expressed as ρ(Ri,Rj), or ρi,j.

Properties of correlation include:

Correlation measures the strength of the linear relationship between two random variables.
Correlation has no units.
The correlation ranges from -1 to +1. That is, -1 < Corr(Ri,Rj) < +1.

If Corr(Ri,Rj) = 1.0, the random variables have perfect positive correlation. This means that movement in one random variable results in a proportional positive movement in the other relative to its mean.
If Corr(Ri,Rj) = -1.0, the random variables have perfect negative correlation. This means that a movement in one random variable results in an exact opposite proportional movement in the other relative to its mean.
If Corr(Ri,Rj) = 0, there is no linear relationship between the variables. That doesn’t mean there isn’t a linear relationship at all between the variables, but that a non-linear relationship (e.g., quadratic relationship, square root relation relationship, quasi-linear relationship, etc.) between the variables may exist.

4. Causality cannot be assumed if the correlation between variables is high. In other words, if X and Y are positively correlated it is not necessarily the case that X causes Y.

The interpretation of the possible correlation values is summerized in the following table:

It is possible to use these general interpretations of the correlation coefficient to describe the degree of the correlation that is apparent in a scatter plot of two variables. Below are several scatter plots and the corresponding interpretation of correlation.

Data

The dataset was cooked so we can see interesting correlations

Let’s load the relevant libraries

import pandas as pd
import seaborn as sb

Then, let’s load the correlation file

df = pd.read_csv(“correlation.csv”)

Data exploration

df.head()

Let’s use Pandas’ corr function to see all the correlations between all my features

df.corr()

Of course, here it’s a bit hard to see where the high correlations are. If I have 2 or 3 features then it’s easy to see where the high correlations are, but when I have a lot of features it’s hard. It can be seen that between C and B there is a very relatively high correlation (0.980161). The diagonal of course is 1.

Visualizations

In order to find the high correlations, we can use Seaborns’ heatmap function

sb.heatmap(df.corr(),cmap=”YlGnBu”)

In the above particular heatmap with this color map, the dark color means that the correlation is very high. Our diagonal is of course very dark and then we can see that between C and B we got also dark blue cube, which means that they are very correlative. It can also be seen that E has very high correlations with B and C, 93% and 91%, respectively. The correlation between F and D is also very high (98%).

Informative note: Notice the symmetry between the correlations, that is, the correlation between B and E is a mirror image of the correlation between E and B. So, except for the diagonal which is dark blue, all the blue cubes in the heatmap are double.

We can create a heatmap where the high correlations are painted in red, and we can also add an annotation, that is, the correlation numbers themselves.

sb.heatmap(df.corr(),cmap="RdBu_r", annot=True)

Now, we will run a pairplot, which takes every two variables and shows us their scatter versus each other

sb.pairplot(df)

We can change the resolution by changing the bins number.

sb.pairplot(df, diag_kws={'bins':30})

sb.pairplot(df, diag_kws={'bins':100})

Finally, we will use Pandas’ describe function to show the summary statistics of the numeric variables.

df.describe()

The count, mean, min and max rows are self-explanatory. The std shows the standard deviation, and the 25%, 50% and 75% rows show the corresponding percentiles.

Use case

Say, the variable we are going to predict is the “E”. So, let’s look at how much each independent variable correlates with this dependent variable.

corr_matrix = df.corr()
corr_matrix[“E”].sort_values(ascending=False)

The variable “E” tends to increase when the variables “B” and “C” go up. We can see a small positive correlation between the variables “D” and “F” and the variable “E”. And finally, coefficients close to zero indicate that there is no linear correlation between the variable “A” and the variable “E”.