Finding Correlation Between Many Variables (Multidimensional Dataset) with Python

Sebastian Norena
3 min readApr 26, 2018

--

Image from Wikipedia: https://upload.wikimedia.org/wikipedia/en/7/78/Correlation_plots_of_double_knockout_distribution_across_subsystems_in_M.tb_and_E.coli.png

In statistics, dependence or association is any statistical relationship, whether causal or not, between two random variables or bivariate data. Correlation is any of a broad class of statistical relationships involving dependence.

In common usage it most often refers to how close two variables are to having a linear relationship with each other. Familiar examples of dependent phenomena include the correlation between the physical statures of parents and their offspring, and the correlation between the demand for a limited supply product and its price.

Formally, random variables are dependent if they do not satisfy a mathematical property of probabilistic independence. In informal parlance, correlation is synonymous with dependence. However, when used in a technical sense, correlation refers to any of several specific types of relationship between mean values. There are several correlation coefficients, often denoted “p” or “r”, measuring the degree of correlation.

As datasets increase the number of variables, finding correlation between those variables becomes difficult, fortunately Python makes this process very easy as in the example below where I will find correlation on a dataset with the following 19 columns (features/attributes) and 1000 rows (samples/observations/instances):

[‘arrow’, ‘under’, ‘interior’, ‘theta’, ‘amb’, ‘slice’, ‘delta’, ‘pi’, ‘height’, ‘nu’, ‘night’, ‘dataset’, ‘length’, ‘twi’, ‘x’, ‘wind’, ‘y’, ‘rho’, ‘alpha’]

The “corr()” method evaluates the correlation between all the features, then it can be graphed with a color coding:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('https://www.dropbox.com/s/4jgheggd1dak5pw/data_visualization.csv?raw=1', index_col=0)corr = data.corr()
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(corr,cmap='coolwarm', vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,len(data.columns),1)
ax.set_xticks(ticks)
plt.xticks(rotation=90)
ax.set_yticks(ticks)
ax.set_xticklabels(data.columns)
ax.set_yticklabels(data.columns)
plt.show()
Correlation between variables of the dataset

On this example, when there is no correlation between 2 variables (when correlation is 0 or near 0) the color is gray. The darkest red means there is a perfect positive correlation, while the darkest blue means there is a perfect negative correlation.

When evaluating the correlation between all the features, the The “corr()” method includes the correlation of each feature with itself, which is always 1, so that is why this type of graph always has the red diagonal from the upper left to the lower right. Other than the diagonal, the rest of the squares show correlation between different features, making it really easy to find that “wind” and “arrow” are highly correlated, “height” and “slice” are highly correlated, “nu” and “theta” are have a correlation of about 0.5.

Conclusion: the “corr()” is very easy to use and very powerful for the early stages of data analysis (data preparation), by doing a graph of its results using matplotlib or any other python plotting utility, you will get a better idea of the data so you can make decisions for the next steps of data preparation and data analysis.

This work is licensed under the Creative Commons Attribution 3.0License.

--

--