5 Minute EDA: Correlation Heatmap

Aya Spencer

Published in

5 Minute EDA

3 min readApr 9, 2022

A quick way to check for multicolinearity before running a regression

Background

When using data to build models, one of the most important steps (after cleaning and wrangling, of course), is to select which variables will serve as input features for your model. Correlation heat map is a good first step to check if any of the variables have a strong correlation to one another.

What is a correlation heatmap?

A correlation heatmap is a visual graphic that shows how each variable in the dataset are correlated to one another. -1 signifies zero correlation, while 1 signifies a perfect correlation.

Correlation heatmaps are important because it helps identify which variables may potentially result in multicolinarity, which would compromise the integrity of the model.

What is multicolinearity and why is it bad?

Multicolinearity happens when two or more features in a model are correlated with one another. For example, if I want to build a regression model to predict the likelihood of catching a particular disease and I include inputs “history of disease A in the past 12 months” and “history of any disease in the past 12 months,” you can see how a positive outcome in the first input directly correlates to a positive outcome in the second. This creates a case called multi-colinearity. When this occurs, the integrity of the model is compromised because the two inputs are not independent of each other and thus the model cannot distinguish the significance of either in predicting the target outcome.

Source & Method

Kaggle has a dataset about cars. It includes information such as the car’s brand, miles per gallon, and manufactured year. I used this as my base to run my heatmap.

Prepare Data

Let’s import the base data:

df=pd.read_csv("cars_updated.csv")

I notice that some columns have leading and trailing spaces, so let me get rid of those first:

df = df.rename(columns=lambda x: x.strip())

Perfect.

I now notice that “cubicinches” and “weightlbs” are set as objects, so I have to update this to numeric:

df["cubicinches"]=pd.to_numeric(df["cubicinches"], errors='coerce')
df["weightlbs"]=pd.to_numeric(df["weightlbs"], errors='coerce')

Now on to the heatmap:

sns.set(style="white")
corr = df.corr()
mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize=(11,9))
cmap = sns.diverging_palette(220,10,as_cmap=True)
sns.heatmap(corr,mask=mask,cmap=cmap,vmax=1,center=0,square=True, 
            linewidth=.5, cbar_kws={'shrink': .5})
ax.set_title('Multi-Collinearity of Features')

Looking at this correlation, I can see that “cylinders” and cubicinches are highly correlated. In fact, cylinders, cubicinches, horsepower, and weightinlbs are all pretty correlated with each other. If my goal was to build a model to predict the brand of a car based on its unique features, I may have to remove one or more of these three variables in order to optimize my model and prevent multicolinearity.

A correlation heatmap like this can be a very simple check to determine which features to keep and which ones to remove when building out a model.

This is part of my 5-minute EDA series, where I run quick exploratory data analysis on an interesting dataset. Thanks for reading!