Python Data Visualization — Heatmaps

Andy Luc
4 min readApr 30, 2019

--

Whether you are presenting in front of 500 students or 5 executives of a large corporation, data visualization is an important aspect of any career. The basic concept is to create a visual representation of a dataset taken from a table, excel file, or even a simple survey at a local event. The ability to see possible trends and correlations in the relationship of the data will clearly illustrate the purpose and or objective of the presentation.

In python libraries, there are a myriad of methods and ways to visually represent data, but I will be focusing on the use of heatmaps. Heatmaps are a great way of finding the collinearity of the data and help distinguish which rows or columns should or should not be included as part of your results. If the objective is to create several predictor models, this will help you filter any dependent variables that are collinear. Having an independent variable will be a factor, as you will see later on.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv('cars.csv')
Pandas DataFrame for cars.csv

First, we want to import the necessary python libraries and the data set; in this case, ‘cars.csv’. The chart to the left shows what type of data the .csv file contains. There are 5076 rows, each with values corresponding to 1 of the 8 variable columns. I took the head (first 5 rows) and tail (last 5 rows) of the pandas dataframe for demonstration purposes. As heatmaps are based on numerical values, Model Year was non-numerical, which consequently had to be dropped from the data set. Granted that Model Year was meaningful, it can be incorporated later on in the summary of results.

Now that we have the data setup, let’s move on to the fun stuff. Using Matplotlib, we will setup a graphical chart of the data with a figure size of (10,6). In the next step, I will preface by saying that there are methods to produce a heatmap from Matplotlib. However, considering Seaborn is a library that is essentially built on top of it, using Seaborn is visually more appealing and also a much simpler 1-line code in python.

fig, ax = plt.subplots(figsize=(10,6))
sns.heatmap(data.corr(), center=0, cmap=’Blues’)
ax.set_title(‘Multi-Collinearity of Car Attributes’)
Muti-Collinearity Heatmap for cars.csv

The heatmap above represents the collinearity of the multiple variables in the dataset. data.corr() was used in the code to show the correlation between the values. This is where we want to set our independent or target variable. Let’s set our target variable to ‘City mpg.’ We want to find out how all of the other variables affect the miles per gallon in the city of a particular car. Looking at the blue heatmap, the focus should be on the dark and light areas. Dark blue represents a positive correlation, while light white is a negative correlation. It is also normal that the darkest areas are a 1:1 ratio since Torque=Torque, Length=Length, etc.

Before analyzing further, there are a couple things we can do to make this heatmap a bit more clear. By adjusting the color and adding annotation, which are actual correlation values, this makes it easier to form a conclusion and what possible actions to take. While City mpg is still our independent variable, we can see in the map below that there is little to no correlation between the length of a car and mpg (-0.015), though a high correlation between Horsepower and Torque (-0.97, -0.98). Features with high correlation are more linearly dependent and hence have almost the same effect on the dependent variable. So, when two features have high correlation, we can drop one of the two features. What happens now? Since City mpg is our independent variable, we won’t drop that. However, we can drop either Horsepower or Torque and still produce a fairly accurate prediction model.

fig, ax = plt.subplots(figsize=(10,6))
sns.heatmap(data.corr(), center=0, cmap=’BrBG’, annot=True)
Muti-Collinearity Heatmap adding color and annotation attributes

Heatmaps can come in many forms, and multi-collinearity was just one of them. I have included here an example of a density heatmap in purple. The data was extracted from a file named ‘flights.csv’ which included the month, year, and number of passengers. This one is slightly more explanatory, where the year and month are both variables to the number of passengers.

To summarize, heatmaps provide a great visual when comparing multiple variables and the relationships between them. They are able to help you in more ways than you can imagine, but understanding how they can be incorporated into your project is the key!

--

--

Andy Luc

Data Scientist | MLE | Business Analyst who has a passion in teaching Taekwondo to toddlers and young adults