Deep Dive in Machine Learning with Python

Part — XII: Data Visualization — III

Rajesh Sharma
Analytics Vidhya
5 min readFeb 7, 2020

--

Welcome to another blog of Deep Dive in Machine Learning with Python, in the last blog, we worked with the Gapminder Dataset to understand the data distributions, histogram plots, and density plots. In today’s blog, we will focus on heatmaps that are commonly used in visualizing the generalized view of numeric values utilizing various color-coding methods. We will also explore the pair plots that allow us to find the relationship between multiple variables.

In today’s blog, we will continue to use Gapminder dataset. And, at the end of this blog, I’ll also share one bonus tip related to Data Visualization.

Thanks to JW

Import the necessary python libraries

Required libraries

Import the dataset

We will import the dataset from a CSV file(i.e.gapminder.csv) and create a Pandas DataFrame.

Data read from CSV file

Problem-1: What are Heatmaps and when or why to use them?

When I first heard the term “Heatmaps”, the immediate thought that came to my mind was its relationship with weather forecasting. If you are also thinking around the same line then you are right, it is because heatmaps are used to display the “hot” and “cold” zones on a map.

If we go by the definition, Heatmaps are the graphical representations of data that bring the users attention on areas that matter most. They are packed with several color-coding schemes by which you can visualize the large dataset. So, if you want to find the co-relation amongst the dataset variables then use heatmaps.

Problem-2: How to plot the co-relation heatmap using Gapminder Dataset?

Co-relation Heatmap

Here, I have used the Seaborn library(aka sns) and provided the co-relation matrix as an input to the heatmap.

Co-relation matrix

As you can see Heatmap is just representing the above matrix result in the “cool warm” color-coding scheme. You can use various other schemes which I’ll talk about later. Other than colormap(aka cmap) I used the below parameters:

annot : If True, write the data value in each cell.linewidths : Width of the lines that will divide each cell.linecolor : Color of the lines that will divide each cell. It also accepts 'R','G','B' and 'K' as values.    

Colormaps

You can use several color schemes like ‘Inferno’, ‘Viridis’, ‘Winter’, ‘Summer’ and others.

Inferno
Viridis

Bonus Tip — Heatmaps

By seeing the above plots, you might be wondering why I have explicitly set the y-axis limits. So, it is because there is a bug in the version of the seaborn that I’m using. Refer to below plot:

Image Truncating Issue

Hence, to get rid of this issue I explicitly changed the top and bottom limits of the y-axis.

Problem-3: How to visualize the NULL values in the variables by using heatmaps?

Visualizing NULL values

Here, I have facilitated the heatmap where it will find the values which are NULL.

NULL values in Variables

So, the highlighted lines in the heatmap are the representation of 9929 NULL values in the “life_expectancy” column.

Problem-4: What are pair plots and why these are used?

A pair plot represents the pairwise relationships in a dataset. It creates a grid of Axes such that each variable in the data will be shared in the y-axis across a single row and in the x-axis across a single column. The diagonal Axes are treated differently, drawing a plot to show the univariate distribution of the data for the variable in that column.

Problem-5: How can we create a pair plot ‘life_expectancy’, ‘age5_surviving’, ‘babies_per_woman’ and data colored based on ‘region’?

KDE Pair Plot
HIST Pair Plot

Here, we created the Histogram and KDE pair plots depicting the relationship amongst variables.

Bonus Tip — Pair Plot

In many cases, while doing the data visualization you may be required to add a vertical line across the axes to easily separate the data values. For this, axvline is used to add vertical lines as a level in your plot.

Plot_Data Function

In this function, I’m creating the Scatter plot using gdp_per_day and life_expectancy. Also, added 3 levels where gdp_per_day is 4, 16 and 64.

Scatter plots with levels

Congratulations, we come to the end of this blog. To summarize, we worked on building the understanding around Heatmaps and Pair Plots. In the next blog, we will start with the EDA concepts.

If you want to download the Jupyter Notebook of this blog, then kindly access below GitHub repository:

https://github.com/Rajesh-ML-Engg/Deep_Dive_in_ML_Python

Thank you and happy learning!!

Blog-13: Converse with the data

--

--

Rajesh Sharma
Analytics Vidhya

It can be messy, it can be unstructured but it always speaks, we only need to understand its language!!