Deep Dive in Machine Learning with Python

Part — XI: Data Visualization — II

Published in

Analytics Vidhya

5 min readFeb 6, 2020

Welcome to another blog of Deep Dive in Machine Learning with Python, in the last blog, we gained a better understanding of the Gapminder Dataset by plotting several charts like Bar, Scatter, Line and others. In today’s blog, we will visualize data distribution using Pandas and data visualization libraries.

We will continue to use the Gapminder dataset in today’s blog as well. And, similar to the previous blog, we will also focus on the options by which you can make your plots more attractive.

Import the necessary python libraries

Import the dataset

We will import the dataset from a CSV file(i.e.gapminder.csv) and create a Pandas DataFrame.

Problem-1: What are histogram plots?

It is a plot that gives the underlying frequency distribution of data. Histogram plots allow us to inspect the underlying distribution (e.g. normal or bell shape distribution), outliers, skewness, etc.

It is different from a bar graph in following ways:

A bar graph relates two variables, but a histogram relates only one
To construct a histogram, the first step is to “bin” the range of values (i.e. divide the entire range of values into a series of intervals), then count how many values fall into each interval.

If the bins are of equal size, a rectangle with height proportional to the frequency (i.e.the number of cases in each bin) is built.

Problem-2: How to visualize the distribution of ‘Life_Expectancy’?

In the above cell, we created a ‘step’ histogram of ‘life_expectancy’. If you closely see the graph then you will find the major grid lines.

So, in this graph, we found that in the gapminder dataset, 71 years is the average age that a person is expected to live.

You can use different line-styles like “-”, “ — ”, “-.”, “:”. And, I defined a dictionary(bar_font) containing a customized details of font which I want to use in the labels.

Problem-3: How to visualize the block type distribution of ‘age5_surviving’?

In the above result, a left or negatively skewed histogram got created of ‘age5_surviving’.

Problem-4: What are the different types of data distribution?

Skewed Distributions

A distribution skewed to the right is referred to positively skewed. This kind of distribution has a large number of occurrences in the lower-value cells (left side) and few in the upper-value cells (right side).

A distribution skewed to the left is referred to negatively skewed. This kind of distribution has a large number of occurrences in the upper-value cells (right side) and few in the lower-value cells (left side).

Double-peaked distributions

A histogram with two peaks is called “double-peaked” or “bimodal”. It contains two values or data ranges that appear most often in the data. These kind of histograms reflects the presence of two different processes in the data.

Truncated distributions

A “truncated” histogram arises when we are dealing with incompletely reported data or when the data provided is outside the specification limits.

Plateau distributions

A “plateau” histogram is a combination of multiple bell-shaped curves and it is an extreme version of a bimodal distribution.

Problem-5: What are density plots?

Density Plots are the smoothed and continuous version of a histogram created from the data. The popular method for estimating the density curve of a histogram is Kernel Density Estimation.

Kernel density estimation (KDE) is an algorithm that takes the mixture-of-Gaussians and uses a mixture consisting of one Gaussian component per point, resulting in an essentially non-parametric estimator of density.

There are several versions of kernel density estimation available in Python (SciPy and StatsModels packages), I mostly prefer to use Scikit-Learn’s (sklearn.neighbors.KernelDensity) version because it is more efficient(use tree-based algorithm). And, handles KDE in multiple dimensions with one of six kernels and one distance metric.

Problem-6: How to create the density plot of ‘Life_expectancy’?

Here, we created the density plot and step-type histogram of ‘life_expectancy’.

kind = ‘density’ is same as KDE

Problem-7: How to create a KDE curve and step-type histogram plots of multiple columns via user-defined function?

Congratulations, we come to the end of this blog. To summarize, we covered different data distributions, histograms, and density plots. In the next blog, we will cover Pair plots and Heatmaps, then we will start exploring the depth of EDA.

If you want to download the Jupyter Notebook of this blog, then kindly access below GitHub repository:
https://github.com/Rajesh-ML-Engg/Deep_Dive_in_ML_Python

Thank you and happy learning!!!

Blog-12: Data Visualization — III