Deep Dive in Machine Learning with Python

Part — XI: Data Visualization — II

Rajesh Sharma
Analytics Vidhya
5 min readFeb 6, 2020

--

Image Link

Welcome to another blog of Deep Dive in Machine Learning with Python, in the last blog, we gained a better understanding of the Gapminder Dataset by plotting several charts like Bar, Scatter, Line and others. In today’s blog, we will visualize data distribution using Pandas and data visualization libraries.

We will continue to use the Gapminder dataset in today’s blog as well. And, similar to the previous blog, we will also focus on the options by which you can make your plots more attractive.

Link: Giphy

Import the necessary python libraries

Required libraries

Import the dataset

We will import the dataset from a CSV file(i.e.gapminder.csv) and create a Pandas DataFrame.

Data read from CSV file

Problem-1: What are histogram plots?

It is a plot that gives the underlying frequency distribution of data. Histogram plots allow us to inspect the underlying distribution (e.g. normal or bell shape distribution), outliers, skewness, etc.

It is different from a bar graph in following ways:

  1. A bar graph relates two variables, but a histogram relates only one
  2. To construct a histogram, the first step is to “bin” the range of values (i.e. divide the entire range of values into a series of intervals), then count how many values fall into each interval.

If the bins are of equal size, a rectangle with height proportional to the frequency (i.e.the number of cases in each bin) is built.

Problem-2: How to visualize the distribution of ‘Life_Expectancy’?

Top-5 records of the DataFrame
Solution-2

In the above cell, we created a ‘step’ histogram of ‘life_expectancy’. If you closely see the graph then you will find the major grid lines.

So, in this graph, we found that in the gapminder dataset, 71 years is the average age that a person is expected to live.

You can use different line-styles like “-”, “ — ”, “-.”, “:”. And, I defined a dictionary(bar_font) containing a customized details of font which I want to use in the labels.

Problem-3: How to visualize the block type distribution of ‘age5_surviving’?

Top-5 records of the DataFrame
Solution-3

In the above result, a left or negatively skewed histogram got created of ‘age5_surviving’.

Problem-4: What are the different types of data distribution?

Histogram Distributions

Skewed Distributions

A distribution skewed to the right is referred to positively skewed. This kind of distribution has a large number of occurrences in the lower-value cells (left side) and few in the upper-value cells (right side).

A distribution skewed to the left is referred to negatively skewed. This kind of distribution has a large number of occurrences in the upper-value cells (right side) and few in the lower-value cells (left side).

Double-peaked distributions

A histogram with two peaks is called “double-peaked” or “bimodal”. It contains two values or data ranges that appear most often in the data. These kind of histograms reflects the presence of two different processes in the data.

Truncated distributions

A “truncated histogram arises when we are dealing with incompletely reported data or when the data provided is outside the specification limits.

Plateau distributions

A “plateau” histogram is a combination of multiple bell-shaped curves and it is an extreme version of a bimodal distribution.

Problem-5: What are density plots?

Density Plots are the smoothed and continuous version of a histogram created from the data. The popular method for estimating the density curve of a histogram is Kernel Density Estimation.

Kernel density estimation (KDE) is an algorithm that takes the mixture-of-Gaussians and uses a mixture consisting of one Gaussian component per point, resulting in an essentially non-parametric estimator of density.

There are several versions of kernel density estimation available in Python (SciPy and StatsModels packages), I mostly prefer to use Scikit-Learn’s (sklearn.neighbors.KernelDensity) version because it is more efficient(use tree-based algorithm). And, handles KDE in multiple dimensions with one of six kernels and one distance metric.

Problem-6: How to create the density plot of ‘Life_expectancy’?

Here, we created the density plot and step-type histogram of ‘life_expectancy’.

kind = ‘density’ is same as KDE

Problem-7: How to create a KDE curve and step-type histogram plots of multiple columns via user-defined function?

Function definition and calling
Life_expectancy
Age5_surviving
Babies_per_woman
Gdp_per_capita

Congratulations, we come to the end of this blog. To summarize, we covered different data distributions, histograms, and density plots. In the next blog, we will cover Pair plots and Heatmaps, then we will start exploring the depth of EDA.

If you want to download the Jupyter Notebook of this blog, then kindly access below GitHub repository:

https://github.com/Rajesh-ML-Engg/Deep_Dive_in_ML_Python

Thank you and happy learning!!!

Blog-12: Data Visualization — III

--

--

Rajesh Sharma
Analytics Vidhya

It can be messy, it can be unstructured but it always speaks, we only need to understand its language!!