Exploratory Data Analysis (Python) in 2 Minutes with Pandas-Profiler.

Problem Introduction

Tanveer Khan
AI For Real
5 min readOct 3, 2020

--

Data Exploration is the very first and fundamental task that Data Scientist’s perform as soon as they receive the data.

Often Data Exploration even in a basic sense takes a lot of time. Though some of the metrices which Data Scientists want to take a look are common for various tasks but they usually don’t have a single code base to run such tasks. And if not every time but most of the time they need to re-write the code, fix the error etc. This results in lot of time.

There are various reasons for doing data exploration:

  1. Understanding distribution of various variables.
  2. Finding percentage of missing values in variables so that data imputation strategies can be formed.
  3. Discovering interactions between the variables.
  4. Finding out correlated variables.
  5. Measuring variability/spread of the variables.
  6. Discovering the presence and nature of extreme (outliers) values.

Above are few important reasons for performing data explorations. The above factors directly and prominently impact the quality of output predictions. If they are not accounted while building the model it will result in a very sub-standard quality model that will be of no-use.

As you can see there are lot of tasks that needs to be done for each and every project for machine learning. And it needs time. Further time needed to accomplish this task increases with increase in the number of variables and number of Data sources you have at your disposal.

How do we do this ?

  1. We take our data may be shared in an excel sheet or csv file or in a database as source.
  2. Read the data from the source to Pandas in the Python.
  3. Then use scipy or Pandas function or matplotlib or some other library to start writing code for this.

Gosh !! Too much of work if say there are 5 data sources and 200 variables.

Solution

To do data exploration literally in two minutes, We will use this fantastic python library called “Pandas Profiler” which will do all the tasks for us in just one line of code.

You read it right - in just one line of code all the variables at the same time. Amazing isn’it !!

Let’s take a look at the library:

This is a very simple and easy to use library. It can do a lot of analysis of Structured Data and can help you to do some analysis for unstructured data like images and text. For structured data it can product a lot of matrices and for unstructured data it will generate the report about the metadata of the files.

To install the library we can just simply run the command:

!pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

Then read your dataset as Pandas data frame. In the example I have data inside the excel file so reading the data from excel sheet uploaded on a drive and reading it inside the pandas.

input_data = pd.ExcelFile('./gdrive/My Drive/dataset/pandas profiler demo.xlsx')input_data = input_data.parse("Sheet1")

We can see in my dataset there are 5 variables and 117580 observation.

Let’s check the column name and their types:

input_data.dtypes

Now we are ready to start with the exploratory data analysis. First thing that we will do is to select the variables we want to study. Let’s say in this case we want to study x1,x2 and x3. So we will trim our data frame to a have these columns (for simplicity).

input_data_trim = input_data[['x1','x2','x3']].copy()

Now we will use our ammo to do the data exploration. Let’s import the pandas profiler as:

from pandas_profiling import ProfileReport

It’s a time to invoke python profiler.

profile = ProfileReport(input_data_trim, title='Profiler Demo', explorative=True)

There are two arguments. “Title” is the name of the profile report and “Explorative” will control the information that we want to produce as output.

There are various way to see the report. We can see the report with-in the notebook using Ipython widgets or we can save it as html pages and share with larger audience.

Since we are using colab for the demo let’s generate the output with-in the notebook.

profile.to_widgets()

Output will look like:

You can click on any tab and see the results. Below screen shot is the overall summary of the results.

We can click on the tab and check all the details. Let’s click on variable.

There is an analysis for all the variables. Let’s get analysis of “x1”.

We can see lot of descriptive analysis of the “x1” including missing values, unique, mean, media, IQR, min, max, range, skewness, variance etc.:

We can also the histogram to understand the nature of distribution

We can also check the common values and outliers too:

We can also get the correlation and interaction between the variables by clicking interaction and correlation tabs:

Summary

In this blog post we looked at this amazing library Pandas-Profiler. We have also seen that it’s usage is very easy and simple. This library does a lot of work for us out of the box and without “ANY” effort. If we opt to do manually we might miss some statistics about the data and using this library clearly eliminates that risk.

We hope you find this post informative and useful. Please drop your suggestion in the comment box.

Code shown in this notebook is present at the below github repo.

Happy Learning !!

--

--

Tanveer Khan
AI For Real

Sr. Data Scientist with strong hands-on experience in building Real World Artificial Intelligence Based Solutions using NLP, Computer I Vision and Edge Devices.