Understand a Dataset in Seconds Using Pandas Profiling

Doing EDA of the whole dataset in one shot. Super Easy! Don’t believe it, do it for yourself!!

Shelvi Garg
Nerd For Tech
5 min readApr 29, 2021

--

Output Image by Author

In this blog, we will see the amazing types of mini-reports and EDA generated by Pandas Profile, how can we analyze data from this, how to save the report in HTML and other format so as to be able to give instant presentation and drive amazing data analysis from it.

About Pandas Profiling:

About Pandas Profiling:

Pandas profiling is a package of Pandas that lets you do Exploratory analysis of your database. Much like the pandas df.describe() function (which does basic EDA) pandas_profiling extends the analysis of DataFrame with df.profile_report() for getting a complete Report.

Pandas Profiling is an incredible open-source tool that every data scientist should consider for data exploration.

It is an efficient way to digest and analyze an unfamiliar dataset by providing in-depth descriptive statistics, visual distribution graphs, and a set of correlation tools.

The Pandas-profiling report offers:

  • The complete Dataset overview
  • Report on each attribute and variables
  • Gives analyzed different types of correlations between attributes
  • Shows warning: Inaccuracies,duplicacy in the dataset, you might need to work upon
  • Variable Types: Categorical, Numerical, etc
  • Reports upon missing values and zeroes(with graphs)
  • Creates superfast, detailed report
  • Distinct values, common values, cardinality, memory usage,
  • Statistical Report: Descriptive, Quantile

and much more…………

You can Toggle on further details for each sub-report, all this being offered in few code lines!!

With Pandas profiling we can quickly do an exploratory data analysis with just a few lines of code.

If this is not enough to convince us to use this tool, it also generates interactive reports in a web format that can be presented to any person, even if they don’t know to program.

In short, what pandas profiling does is save us all the work of visualizing and understanding the distribution of each variable. It generates a report with all the information easily available.

Installation:

Using Pip:

pip install pandas-profiling[notebook]

From GitHub:

pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

Using conda:

You can install using the conda package manager by running

conda install -c conda-forge pandas-profiling

Documentation

You can find the documentation of pandas_profiling here.

Using pandas profiling

#pip install pandas_profiling

Importing libraries

import pandas as pd
import pandas_profiling

Hands-on With Database:

You can check out the complete clean code and dataset on my Jupyter Notebook here: https://github.com/shelvi31/Pandas-Profiling

The dataset has covid-19 cases reported county-wise.

Code:

import pandas as pd
import pandas_profiling
df2 = pd.read_csv("corona_dataset")
profile2 = df2.profile_report(title="Corona Small dataset report")
profile2

Here’s the Output

  1. Output Dataset Overview: The Profile Report gives a statistical overview of our complete dataset. Including no. of categorical and numerical variables, duplicacy, missing. It’s basically a statistical snapshot of the database. The Statistics related to my covid database,
Image by author

2. Output Report on Variables: The profile report offers us an individual detailed report on each variable. It’s so detailed that we would hardly need to look upon anything else.

Image by author

3. Output: Interaction Report

The profile report generates interaction Reports between all individual entities. Shown in the image is the output interaction report between India-Pakistan Covid-19 cases. You can find these interactions for any sets of columns in your database.

Image by Author

4. Output Report Correlational Matrix:

Relationship of variables with each other.You can always toggle for more details.

Image by Author

Other co-relational graph developed by the report:

Phik Correlation : Image by Author

5. Output Report Warnings Issued:

The profile report issues the warnings and alert where we might need to work upon on our database or we might have to be cautious about, including high cardinality, high correlation etc

Image By Author

5. Output Dataset Sample: Randomly picked dataset values, that gives in detail views, first rows, last rows etc

Image by Author

6. Output Report on Missing Values:

The profile Report shows missing values per column, here he missing values for each country

Image by Author

Trying For larger Dataset:

You can find the dataset here : https://github.com/shelvi31/Pandas-Profiling

df = pd.read_csv("worldometer_coronavirus_daily_data.csv")

The pandas_profiling library in Python include a method named as ProfileReport()

Generating Profile Report for Large Dataset

pandas_profiling.ProfileReport(df)

Output:

Output’s GIF by Author

Converting to Jupyter Widget

Image by Author

Ways to save the generated report:

Jupyter Widgets

profile.to_widgets()

Iframes

profile.to_notebook_iframe()

As a string

json_data = profile.to_json()

As a file:

profile.to_file("report.json")

As an HTML File:

profile.to_file(“profile.html”)

Converting My report to Jupyter Widget: will give a result something like this

profile2.to_widget

You can check out the complete clean code and dataset on my Jupyter Notebook here: https://github.com/shelvi31/Pandas-Profiling

Also checkout Output Report on your Live server: https://raw.githubusercontent.com/shelvi31/Pandas-Profiling/main/output.html

… and if you like this article, feel free to leave a few hearty claps :)

--

--