Understand a Dataset in Seconds Using Pandas Profiling
Doing EDA of the whole dataset in one shot. Super Easy! Don’t believe it, do it for yourself!!
In this blog, we will see the amazing types of mini-reports and EDA generated by Pandas Profile, how can we analyze data from this, how to save the report in HTML and other format so as to be able to give instant presentation and drive amazing data analysis from it.
About Pandas Profiling:
About Pandas Profiling:
Pandas profiling is a package of Pandas that lets you do Exploratory analysis of your database. Much like the pandas df.describe()
function (which does basic EDA) pandas_profiling
extends the analysis of DataFrame with df.profile_report()
for getting a complete Report.
Pandas Profiling is an incredible open-source tool that every data scientist should consider for data exploration.
It is an efficient way to digest and analyze an unfamiliar dataset by providing in-depth descriptive statistics, visual distribution graphs, and a set of correlation tools.
The Pandas-profiling report offers:
- The complete Dataset overview
- Report on each attribute and variables
- Gives analyzed different types of correlations between attributes
- Shows warning: Inaccuracies,duplicacy in the dataset, you might need to work upon
- Variable Types: Categorical, Numerical, etc
- Reports upon missing values and zeroes(with graphs)
- Creates superfast, detailed report
- Distinct values, common values, cardinality, memory usage,
- Statistical Report: Descriptive, Quantile
and much more…………
You can Toggle on further details for each sub-report, all this being offered in few code lines!!
With Pandas profiling we can quickly do an exploratory data analysis with just a few lines of code.
If this is not enough to convince us to use this tool, it also generates interactive reports in a web format that can be presented to any person, even if they don’t know to program.
In short, what pandas profiling does is save us all the work of visualizing and understanding the distribution of each variable. It generates a report with all the information easily available.
Installation:
Using Pip:
pip install pandas-profiling[notebook]
From GitHub:
pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
Using conda:
You can install using the conda package manager by running
conda install -c conda-forge pandas-profiling
Documentation
You can find the documentation of pandas_profiling
here.
Using pandas profiling
#pip install pandas_profiling
Importing libraries
import pandas as pd
import pandas_profiling
Hands-on With Database:
You can check out the complete clean code and dataset on my Jupyter Notebook here: https://github.com/shelvi31/Pandas-Profiling
The dataset has covid-19 cases reported county-wise.
Code:
import pandas as pd
import pandas_profilingdf2 = pd.read_csv("corona_dataset")
profile2 = df2.profile_report(title="Corona Small dataset report")
profile2
Here’s the Output
- Output Dataset Overview: The Profile Report gives a statistical overview of our complete dataset. Including no. of categorical and numerical variables, duplicacy, missing. It’s basically a statistical snapshot of the database. The Statistics related to my covid database,
2. Output Report on Variables: The profile report offers us an individual detailed report on each variable. It’s so detailed that we would hardly need to look upon anything else.
3. Output: Interaction Report
The profile report generates interaction Reports between all individual entities. Shown in the image is the output interaction report between India-Pakistan Covid-19 cases. You can find these interactions for any sets of columns in your database.
4. Output Report Correlational Matrix:
Relationship of variables with each other.You can always toggle for more details.
Other co-relational graph developed by the report:
5. Output Report Warnings Issued:
The profile report issues the warnings and alert where we might need to work upon on our database or we might have to be cautious about, including high cardinality, high correlation etc
5. Output Dataset Sample: Randomly picked dataset values, that gives in detail views, first rows, last rows etc
6. Output Report on Missing Values:
The profile Report shows missing values per column, here he missing values for each country
Trying For larger Dataset:
You can find the dataset here : https://github.com/shelvi31/Pandas-Profiling
df = pd.read_csv("worldometer_coronavirus_daily_data.csv")
The pandas_profiling library in Python include a method named as ProfileReport()
Generating Profile Report for Large Dataset
pandas_profiling.ProfileReport(df)
Output:
Converting to Jupyter Widget
Ways to save the generated report:
Jupyter Widgets
profile.to_widgets()
Iframes
profile.to_notebook_iframe()
As a string
json_data = profile.to_json()
As a file:
profile.to_file("report.json")
As an HTML File:
profile.to_file(“profile.html”)
Converting My report to Jupyter Widget: will give a result something like this
profile2.to_widget
You can check out the complete clean code and dataset on my Jupyter Notebook here: https://github.com/shelvi31/Pandas-Profiling
Also checkout Output Report on your Live server: https://raw.githubusercontent.com/shelvi31/Pandas-Profiling/main/output.html
… and if you like this article, feel free to leave a few hearty claps :)