An Introduction to Pandas Profiling

Kuharan Bhowmik
Analytics Vidhya
Published in
3 min readSep 29, 2019

--

Photo by Carlos Muza on Unsplash

If you are someone who is familiar with data science, I can confidently say you must have realized the power of EDA. This is the very first data analysis I do on my own and of course, I am influenced by other data scientists I follow on the Internet. Till now I used pandas and Matplotlib extensively. Pandas for data manipulation and Matplotlib, well, for plotting graphs.

Well this lovely (it is 22 degrees Celcius, partial cloudy here in Kolkata) Sunday morning, is my distraction. I’ve been focused the last few weeks on my new love, the Azure Data Factory — Data Lake Project. However, lately, I came across this magical package as it had become a necessity to profile large datasets quickly. Enough Talking. Let us get started.

Let us look at an example. Let us take a dig into exploring New York City Airbnb Open Data — Airbnb listings and metrics in NYC, NY, USA (2019). And I am not kidding, it is currently the hottest trending dataset on Kaggle.

Let us start with the same pandas' bread and butter stuff.

import pandas as pd
df = pd.read_csv("new-york-city-airbnb-open-data/AB_NYC_2019.csv")
df.head()
df.describe()

While the output above contains lots of information, it does not tell you everything you might be interested in.

Pandas-Profiling

pip install pandas-profiling
import pandas_profiling

You need to run this one-liner to profile the whole dataset in one shot. Super Easy!

df.profile_report(style={‘full_width’:True})

And for each variable —

Also, it is super easy to get quick correlations —

Or you can export this into an HTML report —

profile = df.profile_report(title=’Pandas Profiling Report’)
profile.to_file(output_file=”Pandas Profiling Report — AirBNB .html”)

Here is the link to the notebook, which contains the entire code I used for this demo. Until next time, Keep Profiling!

Some Major Updates:

Pandas profiling is continuously evolving since I wrote this article. There are some major updates:

from pandas_profiling import ProfileReport
profile = ProfileReport(df, title="Pandas Profiling Report")

Explore deeper

You can configure the profile report in any way you like.

profile = ProfileReport(df, title='Pandas Profiling Report', explorative=True)

Jupyter Widgets or Iframe

profile.to_widgets()
profile.to_notebook_iframe()

Saving the report

# As a string
json_data = profile.to_json()

# As a file
profile.to_file("your_report.json")

Large datasets

This is one of the most interesting updates. We often have to deal with large datasets and here is the answer. Version 2.4 introduces minimal mode. This is a default configuration that disables expensive computations (such as correlations and dynamic binning). Use the following syntax:

profile = ProfileReport(large_dataset, minimal=True)
profile.to_file("output.html")

Advanced settings

profile = df.profile_report(title='Pandas Profiling Report', plot={'histogram': {'bins': 8}})
profile.to_file("output.html")

--

--