1 line code for Data Exploration in Python (Pandas Profiling)!

Yash Gupta
Data Science Simplified
5 min readMay 16, 2022

Yes, it exists. It was tough for me to believe too that data exploration or EDA or Exploratory Data Analysis, which is probably the most tedious thing to do in a data science life cycle after data cleaning, can happen with ONLY 1 line of code.

Being an open-source programming language the opportunities that python offers to data enthusiasts today to learn more and more about their data is impeccable. One can make use of the growing 130,000+ libraries in python to pull out all of what they need from any given dataset. Of these open-source libraries, one which people use most of the time when working with tables is the Pandas library.

A brief note on Pandas to anyone new to Python:
Pandas is — an open-source library developed in 2008, essentially for working with relational or labeled data or both, easily and almost replicating Excel in python. It can work with many data structures and operations for manipulating data in text and/or numbers and time series. Built on top of the NumPy library in python, it is fast and has high performance and usability.

Welcome To Pandas Profiling!

Pandas profiling is an open-source module by python which performs a complete data exploration process in just a ONE LINER CODE. If that was not already cool enough, the tool helps generate interactive reports on the web and can be presented to anyone in your stakeholders without any required technical expertise.

What it essentially does is gives you a brief understanding of each variable without having to write individual code to do it and prepare a report that you can access with all the required information for all the variables in your project. Let’s look at it in action.

To Install Pandas Profiling:

pip install pandas-profiling

Load your dataset and import the required libraries:

import pandas as pd
from pandas_profiling import ProfileReport
df = pd.read_csv("Housing Data.csv")

Use the following line to generate the report and save it in your working directory as an HTML file to view in your browser

prof = ProfileReport(df)
prof.to_file(output_file='output.html')

Go to your working directory and open the file titled ‘output.html’

Let us look at the report that was prepared by Pandas Profiling for this example dataset.

The output for a successful report generation should look like this in Pandas Profiling

Snippets of the report created:

  1. The overview shows the dataset statistics and offers alerts and reproduction details as required

2. The alerts are a collection of correlation data that is symmetric but helps when your data has some correlation that you now know about before you even see the data.

3. The reproduction details show how long your analysis took and how you can download the JSON file for the configuration which can help you decide if you can spend the time on the report generation or if you need to use a minimal more report to save time.

4. Each variable’s individual information and central tendencies are shown along with a graph of the distribution to visualize the data and any skewness.

Note: You can toggle the details to see more details about the individual variable and see the graphics more clearly and enlarged

(notice the alerts for correlation below the column name, how convenient is this compared to manual EDA)

5. Similarly to the individual information, Pandas Profiling allows you to see a hex scatter plot for each variable against another, correlations based on metrics like Spearman’s correlation, Pearson's correlation, Kendall Tau’s Rank correlation, etc.

Missing values are represented in the form of a bar graph for each variable or in a matrix. The sample data represents the first and last rows of your data to see which if you think of it is a replication of the head and tail functions of the Pandas library.

Convinced enough now?

While the EDA is imperfect, this can be the perfect thing to do if you are short of time and need to skim through your dataset and understand 90% of your data. The things other than these which you can do in your EDA manually can be to use a third variable in scatterplots to find patterns using hues or to just eliminate columns like Longitude and Latitude because they numerically will impact your model while they work on geography.

For more on Pandas Profiling:

Try out Pandas Profiling and compare it with your manual EDA and let me know how it works for you!

For more such articles, stay tuned with us as we chart out paths on understanding data and coding and demystify other concepts related to Data Science. Please leave a review down in the comments.

Check out my other articles at:

Do connect with me on LinkedIn if you want to discuss it further!

--

--

Yash Gupta
Data Science Simplified

Lead Analyst at Lognormal Analytics and self-taught Data Scientist! Connect with me at - https://www.linkedin.com/in/yash-gupta-dss