Pandas Profiling — convenient way to conduct Exploratory Data Analysis in Python

Ethan Duong
6 min readJan 12, 2023

--

Using Pandas Profiling Library to create effective EDA.

Exploratory Data Analysis (EDA) is the process that helps us to be familiar with our data by exploring it from multiple aspects such as through statistics, visualization, and data summaries. This later allow us to identify new patterns in the data, spot anomalies, and test our hypothesis.

In my previous article — Data Science for beginner — Overview of what to learn, I have layout the steps to conduct an EDA and how it helps beginners gain a solid understanding of the data they are working with.

Conducting EDA can be done manually, but sometimes, in the real world situation, efficiency is not about accuracy but also about timeliness. In other words, Data Analyst need to be able to quickly visualize and understand the distribution of each variable from the data.

Therefore, I am writing this tutorial to introduce a convenient way to performs an automated Exploratory Data Analysis, as known as Pandas Profiling.

Content:

1. About Pandas Profiling library.

2. Pandas Profiling implementation.

3. Understand the report

4. Limitations of Pandas Profiling.

1. About Pandas Profiling library

Pandas-profiling project description: pandas-profiling 3.6.2

Pandas Profiling primary goal is to provide a one-line Exploratory Data Analysis (EDA) experience in a consistent and fast solution.

Like pandas’ s describe function, that is so handy, pandas-profiling delivers an extended analysis of a Data Frame while allowing the data analysis to be exported in different formats such as html and json.

From my understanding, Pandas Profiling is a Python library that generates a detailed report on our dataset in a few lines of code.

It is a popular library for Data Scientist, with nearly a million downloads in the last month (PyPi Stats).

There are some key-features from that library that provides to Data Analyst in quickly understanding the dataset they are working on:

  • Type inference: automatic detection of column’ s data types.
  • Warnings: overview of the current problem (missing data, inaccuracie, or skewness,…)
  • Univariate analysis: automatic calculation of mean, mode, median and generation of distribution histograms.
  • Text analysis: generates word frequency under different conditions.

2. Pandas Profiling implementation

  1. Install Pandas Profiling
pip install pandas-profiling

However, Pandas Profiling cannot be directly used on Google Colab. The code will result in an error:

“concat() got an unexpected keyword argument ‘join axes“

The reason for this is Google Colab comes with a pre-installed older version of Pandas-profiling and the join_axes function is blocked in the installed Pandas version on Google Colab. So try this:

pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

If you are using Google Colab, please pay attention to section 4 :)

2. Importing library

Here is how to import the library:

import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport

3. Loading the dataset

We will use the Data Scientist Employment dataset (csv) that already be cleaned from my 8 common techniques for data cleaning — Python blog.

We wil have to use read_csv function to read our csv-formated data as so:

df = pd.read_csv('Cleanned dataset.csv', na_values=-1)

Our data contains missing values represented by -1.

4. Running Pandas Profiling

First, we will have to create a variable called “report” and assign ProfileReport() to that. We will also include the title for our report.

report = ProfileReport(df, title='Data Scientist Employment')
report

Within only two line of codes, the report will automatically generated and analyses all your data. It will take a while for this process to complete if you have a big dataset.

Report Generated by using Profiling library.

Note for Google Colab user:
Profile.to_widgets() will not be working properly as it is not yet fully supported on Google Colab.
-> change to profile.to_notebook_iframe(), as below:

report.to_notebook_iframe()

5. Save output file as HTLM format for sharing your result

#Save output file as html 
report.to_file(output_file = 'Data_Science_Employment.html' )

Now, let see what is the result looks like!

3. Understand the report

Overview

This section contains 3 tabs: Overview, Alerts, and Reproduction.

The overview tab will give you overview statistical information about your dataset as below:

Alert tab informs you about the issues existed in all columns of your dataset, such as duplication, missing value, or correlation between variables.

Variables

This section gives us details statistics information of each columns in our data frame such as distinct percentage, missing value percentage, memory size and numeric statistics (mean, mode, median, maximum, minimum).

Let take Industry column and Estimate Minimum Salary column as an example:

Interactions

This is a very useful section that enable you to plot one variable against another in order to identify their relation to each other.

Here, I picked Minimum and Maximum Salary statistic. It indicates the positive relation between these two variables.

Correlations

This is also an useful sections as it helps us to identify the correlations among all attributes. This help Data Analyst to identify critical attributes for their further analysis.

I personally find this super useful when creating regression model.

Missing values

This section indicates the number of missing values in each columns and automatically generate bar chart for comparison.

Sample

This section simply give us a glance of our dataset.

4. Limitations of Pandas Profiling.

Every product or software always come with some limitations. This section will discuss a few most common disadvantages of Pandas Profiling that I think is worth mentioning.

  • Pandas Profiling will take long time to generate report for big data set. Therefore, we should generate the report from only a part of all the data we have.
  • If you have an older version of pandas installed or using Google Colab, pandas profiling may throw an error. So make sure you always use the most recent version.
  • Pandas Profiling is good for overall look at your dataset, but not for complex exploration. Therefore, you still have to learn to do pandas itself to gain more insight from the data.

Note of writer:

Thank you for being here, I hope it helps you in some ways.

If you have any concerns, you can contact me through email: ethan.duong1120@gmail.com

Reference:

Pandas-profiling. PyPI. (n.d.). Retrieved January 12, 2023, from https://pypi.org/project/pandas-profiling/

Ismail, A. (2020, September 2). How to use pandas-profiling on Google Colab. plainenglish.io/blog/how-to-use-pandas-profiling-on-google-colab-e34f34ff1c9f. Retrieved January 12, 2023, from https://plainenglish.io/blog/how-to-use-pandas-profiling-on-google-colab-e34f34ff1c9f

--

--

Ethan Duong

The place to share what I've learned, mostly tech-related ! Trying to keep the knowledge from fading overtime :) Reach me at ethan.duong1120@gmail.com