Automation OF EDA For Superstore Dataset

Published in

Analytics Vidhya

4 min readNov 25, 2020

Exploratory Data Analysis (EDA) is an approach for data analysis and data exploration that employs a variety of techniques (mostly graphical representation) on the data we are working on.

EDA helps in finding

uncover the different relationship between variables
extract important variables
detect outliers
maximize insight into a data set
test underlying assumptions
develop parsimonious models
determine optimal factor settings

According to a survey in Forbes, data scientists spend 80% of their time on data preparation.

But, what if I told you that python can automate the process of EDA with the help of some libraries? Won’t it make your work comfortable? So let’s start learning about Automated EDA.

So to Minimize the time, we will use an open-source python module that can automate the whole EDA process with just a few code lines.

Besides, suppose it is not sufficient to convince us to use this tool. In that case, it also generates interactive web format reports that can be represented to any person who does not have any knowledge of programming language.

Some of the popular automation python libraries are:-

Profiling
Sweetviz
Autoviz

Dataset can be found here.

1. Pandas Profiling

We mainly use df.describe() a function for exploratory data analysis but for serious EDA we need to use pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis.

This Dataset contains 9994 rows and 13 columns, as analyzing manually will take a lot of effort and time. We don’t have to worry as we can use Pandas Profiling library for large datasets as it’s fast and creates in a few seconds.

One of the strong points of using pandas profiling is that it shows the warning message at the beginning.

Overview of Profiling:

How many rows are there
Detecting the types of columns in a dataframe
Find datasets having duplicate rows, missing values, unique values
Shows highest correlated variables with different techniques
It does a descriptive analysis which is very helpful
Shows the highest cardinality and it also does text analysis

Install Pandas Profiling

Installing using pip or Github

#Installing using pip
pip install pandas-profiling[notebook]#Installing using Github
pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

We can generate the HTML report using the following

profile = ProfileReport(df, title="Pandas Profiling Report")

Saving the report

If we want to generate an HTML report file, save the ProfileReport to an object and use the to_file() function:

profile.to_file("your_report.html")

Alternatively, we can also obtain the data in json file:

#As a string
json_data = profile.to_json()

# As a file
profile.to_file("my_report.json")

Let’s apply this to the Super Store dataset to create the report

Understanding the report

Overview: Showing a simple overview of the dataset
Variable Properties: Showing all variables in the dataset and its properties like Mean, Median, etc.
Interaction of Variables: Interaction between different categorical and numerical variables with each other.
Correlations of the variable: The report generated contains different types of correlations like Pearson’s, Spearman’s, Kendall’s, Phik, Cramer’s for all the attributes of the dataset.
Missing Values: Other than this the report is also showing which attributes are have missing values.

Now let’s see the output achieve by few lines of code:

Showing output for super store dataset

2. Sweetviz

Sweetviz is a python library that mainly focuses on exploring and analyzing the data with the help of high-density and easily understandable visualizations. It not only automates the EDA process but is also used for comparing datasets and drawing inferences from them.

Installing using pip

pip install sweetviz

We can generate the HTML report using the following

# Importing 
import sweetviz as sv# Analyzing & Display the
store_report = sv.analyze(df)
store_report.show_html('store.html')

Understanding the Report

In Sweetviz we can clearly see what are the different attributes of the dataset and its properties too.

Now let’s see the output achieve by few lines of code:

Sweetviz output

Sweetviz also allows you to compare two different datasets or the data in the same dataset by converting it into testing and training datasets.

df1 = sv.compare(df[4997:], df[:4997])
df1.show_html('Compare.html')

Let’s see how it’s done.

3. Autoviz

Autoviz is an open-source python library that mainly works on deep visualizing the relationship of the data. It is the most impactful feature and plot creative visualization in just a few lines of code.

Installing using pip

#Installing using pip
pip install autoviz

We can generate the HTML report using the following

# Importing 
from autoviz.AutoViz_Class import AutoViz_Class# Analyzing & Display the report
AV = AutoViz_Class()
df = AV.AutoViz('SampleSuperstore.csv')

Understanding the Report

The above command will create a report which will contain the following attributes:

Pairwise scatter plot of all continuous variables
Boxplot & Distplot
Histograms(KDE Plots) of all continuous variables
Violin Plots of all continuous variables
Heatmap of continuous variables

Now let’s see the output achieve by few lines of code:

That’s it!

References

10 Python Automatic EDA libraries which makes Data Scientist life easier

Data will talk if you are willing to listen -Jim Bergeson

medium.com

Automating EDA using Pandas Profiling, Sweetviz and Autoviz in Python

Download our Mobile App Exploratory Data Analysis (EDA) is used to explore different aspects of the data we are working…

analyticsindiamag.com

I hope you liked this content, support my work!

Automation OF EDA For Superstore Dataset

Overview of Profiling:

Install Pandas Profiling

Understanding the report

2. Sweetviz

3. Autoviz

Understanding the Report

That’s it!

References

10 Python Automatic EDA libraries which makes Data Scientist life easier

Data will talk if you are willing to listen -Jim Bergeson

Automating EDA using Pandas Profiling, Sweetviz and Autoviz in Python

Download our Mobile App Exploratory Data Analysis (EDA) is used to explore different aspects of the data we are working…

I hope you liked this content, support my work!

Written by Kashish Rastogi