Automation OF EDA For Superstore Dataset

Kashish Rastogi
Analytics Vidhya
Published in
4 min readNov 25, 2020
Photo by William Iven on Unsplash

Exploratory Data Analysis (EDA) is an approach for data analysis and data exploration that employs a variety of techniques (mostly graphical representation) on the data we are working on.

EDA helps in finding

  • uncover the different relationship between variables
  • extract important variables
  • detect outliers
  • maximize insight into a data set
  • test underlying assumptions
  • develop parsimonious models
  • determine optimal factor settings

According to a survey in Forbes, data scientists spend 80% of their time on data preparation.

But, what if I told you that python can automate the process of EDA with the help of some libraries? Won’t it make your work comfortable? So let’s start learning about Automated EDA.

So to Minimize the time, we will use an open-source python module that can automate the whole EDA process with just a few code lines.

Besides, suppose it is not sufficient to convince us to use this tool. In that case, it also generates interactive web format reports that can be represented to any person who does not have any knowledge of programming language.

Some of the popular automation python libraries are:-

  • Profiling
  • Sweetviz
  • Autoviz

Dataset can be found here.

1. Pandas Profiling

We mainly use df.describe() a function for exploratory data analysis but for serious EDA we need to use pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis.

This Dataset contains 9994 rows and 13 columns, as analyzing manually will take a lot of effort and time. We don’t have to worry as we can use Pandas Profiling library for large datasets as it’s fast and creates in a few seconds.

One of the strong points of using pandas profiling is that it shows the warning message at the beginning.

Overview of Profiling:

  • How many rows are there
  • Detecting the types of columns in a dataframe
  • Find datasets having duplicate rows, missing values, unique values
  • Shows highest correlated variables with different techniques
  • It does a descriptive analysis which is very helpful
  • Shows the highest cardinality and it also does text analysis

Install Pandas Profiling

Installing using pip or Github

#Installing using pip
pip install pandas-profiling[notebook]
#Installing using Github
pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

We can generate the HTML report using the following

profile = ProfileReport(df, title="Pandas Profiling Report")

Saving the report

If we want to generate an HTML report file, save the ProfileReport to an object and use the to_file() function:

profile.to_file("your_report.html")

Alternatively, we can also obtain the data in json file:

#As a string
json_data = profile.to_json()

# As a file
profile.to_file("my_report.json")

Let’s apply this to the Super Store dataset to create the report

Understanding the report

  • Overview: Showing a simple overview of the dataset
  • Variable Properties: Showing all variables in the dataset and its properties like Mean, Median, etc.
  • Interaction of Variables: Interaction between different categorical and numerical variables with each other.
  • Correlations of the variable: The report generated contains different types of correlations like Pearson’s, Spearman’s, Kendall’s, Phik, Cramer’s for all the attributes of the dataset.
  • Missing Values: Other than this the report is also showing which attributes are have missing values.

Now let’s see the output achieve by few lines of code:

Showing output for super store dataset

2. Sweetviz

Sweetviz is a python library that mainly focuses on exploring and analyzing the data with the help of high-density and easily understandable visualizations. It not only automates the EDA process but is also used for comparing datasets and drawing inferences from them.

Installing using pip

pip install sweetviz

We can generate the HTML report using the following

# Importing 
import sweetviz as sv
# Analyzing & Display the
store_report = sv.analyze(df)
store_report.show_html('store.html')

Understanding the Report

In Sweetviz we can clearly see what are the different attributes of the dataset and its properties too.

Now let’s see the output achieve by few lines of code:

Sweetviz output

Sweetviz also allows you to compare two different datasets or the data in the same dataset by converting it into testing and training datasets.

df1 = sv.compare(df[4997:], df[:4997])
df1.show_html('Compare.html')

Let’s see how it’s done.

3. Autoviz

Autoviz is an open-source python library that mainly works on deep visualizing the relationship of the data. It is the most impactful feature and plot creative visualization in just a few lines of code.

Installing using pip

#Installing using pip
pip install autoviz

We can generate the HTML report using the following

# Importing 
from autoviz.AutoViz_Class import AutoViz_Class
# Analyzing & Display the report
AV = AutoViz_Class()
df = AV.AutoViz('SampleSuperstore.csv')

Understanding the Report

The above command will create a report which will contain the following attributes:

  • Pairwise scatter plot of all continuous variables
  • Boxplot & Distplot
  • Histograms(KDE Plots) of all continuous variables
  • Violin Plots of all continuous variables
  • Heatmap of continuous variables

Now let’s see the output achieve by few lines of code:

That’s it!

References

I hope you liked this content, support my work!

--

--

Kashish Rastogi
Analytics Vidhya

Data Analyst | Data Visualization | Storyteller | Tableau | Plotly