Automated, Quick, and Powerful EDA with Sweetviz Library!

Sukanya Bag
Analytics Vidhya
Published in
6 min readAug 13, 2020
An Overview of Sweetviz Analytics Report

Exploratory Data Analysis (EDA) is used to explore different aspects of the data we are working on. EDA should be performed in order to find the patterns, visual insights, etc. that the data set is having, before creating a model or predicting something through the dataset. EDA is a general approach of identifying characteristics of the data we are working on by visualizing the dataset. EDA is performed to visualize what data is telling us before implementing any formal modelling or creating a hypothesis testing model.

Analyzing a dataset is a hectic task and takes a lot of time, according to a study EDA takes around 40% effort of the machine learning project but it cannot be eliminated.

What is Sweetviz ?

SWEETVIZ is an open source Python library that generates beautiful, high-density visualizations to kickstart EDA (Exploratory Data Analysis) with a single line of code. Output is a fully self-contained HTML application.

The system is built around quickly visualizing target values and comparing datasets. Its goal is to help quick analysis of target characteristics, training vs testing data, and other such data characterization tasks.

Make sure you visit https://pypi.org/project/sweetviz/ to explore more, and also consult and go through documentation of this library.

Why is Sweetviz far better compared to Pandas Profiling ?

Sweetviz packs a powerful punch; in addition to creating insightful and beautiful visualizations with just two lines of code, it provides analysis that would take a lot more time to generate manually, including some that no other library provides so quickly such as:

a) Comparison of 2 datasets (e.g. Train vs Test)

b) Visualization of the target value against all other variables (e.g. “What was the survival rate of male vs female” etc.)

c) Pandas profiling is seen to give awful errors on large datasets and those containing many categorical features.

DATASET USED TO PERFORM EDA :

I have used train and test datasets from the “House Prices: Advanced Regression Techniques” competition hosted by KAGGLE, for performing the EDA.

You can find the datasets at the following link redirected to KAGGLE :

https://www.kaggle.com/c/house-prices-advanced-regression-techniques

OK! Cool enough!

Now let’s get started with Sweetviz and explore its awesome features!

  1. INSTALLATION :

Type the following command in your command prompt/anaconda prompt/jupyter notebook or your preferred IDE :

pip install sweetviz — user

!! Note : If you are carrying out the installation in Jupyter Notebook, make sure to restart the kernel before starting to code.

2. IMPORT THE NECESSARY LIBRARIES :

3. LOAD TRAIN AND TEST DATASETS USING PANDAS :

4. PRINT OUT THE FIRST 5 AND LAST 5 ROWS :

5. CASE 1: ANALYZING A SINGLE DATAFRAME (Train) :

A snapshot of the cool web-based interactive HTML report generated is given below!

Report of automated EDA of train dataset

Pretty cool, isn’t it ?! This amazing interactive report can let you crawl in and find outstanding insights with just 2 lines of code!

6. CASE 2: COMPARING TWO DATAFRAMES FOR ADVANCED ANALYSIS (eg. test vs. train sets)

And the report generated looks like…

Report of automated EDA of train vs test dataset

The blue color represents the train and the orange color represents the test dataset.

Now let’s dig deeper and see what insights we get from sweetviz automated EDA!

  1. SUMMARY DISPLAY :
summary of Training and Test datasets

The summary shows us the characteristics of both data frames, train and test side-by-side. We can identify that the datasets have 0 duplicates. That legend at the bottom shows us that the training set does contain the “SalePrice” target variable, but that the testing set does not.

2. ASSOCIATION :

Hovering your mouse over the “Associations” button in the summary will make the Associations graph appear on the right-hand side :

Association of Train dataset

Basically, in addition to showing the traditional numerical correlations, it unifies in a single graph not only numerical correlation but also the uncertainty coefficient (for categorical-categorical) and correlation ratio (for categorical-numerical). Squares represent categorical-featured-related variables and circles represent numerical-numerical correlations.

Finally, it is worth mentioning that these correlation/association methods shouldn’t be taken as the bible as they make some assumptions on the underlying distribution of feature data. However, they can be very useful for a quick watch over.

3. TARGET VARIABLE :

Target variable Analysis (SalePrice)

When a target variable is specified, it will show up at the top, in a black box.

NOTE: only numerical and boolean features can be targets currently.

We can gather from this summary that “SalePrice” has no missing data in the training set (1460, 100%), that there are 663 distinct possible values (accounting for less than 46% of all values), and from the graph, it can be estimated that roughly 40%-45% of the SalePrice is around 200k.

4. DETAIL AREA (CATEGORICAL FEATURES) :

Categorical Feature Details

The ones marked as Categorical Feature, probably those with no descriptive statistical reports, are the categorical features.

5. DETAIL AREA (NUMERICAL FEATURES) :

Numerical Feature Details

Numerical data shows more information on its summary. As it can provide the descriptive statistics details unlike categorical/boolean features.

Note that the target value (“SalePrice” in this case) is plotted as a line, right over the distribution graph. This enables an instant analysis of the target distribution with regard to other variables.

So that’s how you can gain extremely meaningful insights from your data with just 2 lines of code!

I need to leave the further detailed discussion here as it is beyond the scope of this blog, but make sure to visit my Github repository where i have uploaded this Jupyter Notebook, and html reports, to study and gain insights from the EDA!

GitHub link: https://github.com/Sukanya-git464/AUTOMATED-QUICK-EDA-WITH-SWEETVIZ-

CONCLUSION :

I hope y’all had fun analyzing all this information from just two lines of code!

Using Sweetviz easily gives me a significant jump-start when I start looking at a new dataset. It’s worth pointing out I also find it useful later in the analysis process, for example during feature engineering and feature selection, to get a quick overview of various features to play out. I hope you will find it an amazing tool in your own data analysis.

But remember, you should also carry out your own Data Analysis, apart from trying out these automated methods, as beautiful and insightful data analysis is actually an ART and you should be an ARTIST in it!

Thank you, and,

Happy Pythoning!

Until next time… :)

--

--

Sukanya Bag
Analytics Vidhya

I love to teach Machine Learning in simple words! All links at bio.link/sukannya