Automated EDA using pandas profiling,sweetviz,autoviz

Guhanesvar
Analytics Vidhya
Published in
7 min readAug 19, 2021

Exploratory data analysis(EDA) is used to explain how the data is, what is the relationship between the attributes, and furthermore.EDA is not only about dealing with numbers but also understanding the meaning full relationships between attributes in the data. An EDA helps us draw conclusions on how the data attributes are distributed. Before going on to modeling an EDA should be done to understand the data and prepare for the modeling.EDA is not just about numbers it is also a visual analysis of the data attributes using various graphs suited for each purpose. A meaningful EDA must help in identifying patterns in the dataset.

Photo by Luke Chesser on Unsplash

Analysis and cleaning data takes up about 80% of the work in preparing a machine learning model, it consumes nearly 40% of your time. Python provides certain open-source modules that can automate the whole process of EDA and save a lot of time. Some of which I have used and are very resourceful are listed below

  1. Pandas Profiling
  2. Sweetviz
  3. Autoviz

Pandas Profiling

EDA can be automated using a Python library called Pandas Profiling. It is a great tool to create reports in the interactive HTML format which is quite easy to understand and analyze data. Let’s explore Pandas Profiling to do EDA in a very short time and with just a single line code.

Here I’ve taken a dataset on heart failure from Kaggle. That contains about 300 records and 13 columns(https://www.kaggle.com/andrewmvd/heart-failure-clinical-data) I’ve used a pretty small dataset since pandas profiling consumes more time than sweetviz and autoviz.

Implementation

First, start by installing the pandas profiling

pip install pandas-profiling

import pandas as pd
from pandas_profiling import ProfileReport
df=pd.read_csv(r’C:\Users\Guhanesvar\Downloads\heart_failure_clinical_records_dataset.csv’)
df
design_report = ProfileReport(df)
design_report.to_file(output_file='report.html')

This generates an output report in HTML format in the current working folder.

Understanding the report

The generated report contains many sections, we’ll explore them one by one

Overview

The warning tab contains the highly correlated variables

Variables

Then scroll down the variable section that contains all the variables in the dataset and their properties.

The toggle details tab contains general statistics about the attribute, its most common, extreme values, and a histogram.

Interactions

In the Interactions section, you can see how each variable is related to one and another where a scatter plot is drawn for the variables selected.

Correlations

The correlation section contains the different correlation plots such as Pearson’s r, Spearman’s ρ, Kendall’s τ,Phil (φk), and Cramér’s V (φc) for each attribute.

The report generated is really helpful in identifying patterns in the data and finding out the characteristics of the data.

Sweetviz

One of the latest is a new open-source Python library called Sweetviz, created by a few contributors and myself for just that purpose. It takes pandas dataframes and creates a self-contained HTML report that can be viewed by itself in a browser or integrated into notebooks.

It packs a powerful punch; in addition to creating insightful and beautiful visualizations with just two lines of code, it provides analysis that would take a lot more time to generate manually

Implementation

First, start by installing the sweetviz

pip install sweetviz

import sweetviz as sv
sweet_report = sv.analyze(df.""DEATH_EVENT")
sweet_report.show_html('sweetviz_report.html')

Here ‘DEATH_EVENT ‘ is our target variable. This generates an output report in HTML format in the current working folder.

Understanding the report

The top of the report contains information about the data like how many rows, categorical and numerical features.

When hovered over the Associations tab, a correlation matrix pops up. This graph is a composite of the visuals from Drazen Zaric: Better Heatmaps and Correlation Matrix Plots in Python and concepts from Shaked Zychlinski: The Search for Categorical Correlation.

Basically, in addition to showing the traditional numerical correlations, it unifies in a single graph both numerical correlation, the uncertainty coefficient (for categorical-categorical), and correlation ratio (for categorical-numerical). Note that the trivial diagonal is left empty, for clarity.

Squares: Categorical Association

Circles: Numerical Association

The big circles and squares are the most influencing variables with the target variable

Next, when you hover over each variable you can find its’s numerical and categorical associations(correlations) ranked.

Target analysis

When analyzing a dataset that has a target variable, this feature is incredibly insightful.

If we specify a target variable (only boolean and numerical are supported currently), it is displayed prominently as the first variable and uses black coloring.

Most importantly, its value is overlaid on top of every other graph, quickly giving an insight into the distribution of the target with regard to every other variable.

At a glance, you can immediately spot how the target value is influenced by other variables. As expected, this generally follows what is found in the “Associations” graph, but gives you specifics for each variable.

For numerical data, you can change the number of “bins” in the graph, to better gauge distribution, as well as how the target feature correlates.

You’ll also find tables below that shows the largest, small, and most frequent values of the attribute.

df1 = sv.compare(df[210:], df[:90])
df1.show_html('sweetvizCompare2.html')

You can compare two different datasets which can be extremely useful (e.g. Train vs Test data). But even if you are only looking at a single dataset, you can study the characteristics of different subpopulations within that dataset.

With target analysis, dataset/intra-set comparisons, full feature analysis, and unified association/correlation data, Sweetviz provides an unparalleled wealth of insights with just 2 lines of code.

AutoViz

AutoViz performs automatic visualization of any dataset with just one line of code. AutoViz can find the most important features and plot impactful visualizations only using those automatically selected features. Also, AutoViz is incredibly fast so it creates visualization within seconds.

Implementation

First, start by installing the Autoviz

pip install autoviz

from autoviz.AutoViz_Class import AutoViz_Class
AV = AutoViz_Class()
df = AV.AutoViz(‘heart_failure_clinical_records_dataset.csv’)

Autoviz is relatively fast but the output is generated within the notebook. Autoviz generates more visualizations than the pandas profiling and sweetviz

Understanding the results

Pairwise scatter plot of all continuous variables

For each continuous variable pairwise scatter plots are produced

Distplot, Boxplot, and Probability plot of all continuous variables

Distplot -To show the count of values in the variable

Boxplot-To find the outliers

Probability plot- To show the distribution of the variable

Violin Plots of all continuous variables

Violin plots are used when you want to observe the distribution of numeric data and are especially useful when you want to make a comparison of distributions between multiple groups. The peaks, valleys, and tails of each group’s density curve can be compared to see where groups are similar or different

Furthermore, it will generate a correlation heatmap and a few more bar charts.

If we know the dependent variable in the dataset which is dependent on other variables, then we can pass it as an argument and visualize the data according to the Dependent Variable

df = AV.AutoViz('heart_failure_clinical_records_dataset.csv', depVar='DEATH_EVENT')

This will create the same report as we have seen above but in the context of the dependent variable i.e. DEATH_EVENT

Conclusion

Summing it up all these three automated EDA libraries have their own advantages and disadvantages over the others. By using these you could save up a lot of time and get results quickly. The overall purpose of these libraries is to help in:

  • Feature engineering: visualize how engineered features perform/correlate relative to other features and the target variable
  • Interpretation/communication: the generated graphs can provide insights that are easily interpretable and can be passed amongst a team or to clients quickly, without any extra work
  • Testing: confirm the makeup & balance of testing/validation sets

--

--

Guhanesvar
Analytics Vidhya

Data Analyst |MSc Decision and Computing Science CIT |Machine Learning | Data Science