Automated EDA for Classification

Published in

Analytics Vidhya

8 min readJul 28, 2020

Exploratory Data Analysis made simple in few lines of code!

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

However, EDA generally takes a lot of time and effort in the ML workflow and as we all know time is money! As a beginner in data science, I am sure all of us would have faced a similar challenge of comprehensively doing EDA. To cover variety of visualizations while exploring the data, it needs endless lines of codes. Also, what all is required for a good EDA is scattered like streaks of information in the plethora of data science world.

This motivated me to write a python module called ClfAutoEDA for performing an automated EDA on virtually any Classification problem. Although there are a few automated EDA libraries out there, but they do not cover all the aspects of it and are not too flexible. I am sharing this module so that you can customize and add additional functions as per your choice.

For the source code please visit the following link:

jatinkataria94/EDA-Classification

Contribute to jatinkataria94/EDA-Classification development by creating an account on GitHub.

github.com

I will demonstrate the working of ClfAutoEDA using two popular classification problems- titanic and iris.

For accessing datasets, visit the following link:

jatinkataria94/EDA-Classification

Contribute to jatinkataria94/EDA-Classification development by creating an account on GitHub.

github.com

Importing Module

Download ClfAutoEDA.py file from the link shared above and store it in your working directory (the location where all your other python files and datasets are stored).

Create a new python file where you want to work and import the module in that file.

# import the autoEDA module
from ClfAutoEDA import *

In the above program, all the functions defined in ClfAutoEDA.py file are imported .

Loading the Data

# Load the iris dataset from the csv file using pandas
df=pd.read_csv('iris_small.csv')

EDA Parameters

Once the module is imported, you can call EDA function in your python file which will automatically do all the required exploration on your dataset. Before calling the function we must understand the function parameters:

EDA(df,labels,target_variable_name,
        data_summary_figsize,corr_matrix_figsize,
        data_summary_figcol,corr_matrix_figcol,
        corr_matrix_annot,
        pairplt_col,pairplt,
        feature_division_figsize)

df : dataframe containing your classification data

labels: 1D list of str. (The names given to the labels of the target variable )

target_variable_name: variable of str. (The name of the target variable in the dataset)

data_summary_figsize: tuple. (The size of figure containing data summary), default: (16,16)

corr_matrix_figsize: tuple. (The size of figure containing correlation matrix), default: (16,16)

data_summary_figcol: str. (The color name of figure containing data summary), default: ‘Reds_r’

corr_matrix_figcol: str. (The color name of figure containing correlation matrix), default: ‘Blues’

corr_matrix_annot: boolean. (True if you want to display annotations/coefficients on the correlation matrix), default: False

pairplt_col: 1D list of str. (The names of the columns/features for which pairplots are required), default: ‘all’ (All the features will be taken)

pairplt: boolean. (True if you want to pairplots), default: False

feature_division_figsize: tuple. (The size of figure containing bar and pie plot of proportion of target and categorical feature labels), default: (12,12)

NOTE- This function returns 3 items (dataframe after removing null values, list of numerical features, list of categorical features) apart from the EDA plots and data description. The EDA plots include heatmaps of correlation matrix, data summary, missing values, skewness plots, violin plots, pairplots, boxplots and categorical features’ distribution bar and pie charts.

Run the program for iris data

We will set the values of EDA function parameters and then run the program.

#Setting parameter values
target_variable_name='species'
labels=['F1','F2','F3']#Calling EDA function with parameters of choice
df_processed,num_features,cat_features=EDA(df,labels,
                                         target_variable_name,
                                         data_summary_figsize=(6,6),
                                         corr_matrix_figsize=(6,6), 
                                         corr_matrix_annot=True,
                                         pairplt=True)

Voila! It automatically gives you the following data description and plots with hardly 4–5 lines of code

The data looks like this: 
    sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2       0
1           4.9          3.0           1.4          0.2       0
2           4.7          3.2           1.3          0.2       0
3           4.6          3.1           1.5          0.2       0
4           5.0          3.6           1.4          0.2       0

The shape of data is:  (150, 5)

The missing values in data are: 
 species         7
petal_width     0
petal_length    0
sepal_width     0
sepal_length    0
dtype: int64

The summary of data is: 
        sepal_length  sepal_width  petal_length  petal_width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.054000      3.758667     1.198667
std        0.828066     0.433594      1.764420     0.763161
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000

Some useful data information: 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       143 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None

The columns in data are: 
 ['sepal_length' 'sepal_width' 'petal_length' 'petal_width' 'species']

The target variable is divided into: 
 0    50
1    45
2    48
Name: species, dtype: int64

The numerical features are: 
 ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

The categorical features are: 
 []

Execution Time for EDA: 0.11 minutes

Run the program for titanic data

from ClfAutoEDA import *df=pd.read_csv('titanic_train.csv')#Dropping Id related columns
df.drop(['PassengerId','Ticket'],axis=1,inplace=True)#Setting parameter values
labels=["not survived","survived"]
target_variable_name='Survived'df_processed,num_features,cat_features=EDA(df,labels,
                                         target_variable_name,
                                         data_summary_figsize=(6,6),
                                         corr_matrix_figsize=(6,6),
                                         corr_matrix_annot=True,
                                         pairplt=True)

Let us see how the EDA looks for titanic dataset.

The data looks like this: 
    Survived  Pclass                                               Name  \
0         0       3                            Braund, Mr. Owen Harris   
1         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...   
2         1       3                             Heikkinen, Miss. Laina   
3         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)   
4         0       3                           Allen, Mr. William Henry   

      Sex   Age  SibSp  Parch     Fare Cabin Embarked  
0    male  22.0      1      0   7.2500   NaN        S  
1  female  38.0      1      0  71.2833   C85        C  
2  female  26.0      0      0   7.9250   NaN        S  
3  female  35.0      1      0  53.1000  C123        S  
4    male  35.0      0      0   8.0500   NaN        S  

The shape of data is:  (891, 10)

The missing values in data are: 
 Cabin       687
Age         177
Embarked      2
Fare          0
Parch         0
SibSp         0
Sex           0
Name          0
Pclass        0
Survived      0
dtype: int64

The summary of data is: 
          Survived      Pclass         Age       SibSp       Parch        Fare
count  891.000000  891.000000  714.000000  891.000000  891.000000  891.000000
mean     0.383838    2.308642   29.699118    0.523008    0.381594   32.204208
std      0.486592    0.836071   14.526497    1.102743    0.806057   49.693429
min      0.000000    1.000000    0.420000    0.000000    0.000000    0.000000
25%      0.000000    2.000000   20.125000    0.000000    0.000000    7.910400
50%      0.000000    3.000000   28.000000    0.000000    0.000000   14.454200
75%      1.000000    3.000000   38.000000    1.000000    0.000000   31.000000
max      1.000000    3.000000   80.000000    8.000000    6.000000  512.329200

Some useful data information: 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Fare      891 non-null    float64
 8   Cabin     204 non-null    object 
 9   Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(4)
memory usage: 69.7+ KB
None

The columns in data are: 
 ['Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch' 'Fare' 'Cabin'
 'Embarked']

The target variable is divided into: 
 0    424
1    288
Name: Survived, dtype: int64

The numerical features are: 
 ['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']

The categorical features are: 
 ['Name', 'Sex', 'Embarked']

The categorical variable is divided into: 
 Karlsson, Mr. Nils August           1
Rosblom, Mr. Viktor Richard         1
Shutes, Miss. Elizabeth W           1
Sutton, Mr. Frederick               1
Silverthorne, Mr. Spencer Victor    1
                                   ..
Jansson, Mr. Carl Olof              1
Dean, Master. Bertram Vere          1
Greenberg, Mr. Samuel               1
Leinonen, Mr. Antti Gustaf          1
Homer, Mr. Harry ("Mr E Haven")     1
Name: Name, Length: 712, dtype: int64

The categorical variable Name has too many divisions to plot 


The categorical variable is divided into: 
 male      453
female    259
Name: Sex, dtype: int64

The categorical variable is divided into: 
 S    554
C    130
Q     28
Name: Embarked, dtype: int64

Execution Time for EDA: 0.15 minutes

To understand what each of the above plots mean, kindly refer to these wonderful articles:

5 reasons you should use a violin graph - BioTuring's Blog

This is when violin graphs, or violin plots, come to the rescue. A violin plotcarry all the information that a box plot…

blog.bioturing.com

Violin Plot

A Violin Plot is used to visualise the distribution of the data and its probability density. This chart is a…

datavizcatalogue.com

Visualizing the patterns of missing value occurrence with Python

(A Japanese translation is available here.) During data analysis, we need to deal with missing values. Handling missing…

dev.to

Understanding Boxplots

The image above is a boxplot. A boxplot is a standardized way of displaying the distribution of data based on a five…

towardsdatascience.com

Hot or Not? Heatmaps and Correlation Matrix Plots

A overview of why to use heatmaps when creating linear regression models.

medium.com

Histograms and Density Plots in Python

Visualizing One-Dimensional Data in Python

towardsdatascience.com

Before You Go

Thanks for reading! Feel free to use this automated EDA module for your classification problems. If you have any difficulty or any doubts kindly comment below. Your support is always highly appreciated. If you want to get in touch with me, reach me on jatin.kataria94@gmail.com.

Automated EDA for Classification

jatinkataria94/EDA-Classification

Contribute to jatinkataria94/EDA-Classification development by creating an account on GitHub.

jatinkataria94/EDA-Classification

Contribute to jatinkataria94/EDA-Classification development by creating an account on GitHub.

Importing Module

Loading the Data

EDA Parameters

Run the program for iris data

Run the program for titanic data

5 reasons you should use a violin graph - BioTuring's Blog

This is when violin graphs, or violin plots, come to the rescue. A violin plotcarry all the information that a box plot…

Violin Plot

A Violin Plot is used to visualise the distribution of the data and its probability density. This chart is a…

Visualizing the patterns of missing value occurrence with Python

(A Japanese translation is available here.) During data analysis, we need to deal with missing values. Handling missing…

Understanding Boxplots

The image above is a boxplot. A boxplot is a standardized way of displaying the distribution of data based on a five…

Hot or Not? Heatmaps and Correlation Matrix Plots

A overview of why to use heatmaps when creating linear regression models.

Histograms and Density Plots in Python

Visualizing One-Dimensional Data in Python

Before You Go

Written by Jatin Kataria