Automated EDA for Classification

Jatin Kataria
Analytics Vidhya
Published in
8 min readJul 28, 2020

Exploratory Data Analysis made simple in few lines of code!

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

However, EDA generally takes a lot of time and effort in the ML workflow and as we all know time is money! As a beginner in data science, I am sure all of us would have faced a similar challenge of comprehensively doing EDA. To cover variety of visualizations while exploring the data, it needs endless lines of codes. Also, what all is required for a good EDA is scattered like streaks of information in the plethora of data science world.

This motivated me to write a python module called ClfAutoEDA for performing an automated EDA on virtually any Classification problem. Although there are a few automated EDA libraries out there, but they do not cover all the aspects of it and are not too flexible. I am sharing this module so that you can customize and add additional functions as per your choice.

For the source code please visit the following link:

I will demonstrate the working of ClfAutoEDA using two popular classification problems- titanic and iris.

For accessing datasets, visit the following link:

Importing Module

Download ClfAutoEDA.py file from the link shared above and store it in your working directory (the location where all your other python files and datasets are stored).

Create a new python file where you want to work and import the module in that file.

# import the autoEDA module
from ClfAutoEDA import *

In the above program, all the functions defined in ClfAutoEDA.py file are imported .

Loading the Data

# Load the iris dataset from the csv file using pandas
df=pd.read_csv('iris_small.csv')

EDA Parameters

Once the module is imported, you can call EDA function in your python file which will automatically do all the required exploration on your dataset. Before calling the function we must understand the function parameters:

EDA(df,labels,target_variable_name,
data_summary_figsize,corr_matrix_figsize,
data_summary_figcol,corr_matrix_figcol,
corr_matrix_annot,
pairplt_col,pairplt,
feature_division_figsize)

df : dataframe containing your classification data

labels: 1D list of str. (The names given to the labels of the target variable )

target_variable_name: variable of str. (The name of the target variable in the dataset)

data_summary_figsize: tuple. (The size of figure containing data summary), default: (16,16)

corr_matrix_figsize: tuple. (The size of figure containing correlation matrix), default: (16,16)

data_summary_figcol: str. (The color name of figure containing data summary), default: ‘Reds_r’

corr_matrix_figcol: str. (The color name of figure containing correlation matrix), default: ‘Blues’

corr_matrix_annot: boolean. (True if you want to display annotations/coefficients on the correlation matrix), default: False

pairplt_col: 1D list of str. (The names of the columns/features for which pairplots are required), default: ‘all’ (All the features will be taken)

pairplt: boolean. (True if you want to pairplots), default: False

feature_division_figsize: tuple. (The size of figure containing bar and pie plot of proportion of target and categorical feature labels), default: (12,12)

NOTE- This function returns 3 items (dataframe after removing null values, list of numerical features, list of categorical features) apart from the EDA plots and data description. The EDA plots include heatmaps of correlation matrix, data summary, missing values, skewness plots, violin plots, pairplots, boxplots and categorical features’ distribution bar and pie charts.

Run the program for iris data

We will set the values of EDA function parameters and then run the program.

#Setting parameter values
target_variable_name='species'
labels=['F1','F2','F3']
#Calling EDA function with parameters of choice
df_processed,num_features,cat_features=EDA(df,labels,
target_variable_name,
data_summary_figsize=(6,6),
corr_matrix_figsize=(6,6),
corr_matrix_annot=True,
pairplt=True)

Voila! It automatically gives you the following data description and plots with hardly 4–5 lines of code

The data looks like this: 
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0

The shape of data is: (150, 5)

The missing values in data are:
species 7
petal_width 0
petal_length 0
sepal_width 0
sepal_length 0
dtype: int64

The summary of data is:
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

Some useful data information:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sepal_length 150 non-null float64
1 sepal_width 150 non-null float64
2 petal_length 150 non-null float64
3 petal_width 150 non-null float64
4 species 143 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None

The columns in data are:
['sepal_length' 'sepal_width' 'petal_length' 'petal_width' 'species']

The target variable is divided into:
0 50
1 45
2 48
Name: species, dtype: int64

The numerical features are:
['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

The categorical features are:
[]

Execution Time for EDA: 0.11 minutes

Run the program for titanic data

from ClfAutoEDA import *df=pd.read_csv('titanic_train.csv')#Dropping Id related columns
df.drop(['PassengerId','Ticket'],axis=1,inplace=True)
#Setting parameter values
labels=["not survived","survived"]
target_variable_name='Survived'
df_processed,num_features,cat_features=EDA(df,labels,
target_variable_name,
data_summary_figsize=(6,6),
corr_matrix_figsize=(6,6),
corr_matrix_annot=True,
pairplt=True)

Let us see how the EDA looks for titanic dataset.

The data looks like this: 
Survived Pclass Name \
0 0 3 Braund, Mr. Owen Harris
1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th...
2 1 3 Heikkinen, Miss. Laina
3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel)
4 0 3 Allen, Mr. William Henry

Sex Age SibSp Parch Fare Cabin Embarked
0 male 22.0 1 0 7.2500 NaN S
1 female 38.0 1 0 71.2833 C85 C
2 female 26.0 0 0 7.9250 NaN S
3 female 35.0 1 0 53.1000 C123 S
4 male 35.0 0 0 8.0500 NaN S

The shape of data is: (891, 10)

The missing values in data are:
Cabin 687
Age 177
Embarked 2
Fare 0
Parch 0
SibSp 0
Sex 0
Name 0
Pclass 0
Survived 0
dtype: int64

The summary of data is:
Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

Some useful data information:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Survived 891 non-null int64
1 Pclass 891 non-null int64
2 Name 891 non-null object
3 Sex 891 non-null object
4 Age 714 non-null float64
5 SibSp 891 non-null int64
6 Parch 891 non-null int64
7 Fare 891 non-null float64
8 Cabin 204 non-null object
9 Embarked 889 non-null object
dtypes: float64(2), int64(4), object(4)
memory usage: 69.7+ KB
None

The columns in data are:
['Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch' 'Fare' 'Cabin'
'Embarked']

The target variable is divided into:
0 424
1 288
Name: Survived, dtype: int64

The numerical features are:
['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']

The categorical features are:
['Name', 'Sex', 'Embarked']

The categorical variable is divided into:
Karlsson, Mr. Nils August 1
Rosblom, Mr. Viktor Richard 1
Shutes, Miss. Elizabeth W 1
Sutton, Mr. Frederick 1
Silverthorne, Mr. Spencer Victor 1
..
Jansson, Mr. Carl Olof 1
Dean, Master. Bertram Vere 1
Greenberg, Mr. Samuel 1
Leinonen, Mr. Antti Gustaf 1
Homer, Mr. Harry ("Mr E Haven") 1
Name: Name, Length: 712, dtype: int64

The categorical variable Name has too many divisions to plot


The categorical variable is divided into:
male 453
female 259
Name: Sex, dtype: int64

The categorical variable is divided into:
S 554
C 130
Q 28
Name: Embarked, dtype: int64

Execution Time for EDA: 0.15 minutes

--

--