Why Exploratory Data Analysis?

Published in

The Startup

4 min readOct 12, 2020

When I first started studying about machine learning, I used to read blogs and watch the video, and everywhere I get to know that EDA is one of the essential steps and that raises the question why?

You’ll find out soon!

Imagine you are planning to buy some product, what you do first?
You try to understand the features of the product, what features your product has and what features is required to fulfil your need with that product.
Similarly, Exploratory data analysis is a task performed by an individual to get familiar with the data.

Definition :

EDA is not identical to statistical graphics although the two terms are used almost interchangeably. Statistical graphics are a collection of techniques. All graphically based and all are focusing on one data characterization aspect. EDA encompasses a larger venue.EDA is an approach to data analysis that postpones the usual assumptions about what kind of model the data follow with the more natural process of allowing the data itself to reveal its underlying structure and model. EDA is not a mere collection of techniques. EDA is a philosophy as to how we dissect a data set, what we look for, how we look, and how we interpret. It is true that EDA heavily uses the collection of techniques that we call “statistical graphics”, but it is not identical to statistical graphics per se.

What is Exploratory Data Analysis (EDA)?

Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to

Data is fit to use in machine learning algorithms.
EDA helps you to choose the most suitable algorithms for your data set.
Allows you to determine the feature potentially ideal for Machine Learning algorithm
Maximize insight into a data set
To uncover the underlying structure
Extract important variables(features)
Detect outliers and anomalies in the data set.
Determine optimal factor settings.

Exploratory Data Analysis explained using sample Data set:

I’ll take an example of a Haberman’s Survival Data set.which is available on the UCI Machine Learning Repository and try to catch hold of as many insights from the data set using EDA.

Before starting with EDA, I imported some. necessary libraries (pandas, seaborn, NumPy, matplotlib)

NOTE: I’ll mention the observation below the images.

Data set contains 306 data point
Data set contains 4 feature

Data set is imbalanced
There are two class
Class 1:the patient survived
Class 2:the patient not survived
Class 1 have 225 data point and Class 2 have 81 data point

Univaraite analysis:PDF

Based on this graph we can’t conclude because they are overlapping each other tightly

Based on this graph we can’t not conclude because they are overlapping each other tightly

Based on this graph we can’t complete accurately but,
Node 0–4 50+% chance of surviving as well as (10–15) % chance of getting died as well, can’t be sure
Node 4–60: a chance of survival is significantly less

Univaraite analysis:CDF

The patient who has more than 47 nodes can’t survive

(16–17) % patient who’s age less than 38 have the chance to survive

Univaraite analysis: Boxplot

50% chance of surviving if the nodes are less then three and
there are 15–20% chance of not surviving id nodes are >3 and <5
if the node is >5 chance of surviving is very rare

Univaraite analysis:Voilin plots

If the node in between (0–5) and age is less then 32 patient survived

Bi-variate analysis : pair plots

As per the above image, you can see the overlapping of the data is very tight, so this is not useful for this data.

Conclusion:

The information is higly overlapped to each other.
If the nodes in range of (0–4) the surviving are high
The patient who has more than 47 nodes can’t survive
The data set is imbalanced
We can’t create a model based on this data (will not be useful)
This data is not helpful as we need more insights to classify

Lastly, I’ll recommend that before jumping to Ml Algorithms do perform EDA to understand the behaviour of the data and analyze if the data is useful or not.

You can find the above python notebook here and perform all the operation on diffrent dataset and understand all the concept.

Do let me know in the comment if I miss something or mistakenly put the wrong information