Why Exploratory Data Analysis?

naveen singh
The Startup
Published in
4 min readOct 12, 2020

When I first started studying about machine learning, I used to read blogs and watch the video, and everywhere I get to know that EDA is one of the essential steps and that raises the question why?

You’ll find out soon!

Imagine you are planning to buy some product, what you do first?
You try to understand the features of the product, what features your product has and what features is required to fulfil your need with that product.
Similarly, Exploratory data analysis is a task performed by an individual to get familiar with the data.

Definition :

EDA is not identical to statistical graphics although the two terms are used almost interchangeably. Statistical graphics are a collection of techniques. All graphically based and all are focusing on one data characterization aspect. EDA encompasses a larger venue.EDA is an approach to data analysis that postpones the usual assumptions about what kind of model the data follow with the more natural process of allowing the data itself to reveal its underlying structure and model. EDA is not a mere collection of techniques. EDA is a philosophy as to how we dissect a data set, what we look for, how we look, and how we interpret. It is true that EDA heavily uses the collection of techniques that we call “statistical graphics”, but it is not identical to statistical graphics per se.

What is Exploratory Data Analysis (EDA)?

Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to

  1. Data is fit to use in machine learning algorithms.
  2. EDA helps you to choose the most suitable algorithms for your data set.
  3. Allows you to determine the feature potentially ideal for Machine Learning algorithm
  4. Maximize insight into a data set
  5. To uncover the underlying structure
  6. Extract important variables(features)
  7. Detect outliers and anomalies in the data set.
  8. Determine optimal factor settings.

Exploratory Data Analysis explained using sample Data set:

I’ll take an example of a Haberman’s Survival Data set.which is available on the UCI Machine Learning Repository and try to catch hold of as many insights from the data set using EDA.

Before starting with EDA, I imported some. necessary libraries (pandas, seaborn, NumPy, matplotlib)

NOTE: I’ll mention the observation below the images.

  • Data set contains 306 data point
  • Data set contains 4 feature
  • Data set is imbalanced
  • There are two class
  • Class 1:the patient survived
  • Class 2:the patient not survived
  • Class 1 have 225 data point and Class 2 have 81 data point

Univaraite analysis:PDF

Histogram of Age
  • Based on this graph we can’t conclude because they are overlapping each other tightly
Histogram of Year
  • Based on this graph we can’t not conclude because they are overlapping each other tightly
Histogram of Node
  • Based on this graph we can’t complete accurately but,
  • Node 0–4 50+% chance of surviving as well as (10–15) % chance of getting died as well, can’t be sure
  • Node 4–60: a chance of survival is significantly less

Univaraite analysis:CDF

PDF vs CDF of Nodes
  • The patient who has more than 47 nodes can’t survive
PDF vs CDF of ages
  • (16–17) % patient who’s age less than 38 have the chance to survive

Univaraite analysis: Boxplot

Box plot of nodes
  • 50% chance of surviving if the nodes are less then three and
  • there are 15–20% chance of not surviving id nodes are >3 and <5
  • if the node is >5 chance of surviving is very rare

Univaraite analysis:Voilin plots

Scatter plot of ages, nodes
  • If the node in between (0–5) and age is less then 32 patient survived

Bi-variate analysis : pair plots

  • As per the above image, you can see the overlapping of the data is very tight, so this is not useful for this data.

Conclusion:

  • The information is higly overlapped to each other.
  • If the nodes in range of (0–4) the surviving are high
  • The patient who has more than 47 nodes can’t survive
  • The data set is imbalanced
  • We can’t create a model based on this data (will not be useful)
  • This data is not helpful as we need more insights to classify

Lastly, I’ll recommend that before jumping to Ml Algorithms do perform EDA to understand the behaviour of the data and analyze if the data is useful or not.

You can find the above python notebook here and perform all the operation on diffrent dataset and understand all the concept.

Do let me know in the comment if I miss something or mistakenly put the wrong information

--

--

naveen singh
The Startup

Software Engineer | A Lifelong Learner | Data Science and Machine Learning Enthusiast