Basics to know before even you start exploratory data analysis (EDA)

Amit Bhardwaj
Analytics Vidhya
Published in
3 min readFeb 19, 2022

Data enthusiasts just love EDA. I am sure people who have gone through lots of data now have their pathways or templates created which saves them a lot of time and comes to the conclusion.

But for data aspirants who are just starting EDA can be exhausting sometimes if you are not reiterating questions in your mind again and again so that you are not lost inside.

In this article, I am going to list down a few things which can be helpful for guiding through EDA.

Majorly doing EDA has the following objectives :

  1. Maximise Insights
  2. Uncover underlying structure
  3. Extract important variables
  4. Detect Anomalies
  5. Test underlying assumptions

The objective can be different from what I have listed, but we have to have an objective before starting.

EDA FLOW CHART

Univariate Analysis

This means looking at each variable at a single time. In this analysis typically five-point summary is calculated.

Measure of Central tendencies: Mean, Median and Mode

Measure of Dispersion: Standard deviation, Variance

Measure of tailedness (Kurtosis): Right-skewed, Left-skewed

Bivariate Analysis

This means looking at relationships among two variables.The thing to take care is that while analysis we should always take mean or proportion into account rather absolute row numbers.

Types of variables:

Continuous variable: A continuous variable is a specific kind of quantitative variable used in statistics to describe data that is measurable in some way. If your data deals with measuring a height, weight, or time, then you have a continuous variable.

Categorical variables: Categorical variables contain a finite number of categories or distinct groups. Categorical data might not have a logical order. For example, categorical predictors include gender, material type, and payment method.

Univariate Visualisation

Now let’s see some code in action for step by step EDA :

The following code snippet will give us the five-point summary for all the continuous variables:

FIVE POINT SUMMARY

Visualising Mean , median and mode

MEASURE OF CENTRAL TENDENCY

BI-Variate Visualization

CORRELATION BETWEEN VARIABLES
VISUALISATION OF CORRELATION
JOINTPLOT
VIOLIN PLOT
BOXPLOT FOR COMPARING MEANS
BOXPLOT

Thanks!

--

--