What is Exploratory Data Analysis (EDA)?

Lala Ibadullayeva
3 min readApr 27, 2022

--

If you are interested in data, you have encountered EDA at least once in your life. Because this process is a direction that gives us time.

Data does not always come to us in full detail and meaning. In order to operate on them, we must first recognize that data and proceed to work.

EDA is, in fact, a process that seeks to convey information to us, usually visually, by generalizing the basic features of datasets. We see this process mostly in machine learning.

If we analyze the data, the first step we will take is EDA.

Basically, the EDA process is as follows (although many sources describe this process in terms of poetry, I will try to clarify this article as simply as I can):
Before every detail, know that you must have a purpose.
1. Recognize your insights. Recognize variables.
2. Identify the structure. — Univariate analysis — columns analysis
3. Extracting important variables — bivariate analysis — relationships
4. Detecting anomalies — missing null values
5. Testing assumptions

EDA is actually logically focused on 2 main analyzes.
+ Univariate: We look at each variable at the same time.

  • Our central tendencies — Mean, Median and Mode.
  • Dispersion measure — Standard deviation and variance.
  • Measure of tailedness — Right-skewed, Left-skewed.

+Bivariate — We look at the relationships between 2 different variables.

There are two types of variables:
- Continuous variable is a special type of quantitative variable that is often used in statistics.
- Categorical variable — consists of a finite number of categories or different groups.

We can understand the EDA process more easily by working on a dataset.
So, you can perform this process step by step by downloading any CSV file from Kaggle site.
In the first step, you add the data inside Google Colab, Jupyter notebook or VSC.
You then add libraries that are compatible with Python.

In this case, you will need:

import pandas as pd
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt

After importing our libraries, we read our data:

data = pd.read_csv (“your data path or name”)

And then to analyze our data in general, we analyze its first 5 and last 5 lines.

data.head ()
data.tail ()

then we describe our data.

data.describe()

and for more information about our data:
We use this function :

data.info()

To understand the form of our data: we use the function data.shape().

Now that we know some of our data, let’s take a look at the risk of being Null-values :

data.isnull().sum()

Then we will try to recognize the duplicate variables.

data.duplicated().sum()

Now, it’s time to move away from the two events that slowed down the process in our data.

#droping duplicates:
data.drop_duplicates (inplace = True)

and no null values:

data.fillna ({column names}, inplace = True)

Then repeat to check for gaps call this function :

data.isnull().sum()

We have worked on our data to some extent, we can visualize it in various ways with the help of matplotlib and seaborn libraries.

As a result, I would like to say that this process may seem difficult to you at first. However, over time you will feel that the EDA process is the best gift for those of us who work with data.

To understand the EDA process more easily, you can check my Github account.

--

--

Lala Ibadullayeva

Programming Engineer @ CSB Lab - IMBB / PhD candidate of Molecular Biology / Google WTM Ambassador