Introduction to Exploratory Data Analysis

Published in

Backprop Lab

3 min readApr 24, 2020

Understand the concept of EDA in elementary words.

While learning data science, Exploratory Data Analysis is the topic that you will encounter and also follow it but without exactly being aware of it (EDA). Exploratory Data Analysis, as its name suggest it is the technique of manipulating datasets, playing with it in right way, understand the datasets, represent the understandings, and modeling and implementing algorithms in order to understand the behavior of the data.

This definition seems like EDA and data mining/analysis are the same thing. But first, let me give you a list of tasks you do while data analysis/mining.

Data requirements
Data collection
Data processing
Data cleaning
EDA
Modeling and algorithm
Data Product
Communication

In step 5 which is EDA, we prepare the data for actual analysis. We select the relevant data from the overall datasets, delete non-relevant datasets, transform the data, and divide the data into required chunks for analysis. And then we work on the most crucial steps that deal with descriptive statistics. Here we summarize the data, finding the hidden correlation and relationships among the data. Then we can visualize these correlation and regression relationships using different visualizing tools. Next, we work more on understanding the behavior of data by developing predictive models, evaluating the models, and calculating the accuracies. Here are four stages of EDA that I just explained about:

Problem definition
Data preparation
Data analysis
Development and representation of the results

While working on EDA we must recognize different facts in our datasets like data type and measurement scales, because they make a difference in results we assume after analyzing a datasets. A datasets contains many observations(columns) about a particular object. These observations might be numerical data or categorical data. Even in numerical they could be discrete data or continuous data. In the case of measurement scale, it could be nominal, ordinal, interval, and ratio.

We generally use Python as the main tool for data analysis. Python has been consistently ranked among the top 10 programming languages and is widely adopted for data analysis and data mining by data science experts. Some of the libraries used for EDA are listed below.

NumPy:

Create arrays with NumPy, copy arrays, and divide arrays
Perform different operations on NumPy arrays
Understand array selections, advanced indexing, and expanding
Working with multi-dimensional arrays
Linear algebraic functions and built-in NumPy functions

Pandas:

Understand and create DataFrame objects
Subsetting data and indexing data
Arithmetic functions, and mapping with pandas
Managing index
Building style for visual analysis

Matplotlib:

Loading linear datasets
Adjusting axes, grids, labels, titles, and legends
Saving plots

SciPy:

Importing the package
Using statistical packages from SciPy
Performing descriptive statistics
Inference and data analysis

This article is highly motivated by the book Hands on EDA with python . This book will teach you from introduction to EDA to implementing machine learning model with huge datasets .

[1]: Hands-on Exploratory Data Analysis using Python, By Suresh Kumar Mukhiya, Usman Ahmed, 2020, PACKT Publication

Introduction to Exploratory Data Analysis

Written by asha gaire