Introduction to Exploratory Data Analysis

asha gaire
Backprop Lab
Published in
3 min readApr 24, 2020

Understand the concept of EDA in elementary words.

Photo by Isaac Smith on Unsplash

While learning data science, Exploratory Data Analysis is the topic that you will encounter and also follow it but without exactly being aware of it (EDA). Exploratory Data Analysis, as its name suggest it is the technique of manipulating datasets, playing with it in right way, understand the datasets, represent the understandings, and modeling and implementing algorithms in order to understand the behavior of the data.

This definition seems like EDA and data mining/analysis are the same thing. But first, let me give you a list of tasks you do while data analysis/mining.

  1. Data requirements
  2. Data collection
  3. Data processing
  4. Data cleaning
  5. EDA
  6. Modeling and algorithm
  7. Data Product
  8. Communication

In step 5 which is EDA, we prepare the data for actual analysis. We select the relevant data from the overall datasets, delete non-relevant datasets, transform the data, and divide the data into required chunks for analysis. And then we work on the most crucial steps that deal with descriptive statistics. Here we summarize the data, finding the hidden correlation and relationships among the data. Then we can visualize these correlation and regression relationships using different visualizing tools. Next, we work more on understanding the behavior of data by developing predictive models, evaluating the models, and calculating the accuracies. Here are four stages of EDA that I just explained about:

  1. Problem definition
  2. Data preparation
  3. Data analysis
  4. Development and representation of the results

While working on EDA we must recognize different facts in our datasets like data type and measurement scales, because they make a difference in results we assume after analyzing a datasets. A datasets contains many observations(columns) about a particular object. These observations might be numerical data or categorical data. Even in numerical they could be discrete data or continuous data. In the case of measurement scale, it could be nominal, ordinal, interval, and ratio.

We generally use Python as the main tool for data analysis. Python has been consistently ranked among the top 10 programming languages and is widely adopted for data analysis and data mining by data science experts. Some of the libraries used for EDA are listed below.

NumPy:

  • Create arrays with NumPy, copy arrays, and divide arrays
  • Perform different operations on NumPy arrays
  • Understand array selections, advanced indexing, and expanding
  • Working with multi-dimensional arrays
  • Linear algebraic functions and built-in NumPy functions

Pandas:

  • Understand and create DataFrame objects
  • Subsetting data and indexing data
  • Arithmetic functions, and mapping with pandas
  • Managing index
  • Building style for visual analysis

Matplotlib:

  • Loading linear datasets
  • Adjusting axes, grids, labels, titles, and legends
  • Saving plots

SciPy:

  • Importing the package
  • Using statistical packages from SciPy
  • Performing descriptive statistics
  • Inference and data analysis

This article is highly motivated by the book Hands on EDA with python . This book will teach you from introduction to EDA to implementing machine learning model with huge datasets .

[1]: Hands-on Exploratory Data Analysis using Python, By Suresh Kumar Mukhiya, Usman Ahmed, 2020, PACKT Publication

--

--

asha gaire
Backprop Lab

Practicing Data Science, AI Enthusiastic, Forthcoming ML Engineer