Exploratory Data Analysis (EDA) and Data Preprocessing: A Beginner’s Guide

Naresh Thakur
Artificial Intelligence in Plain English
5 min readApr 10, 2020

--

Exploratory Data Analysis (EDA) was promoted by John W. Tukey, a renowned American statistician in the 1970s. In the data science arena, it is the first step towards solving a real-world problem. EDA done right is half the battle won as it is the key to building high-performance data models.

Although EDA and Data Preprocessing are two distinct terms, they involve many overlapping subtasks. At times, they are even used interchangeably.

The picture below demonstrates how EDA and Data Preprocessing fit within a data science process.

Image Credits: Wikipedia

Here in this post, we will shed light on both EDA and Data Preprocessing Steps.

Exploratory Data Analysis (EDA)

Before diving deeper into the concept of EDA, ponder upon the following questions:

  • How do you make sure that you are ready to apply machine learning algorithms to a data set?
  • How do you pick the best algorithm for a data set?
  • How can you define and refine various feature variables that can potentially be used for data modeling?

EDA can help answer all such questions.

It is the process of summarizing, visualizing and getting deeply acquainted with the important traits of a data set. When you carry out EDA, domain knowledge (e.g. about the business or social impact category) can help a great deal in understanding the data and extracting insights from it.

EDA is extremely valuable in the data science pipeline as it allows you to get closer to the certainty that future results obtained from a model will be valid, accurately interpreted, and applicable to the desired contexts.

To achieve this level of certainty, here’s what you can do with EDA:

  • Understand how the raw data was collected
  • Get familiar with different characteristics of the data
  • Learn about the individual features and their mutual relationships (or lack of)
  • Check and validate the data for anomalies, outliers, missing values, human errors, etc.
  • Extract insights that weren’t so evident to business stakeholders but can provide useful information about the business
  • Discover hidden patterns in the data that allow for better comprehension of the business problem
  • Validate if the data has been generated in an expected manner

When EDA is complete, data scientists have a firm feature set at their disposal that can be used for data modeling.

When the data has been fully understood, data scientists generally need to go back to data collection and data cleaning phases of the data science pipeline so as to transform the data set as per the expected business outcomes.

Skills Required for EDA

Data scientists normally use one of the following data visualization libraries on a daily basis:

  • Python: Matplotlib, Seaborn, and Plotly
  • R Language: ggplot

In order to perform quick and effective EDA, you should learn to use one of these data visualization libraries.

Data Preprocessing Steps

Data preprocessing is highly recommended before you begin with the modeling phase. The steps involved are:

1. Get Rid of Duplicate Data-Points:

Identical data-points can repeat many times over if the training data is huge in size. Therefore, to prevent bias during modeling, it is important to remove duplicate data-points.

2. Handling Highly Correlated Features:

Clustering and correlation plots can help find out if two features are strongly correlated or offer the same information.

As a general rule, if the correlation between the two features is higher than 99%, you can safely remove one of them.

The threshold (for correlation) percentage can be decided on the basis of the problem at hand.

3. Handling Low-Variance Features:

You can remove a feature if its variance is too low.

Such a feature remains constant in a dataset and cannot explain or influence the variation in the target variable.

4. Handling Imbalanced Data:

In the case of imbalanced data sets, you can

  • Oversample the class with lesser data points (you can use SMOTE or create duplicate data points)
  • Undersample the class with more data points (you can remove a few similar data points)

5. Handling Missing Values:

There are different ways to handle missing values in a data set after you are done importing the libraries and the data set.

  • High Percentage of Missing Values: You can drop a feature having more than 40–50% missing values.
  • Low Percentage of Missing Values: If the missing values for a feature are very low, you can drop the rows that contain missing values.
  • Imputation: The data is rarely complete; data can be missing due to numerous reasons: not captured, captured but not available, etc. In this scenario, you can continue with analysis after estimating the missing value. The process is called imputation. You can impute the missing values with the mean or median for a numerical feature and mode for a categorical feature.

6. Encoding Categorical Features:

At times, some data is in qualitative (text) form. In this case, you will find categories in text form in the data.

Such categorical features need to be converted into numerical data as most data models are based on mathematical equations and calculations and take numerical data as input.

You can use one-hot (variable binary representation) or label encoding if there aren’t too many categorical features. Otherwise, you may need to use supervised ration.

7. Feature Scaling:

Scaling is a method deployed to standardize the range of features or independent variables.

Various features in a data set will vary in their scale.

Since some features may dominate the rest, it is recommended to have all of them on the same scale.

8. Dimensionality Reduction:

This preprocessing step is important when you’re dealing with big data sets having hundreds or thousands of features.

You can use the Principal Component Analysis (PCA) technique here.

In this technique, the linear combination of a set of original features is transformed into a new set of features by reducing the size of feature space while retaining maximum information possible.

9. Train and Test Sets:

Check if the distribution of train and test sets is the same. Otherwise, the analysis will make no sense.

As a general rule, 20% of the data set is allocated to the test set and the remaining 80% is allocated to the training set. You will train a machine learning set on the training set and test it on the training set to check how well it can predict.

Shuffle the data set so that your model learns about the various data points in a single iteration.

Final Words

Do keep in mind that data preprocessing steps outlined above are used for handling tabular data sets. It’s different from how data processing is done for text or images.

Follow me on: LinkedIn. Twitter.

If you’ve any questions about this topic, please drop them in the comment section below. I will be glad to answer any questions or clear doubts that you may have about EDA and data preprocessing.

Don’t forget to hit the ‘follow’ button to receive updates on my upcoming posts.

--

--

Inspire and help teams to solve problems with Software Engineering, AI and ML. Create and build business. Talks about Technology, Team, Product and Sales.