Data Preprocessing — The first step in Data Science

Published in

Analytics Vidhya

5 min readJan 2, 2020

We all know that the volume of data around us is increasing rapidly around us and so is the probability of having anomalies in that data. The reason for such anomalies in data could be anything such as origin for heterogeneous sources or may be due to large humongous size. Considering the fact that high quality data leads to better models and predictions, data preprocessing has become vital and the crucial fundamental step in the data science/machine learning.

The quality of data that is being gathered, can be accessed by 3 main factors :

Accuracy : Erroneous values showing deviation from expected.
Completeness : Lacking attribute/feature values.
Consistency : Data aggregation is inconsistent.

Some reasons which affect the factors mentioned above could be:

Human/computer errors which might occur during data entry and transmission.
Users submitting incorrect values.
Incorrect formats for input fields.
Unavailability of data.

To ensure the high quality of data, it’s crucial to preprocess it.

Data preprocessing is divided into four stages:

Data cleaning
Data integration
Data reduction
Data transformation.

Data Cleaning

Data cleaning can be explained as a process to ‘clean’ data by removing outliers, replacing missing values, smoothing noisy data, and correcting inconsistent data.

-> Handling Missing values

In order to deal with missing data, we can use multiple approaches :

Removing the training example: The easiest as well as simplest way could be to ignore the particular training example with missing output label(if it is a case of classification problem). But this is usually discouraged as it leads to loss of data.
Filling in missing value manually: The another way could be to enter the missing values manually in to the system but this is time consuming, and not recommended for huge data sets.
Using a standard value to replace the missing value: The missing value can be replaced by a global constant such as ‘N/A’ or ‘Unknown’. This is a very basic approach which is usually done to prevent data loss and encounter the missing values.
Using central tendency(mean, median, mode): Based on data distribution, mean (in case of normal distribution) or median (for non-normal distribution) can be used to fill the missing value.

-> Handling Noisy data

Noise in data is defined as a random variance in a measured variable. In simple words noisy data is data with large amount of additional meaningless information(data may be corrupted, or distorted).

To deal with these anomalous values we use data smoothing techniques which are described below :

Binning: It is the process of dividing a continuous measure in to discrete intervals called bins, and then we look around these bins for noise in data . There are various approaches to binning. Two of them are smoothing by bin means where each bin is replaced by the mean of bin’s values, and smoothing by bin medians where each bin is replaced by the median of bin’s values.
Regression: Linear regression and multiple linear regression can be used to smooth the data, where the values are conformed to a function.
Outlier analysis: Approaches such as clustering can be used to detect outliers and deal with them accordingly.

Data integration

Since data is being collected from multiple sources, data integration has become a vital part of the process. This may lead to redundant and inconsistent data due to various reasons which could result in poor accuracy and speed of data model. To deal with these issues and maintain the data integrity, approaches such as tuple duplication detection and data conflict detection are sought after. The most common approaches to integrate data are:

Data consolidation: The data is physically bought together to one data store. This usually involves Data Warehousing.
Data propagation: Copying data from one location to another using applications is called data propagation. It can be synchronous or asynchronous and is event-driven.
Data virtualization: An interface is used to provide a real-time and unified view of data from multiple sources. The data can be viewed from a single point of access.

Data reduction

The purpose of data reduction is to have a condensed representation of the data set which is smaller in volume, while maintaining the integrity of original. This results in efficient yet similar results as it improves the quality of data. A few methods to reduce the volume of data are:

Missing values ratio: Attributes that have more missing values than a threshold are removed.
Low variance filter: Normalized attributes that have variance (distribution) less than a threshold are also removed, since little changes in data means less information.
High correlation filter: Normalized attributes that have correlation coefficient more than a threshold are also removed, since similar trends means similar information is carried. Correlation coefficient is usually calculated using statistical methods such as Pearson’s chi-square value etc.
Principal component analysis: PCA, is a statistical method which reduces the numbers of attributes by grouping highly correlated attributes together. With each iteration, the initial features are reduced to principal components, with greater variance than the original set on the condition that they are uncorrelated with the preceding components. This method suites only for features with numerical values.

Data transformation

The final step of data preprocessing is transforming the data into form appropriate for Data Modeling. This can be done by:

Smoothing
Attribute/feature construction: New attributes are constructed from the given set of attributes.
Aggregation: Summary and Aggregation operations are applied on the given set of attributes to come up with new attributes.
Normalization: The data in each attribute is scaled between a smaller range(e.g. 0 to 1 or -1 to 1).
Discretization: Raw values of the numeric attributes are replaced by discrete or conceptual intervals, which can in return be further organized into higher level intervals.
Concept hierarchy generation for nominal data: Values for nominal data are generalized to higher order concepts.

Despite having multiple approaches to preprocessing data, it’s still an actively researched field due to the amount of incoherent data being generated each day at tremendous rapid rates.

References :

You may try implementation of data Preprocessing using pandas here: https://github.com/bhartendudubey/Data-Preprocessing

For better understanding, one may consider the following articles :

Data Preprocessing — Machine Learning

2. Data Pre Processing Techniques You Should Know

3. Data Preprocessing in Python