WHY Data Preprocessing ??

Abu Qais
Nerd For Tech
Published in
3 min readJun 16, 2021

In the Machine learning process, Data Preprocessing is that step in which the data gets transformed or Encoded so that machine can easily parse it.

WHY Data Preprocessing ?

As Machines don’t understand text, image, or video data as it is, they only understand 0s and 1s. So if we put several folds of images and expect our machine learning model to get trained, IT WILL NOT HAPPEN.

In the real world, data are generally incomplete; lacking attribute values, duplicate values, or containing only aggregate data, noisy data: containing errors or outliers due to human error or false or manipulated survey data.

By Preprocessing Data :

We make our database more precise and accurate. We eliminate the incorrect or missing values that are there as a result of human error.

We can fill the attributes or feature values that are missing if needed, it will make the database more complete.

We smooth the data, which will make it easier to use and interpret.

Steps in Data Preprocessing

Examine our Data

First, we have to take close look at our database i.e., looking for null values, size, outliers as a random collection of data often has irrelevant bits. The null values can be found by predefined functions in pandas by df.info() & df.describe().

Data Quality Assessment

It is the process of scientifically and statistically evaluating data in order to determine whether they meet the required criteria for the required model. This process describes the data and provides asses and improves the quality of data.

(I) Missing Values: It is much usual to have missing values in the dataset. It happened during data collection. We can rid of missing values by :

Eliminating rows with null values(nan values). It works effectively for data with few missing values.

Estimating missing values by filling with mean, median, or mode value of their respective feature.

(II) Duplicate Values: Dataset may contain data objects that are duplicate values of features. We can get rid of duplicate values by eliminating them as they give bias to particular data objects.

Data Aggregation Or Data Reduction :

Working with a complete dataset comes out to be much expensive considering time and memory constraints.

Aggregations provide us a stable view of data as the behavior of grouped data is much smoother than individual data objects.

Data Transformation

As noted earlier whole preprocessing process is to encode the data in such a way that machine can easily parse or understand it. Transformation of data helps machine to accept input for learning algorithm to perform in model.

Normalization: It is done to scale the feature values in a specific range (-1to1).

Train — Test Data split

After feature encoding is done, our dataset is ready for the machine learning algorithms. But before we start deciding the algorithm which should be used, We should split the dataset into 2 parts.

Thank you for reading..☺

--

--

Abu Qais
Nerd For Tech

The price of “anything” is the amount of “time”, U xchange for it. Education | Technology | Data Science | Statistics | History