Nerd For Tech
Published in

Nerd For Tech

WHY Data Preprocessing ??

In the Machine learning process, Data Preprocessing is that step in which the data gets transformed or Encoded so that machine can easily parse it.

WHY Data Preprocessing ?

As Machines don’t understand text, image, or video data as it is, they only understand 0s and 1s. So if we put several folds of images and expect our machine learning model to get trained, IT WILL NOT HAPPEN.

In the real world, data are generally incomplete; lacking attribute values, duplicate values, or containing only aggregate data, noisy data: containing errors or outliers due to human error or false or manipulated survey data.

By Preprocessing Data :

We make our database more precise and accurate. We eliminate the incorrect or missing values that are there as a result of human error.

We can fill the attributes or feature values that are missing if needed, it will make the database more complete.

We smooth the data, which will make it easier to use and interpret.

Steps in Data Preprocessing

Examine our Data

First, we have to take close look at our database i.e., looking for null values, size, outliers as a random collection of data often has irrelevant bits. The null values can be found by predefined functions in pandas by & df.describe().

Data Quality Assessment

It is the process of scientifically and statistically evaluating data in order to determine whether they meet the required criteria for the required model. This process describes the data and provides asses and improves the quality of data.

(I) Missing Values: It is much usual to have missing values in the dataset. It happened during data collection. We can rid of missing values by :

Eliminating rows with null values(nan values). It works effectively for data with few missing values.

Estimating missing values by filling with mean, median, or mode value of their respective feature.

(II) Duplicate Values: Dataset may contain data objects that are duplicate values of features. We can get rid of duplicate values by eliminating them as they give bias to particular data objects.

Data Aggregation Or Data Reduction :

Working with a complete dataset comes out to be much expensive considering time and memory constraints.

Aggregations provide us a stable view of data as the behavior of grouped data is much smoother than individual data objects.

Data Transformation

As noted earlier whole preprocessing process is to encode the data in such a way that machine can easily parse or understand it. Transformation of data helps machine to accept input for learning algorithm to perform in model.

Normalization: It is done to scale the feature values in a specific range (-1to1).

Train — Test Data split

After feature encoding is done, our dataset is ready for the machine learning algorithms. But before we start deciding the algorithm which should be used, We should split the dataset into 2 parts.

Thank you for reading..☺




NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit

Recommended from Medium

Applying Machine Learning to DevOps

What elements make difference in classification problems

Customer Transaction Prediction

Pass It On: Explaining Machine Learning

Uncertainty estimation for Neural Network — Dropout as Bayesian Approximation

Edge Detection Algorithms

Why should you learn Machine Learning ??

Natural Language Processing

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Abu Qais

Abu Qais

The price of “anything” is the amount of “time”, U xchange for it. Education | Technology | Data Science | Statistics | History

More from Medium

Workflow of Supervised Learning algorithms

Image Matching with Shopee

Building an Expected Goals Soccer Model using Machine Learning Decision Trees and XGBoost

Credit Card Default Detection — Kaggle