Data Preprocessing — (in short)

Pankaj kanyal
2 min readJun 7, 2022

--

Data preprocessing is the step after the data collection or data gathering. It includes understanding data and processing it into a suitable format to fit our machine learning model and to get maximum results.

It takes around 70% of the time in a Data Science project. Data could be structured or unstructured, data may have outliers and missing values at the time of collection.

Things To Do Before Starting Data-Preprocessing

To make Data-preprocessing step more effective. We must be clear with the following points.

  1. Understand the data very well
  2. Be clear with the problem Statement
  3. The Data collected must be from an authenticated source
  4. Understand what type and properties your data should have so that the Machine Learning model can behave in an efficient way.

Steps Involved in Data pre-processing

In Data pre-processing there are majorly Four-Step involved

  1. Data Cleaning — It is a process in which duplicate values, outliers, and missing values are handled in the dataset. Data Cleaning may require domain expertise to remove the unwanted features from the dataset.
  2. Data Integration — It is a way to combine data collected from different data sources and show or store it in a unified format. This step could be challenging as integration may create redundant data points, inconsistency, conflict in data points in the dataset etc.
  3. Data Reduction — Data Reduction is a process to reduce or remove irrelevant features from the data. The higher the features in the input variable higher will be the time complexity for the machine learning algorithm. so It becomes important to remove those features that don’t contribute much to the target variable. Dimensional Reduction and Feature Selection are the two main techniques to perform Data Reduction.
  4. Data Transformation — It is a process of converting data from one format to a different format. Data transformation is important as it makes computers understand the data in a standard format. For example Change in scale, Normalization, On-hot encoding, label encoding etc.

The topics cover above are just to give an overview of data preprocessing.

Thank you for reading :)

--

--

Pankaj kanyal

Data Science Enthusiast, Learning to walk in the field of Computer Science and Machine Learning.