Data preprocessing

Nikunj joshi
Analytics Vidhya
Published in
4 min readMay 18, 2020

In machine learning data preprocessing is one of the most important step. So the question is what exactly is data preprocessing? In this article i will try to explain data preprocessing in best and simplest way possible.

Real-world data is often incomplete, inconsistent or lacking in certain behaviours or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. It is that step in which the data gets transformed to bring it to such a state that machine can easily analyse it. In other words, the features of the data can now be easily elucidated by the algorithm.

Let us try to understand that why we have to preprocess data before feeding it as the input to machine learning model.

  • Machine learning models have some specific conditions for input data like it has to be in numeric form. For example: if we have a column in our data with week names then we have to convert it into numbers like 0 for monday, 1 for tuesday and so on.
  • Preprocessing also help to improve accuracy of model. This includes dealing with missing values, normalizing , creating dummy variables etc.

Things that we have to keep in mind while preprocessing.

  • Missing values — Missing values are a frequent phenomenon, and we need to have a strategy for treating them. A missing value can signify different things in our data. Perhaps the data was not available or not applicable or the event did not happen. It could be that the person who entered the data did not know the right value, or missed filling in. Data preprocessing methods vary in the way they treat missing values. Typically, they ignore the missing values, or exclude any records containing missing values, or replace missing values with the mean, or infer missing values from existing values.
  • Noisy data — Noisy data means that data contain errors, or outlier values that swerve from the expected. Incorrect data may also result from inconsistencies in naming date conventions or data codes used, or inconsistent formats for input fields, such as date.This problem can be

    corrected by binning, regression and clustering.
  • Duplicate data — A dataset may include data entries which are duplicates of each other. It may happen when a person submits a form more than once. The term deduplication is often referred to the process of dealing with duplicate entries.
  • Dimensionality of data —Most real world datasets have a large number of features. Dimensionality reduction or dimension reduction is the process of reducing the number of features under consideration. The disadvantage of high dimensionality is that data analysis tasks become harder as the dimensionality of the data increases. As the dimensionality increases, the number of planes occupied by the data increases thus adding more and more sparsity to the data which is difficult to model and visualize. Dimensionality of a dataset is reduced by creating new features which are combination of the old features. In other words, the higher-dimensional feature-space is mapped to a lower-dimensional feature-space.The common techniques used for this process are Principal Component Analysis and Singular Value Decomposition. The models build on top of lower dimensional data are more accurate and understandable. Data visualization also get easier.

Feature encoding

Feature encoding is the process of transforming a categorical variable into a continuous variable and using them in the model.

1- One hot encoding and label encoding

Let’s say we have ‘eggs’, ‘butter’ and ‘milk’ in a categorical variable.

  • One hot encoding will produce three columns and the presence of a class will be represented in binary format. Three classes are separated out to three different features. The alogirithm is only worried about their presence/absence without making any assumptions of their relationship.
  • Label encoding gives numerical aliases to the classes. So the resultant label encoded feature will have 0,1 and 2. The problem with this approach is that there is no relation between these three classes yet our algorithm might consider them to be ordered (that is there is some relation between them) maybe 0<1<2 that is ‘eggs’<‘butter’<‘milk’.

2- Frequency encoding — It is a way to utilize the frequency of the categories as labels. In the cases where the frequency is related somewhat with the target variable, it helps the model to understand and assign the weight in direct and inverse proportion, depending on the nature of the data.

After feature encoding is done, our dataset is ready for the machine learning algorithms.
But before we start deciding the algorithm which should use, we have to split the 3 parts (training data, validation data, testing data).

Machine Learning algorithms has to be first trained on training data then validated and tested on validation and testing dataset respectively.

Thank you for reading!

--

--