INTRODUCTION TO PRE-PROCESSING IN MACHINE LEARNING

6 min readJun 1, 2019

Before starting with the data pre-processing I want you all to have a basic idea of how the model get trained in machine learning ,what all steps are required.

The steps are-

1) Loading the dataset

2) Pre- processing of dataset

3) Splitting of data into Training and Testing set

4) Creating Machine learning model and applying the algorithm to it.

5) Prediction

These are the basic steps which are being followed on creating a machine leaning project , coming to pre-processing.

Pre-processing is predominantly one of the important step , where we clean the raw data and make it structured so that it would be able to predict the model much efficiently.

The concepts I will cover in this article is –

1) Handling Null values

2) Standardization

3) Normalization

4) Label encoder

5) One-hot encoding

1) Handling Null values

In Real world data-set you will get to see some null values .Whether your model is classification, regression or any other, you will always come across null value.

So in this by the heading only you might be able to guess that here we will be dealing with the null values which occur in our data-set.

The reason we are removing all null values is because it takes more unnecessary data and hence more space in the data-set , which eventually reduces the efficiency of the model.

Firstly we have to check for the Null values .Also null value is represented as nan in python.

To have a better understanding i’ll show you the code , so that you’ll get to know how it actually works.

* df.isnull() or df.isna() -It returns a boolean value , if the value is Nan then it is True otherwise False.

* df.isnull().sum() or df.isna().sum() -Returns the column names along with the number of Nan values in that particular column.

We can solve this problem in many ways-

i)By dropping column

ii)Imputation

iii)Slicing though position if Nan values occur in array

i) By dropping column

We can simply drop the column by using df.dropna().It is good for only large dataset having large data points.

However it is not the best option to remove the rows and columns from our dataset as it can lead to loss of valuable information. If you have large data-points say around 10000 data-points then removing 2–3 rows won’t affect your dataset much but if you only have 100 data points and out of which 30 have Nan values for a particular field then you can’t simply drop those rows.

ii) Imputation

Imputation is simply the process of substituting the missing values of our dataset. We can do this by defining our own customised function or we can simply perform imputation by using the Imputer class provided by sklearn.

iii ) Slicing through position if Nan occur in array

Here we just do the slicing of the respective row and column where we were getting nan values.

Suppose this is the array generated which is having nan values.

Now on simple slicing of columns and rows –

Here, what I have done is , if we look for X the first coordinate is for row and second for column. So doing [ 1 : , 1 : 5 ] i am accessing from the first row onwards till the end of the row in the dataset and accessing the column from 1 to 5 in the dataset.

2) Standardization

Standardization is nothing but scaling of larger values and smaller values in dataset in such a way that it comes around the same range ,so that later in training it do not fluctuate.

It sometimes happens that we get more than required and sometimes very bad result , so both have a negative impact because that comes under over-fitting and under-fitting problem.

The things to remember is we do scaling always before spliting it into training and testing set.

See the example –

Now applying scaling –

Now see the values of each data-point and compare from above.

Here you’ll get to see the value around the same range.

3) Normalization

Normalization is also one type of pre-processing where we divide the data with the maximum value among the data-points so that we could get more accurate result.

Suppose this is the array which I have got after converting to array and in this the max value is 255.

So what I will do is i will normalize the data by dividing every data-point by 255.So that I could get value . The goal of normalization is to change the values of numeric columns in the dataset or vector to a common scale, without distorting differences in the ranges of values. For machine learning, every dataset does not require normalization. It is required only when features have different ranges.

After normalizing we will get as-

Ok some of you might be in confusion now that what is the difference between Scaling and Normalizing ?

Well they both are same , they both make the data points of same range ,the only difference is in their way of implementation.

In Algebra, Normalization seems to refer to the dividing of a vector by its max length and it transforms your data into a range between 0 and 1.

And in statistics, Standardization seems to refer to the subtraction of the mean and then dividing by its SD (standard deviation). Standardization transforms your data such that the resulting distribution has a mean of 0 and a standard deviation of 1.

4)Label Encoder

Basically both Label encoder and One –hot encoding are the same ,they both change the string value in the column to the numeric value so that further our predicted model shouldn’t give error.That’s the reason why we do it in pre-processing.Also both are the parts of scikit learn library in python.

For example, we have encoded a set of messages whether spam or ham into numerical data. This is actually categorical data and there is no relation, of any kind, between the rows.But the model don’t understand it so we make use of one-hot encoder.

5) One-hot encoding

In a dataset what one hot encoding does is, it takes a column which has categorical data, which has been label encoded, and then splits the column into multiple columns. The numbers are replaced by 1s and 0s, depending on which column has what value.

In our example, I have make use of array and as you can see it is converting the numeric value into 0 and 1 vector form.

So that’s it from my side ,hope you like it. And if you like don’t forget to clap .

-Mahesh Singh Dasila

INTRODUCTION TO PRE-PROCESSING IN MACHINE LEARNING

Written by Mahesh Singh Dasila