Six steps to hone your Data: Data Preprocessing, Part 6

Anushkad
GoDataScience
Published in
5 min readOct 13, 2020

Finally, this part marks the end of our Data Preprocessing journey. We were introduced to what Data Preprocessing is, why it is important, and how the first five steps of Data Preprocessing are carried out, so far in our series.

If you have not checked out the previous tutorials of the series, I suggest you check it out to get a clear idea about Step 6.

This tutorial answers all the following questions:

  1. What is feature scaling?
  2. What is feature scaling necessary?
  3. When to scale?
  4. How to scale?

In the last tutorial, we split our dataset into a train set and test set, but to pass on these sets to our ML model, we need to ensure that all the magnitudes are of the same standing.

Real-world data is often unscaled and consists of all types of ranges. Thus, we need to scale all the features so that one significant number doesn’t impact the model just because of their large magnitude.

And we will be learning how to do that in Step 6.

Step 6] Feature Scaling

Feature scaling in machine learning is one of the most crucial steps during the preprocessing of data before creating a machine learning model.

Feature scaling helps in training a weak model to do better.

What is Feature scaling?

Feature Scaling is a method used to standardize the range of independent variables or features of data.

Let us consider a dataset.

In the dataset, you can notice that the age and salary columns do not have the same scale.

Why is Feature Scaling necessary?

Well, look at our data. We have one column with ages from 21–80, which earns a salary of about 45000–200000!

Not only is the worth column made up of much higher numbers than the age column, but the variables also cover a much wider range of data.

That means that the Euclidean distance will be dominated by salaries and will wind up dominating the age data.

This unscaled data may lead to difficulty in visualizations and, more importantly, they can degrade the predictive performance of many machine learning algorithms.

Therefore, we need to scale our data before employing ML algorithms so that all the features contribute equally to the result, and the result generated is accurate.

When to Scale?

A lot of machine learning models depend on Euclidean distance.

If, for example, the value in one column (x) is much higher than the value in another column (y), thus (x2-x1) squared will give a far larger value than (y2-y1) squared.

So clearly, one square difference dominates over the other square difference.

In the machine learning equations, the square difference with the lower value in comparison to the far greater value will almost be treated as almost equal to zero.

We do not want that to happen. And that is when we need to scale.

It is not always necessary to apply feature scaling for all ML models. Some models can compensate for a high value using a flow coefficient.

Throughout our ML journey with the practice of building different models, we will learn to differentiate the models which need feature scaling and those which do not.

How to scale?

The most common techniques of feature scaling are Normalization and Standardization.

What is Normalization?

Normalization means to scale the variables to have values between 0 and 1.

It is also known as MaxMin Normalization. For every feature, the minimum value of that feature gets transformed into 0, and the maximum value gets transformed into 1.

What is Standardization?

Standardization means to scale the variable data to have a value having a mean of 0 and a standard deviation of 1.

(Using any of the above technique does not change the output)

How to implement feature scaling?

  1. Normalization

Let us first consider the technique of normalization using the same dataset as above.

To accomplish this task, we need to import MinMaxScalar from the scikit-learn preprocessing library and create its object.

Once we import all the necessary libraries, we need to carry out the 5 previous steps to ensure consistency. We will apply data scaling on our X data of features.

Thus on applying Normalization, the scale of features will be tuned between 0 and 1.

  1. Standardization

Let us consider the same dataset as above and apply Standardization to scale features.

To accomplish this, we will first import the class StandardScaler from the scikit preprocessing library and create an object of that class.

Now we will fit and transform our X_train set. That will transform all the data to the same standardized scale.

Thus, you can now see that the values are transformed and scaled between -1 and 1.

These are the general Six steps of Data preprocessing for machine learning. Depending on the condition of your dataset, you may or may not have to go through all these steps.

You will be able to distinguish the need for these steps depending on the model as you practice.

Our Data preprocessing journey ends here. Congratulations, you can now clean your data and are prepared to apply it to your models. Welcome to the ML world!

Stay Tuned. Keep Practicing.

(Image Source: Internet)

--

--

Anushkad
GoDataScience

•Machine Learning and AI enthusiast • Python Programming • Bibliophile • Currently pursuing UG in Information Technology.