Why Feature Scaling is required for Training ML Models

Om Rastogi
Analytics Vidhya
Published in
3 min readJun 7, 2020


A question that all the Data Science learners have is, why is there a need to normalize data. Here we will discuss the answer and get the intuition of the mechanics.

What is Normalization?

Feature scaling is a method used to normalize the range of independent variables or features of data. While, data normalization refers to shifting and scaling of values. There are many ways, Features can be scaled-

  1. Min Max Scaling
  2. Standard Scaling
  3. Max abs Scaling
  4. Z-Score Scaling

The details of whose, most of the readers are aware and can be kept for later.

Here we are looking at the Z-score formula. Mean is subtracted by the value, then divided by the standard deviation of the dataset.

Z-score Normalization

Why Scale the data ?

We understand that models are capable to train, no matter what type of values are provided. However the answer to this question, lies deep in the working of optimization functions of learning models.

A normalized data with comparable ranges, leads to better and efficient optimization of the learning algorithms.


We can get some intuition from the most famous optimization algorithm:
The Gradient Descent Algorithm.

A complex visualization for gradient descent plane

To know more about gradient descent click here.

Let us look at a simplified version.

Now the size and complexity of this curve would depend on the range of data being trained.

For two feature learning model and data with similar range data, the gradient descent can look like this.

For a optimized learning rate, the algorithm smoothly reaches the global minima hence optimization is fast and effective.

Now let us consider unscaled data:
X1: Range -(1,100)
X2: Range -(0,1)


Let us see how it reached minima for a given learning rate.(X1 perspective)

WHAT !! BUT WHY!! Why Zig zag?

This looks like unreal but, lets look at this from another perspective(of X2), for the same learning rate.

Sure we can decrease the learning rate, but then the leaning in X1 will decrease too. Hence it is best to normalize and make the ranges comparable.



Om Rastogi
Analytics Vidhya

I believe in an altruistic world, where creativity and imagination replace repetitive work