Why do Feature Scaling ? | Overview of Standardization and Normalization | Machine Learning

Why do we need feature scaling ? When to use feature scaling and What feature scaling method to use ?

Published in

Analytics Vidhya

7 min readJun 23, 2021

Feature Scaling is a pre-processing technique that is used to bring all the columns or features of the data to the same scale. This is done for various reasons.
It is done for algorithms that involves gradient descent and also for algorithms like K-means clustering and K-Nearest Neighbors.

Why Gradient Descent Requires Feature Scaling ?

Let’s consider an example of Linear Regression to understand why feature scaling is required for algorithms with gradient descent.

fig 1.1: gradient descent in linear regression

fig 1.1 represents the working of the gradient descent algorithm in linear regression.
Here, for the weight associated with a particular feature to be updated, the original weight is subtracted by some values multiplied by the sum of all the rows of that feature itself.
To put it more clear, The updation of theta(j) requires us to subtract the sum of all the values in the J(th) column multiplied by some other terms (original values - predicted values in this case ).

In most of the courses and blogs, you might have encountered a comparison picture like figure 1.2, How it is skinny when not scaled and hence overshoots to the other side and gets away from the global minimum. But, I found it difficult to understand it with this skinny vs circle contour plot technique. So, Here’s my interpretation of why the problem occurs.

Note : By overshoots, what I mean is that the value of w1 in the skinny contour plot in figure 1.2 is expected to go to a point that gives us the global minimum of the loss function. But Instead, at each iteration, It kind of moves in a zig-zag motion from left to right and right to left, instead of the point in the middle.

Because of subtracting the sum of values from the jth column for the updation for theta(j) while having the same alpha value, Values of some theta(s) would overshoot to the other side, Instead of facing and going directly to the global minimum. This is because some columns have values ranging from 0–10 while others might have values ranging from 10,000–20,000.
So, obviously the column or feature with the higher range of values will have a higher sum. Hence, the weight associated with that column overshoots to the other-side when subtracted by the sum of the columns.
Due to this, Gradient descent takes a lot more time to converge to global minimum or in some cases , might still be revolve around the global minimum.

Let us take the dataset shown in figure 1.3 as an example. The values of the population columns are in the range of 50,000 — 130,000 whereas the values of the avg age column is in the range of 40–55. Let us assume that the weight associated with Population is w1 and that for Avg age is w2.
When gradient descent happens as shown in figure 1.1, the sum of the population column would be way higher compared to the sum of the avg age column. Since we use the same alpha for both these weight updations, the gradient for population would also have a high value and on updation, This would throw w1 drastically to the other side, Instead of the optimal w1 value that would give us the global minimum of the loss function.

But, On performing feature scaling and bringing all the features to the same scale in a dataset, we can prevent this overshooting. when all the columns are of the same scale, subtracting the sum of a column’s terms during gradient updation, with a good alpha value, will not result in this overshooting of any weight terms associated with any particular feature and we would get the kind of situation that we can see in the right hand side of the fig 1.2. With some thinking, It is not hard to understand that this problem is solved because of feature scaling.

In conclusion, By preventing this zig-zag overshooting of a particular weight through all the iterations, feature scaling helps in much faster convergence of the gradient descent algorithm.

Why Algorithms Like K-nearest neighbors require Feature Scaling ?

K-Nearest neighbors is an algorithm used for classification and regression.
K-Means clustering is a clustering algorithm. But, Irrespective of their purpose, Both of these algorithms use Euclidean distance to fulfill their goals.
In Both KNN and K-means, we would want to give equal importance to all the variables or features in our dataset. Atleast, we would not want to discriminate features based on the range of values that they take , In terms of importance. But, when the features do not have the same scale , This is exactly what happens.

It’s better to explain this with an example. So, Let’s consider the same sample data from fig 1.3. Let’s take the 2 columns Population and the Average Age. When we want to find the euclidean distance between city A and city B, the population column dominates the result to the point where the presence of the average age column does not have any significance for the result.

As we can see from fig 2.1, The feature Average Age becomes highly insignificant in terms of euclidean distance and does not contribute much to the result. It can be seen that this happens just because the features are not scaled to the same scale. So, without a doubt, Applying feature scale would solve this problem, too.

Normalization vs Standardization

Normalization: Normalization is the process of re-scaling the values of a feature to lie between 0 and 1.
This is done by subtracting the minimum value from the array of values and then dividing by the difference between the maximum and the minimum values of the feature.
X_changed = (X_i - X_min)/(X_max - X_min)
For Example,

In case of predictive modeling, the x_min and x_max are noted. When a new data is given as input, This attribute(age of employees) of the data will go through the same process with the already stored x_min and x_max.

Standardization: It is the process of that re-scales the feature to have 0 mean and 1 standard deviation.
This is done by subtracting the mean of the feature data and then dividing it by the standard deviation of the feature.
X_changed = (X_i - mean(X))/std(X)
For Example,

Similar to Normalization, In standardization, we store the mean and the standard deviation of the feature and when a new instance or data point is used for prediction, It goes through the same process of standardization with the same mean and standard deviation.

In spite of the differences of both these 2 scalers, when to use what has always been a question mark for several people and it is also used inter-changeably by a lot of us.
Even though there are not perfect restrictions or constraints that tell you when to use what scaler , There are certain things to consider.

Standardization is more preferable when you know that the data follows a gaussian distribution. Normalization is preferred when you do not have details of the distribution of the data and are not sure if it forms a normal distribution or not. Expert Data Scientist Krish Naik says that, In his experience, Standardization has proven to be more useful for machine learning alogrithms like linear regression, K-means, etc while normalization is something that would help in terms of bringing the values to a specific range in convolutional neural networks and artificial neural networks. ( he also says that it is completely based on his experience and not theoretically proved )

Conclusion

To conclude, Feature Scaling is very important for algorithms that use gradient descent like linear regression, logistic regression, neural network..etc. and also for algorithms that use euclidean distance like K-Nearest Neighbors and K-Means Clustering. Tree Based algorithms like decision trees, regression trees, random forest and boosting do not require feature scaling, though. But for algorithms that require feature scaling, It improves the performance a lot and is a completely recommended pre-processing step to utilise.