Feature Scaling: Why and How?

Vivekpandian
Analytics Vidhya
Published in
3 min readJul 8, 2020

Feature scaling refers to the process of changing the range (normalization) of numerical features. It is also known as “Data Normalization” and is usually performed in the data pre-processing step of the machine learning pipeline.

Attaining Global minimum before and after scaling

There are different methods to do feature scaling. But first, why do you need to do it? (stop reading and guess the answer)

Yes, you are right. When Machine Learning algorithms measure distances between data points, the results may be dominated by the magnitude (scale) of the features instead of their values. Scaling the features to a range can fix this problem. For gradient-based algorithms, normalization improves the convergence speed.

Use feature scaling when the algorithm calculates distances (K-Nearest Neighbor and Support Vector Machines)or is trained with Gradient
Descent (Regression).

How can we do feature scaling? We’ll use the following synthetic data to compare different methods of scaling techniques.

Min-Max Normalization:

Min-Max scaling is one of the simplest and widely used normalization method to scale the data. It scales each variable/feature in the [0,1] range.

Min-Max Scaling
Example: Min-Max Scaler

The scaled distributions do not overlap as much and their shape remains the same (except for the Normal).
**This method preserves the shape of the original distribution and is sensitive to outliers.

Standardization:

This method re-scales a feature removing the mean and divides by standard deviation. It produces a distribution centered at 0 with a standard deviation of 1. Some Machine Learning algorithms (SVMs) assume features are in this range.

Standard Scaling
Example : Standard Scaling

The resulting distributions overlap heavily. Also, their shape is much narrower.

**This method “makes” a feature normally distributed. With outliers, the data will be scaled to a small interval.

Robust Scaling:

This method is very similar to the Min-Max approach. Each feature is scaled with:

Robust Scaling

where Q are quartiles. The Interquartile range makes this method robust to outliers (hence the name).

Example: Robust Scaling

All distributions have most of their densities around 0 and a shape that is more or less the same.
**Use this method when you have outliers and want to minimize their influence.

Hope you have got an idea about the purpose of scaling in Machine Learning. Thanks for reading the article.

Code: https://github.com/vivekpandian08/feature-scaling-why-and-how-

References:

  1. “ https://en.wikipedia.org/wiki/Feature_scaling
  2. Andrew NG — Machine Learning
  3. https://www.curiousily.com/

--

--