My take on: Why do we need to standardize the data?

I have read a lot of articles on why standardization is so important in solving machine learning problems. No offense but, I didn’t really find an appropriate practical explanation.

Alright, let me take an example of three data points:

Now say the range of feature 1 and feature 2 are 0–100 and 0–1 respectively. Now if we see the Euclidean distance between DP1 and DP2:

Now if the range of feature 1 is 100, a difference of 3 between two data points is 3% overall. But the difference of 0.1 on feature 2 between data points is 10% overall. If we take the Euclidean distance of data points between DP1 and DP3:

The distance between the points (DP1, DP3) should be more than the distance between (DP1, DP2 ) because there is 70% ( 0.7 ) variation on feature 2 which is very greater than the variation of (DP1, DP2). But Euclidean distance measure fails to capture this logic for the problems involving distance as a metric.

This is due to the scale difference between the features. Now if we standardize the data, each feature column will have a mean of 0 and the standard deviation of 1 which eliminates the scale difference(discrepancy caused by difference of range). So this is how ( see below image ) we standardize data points, mu( μ ) is the mean and sigma( σ ) is the standard deviation of the values. For starters, Standard deviation basically tells you the average spread of the values from the mean.

Image credits:

And this is why Machine learning algorithm works well or learns well when we feed in standardized data. So problems which are solved using distance as a metric, standardizing the data comes in handy. So if you guessed it, yes, most of the kNN problems use standardization techniques to improve the model’s accuracy.

Protip: If data points nearly follow a Gaussian distribution, Machine learning algorithm might work very well if you apply standardization.