My take on: Why do we need to standardize the data?

Image for post
Image for post

I have read a lot of articles on why standardization is so important in solving machine learning problems. No offense but, I didn’t really find an appropriate practical explanation. So I thought of writing this blog to..whatever… Let’s cut the BS and get to the point.

Alright, let me take an example of three data points:

Image for post
Image for post

Now say the range of feature 1 and feature 2 are 0–100 and 0–1 respectively. Now if we see the Euclidean distance between DP1 and DP2:

Image for post
Image for post

Now if the range of feature 1 is 100, a difference of 3 between two data points is 3% overall. But the difference of 0.1 on feature 2 between data points is 10% overall. If we take the Euclidean distance of data points between DP1 and DP3:

Image for post
Image for post

The distance between DP1 and DP3 should be more than the distance between DP1 and DP2 because there is 70% ( 0.7 ) variation on feature 2 which is very greater than the variation of DP1 and DP2. But Euclidean distance measure fails to capture this logic for the problems involving distance as a metric.

This is due to the scale difference between the features. Now if we standardize the data, each feature column will have a mean of 0 and the standard deviation of 1 which eliminates the scale difference(discrepancy caused by difference of range). So this is how ( see below image ) we standardize data points, mu( μ ) is the mean and sigma( σ ) is the standard deviation of the values. For starters, Standard deviation basically tells you the average spread of the values from the mean.

Image for post
Image for post
Image credits: https://www.thoughtco.com/z-scores-worksheet-3126534

And this is why Machine learning algorithm works well or learns well when we feed in standardized data. So problems which are solved using distance as a metric, standardizing the data comes in handy. So if you guessed it, yes, most of the kNN problems use standardization techniques to improve the model’s accuracy.

Protip: If data points nearly follow a Gaussian distribution, Machine learning algorithm might work very well if you apply standardization.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store