My take on: Why do we need to standardize the data?
I have read a lot of articles on why standardization is so important in solving machine learning problems. No offense but, I didn’t really find an appropriate practical explanation.
Alright, let me take an example of three data points:
Now say the range of feature 1 and feature 2 are 0–100 and 0–1 respectively. Now if we see the Euclidean distance between DP1 and DP2:
Now if the range of feature 1 is 100, a difference of 3 between two data points is 3% overall. But the difference of 0.1 on feature 2 between data points is 10% overall. If we take the Euclidean distance of data points between DP1 and DP3:
The distance between the points (DP1, DP3) should be more than the distance between (DP1, DP2 ) because there is 70% ( 0.7 ) variation on feature 2 which is very greater than the variation of (DP1, DP2). But Euclidean distance measure fails to capture this logic for the problems involving distance as a metric.
This is due to the scale difference between the features. Now if we standardize the data, each feature column will have a mean of 0 and the standard deviation of 1 which eliminates the scale difference(discrepancy caused by difference of range). So this is how ( see below image ) we standardize data points, mu( μ ) is the mean and sigma( σ ) is the standard deviation of the values. For starters, Standard deviation basically tells you the average spread of the values from the mean.
And this is why Machine learning algorithm works well or learns well when we feed in standardized data. So problems which are solved using distance as a metric, standardizing the data comes in handy. So if you guessed it, yes, most of the kNN problems use standardization techniques to improve the model’s accuracy.
Protip: If data points nearly follow a Gaussian distribution, Machine learning algorithm might work very well if you apply standardization.