Scaling the data for Machine Learning: Standardisation and Normalisation
I am currently participating to a foundation course on Machine Learning (ML) with Microsoft Azure Cloud and I am revising many essential steps to prepare the data before starting with model training.
One fundamental data pre-processing step that often gets overlooked is feature scaling. Feature scaling is a technique for standardising in a fixed range the magnitude of the features present in the data. With feature scaling, data features get transformed so they are represented at the same level of magnitude.
Some ML algorithms are very sensitive to feature scaling, while others are unaffected by it.
For example one feature might be expressed in grams and the other in kilometres, and the magnitude of 3,000 grams would be incorrectly considered greater than 5 km. This is problematic because many ML algorithms (e.g. KNN, K-means and SVM) require the computation of Euclidean distances and use features’ magnitudes rather than their units to perform this computation. Data needs to be scaled before using a distance-based algorithm in order to have all features contributing equally to the final result.
Furthermore, ML methods that use the gradient descent algorithm converge more quickly and smoothly towards the minima if all data’s features are expressed on a similar scale.
ML algorithms insensitive to features’ scaling are tree-based algorithms, as the decision trees get split at a node only on the basis of a single feature (thus the scale of the other features does not affect one specific node split).
There are different methods to perform data scaling, here I will discuss data Standardisation and Normalisation.
Standardisation rescales the features so that they all have a mean value = 0 and a standard deviation value = 1. This transformation is carried out converting each feature value to their Z-score:
Standardisation, which makes features unitless, is a requirement of many ML algorithms, such as gradient descent (an optimisation algorithm used in regression, SVM, neural networks, etc…). Looking at the gradient descent equation used to update the features’ weights (θj), it is evident that certain weights may get updated faster than others because the magnitude of the feature (xj) is included in the weights’ calculation:
Data standardisation ensures that the steps of the gradient descent are updated at an equal rate for all the features’ weights in the dataset.
This scaling technique is also called Min-Max Scaling. Features’ values get transformed so they range between 0 and 1. After this transformation, the minimum value of the feature will be equal to 0 and the maximum to 1.
Normalisation is best to be used when the data does not follow a Gaussian distribution, and then it is best applied with algorithms that do not make this assumption (e.g. KNN and neural networks).
This image (from Quora) nicely represents how standardisation and normalisation transform the data:
References for further reading:
About Feature Scaling and Normalization
Sections The result of standardization (or Z-score normalization) is that the features will be rescaled so that they'll…