Feature Scaling :- Normalization, Standardization and Scaling !

Nishant Kumar
Analytics Vidhya
Published in
4 min readApr 5, 2020

--

When your data is comprised of attributes with varying scales, many machine learning algorithms can benefit from rescaling the attributes to all have the same scale. Often this is referred to as normalization and attributes are often rescaled into the range between 0 and 1.

Scaling to same magnitude in Gradient descent

It is a technique to standardise the independent variables present to a fixed range in order to bring all values to same magnitudes.Generally performed during the data pre-processing step and also helps in speeding up the calculations in an algorithm. Used in Linear Regression, K-means, KNN,PCA, Gradient Descent etc.

Why scaling needed ?

Feature Scaling should be performed on independent variables that vary in magnitudes, units, and range to standardise to a fixed range.
If no scaling, then a machine learning algorithm assign higher weight to greater values regardless of the unit of the values. As the range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalization. For example:-

ML consider the value 1000 gram > 2 kilogram or the value 3000 meter greater than 5 km and hence the algorithm will give wrong predictions.

Features weight mismatch

Many classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature.
Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.Another reason why feature scaling is applied is that gradient descent converges much faster with feature scaling than without it. In short we scale down to same scale.

Algorithm which is NOT distance based are not affected by feature scaling. eg. Naive Bayes

Mainly four types of feature scaling :-

  1. Min-Max Scaling (Scaling) :- It differs from normalisation in the sense that here sole motive to change range of data whereas as in Normalization/standardization , the sole motive is to normalise the distribution shape curve and to make it perfect Gaussian curve.the data is scaled to a fixed range — usually 0 to 1.The cost of having this bounded range in contrast to standardization is that we will end up with smaller standard deviations, which can suppress the effect of outliers.

Used in Deep learning, Image processing and Convolution neural network.

2. Mean Normalization :- The point of normalization is to change your observations so that they can be described as a normal distribution.Normal distribution (Gaussian distribution), also known as the bell curve, is a specific statistical distribution where a roughly equal observations fall above and below the mean. varies between -1 to 1 with mean = 0.

3. Standardization (Z-score normalization):- transforms your data such that the resulting distribution has a mean of 0 and a standard deviation of 1. μ=0 and σ=1. Mainly used in KNN and K-means.

where μ is the mean (average) and σ is the standard deviation from the mean; standard scores (also called z scores) of the samples are calculated as follows:

4. Binarize Data (Make Binary) :-
You can transform your data using a binary threshold. All values above the threshold are marked 1 and all equal to or below are marked as 0. This is called binarizing your data or thresholding your data. It can be useful when you have probabilities that you want to make crisp values. It is also useful when feature engineering and you want to add new features that indicate something meaningful. You can create new binary attributes in Python using scikit-learn with the Binarizer class

# binarization from sklearn.preprocessing import Binarizer
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html

5. Unit Vector :- Scaling is done considering the whole feature values to be of unit length.When dealing with features with hard boundaries this is quite useful. For example, when dealing with image data, the colours can range from only 0 to 255.

Follow my example jupyter notebook code here :- github

All the Best ! Happy coding…..

Facebook | Instagram

--

--