Standardization Vs Normalization in Machine Learning

Vaishno Kumar
5 min readOct 19, 2021

--

Here we learn about standardization and normalization, where, when, and why to use with real-world datasets.

Standardization-

In machine learning, It is a technique where are the values are centered around the mean with a unit standard deviation (µ=0 and σ=1). It means that the attributes of the mean are zero and the resultant distribution has a unit standard deviation.

The formula for standardization:

X’ = ( Xi - µ ) / σ

where σ is standard deviation and µ is mean and Xi is input features values.

Example with a real-world dataset -

Here, we have a simple dataset and take out only integer values and apply train_test_split

Simple data sets
X_train and X_test size after applying train_test_split

sklearn.preprocessing.StandardScaler uses a library that comes from sklearn for standardization.

Standard Scaler

After applying the Standard Scaler dataset lookalike given below

The left figure is shown before Scaling and the Right figure look after scaling

In the above figure, there is a mean which value is 0 and std deviation is 1 which satisfies our standardization theory.

Effect of Standardization -

Here as you can see there is a scatter plot of before scaling and after scaling and see there is no effect after standardization.

Scatter Plot

After standardization, In probability density function graph lookalike the given below.

PDF graph

Why Standardization is important?

In machine learning, our aim is to improve model and accuracy scores. So for improving scores and a good predictive model uses Standardization.

Accuracy Score

In the above figure, as you can see that there are two accuracy_score which are defined without and with standardization respectively and there are highly improvements in accuracy score which is our task.

Normalization -

Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to use a common scale, without distorting differences in the ranges of values or losing information.Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling.

The formula for normalization:

X’ = ( X - Xmin) / (Xmax - Xmin)

Here, Xmax and Xmin are the maximum and the minimum values of the feature respectively.

  • When the value of X is the minimum value in the column, the numerator will be 0, and hence X’ is 0
  • On the other hand, when the value of X is the maximum value in the column, the numerator is equal to the denominator and thus the value of X’ is 1
  • If the value of X is between the minimum and the maximum value, then the value of X’ is between 0 and 1

Example with a real-world dataset -

Here, we have a simple dataset and take out only integer values .

Datasets

Apply train_test_split and MinMaxScaler library which comes from sklearn

after write output lookalike given below.

left figure before normalization and right figure after normalization

Effect of Normalization -

Have a look on graph and see if there any changes before normalization and after normalization.

Scatter Plot

Here as you can see there is a scatter plot of before scaling and after scaling and see there is no effect after normalization.

After standardization, In probability density function graph lookalike given below.

PDF Graph

Standardization vs Normalization -

  • Normalization is good to use when you know that the distribution of your data does not follow a Gaussian distribution. This can be useful in algorithms that do not assume any distribution of the data like K-Nearest Neighbors and Neural Networks.
  • Standardization, on the other hand, can be helpful in cases where the data follows a Gaussian distribution. However, this does not have to be necessarily true. Also, unlike normalization, standardization does not have a bounding range. So, even if you have outliers in your data, they will not be affected by standardization.

Conclusion -

In the End, the choice of using normalization or standardization will depend on your problem and the machine learning algorithm you are using. There is no hard and fast rule to tell you when to normalize or standardize your data. You can always start by fitting your model to raw, normalized and standardized data and compare the performance for best results.

It is a good practice to fit the scaler on the training data and then use it to transform the testing data. This would avoid any data leakage during the model testing process. Also, the scaling of target values is generally not required.

Learning and Resources:

[1] For Code and concept: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

[1] For Code and concept: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

[3] For Code and concept: https://www.analyticsvidhya.com/blog/

--

--