# Scaling the data for Machine Learning: Standardisation and Normalisation

Jul 15, 2020 · 3 min read

I am currently participating to a foundation course on Machine Learning (ML) with Microsoft Azure Cloud and I am revising many essential steps to prepare the data before starting with model training.

One fundamental data pre-processing step that often gets overlooked is feature scaling. Feature scaling is a technique for standardising in a fixed range the magnitude of the features present in the data. With feature scaling, data features get transformed so they are represented at the same level of magnitude.

Some ML algorithms are very sensitive to feature scaling, while others are unaffected by it.

For example one feature might be expressed in grams and the other in kilometres, and the magnitude of 3,000 grams would be incorrectly considered greater than 5 km. This is problematic because many ML algorithms (e.g. KNN, K-means and SVM) require the computation of Euclidean distances and use features’ magnitudes rather than their units to perform this computation. Data needs to be scaled before using a distance-based algorithm in order to have all features contributing equally to the final result.

Furthermore, ML methods that use the gradient descent algorithm converge more quickly and smoothly towards the minima if all data’s features are expressed on a similar scale.

ML algorithms insensitive to features’ scaling are tree-based algorithms, as the decision trees get split at a node only on the basis of a single feature (thus the scale of the other features does not affect one specific node split).

There are different methods to perform data scaling, here I will discuss data Standardisation and Normalisation.

# Data Standardisation

Standardisation rescales the features so that they all have a mean value = 0 and a standard deviation value = 1. This transformation is carried out converting each feature value to their Z-score:

Standardisation, which makes features unitless, is a requirement of many ML algorithms, such as gradient descent (an optimisation algorithm used in regression, SVM, neural networks, etc…). Looking at the gradient descent equation used to update the features’ weights (θj), it is evident that certain weights may get updated faster than others because the magnitude of the feature (xj) is included in the weights’ calculation:

Data standardisation ensures that the steps of the gradient descent are updated at an equal rate for all the features’ weights in the dataset.

# Data Normalisation

This scaling technique is also called Min-Max Scaling. Features’ values get transformed so they range between 0 and 1. After this transformation, the minimum value of the feature will be equal to 0 and the maximum to 1.

Normalisation is best to be used when the data does not follow a Gaussian distribution, and then it is best applied with algorithms that do not make this assumption (e.g. KNN and neural networks).

This image (from Quora) nicely represents how standardisation and normalisation transform the data:

References for further reading:

## Let’s Deploy Data.

Let’s dive deeper into the world of machine learning!

## Let’s Deploy Data.

Our diverse community brings to you articles on machine learning. Have fun reading these :)

Written by

## Let’s Deploy Data.

Our diverse community brings to you articles on machine learning. Have fun reading these :)

## The story of BERT-ian era

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app