Feature Scaling in Scikit-learn

Shobhit Srivastava
Feb 11, 2019 · 4 min read

Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.

Why Scaling

Most of the times, your dataset will contain features highly varying in magnitudes, units, and range. But since most of the machine learning algorithms use Euclidian distance between two data points in their computations, this is a problem.

If left alone, these algorithms only take in the magnitude of features neglecting the units. The results would vary greatly between different units, 5kg, and 5000gms. The features with high magnitudes will weigh in a lot more in the distance calculations than features with low magnitudes. To suppress this effect, we need to bring all features to the same level of magnitudes. This can be achieved by scaling.

In this post we explore 3 methods of feature scaling that are implemented in scikit-learn:

  • StandardScaler
  • MinMaxScaler
  • RobustScaler
  • Normalizer

Standard Scaler

The StandardScaler assumes your data is normally distributed within each feature and will scale them such that the distribution is now centered around 0, with a standard deviation of 1.

The mean and standard deviation are calculated for the feature and then the feature is scaled based on:

If data is not normally distributed, this is not the best scaler to use.

Syntax :

from sklearn.preprocessing import StandardScaler
scaler = preprocessing.StandardScaler()
scaled_df = scaler.fit_transform(df)
#[where df=data]

Min-Max Scaler

The MinMaxScaler is the probably the most famous scaling algorithm, and follows the following formula for each feature:

It essentially shrinks the range such that the range is now between 0 and 1 (or -1 to 1 if there are negative values).

This scaler works better for cases in which the standard scaler might not work so well. If the distribution is not Gaussian or the standard deviation is very small, the min-max scaler works better.

However, it is sensitive to outliers, so if there are outliers in the data, you might want to consider the Robust Scaler below.

Syntax :

from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler()
scaled_df = scaler.fit_transform(df)

Robust Scaler

The RobustScaler uses a similar method to the Min-Max scaler but it instead uses the interquartile range, rather than the min-max, so that it is robust to outliers. Therefore it follows the formula:

xiQ1(x)/Q3(x)–Q1(x)

For each feature.

Of course, this means it is using the less of the data for scaling so it’s more suitable for when there are outliers in the data.

Syntax :

from sklearn import preprocessing
scaler = preprocessing.RobustScaler()
robust_scaled_df = scaler.fit_transform(x)

Normalizer

The normalizer scales each value by dividing each value by its magnitude in n-dimensional space for n number of features. Say your features were x, y, and z Cartesian co-ordinates your scaled value for x would be:

Each point is now within 1 unit of the origin on this Cartesian coordinate system.

X=xi/np.sqrt(xi**2+yi**2+zi**2

Syntax :

from sklearn import preprocessing
scaler = preprocessing.Normalizer()
scaled_df = scaler.fit_transform(df)

Where and when to Scale

Rule of thumb I follow here is an algorithm that computes distance or assumes normality, scales your features!!!

Some examples of algorithms where feature scaling matters are:

  • k-nearest neighbors with a Euclidean distance measure is sensitive to magnitudes and hence should be scaled for all features to weigh in equally.
  • Scaling is critical while performing Principal Component Analysis(PCA). PCA tries to get the features with maximum variance and the variance is high for high magnitude features. This skews the PCA towards high magnitude features.
  • We can speed up gradient descent by scaling. This is because θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.
  • Tree-based models are not distance-based models and can handle varying ranges of features. Hence, Scaling is not required while modeling trees.
  • Algorithms like Linear Discriminant Analysis(LDA), Naive Bayes is by design equipped to handle this and gives weights to the features accordingly. Performing a feature scaling in these algorithms may not have much effect.

Conclusion :

I hope this article would have brought you the idea of what data preprocessing is and how does it work with different techniques and what to employ where.

Want to know more about data normalization do read:

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Shobhit Srivastava

Written by

Data Analyst | Writer at Analytics Vidhya

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade