# Feature Scaling in Scikit-learn

**Feature scaling** is a method used to standardize the range of independent variables or **features** of data. In data processing, it is also known as data **normalization** and is generally performed during the data preprocessing step.

# Why Scaling

Most of the times, your dataset will contain features highly varying in magnitudes, units, and range. But since most of the machine learning algorithms use Euclidian distance between two data points in their computations, this is a problem.

If left alone, these algorithms only take in the magnitude of features neglecting the units. The results would vary greatly between different units, 5kg, and 5000gms. The features with high magnitudes will weigh in a lot more in the distance calculations than features with low magnitudes. To suppress this effect, we need to bring all features to the same level of magnitudes. This can be achieved by scaling.

In this post we explore 3 methods of feature scaling that are implemented in scikit-learn:

`StandardScaler`

`MinMaxScaler`

`RobustScaler`

`Normalizer`

# Standard Scaler

The `StandardScaler`

assumes your data is normally distributed within each feature and will scale them such that the distribution is now centered around 0, with a standard deviation of 1.

The mean and standard deviation are calculated for the feature and then the feature is scaled based on:

If data is not normally distributed, this is not the best scaler to use.

**Syntax :**

`from sklearn.preprocessing import StandardScaler`

scaler = preprocessing.StandardScaler()

scaled_df = scaler.fit_transform(df)

#[where df=data]

# Min-Max Scaler

The `MinMaxScaler`

is the probably the most famous scaling algorithm, and follows the following formula for each feature:

It essentially shrinks the range such that the range is now between 0 and 1 (or -1 to 1 if there are negative values).

This scaler works better for cases in which the standard scaler might not work so well. If the distribution is not Gaussian or the standard deviation is very small, the min-max scaler works better.

However, it is sensitive to outliers, so if there are outliers in the data, you might want to consider the `Robust Scaler`

below.

**Syntax :**

`from sklearn import preprocessing`

scaler = preprocessing.MinMaxScaler()

scaled_df = scaler.fit_transform(df)

# Robust Scaler

The `RobustScaler`

uses a similar method to the Min-Max scaler but it instead uses the interquartile range, rather than the min-max, so that it is robust to outliers. Therefore it follows the formula:

*xi*–*Q*1(*x*)*/Q*3(*x*)–*Q*1(*x*)

For each feature.

Of course, this means it is using the less of the data for scaling so it’s more suitable for when there are outliers in the data.

**Syntax :**

`from sklearn import preprocessing`

scaler = preprocessing.RobustScaler()

robust_scaled_df = scaler.fit_transform(x)

# Normalizer

The normalizer scales each value by dividing each value by its magnitude in *n-dimensional* space for *n *number of features. Say your features were x, y, and z Cartesian co-ordinates your scaled value for x would be:

Each point is now within 1 unit of the origin on this Cartesian coordinate system.

*X=xi/np.sqrt(xi***2+*yi**2*+*zi***2

**Syntax :**

`from sklearn import preprocessing`

scaler = preprocessing.Normalizer()

scaled_df = scaler.fit_transform(df)

# Where and when to Scale

Rule of thumb I follow here is an algorithm that computes distance or assumes normality, **scales your features!!!**

Some examples of algorithms where feature scaling matters are:

**k-nearest neighbors**with a Euclidean distance measure is sensitive to magnitudes and hence should be scaled for all features to weigh in equally.- Scaling is critical while performing
**Principal Component Analysis(PCA)**. PCA tries to get the features with maximum variance and the variance is high for high magnitude features. This skews the PCA towards high magnitude features. - We can speed up
**gradient descent**by scaling. This is because θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven. **Tree-based models**are not distance-based models and can handle varying ranges of features. Hence, Scaling is not required while modeling trees.- Algorithms like
**Linear Discriminant Analysis(LDA), Naive Bayes**is by design equipped to handle this and gives weights to the features accordingly. Performing a feature scaling in these algorithms may not have much effect.

**Conclusion :**

I hope this article would have brought you the idea of what data preprocessing is and how does it work with different techniques and what to employ where.

Want to know more about data normalization do read: