# Feature Scaling with Python’s scikit-learn

Oct 7 · 5 min read

One of the primary objectives of normalization is to bring the data close to zero. That makes the optimization problem more “numerically stable”.

Now, the scaling using mean and standard deviation assumes that the data is normally distributed, that is, most of the data is sufficiently close to the mean. So shifting the mean to zero ensures that most components of most data points are close to 0. Specifically, 68% of data would be between -1 and 1, as can be seen from the following figure:

In this post we explore 3 methods of feature scaling that are implemented in scikit-learn:

• `StandardScaler`
• `MinMaxScaler`
• `RobustScaler`
• `Normalizer`

# Standard Scaler

The `StandardScaler` assumes your data is normally distributed within each feature and will scale them such that the distribution is now centered around 0, with a standard deviation of 1.

The mean and standard deviation are calculated for the feature and then the feature is scaled based on:

If data is not normally distributed, this is not the best scaler to use.

Let’s take a look at it in action:

In [1]:

`import pandas as pdimport numpy as npfrom sklearn import preprocessingimport matplotlibimport matplotlib.pyplot as pltimport seaborn as sns%matplotlib inlinematplotlib.style.use('ggplot')`

In [2]:

`np.random.seed(1)df = pd.DataFrame({    'x1': np.random.normal(0, 2, 10000),    'x2': np.random.normal(5, 3, 10000),    'x3': np.random.normal(-5, 5, 10000)})scaler = preprocessing.StandardScaler()scaled_df = scaler.fit_transform(df)scaled_df = pd.DataFrame(scaled_df, columns=['x1', 'x2', 'x3'])fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(6, 5))ax1.set_title('Before Scaling')sns.kdeplot(df['x1'], ax=ax1)sns.kdeplot(df['x2'], ax=ax1)sns.kdeplot(df['x3'], ax=ax1)ax2.set_title('After Standard Scaler')sns.kdeplot(scaled_df['x1'], ax=ax2)sns.kdeplot(scaled_df['x2'], ax=ax2)sns.kdeplot(scaled_df['x3'], ax=ax2)plt.show()`

All features are now on the same scale relative to one another.

# Min-Max Scaler

The `MinMaxScaler` is probably the most famous scaling algorithm, and follows the following formula for each feature:

It essentially shrinks the range such that the range is now between 0 and 1 (or -1 to 1 if there are negative values).

This scaler works better for cases in which the standard scaler might not work so well. If the distribution is not Gaussian or the standard deviation is very small, the min-max scaler works better.

However, it is sensitive to outliers, so if there are outliers in the data, you might want to consider the `Robust Scaler` below.

For now, let’s see the `min-max` scaler in action

In [3]:

`df = pd.DataFrame({    # positive skew    'x1': np.random.chisquare(8, 1000),    # negative skew     'x2': np.random.beta(8, 2, 1000) * 40,    # no skew    'x3': np.random.normal(50, 3, 1000)})scaler = preprocessing.MinMaxScaler()scaled_df = scaler.fit_transform(df)scaled_df = pd.DataFrame(scaled_df, columns=['x1', 'x2', 'x3'])fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(6, 5))ax1.set_title('Before Scaling')sns.kdeplot(df['x1'], ax=ax1)sns.kdeplot(df['x2'], ax=ax1)sns.kdeplot(df['x3'], ax=ax1)ax2.set_title('After Min-Max Scaling')sns.kdeplot(scaled_df['x1'], ax=ax2)sns.kdeplot(scaled_df['x2'], ax=ax2)sns.kdeplot(scaled_df['x3'], ax=ax2)plt.show()`

Notice that the skewness of the distribution is maintained but the 3 distributions are brought into the same scale so that they overlap.

# Robust Scaler

The `RobustScaler` uses a similar method to the Min-Max scaler but it instead uses the interquartile range, rather than the min-max, so that it is robust to outliers. Therefore it follows the formula:

For each feature.

Of course, this means it is using less of the data for scaling so it’s more suitable for when there are outliers in the data.

Let’s take a look at this one in action on some data with outliers

In [4]:

`x = pd.DataFrame({    # Distribution with lower outliers    'x1': np.concatenate([np.random.normal(20, 1, 1000), np.random.normal(1, 1, 25)]),    # Distribution with higher outliers    'x2': np.concatenate([np.random.normal(30, 1, 1000), np.random.normal(50, 1, 25)]),})scaler = preprocessing.RobustScaler()robust_scaled_df = scaler.fit_transform(x)robust_scaled_df = pd.DataFrame(robust_scaled_df, columns=['x1', 'x2'])scaler = preprocessing.MinMaxScaler()minmax_scaled_df = scaler.fit_transform(x)minmax_scaled_df = pd.DataFrame(minmax_scaled_df, columns=['x1', 'x2'])fig, (ax1, ax2, ax3) = plt.subplots(ncols=3, figsize=(9, 5))ax1.set_title('Before Scaling')sns.kdeplot(x['x1'], ax=ax1)sns.kdeplot(x['x2'], ax=ax1)ax2.set_title('After Robust Scaling')sns.kdeplot(robust_scaled_df['x1'], ax=ax2)sns.kdeplot(robust_scaled_df['x2'], ax=ax2)ax3.set_title('After Min-Max Scaling')sns.kdeplot(minmax_scaled_df['x1'], ax=ax3)sns.kdeplot(minmax_scaled_df['x2'], ax=ax3)plt.show()`

Notice that after Robust scaling, the distributions are brought into the same scale and overlap, but the outliers remain outside of the bulk of the new distributions.

However, in Min-Max scaling, the two normal distributions are kept separate by the outliers that are inside the 0–1 range.

# Normalizer

The normalizer scales each value by dividing each value by its magnitude in nn-dimensional space for nn number of features.

Say your features were x, y, and z Cartesian co-ordinates your scaled value for x would be:

Each point is now within 1 unit of the origin on this Cartesian coordinate system.

In [5]:

`from mpl_toolkits.mplot3d import Axes3Ddf = pd.DataFrame({    'x1': np.random.randint(-100, 100, 1000).astype(float),    'y1': np.random.randint(-80, 80, 1000).astype(float),    'z1': np.random.randint(-150, 150, 1000).astype(float),})scaler = preprocessing.Normalizer()scaled_df = scaler.fit_transform(df)scaled_df = pd.DataFrame(scaled_df, columns=df.columns)fig = plt.figure(figsize=(9, 5))ax1 = fig.add_subplot(121, projection='3d')ax2 = fig.add_subplot(122, projection='3d')ax1.scatter(df['x1'], df['y1'], df['z1'])ax2.scatter(scaled_df['x1'], scaled_df['y1'], scaled_df['z1'])plt.show()`

Note that the points are all brought within a sphere that is at most 1 away from the origin at any point. Also, the axes that were previously different scales are now all one scale.

Written by

## Towards AI

#### Towards AI, is the world’s fastest-growing AI community for learning, programming, building and implementing AI.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just \$5/month. Upgrade