Methods to scale numerical features

StandardScaler, MinMaxScaler, Winsorizing, and RobustScaler explained

Mehul Gupta
Data Science in your pocket

--

Photo by Siora Photography on Unsplash

While working with Data Science projects, you must have at least once scaled your numerical features to some particular range using methods like Standard scaler or Min-Max scaler (hope you must have heard these names as well) while handling features with different scales. This post is dedicated to understand

Why Scaling is important

Different types of Scaling

Example using dummy dataset

Before jumping on complex things, let’s start off with the root question,

What is Scaling?

It is nothing but standardizing features in a dataset to a fixed range. Now, this range is usually [-1,1],[0,1], unbounded, etc

But why do we need it? the below reasons shall suffice

While training a model, if the magnitude of a feature is high, it’s derivative would also be very high leading to big updates in the weights of the model while training hence delays in convergence.

Some models like KMeans or KNN(that depend on distance) are very sensitive to features of high magnitude (say income) and their entire results may get driven by such features overshadowing other features with relatively smaller magnitude (like Age).

Before we move on with the discussion on different scaling methods, let’s once see the dataset we would be testing out the different scaling methods

Looking at the distribution of the 3 columns

As evident, the 3 variables are in different scales. So let’s see the different methods of how we can bring these variables in a similar scale

StandardScaler/ Z-Score Normalization →

x(i) = (x(i)-mean(X))/standard_deviation(X)

So, we subtract the mean from the original value divided by the variable's standard deviation to get to the scaled value. A few key features of StandardScaling

  • This is unbounded i.e. the values are not restricted between a particular range.
  • Helps in outlier detection as well. If the abs|scaled_value| is greater than 3, the value can be a potential outlier
  • The new scaled data has mean=0, variance=1
  • Works better for the normally distributed variable.

Let’s see the python implementation and the output

%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_df = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_df,columns=['a','b','c'])

fig,axs = plt.subplots(1,3,figsize=(20,4))
for x,y in enumerate(scaled_df.columns):
axs[x].plot(scaled_df[y])

As evident from the figure, the values are scaled in a range from -3 to 3 (slightly over 3 for column b).

Min-Max Scaler →

The most commonly used scaling method, the min-max scaler uses the below formula for scaling.

x(i) = (x(i) — min) / (max — min)

  • Min-Max Scaler is sensitive to outliers as usually the maximum or minimum values are potential outliers
  • It scales the data to [0,1]
  • Works well for uniform distribution

To counter this problem, what we can do is

  • Choose a suitable lower & upper bound (and not min/max)
  • Follow the above formula for scaling the data
  • Clip the value to [0,1] for any matter crossing the bounds

This is called clipping.

For now, below is the general implementation for MinMax Scaler

%matplotlib inline
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled_df = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_df,columns=['a','b','c'])

fig,axs = plt.subplots(1,3,figsize=(20,4))
for x,y in enumerate(scaled_df.columns):
axs[x].plot(scaled_df[y])

For now, let’s see the output for a general MinMaxScaler

This method has scaled every variable in a range of 0–1

Winsorizing →

Winsorizing is done to curb the effect of outliers in the data while MinMax Scaling.

  • Choose a lower & upper percentile bound (like 95th-5th, 90th-10th, etc)
  • Once chosen, fill in the values outside the given percentiles using the closest bound percentile value.
  • Apply the Min-Max Scaling formula over this newly transformed data.

For eg →

Suppose we choose 90th-10th as the lower & upper bound, any value lying below the 10th percentile (for example 4th percentile) or above the 90th percentile (say 97th percentile) will be replaced by the 10th and 90th percentile respectively. For example, assume we have the following data

1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20

If bounds selected are 80th & 20th percentile, then 20th = 4, 80th=16

Then the new data is

4,4,4,4,5,6,7,8,9,10,11,12,13,14,15,16,16,16,16,16

Hence, the values lying beyond the set upper & lower percentile bounds are replaced by values at decided lower & upper bounds. Next, we repeat what we did in Min-Max Scaler.

So, to reflect how winsorizing helps us to reduce the impact of outliers, we will introduce an outlier first to column ‘a’

Introduced a random outlier in column ‘a’

Now, if we use MinMaxScaler, the presence of an outlier can impact the range of the scaled data as can be seen below

Here, the rest of the values (apart from the outlier) has fallen in a skim range of 0–0.05. Winsorizing helps us curb this effect while scaling data.

from scipy.stats.mstats import winsorize
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler

#setting limits as 5th & 95th percentile
df['a'] = pd.Series(winsorize(df['a'], limits=[0.05, 0.05]))

scaler = MinMaxScaler()
scaled_df = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_df,columns=['a','b','c'])

import matplotlib.pyplot as plt
fig,axs = plt.subplots(1,3,figsize=(20,4))
for x,y in enumerate(scaled_df.columns):
axs[x].plot(scaled_df[y])

As you can see, how the values have spread uniformly throughout the expected range of 0–1 (figure 1)

Robust Scaler →

As the name suggests, this methodology is robust to outliers using interquartile ranges implementing a formula similar to Min-Max Scaler

x(i) = (x(i) — median)/ (75th_percentile — 25th_percentile)

Why is it robust to outliers? If you notice, it uses Median instead of Minimum or maximum values( as in Min-Max Scaler) which is not sensitive to outliers compared to Min-Max values.

This time, we will be using this scaler on the new dataset where we have introduced an outlier in column ‘a’

import matplotlib.pyplot as plt
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
scaled_df = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_df,columns=['a','b','c'])

fig,axs = plt.subplots(1,3,figsize=(20,4))
for x,y in enumerate(scaled_df.columns):
axs[x].plot(scaled_df[y])

If you look at column ‘a’ (1st plot), this distribution looks similar to what we got for the min-max scaler. But looking closely, let’s replot column ‘a’ without the outlier value in the scaled version

As you can see, the scaling done by a robust scaler is not affected by the outlier, which wasn't the case with the min-max scaler.

With this, it’s a wrap !!

--

--