Feature Scaling with Sci-kit Learn

Deepanshu Anand
CodeX
Published in
4 min readFeb 27, 2023

Feature scaling is a part of data preprocessing which is the most important stage in the data science lifecycle. It is the process of normalizing the range of values of independent variables or features.

Photo by Towfiqu barbhuiya on Unsplash

Feature Scaling includes various techniques such as Standardization, Normalization, Robust Scaling and Maximum Absolute Scaling. We will try to understand these techniques and their implementation in python using the sci-kit learn library.

Standardization

Standardization also known as Z-Score Normalization is a process of converting the values of a feature such that their mean becomes 0 and their standard deviation becomes 1. It is the process of converting a variable into a standard normal variate (a standard normal variate is a name given to that random variable which has 0 mean and standard deviation of 1). Thus after standardization, we get a normal distribution to work.

The formula used in standardization is x_new = (x - mean)/std

Implementation using sci-kit learn

import numpy as np
from sklearn.preprocessing import StandardScaler

arr1 = np.random.rand(10,3)*100
print(arr1)
print("Mean :",arr1.mean())
print("Standard Deviation:",np.sqrt(arr1.var()))

# Sample output
"""[[ 8.5408117 13.23422504 90.54782091]
[11.27108582 39.13818841 48.70205528]
[21.99748897 87.68092973 1.49939192]
[47.07141238 15.4003094 49.42024796]
[21.2729465 4.48959385 47.0832372 ]
[23.46886539 27.46516811 73.10104384]
[18.94886643 13.92315365 2.14299997]
[79.61608123 86.92512652 24.66128341]
[98.79356963 93.94796718 73.08859248]
[ 2.14529107 86.89583035 34.08777342]]
Mean: 41.55204525853225
Standard Deviation: 32.03876393042097"""

scaler = StandardScaler()
scaler.fit(arr1)
arr1 = scaler.transform(arr1)
print(arr1)
print("Mean :",arr1.mean())
print("Standard Deviation:",np.sqrt(arr1.var()))

# sample output
"""[[-0.813695 -0.95143266 1.63557206]
[-0.72401207 -0.21957598 0.15139791]
[-0.37167554 1.1518871 -1.52277322]
[ 0.45194249 -0.89023495 0.17687057]
[-0.39547502 -1.19849203 0.09398212]
[-0.32334437 -0.54937022 1.01677455]
[-0.47181545 -0.93196857 -1.49994591]
[ 1.52095653 1.13053362 -0.7012736 ]
[ 2.15089087 1.32894777 1.01633293]
[-1.02377246 1.12970593 -0.36693743]]
Mean: -2.2204460492503132e-17
Standard Deviation: 1.0"""

Normalization

Normalization, also known as min-max scaling, is used when the values are spread across a long range. It is easier to work on values which lie in a smaller range thus we use normalization to convert the range of the values to [0, 1] or sometimes [-1, 1]. It is not used when there are outliers in the data.

The formula used is x_new = (x - x_min)/(x_max - x_min)

Implementation using sci-kit learn

import numpy as np
from sklearn.preprocessing import MinMaxScaler

arr2 = np.random.rand(10,3)*100
print(arr2)
print("The range before normalization is : [{}, {}]".format(arr2.min(),arr2.max()))

# sample output
"""[[21.00801612 77.74704759 25.63122722]
[54.83292694 16.3455772 32.04866766]
[96.61986889 6.29165261 62.87900849]
[97.42033905 38.17507848 16.48163798]
[64.04436996 91.8359815 34.13440205]
[82.38084647 44.97368287 65.74681869]
[74.09448645 59.98532993 8.00089344]
[24.47881137 23.38872966 48.198353 ]
[64.27830359 15.47658312 92.51492281]
[ 5.51491864 88.43507128 70.30311959]]
The range before normalization is : [5.514918638080635, 97.42033905450633]"""

minmax_scaler = MinMaxScaler()
minmax_scaler.fit(arr2)
arr2 = minmax_scaler.transform(arr2)
print(arr2)
print("The range after normalization is : [{}, {}]".format(arr2.min(),arr2.max()))

# sample output
"""[[0.16857654 0.83530254 0.20860837]
[0.53661697 0.11752883 0.2845418 ]
[0.99129028 0. 0.64933734]
[1. 0.37271233 0.10034718]
[0.63684439 1. 0.30922095]
[0.83635903 0.45218696 0.68327029]
[0.7461972 0.6276708 0. ]
[0.2063414 0.19986219 0.47563061]
[0.63938976 0.10737042 1. ]
[0. 0.96024388 0.73718206]]
The range after normalization is : [0.0, 1.0]"""

Robust Scaling

When there are outliers in the data then we should not use standard scaling instead we should use robust scaling. It scales the data about the median and tells us how far is the data from the median in terms of Inter Quartile range (IQR). Interquartile Range is the distance between the third and the first quartile. It is similar to standard scaling but the only difference is we use mean and standard deviation in standard scaling and median and interquartile range in robust scaling.

The formula used is x_new = (x - median)/(Q3 - Q1)

Implementation using sci-kit learn

import numpy as np
from sklearn.preprocessing import RobustScaler

arr3 = np.random.rand(10,3)*100
print(arr3)
print("Median: {}".format(np.median(arr3)))
print("Q3: {}".format(np.quantile(arr3,.75)))
print("Q1: {}".format(np.quantile(arr3,.25)))

# sample output
"""[[94.12711485 47.64717879 84.83212595]
[41.54166062 32.6974467 82.17632652]
[75.79008789 98.86690653 41.5592884 ]
[72.20724272 33.32484797 17.01262311]
[34.84750494 88.74303599 87.04309146]
[41.68641477 48.63308258 47.53952529]
[55.34183396 45.21925429 17.65331303]
[29.10045669 34.38965987 98.34283707]
[94.16786313 62.21966107 73.30601754]
[69.66391057 81.33398152 10.37885324]]
Median: 51.987458269791546
Q3: 81.96574026584611
Q1: 36.52104385689809"""

rb_scaler = RobustScaler()
rb_scaler.fit(arr3)
arr3 = rb_scaler.transform(arr3)
print(arr3)
print("Median: {}".format(np.median(arr3)))
print("Q3: {}".format(np.quantile(arr3,.75)))
print("Q1: {}".format(np.quantile(arr3,.25)))

# sample output
"""[[ 0.94920585 -0.01249297 0.40320469]
[-0.62915355 -0.39136676 0.359335 ]
[ 0.39881754 1.28557796 -0.31159549]
[ 0.29127797 -0.37546642 -0.71706835]
[-0.83007953 1.02900685 0.43972641]
[-0.62480874 0.01249297 -0.21281125]
[-0.21493952 -0.0740243 -0.70648514]
[-1.00257794 -0.3484807 0.62638069]
[ 0.95042891 0.35682011 0.21281125]
[ 0.21493952 0.84123783 -0.82664794]]
Median: 0.0
Q3: 0.40210790508105054
Q1: -0.3873916777517996"""

Maximum-Absolute Scaling

In this technique, we find the maximum absolute value from the distribution and then divide the values by the maximum absolute value. This helps in scaling down the values to a range of [-1, 1]. This is not used when there are outliers in the data. This scaler does not change the sparsity of the data.

The formula used is x_new = x / max(abs(x))

Implementation using sci-kit learn

import numpy as np
from sklearn.preprocessing import MaxAbsScaler

arr4 = np.random.rand(10,3)*100
print(arr4)
print("The maximum absolute value is: {}".format(max(arr4.min(),arr4.max(),key=abs)))
print("The range before normalization is : [{}, {}]".format(arr4.min(),arr4.max()))

# sample output
"""[[30.20506438 84.42085986 25.43487487]
[ 9.27023388 50.64318179 56.62785835]
[39.3084322 38.35417837 70.30945385]
[23.06047447 44.86597981 95.47736575]
[18.36337079 40.77672149 31.39624294]
[35.40362476 52.15319276 8.6091533 ]
[96.74240177 50.66992418 6.8491357 ]
[45.72171635 88.37444765 89.34686885]
[56.15921269 0.76480578 62.55225838]
[ 9.03338013 65.67922772 82.45366699]]
The maximum absolute value is: 96.74240176773162
The range before normalization is : [0.7648057777124961, 96.74240176773162]"""

mabs = MaxAbsScaler()
mabs.fit(arr4)
arr4 = mabs.transform(arr4)
print(arr4)
print("The range after normalization is : [{}, {}]".format(arr4.min(),arr4.max()))

# sample output
"""[[0.31222157 0.95526322 0.26639691]
[0.0958239 0.57305231 0.59310244]
[0.40632062 0.43399624 0.73639918]
[0.23836988 0.50768046 1. ]
[0.18981719 0.4614085 0.32883441]
[0.36595768 0.59013883 0.09016957]
[1. 0.57335492 0.0717357 ]
[0.472613 1. 0.9357911 ]
[0.58050257 0.00865415 0.65515275]
[0.0933756 0.74319251 0.86359386]]
The range after normalization is : [0.00865415058343911, 1.0]"""

Importance of feature scaling

We know that most of the algorithms use euclidian distances but if our data is highly varying then it becomes a problem, so we use feature scaling. To perform computation efficiently we need to bring all the values to the same level of magnitude. algorithms which are highly affected by scaling include clustering and PCA (principle component analysis) because they involve the calculation of distances between the points.

To know more about data science lifecycle processes and underlying techniques stay tuned for more such blogs

--

--

Deepanshu Anand
CodeX
Writer for

Goodwill to all, I am Deepanshu Anand currently pursuing my B.tech in AI and Data Science. I am a cyber geek and share a major interest in MLOps.