Normalization in Data Science: A comprehensive guide for feature scaling

Pallavi Padav

Published in

Women in Technology

5 min readJun 24, 2024

https://nypost.com/2023/01/23/worlds-tallest-man-feared-he-would-step-on-worlds-shortest/

Here is an interesting article about the tallest man meeting the shortest man on earth.

While running a machine learning model we are often under the impression that all features contribute equally to the model’s prediction. But in reality, a similar situation might arise while working with datasets having features of various ranges.

How to deal with it? Normalization comes to the rescue!!

https://www.alamy.com/stock-photo/tallest-man-shortest-man.html?sortBy=relevant

What is Normalization?

Normalization or feature scaling is a data transformation technique performed during the data preprocessing stage of the life cycle. It brings all the features to a similar range based on the technique we chose.

Why Normalize Data?

Distance-based machine learning algorithms like KNN, Hierarchical Clustering, etc rely on the magnitude of the feature. Features with higher scales might dominate the rest of the features in the dataset and are likely to result in substandard performance.

The presence of an Outlier makes the data skewed normalization helps to remove the skewness.

Aside from these advantages, it improves the interpretability. It's easier to compare the distribution of various features in the dataset and get an insight.

Types of Normalization Techniques:

1. Min-Max Scaling:

It’s a commonly used normalization technique in which it transforms features by scaling them to a given range. Applicable for continuous and numeric and it can’t be used for binary or categorical values.

https://www.oreilly.com/library/view/hands-on-machine-learning/9781788393485/fd5b8a44-e9d3-4c19-bebb-c2fa5a5ebfee.xhtml

import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler, normalize, RobustScaler
arr = np.array([['M', 81.4, 82.2, 44, 6.1, 120000, 'no'],
               ['M', 75.2, 86.2, 40, 5.9, 80000, 'no'],
               ['F', 80.0, 83.2, 34, 5.4, 210000, 'yes'],
               ['F', 85.4, 72.2, 46, 5.6, 50000, 'yes'],
               ['M', 68.4, 87.2, 28, 5.11, 70000, 'no']])
df = pd.DataFrame(arr)
df.columns = ['gender', 'hsc_p', 'ssc_p', 'age', 'height', 'salary', 'suffer_from_disease']
df

Here is the box plot of features [‘hsc_p’, ‘ssc_p’, ‘height’, ‘salary’]

# convert to numpy array
# reshapes the array into a 2D array with one column. 
# Transform using 
scaler = MinMaxScaler()
df['hsc_p'] = scaler.fit_transform(np.array(df['hsc_p']).reshape(-1, 1)) 
df

By default, MinMaxScaler ranges from 0 to 1 but the range can be altered. scaler = MinMaxScaler(feature_range=(0, 2)).

2. StandardScaler:

StandardScaler makes use of Z-Score or the standardization technique. It transforms the data such that its mean is 0 with a standard deviation of 1.

https://vitalflux.com/minmaxscaler-standardscaler-python-examples/

std_scaler = StandardScaler()
df['ssc_p'] = std_scaler.fit_transform(np.array(df['ssc_p']).reshape(-1, 1)) 
df

3. Normalize:

It makes use of L1 or L2 norm. normalize() only normalizes values along rows, we need to convert the column into an array before we apply the method. Although using the normalize() function results in values between 0 and 1, it’s not the same as simply scaling the values to fall between 0 and 1.

https://stackoverflow.com/questions/55498772/is-there-a-more-efficient-way-to-normalize-a-set-of-data-in-sklearn-or-other-pyt

ht = np.array(df['height'])
normalized_ht = normalize([ht]) #L2 norm
df['height'] = normalized_ht.reshape(-1, 1)
df

Uniqueness of normalize() function: It provides the flexibility to normalize along the column and also along the sample(rows). Here is how it works. I have considered the example of L1 but same thing is applicable for L2 norn as well.

nom = np.array([[.10, 1,   10, 100],
                [.11, 1.1, 11, 110],
                [.12, 1.2, 12, 120],
                [.13, 1.3, 13, 130]])
df_nom = pd.DataFrame(nom)
df_nom.columns = ['a', 'b', 'c', 'd']
df_nom

Case 1: L1 norm of samples

# L1 normalization of sample(rows)
norm_default_l1 = normalize(df_nom, norm = 'l1') #axis: default=1
df_l1_row = pd.DataFrame(norm_default_l1)
df_l1_row.columns = ['a', 'b', 'c', 'd']

#display purpose
row_total_l1_row = df_l1_row.sum(axis=1)
df_l1_row['Row_Total'] = row_total_l1_row
df_l1_row.loc['Column_Total'] = df_l1_row.sum(axis=0)
df_l1_row

Case 2: L1 norm of columns

# L1 normalization of features
norm_default_l1 = normalize(df_nom, norm = 'l1', axis = 0) #axis: default=1
df_l1_col = pd.DataFrame(norm_default_l1)
df_l1_col.columns = ['a', 'b', 'c', 'd']

#display purpose
row_total_l1_col = df_l1_col.sum(axis=1)
df_l1_col['Row_Total'] = row_total_l1_col
df_l1_col.loc['Column_Total'] = df_l1_col.sum(axis=0)
df_l1_col

Since its L1 norm you can observe the summation of normalized value along the respective axis is 1.

4. RobustScaler:

Robust Scaler is a popular algorithm in Python used for standardizing features by removing the skewness in the data. It is useful when you have data that contains outliers or is not normally distributed, making it difficult to apply traditional scaling methods.

This method removes the median and scales the data in the range between 1st quartile and 3rd quartile. i.e., in between 25th quantile and 75th quantile range. This range is also called an Interquartile range.

https://www.geeksforgeeks.org/standardscaler-minmaxscaler-and-robustscaler-techniques-ml/

scaler = RobustScaler()
df['salary'] = scaler.fit_transform(np.array(df['salary']).reshape(-1, 1))
df

Use RobustScaler if you want to reduce the effects of outliers, relative to MinMaxScaler.

Box plot of features before and after feature scaling

From the above graph, it’s evident that normalized data gives better insight than the raw data.

Min-Max Scaling v/s StandardScaler v/s normalize v/s RobustScaler

EndNote:

Thanks for reading the blog. Have thoughts or questions? We’d love to hear from you! Feel free to leave a comment below.

Looking forward to staying in touch through Linkedin. Mail me here for any queries.

Stay tuned for more exciting content till then Happy reading!!!!

I believe in the power of continuous learning and sharing knowledge with the community. Your contributions are invaluable in helping me create meaningful content and resources that benefit everyone. Join me on this journey of exploration and innovation in the fascinating world of data science by donating to Buy Me a Coffee.