16 Data Normalization Methods Using Python (With Examples) — Part 1 of 6

6 min readJan 11, 2024

Data Normalization Methods (Part 1 of 6):

Scaling to a Range

Min-Max Normalization
Max Abs Scaling
Hyperbolic Tangent (Tanh) Normalization

Introduction

Different models have different requirements for feature scaling. For instance, tree-based models like Random Forests and Gradient Boosting Machines do not require feature normalization. Neural networks, on the other hand, often benefit from normalization, as it can help the model converge more quickly.

Determining which features to normalize in a dataset and selecting appropriate normalization techniques is an important step in data preprocessing.

In this article, I will be exploring 16 normalization techniques using Python code containing functions of the mathematical formulae of each method (although there are built-in functions in packages like Scikit-learn). The techniques will be applied to some features in the automobile dataset donated by the UC Irvine Machine Learning Repository, based on a simple 1-step EDA process on only the features.

Import dataset using Python code

Install the ucimlrepo package

pip install ucimlrepo

Fetch dataset and view a summary of the variables

from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
automobile = fetch_ucirepo(id=10) 
  
# data (as pandas dataframes) 
X = automobile.data.features 
y = automobile.data.targets 
  
# variable information 
print(automobile.variables)

Output

name     role         type demographic   
0               price  Feature   Continuous        None  \
1         highway-mpg  Feature   Continuous        None   
2            city-mpg  Feature   Continuous        None   
3            peak-rpm  Feature   Continuous        None   
4          horsepower  Feature   Continuous        None   
5   compression-ratio  Feature   Continuous        None   
6              stroke  Feature   Continuous        None   
7                bore  Feature   Continuous        None   
8         fuel-system  Feature  Categorical        None   
9         engine-size  Feature   Continuous        None   
10   num-of-cylinders  Feature      Integer        None   
11        engine-type  Feature  Categorical        None   
12        curb-weight  Feature   Continuous        None   
13             height  Feature   Continuous        None   
14              width  Feature   Continuous        None   
15             length  Feature   Continuous        None   
16         wheel-base  Feature   Continuous        None   
17    engine-location  Feature       Binary        None   
18       drive-wheels  Feature  Categorical        None   
19         body-style  Feature  Categorical        None   
20       num-of-doors  Feature      Integer        None   
21         aspiration  Feature       Binary        None   
22          fuel-type  Feature       Binary        None   
23               make  Feature  Categorical        None   
24  normalized-losses  Feature   Continuous        None   
25          symboling   Target      Integer        None   

                                          description units missing_values  
0                       continuous from 5118 to 45400  None            yes  
1                            continuous from 16 to 54  None             no  
2                            continuous from 13 to 49  None             no  
3                        continuous from 4150 to 6600  None            yes  
4                           continuous from 48 to 288  None            yes  
5                             continuous from 7 to 23  None             no  
6                        continuous from 2.07 to 4.17  None            yes  
7                        continuous from 2.54 to 3.94  None            yes  
8        1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi  None             no  
9                           continuous from 61 to 326  None             no  
10         eight, five, four, six, three, twelve, two  None             no  
11             dohc, dohcv, l, ohc, ohcf, ohcv, rotor  None             no  
12                       continuous from 1488 to 4066  None             no  
13                       continuous from 47.8 to 59.8  None             no  
14                       continuous from 60.3 to 72.3  None             no  
15                     continuous from 141.1 to 208.1  None             no  
16                         continuous from 86.6 120.9  None             no  
17                                        front, rear  None             no  
18                                      4wd, fwd, rwd  None             no  
19      hardtop, wagon, sedan, hatchback, convertible  None             no  
20                                          four, two  None            yes  
21                                         std, turbo  None             no  
22                                        diesel, gas  None             no  
23  alfa-romero, audi, bmw, chevrolet, dodge, hond...  None             no  
24                          continuous from 65 to 256  None            yes  
25                             -3, -2, -1, 0, 1, 2, 3  None             no

Here, I will plot a 5 x 5 matrix of histograms (continuous values) and bar charts (discrete values) to visualize the distributions of each of the 25 features in the automobile dataset.

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Generating a list of distinct colors
colors = plt.cm.viridis(np.linspace(0, 1, len(X.columns)))

plt.figure(figsize=(20, 20))  # Setting the figure size

# Setting the style for no grid and white background for cleaner aesthetics
sns.set(style="white", palette="muted")

# Looping over the features to create subplots
for i, col in enumerate(X.columns):
    plt.subplot(5, 5, i + 1)  # Creating a subplot for each feature
    sns.histplot(X[col], kde=True, color=colors[i], edgecolor="black")  # Plotting the histogram with a unique color
    plt.title(col)  # Setting the title of each subplot as the feature name
    plt.tight_layout()

plt.show()  # Display the plots

5 x 5 matrix of histograms and bar charts with kernel density estimates

Notice that there are categorical and binary variable types? For categorical features with no inherent order (nominal categories), like fuel-system, engine-type, drive-wheels, body-style, and make, we can use one-hot encoding to convert each categorical variable into multiple binary variables (also known as dummy variables), each representing one category of the original variable. For binary features (like engine-location, aspiration, and fuel-type), we can encode them as 0 and 1 (this is often done automatically by many machine learning libraries, but it's good practice to explicitly convert them to ensure consistency).

Advantages of One-Hot Encoding:

It removes the ordinal relationship between categories that do not have a natural order. This is important because many machine learning models, like linear regression and logistic regression, assume numerical inputs that are ordinal.
It is easy to understand and implement.

Disadvantages of One-Hot Encoding:

It can lead to a high dimensionality increase, especially if the categorical variable has many categories (known as the “curse of dimensionality”). This can be problematic for models that struggle with high-dimensional spaces and can significantly increase memory and computational requirements.
It does not capture any information about the categories that might be related to the target variable, such as a natural ordering.

1. Min-Max Normalization

Min-Max Normalization is a scaling technique that transforms features to a specific range, usually [0, 1].

When to Use:

When you want to preserve the relationships among the original data points but need them in a scaled format
When you need to scale features to a bounded interval
When algorithms require data in a fixed range, like neural networks
Suitable for most continuous features without significant outliers, as min-max normalization is sensitive to them

Mathematical Formula:

Where:

x_normalized is the normalized value
x is the original value
min(x) is the minimum value of the feature across all data points
max(x) is the maximum value of the feature across all data points

Python code for applying Min-Max Normalization to the ‘ highway-mpg’ feature in dataset:

def min_max_normalize(series):
    return (series - series.min()) / (series.max() - series.min())

normalized_highwaympg = min_max_normalize(X['highway-mpg'])

Comparison:

2. Max Abs Scaling

Max Abs Scaling scales each feature by its maximum absolute value. This ensures that each feature is within the range [-1, 1].

When to Use:

Ideal for data that is already centered at zero or sparse data. Useful in preserving zero entries in sparse datasets.

Mathematical Formula:

Where:

x is the original value
abs(x) is the absolute value of x

Python code for Max Abs Scaling:

def max_abs_scaling(series):
    return series / series.abs().max()

This method is not particularly suitable for any features in the dataset.

3. Hyperbolic Tangent (Tanh) Normalization

Tanh Normalization applies the hyperbolic tangent function, scaling values to be within the range [-1, 1]. It’s similar to Min-Max scaling but with a range centered around zero.

Comparison with Max Abs Scaling

Range: Both methods scale data to the range [-1, 1].
Handling of Outliers: Tanh Normalization is more robust against outliers than Max Abs Scaling.
Linearity: Tanh Normalization is non-linear, whereas Max Abs Scaling is linear.
Data Centering: Tanh Normalization tends to center data around zero, but Max Abs Scaling does not.
Preservation of Sparsity: Max Abs Scaling is beneficial for sparse datasets as it maintains zero values.

The choice between the two depends on the specific characteristics of the dataset and the requirements of the subsequent analysis or machine learning algorithms.

When to Use:

Ideal for cases where you want to maintain the distribution of your variables but scale them to a fixed range centered around zero.

Python code for Tanh Normalization:

def tanh_normalization(series):
    return np.tanh(series)

This method is not particularly suitable for any features in the dataset.

16 Data Normalization Methods Using Python (With Examples) — Part 1 of 6

Data Normalization Methods (Part 1 of 6):

Scaling to a Range

Introduction

Import dataset using Python code

1. Min-Max Normalization

2. Max Abs Scaling

3. Hyperbolic Tangent (Tanh) Normalization

Written by Reina