16 Data Normalization Methods Using Python (With Examples) — Part 1 of 6

Reina
6 min readJan 11, 2024

--

Designed by Reina

Data Normalization Methods (Part 1 of 6):

Scaling to a Range

  1. Min-Max Normalization
  2. Max Abs Scaling
  3. Hyperbolic Tangent (Tanh) Normalization

Introduction

Different models have different requirements for feature scaling. For instance, tree-based models like Random Forests and Gradient Boosting Machines do not require feature normalization. Neural networks, on the other hand, often benefit from normalization, as it can help the model converge more quickly.

Determining which features to normalize in a dataset and selecting appropriate normalization techniques is an important step in data preprocessing.

In this article, I will be exploring 16 normalization techniques using Python code containing functions of the mathematical formulae of each method (although there are built-in functions in packages like Scikit-learn). The techniques will be applied to some features in the automobile dataset donated by the UC Irvine Machine Learning Repository, based on a simple 1-step EDA process on only the features.

Import dataset using Python code

Install the ucimlrepo package

pip install ucimlrepo

Fetch dataset and view a summary of the variables

from ucimlrepo import fetch_ucirepo 

# fetch dataset
automobile = fetch_ucirepo(id=10)

# data (as pandas dataframes)
X = automobile.data.features
y = automobile.data.targets

# variable information
print(automobile.variables)

Output

name     role         type demographic   
0 price Feature Continuous None \
1 highway-mpg Feature Continuous None
2 city-mpg Feature Continuous None
3 peak-rpm Feature Continuous None
4 horsepower Feature Continuous None
5 compression-ratio Feature Continuous None
6 stroke Feature Continuous None
7 bore Feature Continuous None
8 fuel-system Feature Categorical None
9 engine-size Feature Continuous None
10 num-of-cylinders Feature Integer None
11 engine-type Feature Categorical None
12 curb-weight Feature Continuous None
13 height Feature Continuous None
14 width Feature Continuous None
15 length Feature Continuous None
16 wheel-base Feature Continuous None
17 engine-location Feature Binary None
18 drive-wheels Feature Categorical None
19 body-style Feature Categorical None
20 num-of-doors Feature Integer None
21 aspiration Feature Binary None
22 fuel-type Feature Binary None
23 make Feature Categorical None
24 normalized-losses Feature Continuous None
25 symboling Target Integer None

description units missing_values
0 continuous from 5118 to 45400 None yes
1 continuous from 16 to 54 None no
2 continuous from 13 to 49 None no
3 continuous from 4150 to 6600 None yes
4 continuous from 48 to 288 None yes
5 continuous from 7 to 23 None no
6 continuous from 2.07 to 4.17 None yes
7 continuous from 2.54 to 3.94 None yes
8 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi None no
9 continuous from 61 to 326 None no
10 eight, five, four, six, three, twelve, two None no
11 dohc, dohcv, l, ohc, ohcf, ohcv, rotor None no
12 continuous from 1488 to 4066 None no
13 continuous from 47.8 to 59.8 None no
14 continuous from 60.3 to 72.3 None no
15 continuous from 141.1 to 208.1 None no
16 continuous from 86.6 120.9 None no
17 front, rear None no
18 4wd, fwd, rwd None no
19 hardtop, wagon, sedan, hatchback, convertible None no
20 four, two None yes
21 std, turbo None no
22 diesel, gas None no
23 alfa-romero, audi, bmw, chevrolet, dodge, hond... None no
24 continuous from 65 to 256 None yes
25 -3, -2, -1, 0, 1, 2, 3 None no

Here, I will plot a 5 x 5 matrix of histograms (continuous values) and bar charts (discrete values) to visualize the distributions of each of the 25 features in the automobile dataset.

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Generating a list of distinct colors
colors = plt.cm.viridis(np.linspace(0, 1, len(X.columns)))

plt.figure(figsize=(20, 20)) # Setting the figure size

# Setting the style for no grid and white background for cleaner aesthetics
sns.set(style="white", palette="muted")

# Looping over the features to create subplots
for i, col in enumerate(X.columns):
plt.subplot(5, 5, i + 1) # Creating a subplot for each feature
sns.histplot(X[col], kde=True, color=colors[i], edgecolor="black") # Plotting the histogram with a unique color
plt.title(col) # Setting the title of each subplot as the feature name
plt.tight_layout()

plt.show() # Display the plots
5 x 5 matrix of histograms and bar charts with kernel density estimates

Notice that there are categorical and binary variable types? For categorical features with no inherent order (nominal categories), like fuel-system, engine-type, drive-wheels, body-style, and make, we can use one-hot encoding to convert each categorical variable into multiple binary variables (also known as dummy variables), each representing one category of the original variable. For binary features (like engine-location, aspiration, and fuel-type), we can encode them as 0 and 1 (this is often done automatically by many machine learning libraries, but it's good practice to explicitly convert them to ensure consistency).

Advantages of One-Hot Encoding:

  • It removes the ordinal relationship between categories that do not have a natural order. This is important because many machine learning models, like linear regression and logistic regression, assume numerical inputs that are ordinal.
  • It is easy to understand and implement.

Disadvantages of One-Hot Encoding:

  • It can lead to a high dimensionality increase, especially if the categorical variable has many categories (known as the “curse of dimensionality”). This can be problematic for models that struggle with high-dimensional spaces and can significantly increase memory and computational requirements.
  • It does not capture any information about the categories that might be related to the target variable, such as a natural ordering.

1. Min-Max Normalization

Min-Max Normalization is a scaling technique that transforms features to a specific range, usually [0, 1].

When to Use:

  • When you want to preserve the relationships among the original data points but need them in a scaled format
  • When you need to scale features to a bounded interval
  • When algorithms require data in a fixed range, like neural networks
  • Suitable for most continuous features without significant outliers, as min-max normalization is sensitive to them

Mathematical Formula:

Where:

  • x_normalized​ is the normalized value
  • x is the original value
  • min(x) is the minimum value of the feature across all data points
  • max(x) is the maximum value of the feature across all data points

Python code for applying Min-Max Normalization to the ‘ highway-mpg’ feature in dataset:

def min_max_normalize(series):
return (series - series.min()) / (series.max() - series.min())

normalized_highwaympg = min_max_normalize(X['highway-mpg'])

Comparison:

Comparison of highway-mpg before and after min-max normalization

2. Max Abs Scaling

Max Abs Scaling scales each feature by its maximum absolute value. This ensures that each feature is within the range [-1, 1].

When to Use:

Ideal for data that is already centered at zero or sparse data. Useful in preserving zero entries in sparse datasets.

Mathematical Formula:

Where:

  • x is the original value
  • abs(x) is the absolute value of x

Python code for Max Abs Scaling:

def max_abs_scaling(series):
return series / series.abs().max()

This method is not particularly suitable for any features in the dataset.

3. Hyperbolic Tangent (Tanh) Normalization

Tanh Normalization applies the hyperbolic tangent function, scaling values to be within the range [-1, 1]. It’s similar to Min-Max scaling but with a range centered around zero.

Comparison with Max Abs Scaling

  • Range: Both methods scale data to the range [-1, 1].
  • Handling of Outliers: Tanh Normalization is more robust against outliers than Max Abs Scaling.
  • Linearity: Tanh Normalization is non-linear, whereas Max Abs Scaling is linear.
  • Data Centering: Tanh Normalization tends to center data around zero, but Max Abs Scaling does not.
  • Preservation of Sparsity: Max Abs Scaling is beneficial for sparse datasets as it maintains zero values.

The choice between the two depends on the specific characteristics of the dataset and the requirements of the subsequent analysis or machine learning algorithms.

When to Use:

Ideal for cases where you want to maintain the distribution of your variables but scale them to a fixed range centered around zero.

Python code for Tanh Normalization:

def tanh_normalization(series):
return np.tanh(series)

This method is not particularly suitable for any features in the dataset.

--

--

Reina

📍 Singapore | 📊 Data Science and Analytics