Demystifying Data Normalization in Machine Learning

Dagang Wei
6 min readFeb 5, 2024

--

Image generated with Bard

This article is part of the series Demystifying Machine Learning.

Introduction

In the ever-evolving world of machine learning, the process of preparing data for analysis plays a pivotal role in achieving accurate and efficient results. One crucial step in this preparation is normalization, a technique that often goes unnoticed but significantly impacts the performance of machine learning models. This blog post aims to demystify the concept of normalization, exploring what it is, why it’s essential, and how it can be implemented, all through the lens of a practical example with a generated housing prices dataset.

What is Normalization?

Normalization is a data preprocessing technique used to transform the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values or losing information. It’s about adjusting the scale of your data to level the playing field for all the features in your dataset.

There are several methods to normalize data, but the most common ones include min-max normalization and z-score normalization (standardization). Min-max normalization scales the data between a specified range (usually 0 and 1), while z-score normalization scales the data so that it has a mean of 0 and a standard deviation of 1.

Why Normalize Data?

The importance of normalization becomes apparent in various scenarios, such as:

  • Improving Model Accuracy: Many machine learning algorithms, like gradient descent, converge faster with normalized data. Without normalization, features with higher magnitude can dominate the learning process, leading to less accurate models.
  • Facilitating Model Training: Normalized data helps ensure that each feature contributes equally to the model training process, making it easier for the model to learn the patterns.
  • Enhancing Compatibility: Some algorithms, especially those involving distance calculations like k-nearest neighbors (KNN) and k-means clustering, require normalized data to function correctly because they are sensitive to the magnitude of the data.

Normalization Techniques

There are several common methods of normalization, each with its specific application and advantages. Below are some widely used techniques:

1. Min-Max Scaling

Min-Max Scaling (or Min-Max Normalization) is one of the simplest methods where the values are scaled to a fixed range — usually 0 to 1. The formula for calculating the Min-Max Scaling is:
𝑋ₙₒᵣₘ = (𝑋 − 𝑋ₘᵢₙ) / (𝑋ₘₐₓ − 𝑋ₘᵢₙ) are the minimum and maximum values in the feature, respectively. This method is best used when the distribution is not Gaussian or when the standard deviation is very small. However, it’s sensitive to outliers.

2. Z-Score Normalization (Standardization)

Standardization (or Z-Score Normalization) transforms the features so they have the properties of a standard normal distribution with a mean of 0 and a standard deviation of 1:

𝒁 = (𝑋 − μ) / σ

where μ is the mean of the feature and σ is the standard deviation. This method is less affected by outliers and is suitable for algorithms that assume the input data is normally distributed.

3. Robust Scaling

Robust Scaling uses the median and the interquartile range (instead of mean and standard deviation in Z-score normalization). The IQR is the difference between the 75th percentile (Q3) and the 25th percentile (Q1) of the data. The 25th percentile is the value below which 25% of the data falls, and the 75th percentile is the value below which 75% of the data falls. The IQR thus represents the middle 50% of the data and is used because it measures the variability in the data while ignoring the influence of extreme outliers. It subtracts the median from the data points and divides by the IQR:

𝑋ᵣₒᵦᵤₛₜ = (𝑋 − Median) / IQR

This method is robust to outliers and is preferred if the data contains many outliers or is skewed.

4. Decimal Scaling

Decimal scaling normalizes by moving the decimal point of values of the feature. The number of decimal places moved depends on the maximum absolute value of the feature. This method is less common and is used when the range of data needs to be arbitrarily scaled down.

5. Log Scaling

Logarithmic scaling transforms the data using the logarithm function. This is particularly useful when dealing with data that follows a power law distribution. It can also help in managing skewed data, making it more linear and suitable for linear models.

6. L2 Normalization

L2 Normalization, also known as Euclidean normalization, scales the input vector so that the Euclidean length of the vector is 1. It’s commonly used in text classification and clustering.

The choice of normalization technique depends on the specific dataset and the machine learning algorithm being used. For example, Min-Max Scaling and Z-Score Normalization are broadly applicable and often used as a starting point. In contrast, Robust Scaling is preferred for datasets with many outliers. It’s also worth experimenting with different normalization methods as part of the model selection process to determine which method yields the best performance for your specific problem.

Example

To illustrate the normalization process, let’s consider a simple example using a generated housing prices dataset. This dataset includes features such as the size of the house (in square feet), the number of bedrooms, and the age of the house (in years), with the target variable being the house price. First, we’ll generate a synthetic housing prices dataset and then demonstrate how to normalize it with min-max scaling.

The code is available in this colab notebook.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import seaborn as sns

# Seed for reproducibility
np.random.seed(42)

# Generate synthetic housing data with noise
size = np.random.normal(3000, 750, 100) + np.random.normal(0, 200, 100) # Adding noise
bedrooms = np.random.randint(1, 5, 100)
age = np.random.randint(1, 30, 100)

# Introduce outliers
size[98:100] += 5000 # Extreme size values for outliers
bedrooms[98:100] = 6 # More bedrooms than typical houses
age[98:100] -= 25 # Significantly older

# Simulate house prices with added noise
prices = size * 200 + bedrooms * 5000 + age * -1000 + np.random.normal(0, 15000, 100)

# Create a DataFrame
housing_data = pd.DataFrame({
'Size': size,
'Bedrooms': bedrooms,
'Age': age,
'Price': prices
})

# Initialize the MinMaxScaler
scaler_features = MinMaxScaler()
scaler_price = MinMaxScaler()

# Normalize the features
features_to_normalize = ['Size', 'Bedrooms', 'Age']
housing_data[features_to_normalize] = scaler_features.fit_transform(housing_data[features_to_normalize])

# Normalize the target variable
housing_data['Price'] = scaler_price.fit_transform(housing_data[['Price']])

# Split the dataset into training and testing sets
X = housing_data.drop('Price', axis=1)
y = housing_data['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the testing set
predictions = model.predict(X_test)

# Scale back the predictions to the original price scale
predictions_original_scale = scaler_price.inverse_transform(predictions.reshape(-1, 1))
y_test_original_scale = scaler_price.inverse_transform(y_test.values.reshape(-1, 1))

# Evaluate the model using mean squared error in the original price scale
mse = mean_squared_error(y_test_original_scale, predictions_original_scale)
print(f'Mean Squared Error: {mse}')

# Visualize actual vs. predicted prices in the original scale
plt.figure(figsize=(10, 6))
plt.scatter(y_test_original_scale, predictions_original_scale)
plt.plot([y_test_original_scale.min(), y_test_original_scale.max()], [y_test_original_scale.min(), y_test_original_scale.max()], 'k--', lw=4)
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.title('Actual vs. Predicted Prices')
plt.show()

Conclusion

Normalization is a critical step in the data preprocessing phase of machine learning projects. By ensuring that each feature contributes equally to the model, normalization can lead to more accurate predictions, faster convergence, and overall better model performance. Through the example of a generated housing prices dataset, we’ve seen how straightforward it is to implement normalization using Python’s Scikit-Learn library, making it an accessible practice for data scientists and analysts alike.

Remember, while normalization is powerful, it’s also essential to understand your data and the requirements of the machine learning algorithms you’re using, as this will guide you in choosing the most appropriate normalization technique.

--

--