Unleashing the Power of Feature Scaling: A Comprehensive Exploration of Every Kind

12 min readAug 8, 2023

Written by:
Arvin Melvillo, Made Ary Widanthi, Mehdi Mursalat Ismail, Muhammad Farhan Salimuddin, Risanto Darmawan

Have you scaled your data? Did you know which scaler to use? If not, you have come to the right place! In data preprocessing and machine learning, feature scaling is an important step and cannot be understated. In data preparation, feature scaling plays a crucial role in enhancing the performance and accuracy of machine learning algorithms.

What is Feature Scaling?

Feature scaling is a data preprocessing technique used in machine learning to bring all numerical features in a dataset to a consistent scale or range. In many real-world datasets, features may have different units of measurement or widely varying magnitudes.

Why use Feature Scaling?

As datasets become increasingly complex and diverse, consisting of numerous features with varying scales, it becomes essential to normalize or standardize these features to a common scale. Feature scaling aims to make the range and distribution consistent and can make machine learning models determine decisions and avoid bias caused by differences in magnitudes.

In this article, we will explain different techniques of feature scaling, such as Min-Max Scaler, Standard Scaler, and others, and we will discuss the purpose of the scaler in real life cases and how the scaler works.

Content:

Min-Max Scaler
Standard Scaler
MaxAbs Scaler
Robust Scaler
Quantile Transformer Scaler
Log Transformation
Power Transformer Scaler
Unit Vector Scalar

Min-Max Scaler

What is Min-Max Scaler

Min-Max Scaler, known as min-max scaling or min-max normalization, is a simple method used to rescale the range of features, bringing them within a specified scale such as [0, 1] or [−1, 1]. This estimator works by scaling and translating each feature independently so that it falls within the desired range on the training set, for example, between zero and one.

When to use Min-Max Scaler

We can use Min-Max Scaler when the data has a bounded range or when the distribution is normal and the data has no outlier. For instance, if your data contains outliers, scaling to a fixed range like [0, 1] may compress the majority of data points, making outliers indistinguishable.

The Example of before (Left) and after Min-Max Scaler (Right). Notice the Col_1 label, from range 0 to 60 become 0 to 1

Standard Scaler

What is Standard Scaler?

Standard Scaler is a method that is used to resize the distribution of the data to make the distribution of observed values have the mean of 0 and standard deviation of 1. This type of scaler has formula:

y = (x — mean) / standard_deviation

Where x is the sample of the data.

When to use Standard Scaler?

We use Standard Scaler when the characteristics of the input dataset differ greatly between their ranges, or simply when they are measured in different units of measure. However we have to note that outliers can impact the data when calculating the empirical mean and standard deviation, which narrows the range of characteristic values. Therefore do not use Standard Scaler when it doesn’t have Gaussian Distribution.

Image ref : https://hersanyagci.medium.com/feature-scaling-with-scikit-learn-for-data-science-8c4cbcf2daff

MaxAbs Scaler

What is MaxAbs Scaler?

MaxAbs Scaler is a method that is used to set the maximum value of data into 1. Therefore with this type of scaler, the distribution of data will not be changed or shifted. If there are negative values in the dataset, this scaler will set the data between -1 and 1. The MaxAbs Scaler formula is:

When to use MaxAbs Scaler?

Same as the other scaler, MaxAbs Scaler needs to be preprocessed first. However MaxAbs Scaler will be useful when dealing with sparse matrices, where other scaling techniques may not work as expected due to the presence of many zero values. In such cases, using MaxAbsScaler can help preserve the sparsity of the data while still ensuring that the features are properly scaled.

Difference for using MaxAbs Scaler and MinMax Scaler

Robust Scaler

What is Robust Scaler?

Robust Scaler applies a transformation to the data by first removing the median and then scaling it based on the interquartile range (IQR) — the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).

For each feature, the centering and scaling are computed independently using the relevant statistics from the samples in the training set. The computed median and interquartile range are then stored for later use with the transform method on new data.

Standardizing a dataset is a common requirement for many machine learning estimators. Traditionally, this involves removing the mean and scaling to unit variance. However, outliers can significantly influence the sample mean and variance, leading to suboptimal results. In such cases, using the median and interquartile range often yields better results.

When to use Robust Scaler?

When your dataset contains outliers or extreme values, traditional scaling techniques like Min-Max scaling or z-score scaling (StandardScaler) can be heavily influenced by these outliers, leading to misleading results. The RobustScaler is designed to handle outliers and provide more robust scaling, making it a suitable choice for such data.

Quantile Transformer

What is Quantile Transformer?

This method aims to normalize the features by transforming them into either a uniform or normal distribution. Consequently, it spreads out the most frequent values within a given feature and diminishes the influence of (marginal) outliers, making it a robust preprocessing approach.

Each feature undergoes independent transformation. Initially, an estimate of the cumulative distribution function for a feature is utilized to map the original values to a uniform distribution. Subsequently, the obtained values are further mapped to the desired output distribution using the corresponding quantile function. For new or unseen data, feature values falling below or above the fitted range will be mapped to the bounds of the output distribution. It’s important to note that this transformation is non-linear, which may introduce some distortion in linear correlations between variables measured at the same scale. Nevertheless, it enhances the comparability of variables measured at different scales.

When to use Quantile Transformer?

The Quantile Transformer is a powerful data transformation technique useful for non-Gaussian data, data normalization, addressing skewness, handling outliers, ensuring monotonic transformations, and privacy-preserving rank-based transformations. Its robustness to outliers and ability to maintain data ranking make it valuable in various scenarios.

Log Transformation

What is Log transformation?

Log transformation is a technique for altering data using the logarithmic function. This technique is typically employed to mitigate the impact of skewness on skewed data (positively skewed). The objective of the log transformation is to convert data that originally exhibited a skewed distribution to a normal distribution. Log transformation is commonly utilized on data with exclusively positive values, as logarithms are only defined for positive numbers. In the context of utilizing scikit-learn, the log transformation can be carried out by applying np.log() from numpy or the function .apply() from Pandas.

When to use Log transformation?

Log transformation is a mathematical operation applied to data, commonly used in various fields like statistics, data analysis, and machine learning. It is used to modify the scale of the data by taking the logarithm of the original values. The log transformation is particularly useful in the following situations:

Skewed data: When you have data with a skewed distribution (either positively or negatively skewed), a log transformation can help in reducing the impact of extreme values and make the distribution more symmetric. This is especially beneficial for statistical analyses that assume normality.

Stabilizing variance: In some cases, data may have a varying variance across the range of values, known as heteroscedasticity. Log transformation can stabilize the variance, making the data more suitable for certain statistical models and assumptions.

Multiplicative relationships: When the relationship between variables is multiplicative rather than additive, taking the logarithm can convert the relationship into an additive one. This can simplify data analysis and make interpretation easier.

Percentage changes: Log transformation is commonly used when dealing with variables that represent percentage changes or rates, such as growth rates or interest rates.

Data compression: In certain applications, such as image processing, log transformation can be used to compress dynamic ranges, making it easier to visualize and analyze the data.

Outlier detection: Log transformation can sometimes highlight outliers more effectively, making them stand out in the transformed data, which can aid in outlier detection.

Data normalization: Log transformation can be used to normalize data, bringing extreme values closer to the mean and reducing the effect of scale differences between variables.

It is important to note that while log transformation can be beneficial in the scenarios mentioned above, it may not always be appropriate for all data types or analyses. The decision to use log transformation should be made based on a thorough understanding of the data and the specific analysis or modeling goals. Additionally, when applying a log transformation, make sure that all values are strictly positive since the logarithm of non-positive values is undefined.

The following script describe Log transformation using random data

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Generate random data with lognormal distribution
data_lognormal = np.random.lognormal(mean=1, sigma=1, size=1000)# Log Transformation using NumPy
data_log_transformed = np.log(data_lognormal)# Plot histogram original data and display transformation using sns.histplot
plt.figure(figsize=(12, 6))plt.subplot(1, 2, 1)
sns.histplot(data_lognormal, bins=30, kde=True, color='skyblue')
plt.title('Original Data (Lognormal)')
plt.xlabel('Value')
plt.ylabel('Frequency')plt.subplot(1, 2, 2)
sns.histplot(data_log_transformed, bins=30, kde=True, color='salmon')
plt.title('Log Transformation')
plt.xlabel('Value (Log Transformed)')
plt.ylabel('Frequency')plt.tight_layout()
plt.show()

Reference: numpy.log — NumPy v1.25 Manual

Log Transformation is a specific case of Power Transformation that occurs when the exponent (lambda) is equal to 0.

Power Transformer Scaler

What is Power Transformer Scaler ?

Power Transformer Scaler is a method in scikit-learn used to transform data using a power function in general, not just logarithm. Power Transformer Scaler can use either Box-Cox Transformation or Yeo-Johnson Transformation to perform data transformation. The goal of Power Transformer Scaler is to make the data closer to a normal distribution by using an optimal power transformation based on the data characteristics. Power Transformer Scaler is not only suitable for data that are all positive, but can also be used on data that contain negative values.

So, when using the Power Transformer Scaler, you can choose between two transformation methods: Box-Cox or Yeo-Johnson, including Log Transformation which is a special case of Box-Cox Transformation.

In more detail, here is a brief explanation of the two transformation methods that can be used in the Power Transformer Scaler:

Box-Cox Transformation:

The Box-Cox Transformation is a power transformation defined as follows: T(Y) = ((Y^lambda) — 1) / lambda, if lambda is not equal to zero, and T(Y) = log(Y), if lambda is equal with zero.

Box-Cox Transformation can only be used for data that has positive values and cannot contain zero or negative values.

If lambda = 0, then this transformation will be a Log Transformation.

Yeo-Johnson Transformation:

The Yeo-Johnson Transformation is a more flexible power transformation than Box-Cox and is defined as follows: T(Y) = ((1 + Y)^lambda — 1) / lambda, if lambda is not equal to zero and Y >= 0, and T(Y) = -((1 — Y)^(-lambda) — 1) / lambda, if lambda is not equal to zero and Y < 0.

Yeo-Johnson Transformation can be used for data that contains positive and/or negative values.

If lambda = 0, then this transformation will be a Log Transformation.

Using Power Transformer Scaler

The following script describe Power Transformer Scaler with Box-Cox Transformation and Yeo-Johnson Transformation using random data

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import PowerTransformer

# Generate random data
original_data= np.random.lognormal(mean=1, sigma=1, size=1000)# Box-Cox Transformation using PowerTransformer with method='box-cox'
boxcox_transformer = PowerTransformer(method='box-cox')
data_boxcox_transformed = boxcox_transformer.fit_transform(original_data.reshape(-1, 1))# Yeo-Johnson Transformation using PowerTransformer with method='yeo-johnson'
yeojohnson_transformer = PowerTransformer(method='yeo-johnson')
data_yeojohnson_transformed = yeojohnson_transformer.fit_transform(original_data.reshape(-1, 1))# Plot histogram original data and display transformaion using sns.histplot
plt.figure(figsize=(12, 6))plt.subplot(2, 3, 1)
sns.histplot(original_data, bins=30, kde=True, color='skyblue')
plt.title('Original Data')
plt.xlabel('Value')
plt.ylabel('Frequency')plt.subplot(2, 3, 4)
sns.histplot(data_boxcox_transformed, bins=30, kde=True, color='salmon')
plt.title('Box-Cox Transformation')
plt.xlabel('Value')
plt.ylabel('Frequency')plt.subplot(2, 3, 5)
sns.histplot(data_yeojohnson_transformed, bins=30, kde=True, color='green')
plt.title('Yeo-Johnson Transformation')
plt.xlabel('Value')
plt.ylabel('Frequency')plt.tight_layout()
plt.show()

Reference: sklearn.preprocessing.PowerTransformer — scikit-learn 1.3.0 documentation

So, in summary, when using Power Transformer Scaler, you can apply either Box-Cox Transformation or Yeo-Johnson Transformation, and if the estimated lambda is close to 0, it will essentially be performing Log Transformation.

Unit Vector Scaler (L1 and L2 Normalization (Scaling))

In addition to regularization, we can utilize L1 and L2 to normalize the data. The concepts of L1/L2 regularization and normalization are distinct. Regularization is a technique used during model training to regulate its complexity, whereas normalization is a preparation step used to transform the dataset’s features. They use comparable mathematical principles but for distinct objectives and at various phases of developing a machine learning model.

L1 Normalization

The L1 norm that is calculated as the sum of the absolute values of the vector.

L2 Normalization

L2 norm that is calculated as the square root of the sum of the squared vector values.

When to Use: L1 normalization can be useful when you want to make your data invariant to the scale of the magnitudes but still preserve the sign of each value in your feature vector. It scales the data so that the sum of the absolute values equals 1.

When to Use: L2 normalization is generally chosen when the direction of the data matters and you want to maintain the relative distances between the points. It scales the data so that the sum of the squares of the elements in a vector equals 1.

The decision between L1 and L2 normalization is frequently influenced by your comprehension of your data and the problem you’re attempting to solve. Experimenting with both and comparing their effects on model performance via cross-validation can provide empirical evidence to help make the proper choice.

Conclusion

Min-Max Scaler

Suitable when you want to scale data to a fixed range like [0, 1]. However, if your data contains outliers, this scaler may compress the majority of data points, making outliers indistinguishable.

Standard Scaler

Best used when the data follows a Gaussian distribution. It standardizes the features by removing the mean and scaling to unit variance.

MaxAbs Scaler

Particularly useful for scaling sparse matrices. It scales the data in the range [-1, 1] by dividing each value by the maximum absolute value.

Robust Scaler

If your data has many outliers, this scaler removes the median and scales the data based on the interquartile range (IQR), making it more robust to outliers.

The Quantile Transformer

Normalizes features by mapping them to a uniform or normal distribution, reducing the influence of outliers and enhancing the comparability of variables at different scales. It’s particularly useful for non-Gaussian data, handling skewness, and maintaining data ranking, making it a robust preprocessing choice for various scenarios.

Log Transformation

Utilizes the logarithmic function to alter data. Aimed at converting skewed distribution data to a more normal distribution.

Power Transformer Scaler

Adjusts the data to be closer to a normal distribution using an optimal power transformation. Suitable for both positive and negative values, it offers methods like Box-Cox or Yeo-Johnson transformations.

Unit Vector Scaler

Includes L1 and L2 normalization:

L1 Normalization: Scales the vector so that the sum of its absolute values is 1. It maintains the original sign of each value.
L2 Normalization: Scales the vector so that the square root of the sum of the squared values is 1, preserving the relative distances between data points.

Closing

Machine learning is a complex field with intricate relationships between various mathematical concepts. Understanding the various concepts of feature scaling is pivotal for anyone delving into machine learning and data science. Each method serving different purposes in preprocessing phase. Feature scaling can greatly influence the performance and behavior of many machine learning algorithms. It is a tool that should be in the toolkit of every data scientist and machine learning practitioner, used judiciously and with an understanding of its impact on the modeling process. Whether you’re a seasoned data scientist or just starting your journey, remember that the foundational concepts remain key to unlocking the full potential of machine learning. Happy learning!