From Raw to Rescaled: A Guide to Z-Score, Normalization, and Standardization in Data Preprocessing
In the world of data preprocessing, ensuring that your data is in the right format and scale is a crucial step in building robust and accurate machine-learning models. This is where techniques like Z-score, normalization, and standardization come into play to scale our data.
Why do we need to scale our data?
Imagine you’re comparing the heights of people in two different countries: one measures heights in feet and the other in centimeters. If you want to find an average height, the results will be skewed because of the different units. Scaling the data would involve converting all heights to the same unit, like centimeters. This makes the comparison fair and accurate, just like in machine learning where we scale features so that they’re on the same level playing field for algorithms to work with effectively.
What is Scaling in machine learning?
Scaling is a process in which you adjust the range or size of your data to ensure that all features are on a similar scale. This is crucial for machine learning algorithms to work effectively because some algorithms are sensitive to the scale of input features. Scaling helps prevent certain features from dominating the learning process and ensures that different features contribute equally to the model’s performance. There are generally three types of scaling.
- MinMax Scaler
- Standard Scaler
- Robust Scaler
MinMax Scaler
The Min-Max Scaling (Normalization) technique works by transforming the original data into a new range, typically between 0 and 1.
It subtracts the minimum value from each data point to start the data at zero and then divides it by the range (the difference between the maximum and minimum values) to bring the data within the desired range of 0 to 1.
Standard Scaler
The standard normal distribution, also known as the Z distribution or the Gaussian distribution, is a specific type of probability distribution. It is a special case of the normal distribution where the mean (average) is 0 and the standard deviation is 1. In this distribution, the values are symmetrically distributed around the mean, forming a bell-shaped curve.
The standard normal distribution is commonly used in statistics and probability theory as a reference distribution. Many statistical methods and hypothesis tests assume that the data follows a normal distribution. By converting data to the standard normal distribution, we can use standardized values (Z-scores) to compare and analyze data regardless of its original scale. Z-scores tell us how many standard deviations a data point is away from the mean.
Standard Scaling transforms the data such that the mean becomes 0 and the standard deviation becomes 1.
It subtracts the mean from each data point to center the data around zero and then divides it by the standard deviation to normalize the spread. This method is suitable when the data is approximately normally distributed.
Robust Scaler
A robust scaler is similar to a Standard Scaler but it is robust to outliers. RobustScaler scales features by centering them around the median and normalizing them by the interquartile range, making them resilient to outliers.
The RobustScaler scales numerical features in a dataset, making them robust to the presence of outliers. It employs a transformation that centers the data around the median and normalizes it by the interquartile range (IQR). This scaling method is particularly useful when the data contains extreme values or outliers that could disproportionately influence other scaling techniques like StandardScaler or MinMaxScaler. By using the median and IQR, RobustScaler ensures that the scaling process is less sensitive to outliers, making it a suitable choice for datasets with skewed or non-normal distributions.
Code intuition
Step 1: Import libraries
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt
import seaborn as sns
Step 2: Generate sample data
The sample data is a mixture of data containing numbers between 5 to 20, then we have added outliers containing data from 1 to 2.
x = np.concatenate([np.random.normal(20, 5, 1000), np.random.normal(1, 2, 200)]).reshape(-1, 1)
Step 3: Robust Scaling, as it normalises using interquartile range, we are supposed to specify the values.
robust_scaler = preprocessing.RobustScaler(with_centering=True,with_scaling=True, quantile_range=(25.0, 75.0), copy=True)
robust_scaled_x = robust_scaler.fit_transform(x)
Step 4: Standard Scaling
standard_scaler = preprocessing.StandardScaler()
standard_scaled_x = standard_scaler.fit_transform(x)
Step 5: Min-Max Scaling
minmax_scaler = preprocessing.MinMaxScaler()
minmax_scaled_x = minmax_scaler.fit_transform(x)
Step 6: Plotting in original data
fig, (ax1, ax2, ax3, ax4) = plt.subplots(ncols=4, figsize=(20, 5))
ax1.set_title('Original Data')
sns.kdeplot(x.flatten(), ax=ax1, color='blue')
Step 7: Plotting in robust scaled data
ax2.set_title('After Robust Scaling')
sns.kdeplot(robust_scaled_x.flatten(), ax=ax2, color='red')
Step 8: Plotting in standard scaled data
ax3.set_title('After Standard Scaling')
sns.kdeplot(standard_scaled_x.flatten(), ax=ax3, color='black')
Step 9: Plotting in Min-Max scaled data
ax4.set_title('After Min-Max Scaling')
sns.kdeplot(minmax_scaled_x.flatten(), ax=ax4, color='g')
Step 10: Display the visualisation
plt.tight_layout()
plt.show()
In the output graph, the smaller bell-shaped curve represents the presence of outliers. Observing the plots of both robust and standard scaled data, they share a common mean of 0. However, the distinction lies in the distribution to the right of the 0 axis, where the standard scaled data displays higher density compared to the robust scaled data. This difference highlights the robustness of the latter approach in handling outliers. On the other hand, the min-max scaling technique normalizes the data within the range of 0 to 1. This scaling leads to the outliers being compressed within the range of 0 to 0.25.