Data Scaling Techniques in Python

avish arora — Mon, 07 Jul 2025 16:58:27 GMT

When you’re building a machine learning model, one crucial step is preparing your data. Features in a dataset often have many different scales. When comparing features with different scales, like height in centimeters versus salary in dollars, algorithms can get confused because they treat larger numbers as more important, regardless of the actual meaning. This can cause algorithms that use distance or gradient descent to perform poorly, as they become biased towards features with bigger values. Scaling helps by putting all features on a similar scale, so each one contributes equally to the model’s decisions and improves performance with faster convergence.

Feature scaling adjusts the scale of features to a similar range, ensuring that no single feature dominates the learning process, it also helps in gradient descent by making the optimization process more efficient and stable.

Here’s how:

Faster Convergence: When features are on very different scales, the cost function’s contour can be elongated. Gradient descent updates weights based on gradients, and with varying scales, the steps can be uneven. This causes a zig-zagging path towards the minimum, slowing down convergence. Feature scaling helps to round out the cost function, allowing gradient descent to take a more direct path to the minimum, resulting in faster convergence.
Balanced Weight Updates: Feature scaling ensures that no feature dominates the weight updates just because it has a larger numerical range. This leads to more stable and balanced adjustments to the model’s parameters.
Avoiding Getting Stuck: Without scaling, the elongated cost function contours can cause the gradient descent algorithm to get stuck in inefficient loops or potentially converge to a suboptimal solution. Scaling helps create a more spherical cost function, enabling the algorithm to find the optimal minimum more effectively.
Helps with Regularization: Scaling is also beneficial when using regularization techniques like Ridge or Lasso, which penalize large weights. Scaling ensures that this penalty is applied fairly to all features, preventing features with larger scales from being penalized disproportionately.

Several feature scaling techniques are explained in this post:

MinMax Scaler
Standardization (Z-score scaling)
Normalization
Robust Scaler
MaxAbs Scaler

Here is the explanation of each technique in simple terms with implementations using Python’s Scikit-learn library.

1. MinMax Scaler (Normalization)

MinMax Scaler, also known as normalization, transforms each feature to a specific range, usually between 0 and 1.

How it works:

Each feature is scaled and translated individually based on its minimum and maximum values in the training set. Here is the formula:

X_scaled = (X — X_min) / (X_max — X_min)

Where X is the feature value, X_min is the minimum value in the feature column, and X_max is the maximum value.

When to use it:

When values are needed within a bounded interval, like in image processing.
When the data has few or no outliers.
For algorithms sensitive to the scale of features, like K-Means clustering or algorithms using distance calculations.

Python code example:

from sklearn.preprocessing import MinMaxScaler

Import numpy as np

data = np.array([[1, 2], [3, 4], [5, 6]])

scaler = MinMaxScaler()

scaled_data = scaler.fit_transform(data)

print(scaled_data)

Output:

[[0. 0. ]
[0.5 0.5 ]
[1. 1. ]]

The data is scaled to the range [0, 1].

2. Standardization (Z-score scaling)

Standardization or Z-score scaling transforms data into a mean of 0 and a standard deviation of 1.

How it works:

The mean of each feature is subtracted, and then divided by its standard deviation. The formula is:

X_scaled = (X — mean) / standard_deviation

When to use it:

When the data has an approximately Gaussian (normal) distribution.
For algorithms that assume normally distributed data, like linear regression or logistic regression.
Standardization maintains useful information about outliers and makes algorithms less sensitive to them compared to MinMax scaling.

Python code example:

from sklearn.preprocessing import StandardScaler

import numpy as np

data = np.array([[1, 2], [3, 4], [5, 6]])

scaler = StandardScaler()

scaled_data = scaler.fit_transform(data)

print(scaled_data)

Output:

[[-1.22474487 -1.22474487]
[ 0. 0. ]
[ 1.22474487 1.22474487]]

Here, the data is transformed to have a mean of 0 and a standard deviation of 1.

3. Normalization (L1 or L2)

Normalization, in the context of scaling, often refers to scaling each sample (row) to have unit norm. This is different from the feature scaling discussed earlier. There are two common types:

L1 normalization: Scales the values so that the sum of the absolute values of the components is 1.
L2 normalization: Scales the values so that the sum of the squares of the components is 1.

How it works:

Each data vector (sample) is rescaled independently of the distribution of the samples.

When to use it:

In scenarios where you want to compare similarities between samples based on distance measures.
For text mining, MaxAbs scaling is often used.

Python code example (using L2 normalization):

from sklearn.preprocessing import Normalizer

import numpy as np

data = np.array([[1, 2], [3, 4], [5, 6]])

scaler = Normalizer(norm=’l2')

scaled_data = scaler.fit_transform(data)

print(scaled_data)

Output:

[[0.4472136 0.89442719]
[0.6 0.8 ]
[0.6401844 0.76980036]]

Each row is now a unit vector.

4. Robust Scaler

Robust Scaler is designed to handle datasets with outliers.

How it works:

The median is removed, and the data is scaled based on the interquartile range (IQR), which is the range between the 25th and 75th percentiles.

When to use it:

When the data contains outliers.
To maintain the relative distances between non-outlier data points.

Python code example:

from sklearn.preprocessing import RobustScaler

import numpy as np

data = np.array([[1, 2], [3, 4], [1000, 6]])

scaler = RobustScaler()

scaled_data = scaler.fit_transform(data)

print(scaled_data)

Output:

[[ 0. -1. ]
[ 1. 0. ]
[498.5 1. ]]

The outlier (1000) doesn’t drastically skew the scaling of other values, compared to if you used StandardScaler.

5. MaxAbs Scaler

MaxAbs Scaler scales each feature by its maximum absolute value.

How it works:

Each feature is scaled and translated individually such that the maximal absolute value of each feature in the training set will be 1.0.

When to use it:

For sparse data matrices, such as in text mining.
When you want to keep the zero values as zeros.

Python code example:

from sklearn.preprocessing import MaxAbsScaler

import numpy as np

data = np.array([[1, 2], [3, 4], [-5, 6]])

scaler = MaxAbsScaler()

scaled_data = scaler.fit_transform(data)

print(scaled_data)

Output:

[[ 0.2 0.33333333]
[ 0.6 0.66666667]
[-1. 1. ]]

Each feature is scaled such that its maximum absolute value is 1.

Conclusion: Choosing the Right Scaler

The best scaling method depends on the data and the machine learning algorithm being used.

If the data has a normal distribution, StandardScaler might be a good choice.
If the data has outliers, RobustScaler is a better option.
If the data needs to be in a specific range and doesn’t have many outliers, MinMaxScaler might be appropriate.
MaxAbsScaler is useful for sparse data.

Stories by avish arora on Medium

Data Scaling Techniques in Python