Data Standardization — A Brief Explanation

Wojtek Fulmyk, Data Scientist
3 min readJul 29, 2023

--

Article level: Beginner

My clients often ask me about the specifics of certain data pre-processing methods, why they’re needed, and when to use them. I will discuss a few common (and not-so-common) preprocessing methods in a series of articles on the topic.

In this preprocessing series:

Data Standardization — A Brief Explanation — Beginner
Data Normalization — A Brief Explanation — Beginner
One-hot Encoding — A Brief Explanation — Beginner
Ordinal Encoding — A Brief Explanation — Beginner
Missing Values in Dataset Preprocessing — Intermediate
Text Tokenization and Vectorization in NLP — Intermediate

Outlier Detection in Dataset Preprocessing — Intermediate

Feature Selection in Data Preprocessing — Advanced

In this specific short writeup I will explain what Standardizing data is generally about. This article is not overly technical, but some understanding of specific terms would be helpful, so I attached a short explanation of the more complicated terminology. Give it a go, and if you need more info, just ask in the comments section!

preprocessing technique — Transforming raw data before modeling to improve performance.

standardization — Rescaling data to have zero mean and unit variance.

features — Variables or attributes in the data.

mean — Average value of a dataset.

outliers — Observations far outside the normal range.

model coefficients — Estimated parameters in a model.

Gaussian — Normal distribution bell curve.

data leakage — Information from the test set leaks into the training process, resulting in tainted model performance estimates.

Data Standardization

The Why

Standardization is a crucial preprocessing step for many machine learning algorithms. It rescales the features to have a mean of 0 and standard deviation of 1. This provides two main benefits:

A) Preventing outliers from skewing the distribution — Features may have outliers that can disproportionately distort distributions. Standardization diminishes the effect of outliers by centering the distribution.

B) Allowing direct comparison of model coefficients — Features are standardized to the same scale, allowing their coefficients to be directly comparable. This helps assess feature importance.

The How

The most common approach is mean standardization using this formula:

x_standardized = (x — m) / sd

Where x is the original value, m is the mean, and sd is standard deviation.

Some other standardization techniques include scaling by interquartile range (for robustness to outliers), and Gaussian normalization to fit a Gaussian distribution.

Additional Considerations

  1. Only calculate mean and standard deviation for standardization using training data. Test data statistics should not be allowed to influence training data transformations to avoid data leakage and inflate performance estimates.
  2. Standardization is more commonly used for machine learning algorithms like linear regression that assume a normal distribution. If you are using neural networks, you should consider normalization since standardization maintains differences in the original distribution, and normalization distorts them.
  3. Some advanced ML models are become more powerful, and they are getting better at learning directly from raw unmodified data. These models can automatically identify patterns in the raw data distribution, reducing the need for explicit pre-processing steps like standardization.

Useful Python Code

Using the scikit-learn library’s StandardScaler

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

# example df of 6 rows and 3 columns
df = pd.DataFrame(np.random.randn(6, 3))

scaler = StandardScaler()
scaler.fit(df)

# transforms the df
standardized = scaler.transform(df)

print(standardized)

This will output the following (actual values will vary):

[[-0.83491433 -0.51622402 -1.83606008]
[-0.94568884 -1.56635145 0.73071368]
[ 1.58298178 0.81657703 0.08833971]
[ 0.57211209 -0.16847201 0.51508208]
[ 0.68031602 1.59928463 -0.66986973]
[-1.05480673 -0.16481419 1.17179434]]

And that’s all! I will leave you with some “fun” trivia 😊

Trivia

  • Standardization was used in early statistical models like linear discriminant analysis developed by Fisher in 1936. Fisher recognized that standardizing features aids classification by making coefficients more directly comparable to assess their contributions.
  • The term “standard score” was introduced by psychologist Raymond Cattell in the 1950s for a standardized scale in psychological testing. His work helped popularize standardization in behavioral sciences

--

--

Wojtek Fulmyk, Data Scientist

Data Scientist, University Instructor, and Chess enthusiast. ML specialist.