Exploring Popular Normalization Techniques: CRISP-DM Data Preparation

Data Mastery Series — Episode 8: Normalization Techniques

Donato_TH
Donato Story
3 min readMar 12, 2023

--

If you are interested in articles related to my experience, please feel free to contact me: linkedin.com/in/nattapong-thanngam

CRISP-DM framework (Image by Author)

Normalization is a fundamental data preprocessing technique in data science that aims to transform data into a common scale or range. This technique is widely used to improve the accuracy and performance of machine learning algorithms, as well as to facilitate the interpretation of data. Normalization ensures that the data is more comparable and reduces the impact of outliers or extreme values. Different normalization methods have their strengths and weaknesses, and choosing the appropriate method depends on the nature of the data and the task at hand.

Data Set:

  • The dataset comprises 10,000 randomly generated samples (Gamma distribution with a shape parameter of 3 and a scale parameter of 3)
Dataset distribution (Image by Author)

7 Popular Normalization Techniques

  1. StandardScaler Method
  • The StandardScaler method, also known as Z-score normalization or Standardization, scales the data to have a mean of 0 and a standard deviation of 1
StandardScaler Method (Image by Author)

2. Yeo-Johnson Transformation

  • The Yeo-Johnson transformation is a more robust version of the Box-Cox transformation that can handle both positive and negative values. This method works by applying a power transformation to the data that is optimized using maximum likelihood estimation. It is a method for transforming non-normal data into normal data
Yeo-Johnson Transformation (Image by Author)

3. Min-Max Scaler

  • The Min-Max Scaler, also known as Linear normalization or Scaling to a range, is a method for scaling data to a fixed range of values, typically between 0 and 1.
Min-Max Scaler (Image by Author)

4. Robust Scaler

  • The Robust Scaler method, also known as the Median-MAD method or median and IQR normalization, scales the data to have a median of 0 and a median absolute deviation of 1 (scales the data to the median and interquartile range).
Robust Scaler (Image by Author)

5. Max Absolute Scaler

  • The Max Absolute Scaler method scales the data so that the maximum absolute value of each feature is 1.
Max Absolute Scaler (Image by Author)

6. Log Transformation

  • The Log Transformation applies a logarithmic function to the data to compress the range of values.
Log Transformation (Image by Author)

7. Root Transformation

  • Root transformation is a type of data transformation method that involves taking the nth root of each value in a dataset
Root Transformation (Image by Author)

Summary:

Summary of 7 Popular Normalization Techniques (Image by Author)

Normalization methods play an important role in data preparation and feature engineering. Selecting the right method depends on the characteristics of the data and the goals of the analysis. Understanding the pros, cons, and limitations of each method can help in making an informed decision.

Please feel free to contact me, I am willing to share and exchange on topics related to Data Science and Supply Chain.
Facebook:
facebook.com/nattapong.thanngam
Linkedin:
linkedin.com/in/nattapong-thanngam

--

--

Donato_TH
Donato Story

Data Science Team Lead at Data Cafe, Project Manager (PMP #3563199), Black Belt-Lean Six Sigma certificate