Data Normalization Techniques

Devanshi Patel
CodeX
Published in
4 min readMar 30, 2022

What is it, why is it needed and how can it be done?

Photo by Chris Liverani on Unsplash

Edit: Python libraries for normalization

What is Data Normalization?

According to Wikipedia, normalization can have a range of meanings. In the simplest cases, normalization means adjusting values measured on different scales to a notionally common scale.

Normalization is the process of rescaling the data so that it has same scale.

Measurement unit used can affect the data analysis. Hence data are scaled to fall within a smaller range like 0.0 to 1.0. Such transformation or mapping the data to a smaller or common range will help all attributes to gain equal weight. This is known as Normalization.

Why is Normalization needed?

Need for normalization:

  1. If the data is not normalized, one feature might completely dominate the others. Normalization makes every data point have the same scale so each feature is equally important.
  2. It avoids dependence on the choice of measurement units.
  3. The application of data mining algorithms becomes easier, effective and efficient.
  4. More specific data analyzing methods can be applied to normalized data.
  5. It prevent attributes with initially large ranges (e.g., income) from outweighing attributes with initially smaller ranges (e.g., binary attributes).

How can it be done?

We will discuss three types of normalization techniques:

  1. Z-Score Normalization/Standardization
  2. Min-Max Normalization
  3. Decimal Scaling

Note: I have used the Boston House Pricing dataset from Kaggle for code demonstration.

I. Z-Score Normalization

Z-Score Normalization is the process where the features are rescaled so that they have the property of a standard normal distribution with mean(μ) as 0 and standard deviation(σ) as 1.

The formula for the same is:

Z-Score Formula

where:

  • X is the data point
  • μ is the mean of the attribute values
  • σ is the standard deviation of the attribute values

Features:

  • It scales the variance at 1.
  • It centers the mean at 0.
  • It preserves the shape of the original distribution.
  • It preserves outliers if they exist.
  • Minimum and maximum values vary.

Example:

Given the mean and standard deviation for attribute A as 18 and 4.5 respectively. Normalize the value 27 using Z-score normalization.

Solution:

Code:

# Applying z-score on attribute AGEmean = df["AGE"].mean()     
std_dev = df["AGE"].std()
df["z_score"] = (df['AGE'] - mean)/std_dev

Using python library

from scipy.stats import zscoredf["AGE"] = zscore(df["AGE"])

II. Min-Max Normalization

Min-max normalization performs a linear transformation on the original data in range [0, 1] or [−1, 1]. Selecting the target range depends on the nature of data.

If minA and maxA are the minimum and maximum values of an attribute A, Min-max normalization maps a value, vi of A to vi’ in the range [new-minA , new-maxA by computing:

Min-Max Formula

Min-max normalization preserves the relationships among the original data values. It encounter an “out-of-bounds” error if a future input case for normalization falls outside of the original data range for A.

Features:

  • It does not center the mean at 0.
  • It makes the variance vary across variables.
  • It may not maintain the shape of the original distribution.
  • The minimum and maximum values are in the range of [0,1].
  • This method is very sensitive to outliers.

Example:

Let income range be from 12 to 98. Map income to the range [0.0, 1.0]. By min-max normalization, a value of 73 for income is transformed to?

Solution:

Code:

# Applying min-max on attribute TAXmax_tax = df["TAX"].max()
min_tax = df["TAX"].min()
new_max = 1
new_min = 0
df["min_max"]=((df["TAX"] - min_tax)/(max_tax - min_tax))*(new_max-new_min) + new_min

Using python library

from sklearn.preprocessing import MinMaxScaler# define min max scaler
scaler = MinMaxScaler()
# transform data
scaled = scaler.fit_transform(df["TAX"])

III. Decimal Scaling

Decimal Scaling normalizes the value of attribute A by moving the decimal point in the value. This movement of a decimal point depends on the maximum absolute value of the attribute.

Decimal Scaling Formula

where:

  • j is the smallest integer such that max(|vᵢ/10ʲ|) < 1

Example:

The observed values for attribute A lie in the range from -986 to 917 and the maximum absolute value for attribute A is 986. Normalize the data using Decimal Scaling.

Solution:

Here, to normalize each value of attribute A using decimal scaling, we have
to divide each value of attribute A by 1000 i.e. j=3.

So, the value -986 would get normalized to -0.986 and 917 would get normalized to 0.917.

Code:

# Applying decimal scaling on attribute Bmax_b = str(int(df["B"].max()))
df["decimal_scaling"] = df["B"]/(10**len(max_b))

Thank you for reading!

--

--