What is Normalization In Machine Learning…?

Brian Daniel Thomas
5 min readAug 13, 2022

--

Before you read about Normalization I suggest you read about Standardization as well. (Since both the topics are quite similar, I’ve kept most of the content same).

Link to my GitHub - Jupyter Notebook!

Just like we read in Standardization about the data being downscaled/standardized to a scale common to all the values (usually -1 to 1) Normalization is somewhat similar but distinct & better as well. Normalization downscales the values in the range 0 to +1

What actually happens when we downscale the Values…?

For any Machine Learning Model, it’s easier to crunch numbers which are smaller numerically (not size), hence if we are able to downscale the values to 0 to 1, it’s extremely good.

But another problem arises, What about the Range of all the values? i.e., if we’re downscaling won’t it affect the range of the values?

This is also handled by Normalization! The range of the Values is also maintained

1] We have collected our data:

Our data can be in various formats i.e., numbers (integers) & words (strings), for now we’ll consider only the numbers in our Dataset.

Assume our dataset has random numeric values in the range 1 to 95,000. Obviously in random order though. Just for our understanding consider a small Dataset of barely 10 values with numbers in the given range and randomized order.

Original Values for our Input Dataset

If we just look at these values, their range is so high, if we use a model with 10,000 such values for training it will take lot of time. So we have a problem .

2] Munging/Cleaning the Data by performing Feature Engineering

3] Pre-Processing the Data:

But we have a solution for the same and a very prominent one! (Even Better than Standardization at times)

Normalization helps us solve this by

  • Down Scaling the Values to a scale common to all (always in the range 0 to +1).
  • And keeping the Range between the values intact.

So, how do we do that? Well there’s a mathematical formula for the same i.e.,

Normalization = (Current_value — Min) / (Max — Min).

Using this formula we are replacing all the input values by Normalization for each and every value.

Hence we get values ranging from 0 to +1, keeping the range intact.

Standardization performs the following:

1) Downscales all the Values to a range between 0 — 1

2) Keeps the Range intact even after downscaling

NOTE : (Just for Better Understanding)

When we Subtract the Smallest Value from the Min, (the Numerator becomes 0) . Hence we get (0) as Output.
When we Subtract the Largest Value from the Max, (the Numerator == Denominator ) . Hence we get (1) as Output.

Hence, all our Normalized Values will ALWAYS be in the range 0–1,

Let’s look at the Execution/ Implementation now

Here we are doing the Following:

1) Calculating the Min of all the Values

2) Calculating the Max of all the Values

3) Substituting the same and Calculating the Normalized Values

We’ll check out the Basic Implementation

Whole Program:

#Input Values for our Dataset
dataset_0 = [10,5,6,1,3,7,9,4,8,2]
dataset_1 = [1,99,789,5,6859,541,94142,7,50826,35464]
n = len(dataset_1)

min_dataset_1 = min(dataset_1)
max_dataset_1 = max(dataset_1)

print("Min(dataset_1) : ", min_dataset_1 )
print("Max(dataset_1) : ", max_dataset_1 )
final_normalization = []
#Calculating the Normalized Values for all the Input Values
print("\nNormalization = (Current_Value - Min) / (Max - Min)\n")

for i in dataset_1:
normalization = (i - min_dataset_1) / (max_dataset_1 - min_dataset_1 )
final_normalization.append("{:.20f}".format(normalization))

print("For", i)
print("\t (", i,"-",min_dataset_1,") / (", max_dataset_1, "-", min_dataset_1,")")
print("\tNormalization(", i, ") : {:.20f}".format(normalization))
print("\n")
#Comparing the Original Values and the Normalized Values
print(" Original DataSet | Normalization ")
print()
for i in range (len(dataset_1)):
print(" ", dataset_1[i], " | ", final_normalization[i])
#Comparing the Ranges of the Original Values and the Normalized Values
import matplotlib
import matplotlib.pyplot as plt
#Scatter Plot of the Original Values
plt.scatter(dataset_0, dataset_1, label= "stars", color= "blue", marker= "*", s=40)

plt.xlabel('Index')
plt.ylabel('Original Values')

plt.title('Graph of Original Values')
plt.legend()

plt.show()
#Scatter Plot of the Normalized Values
plt.scatter(dataset_0, final_normalization, label= "stars", color= "blue", marker= "*", s=30)

plt.xlabel('Index')
plt.ylabel('Normalization Values')

plt.title('Graph of Normalized Values')
plt.legend()

plt.show()

Calculating the basic values

1) No. of Input Data

2) Min of Dataset

3) Max of Dataset

Python3

dataset_0 = [10,5,6,1,3,7,9,4,8,2]
dataset_1 = [1,99,789,5,6859,541,94142,7,50826,35464]
n = len(dataset_1)
min_dataset_1 = min(dataset_1)
max_dataset_1 = max(dataset_1)
print("n : ", n)
print("Min(dataset_1) : ", min_dataset_1 )
print("Max(dataset_1) : ", max_dataset_1 )

Finding the Normalized Values for Each Input Value in the Dataset

Python3

final_normalization = []print("\nNormalization = (Current_Value - Min) / (Max - Min)\n")for i in dataset_1:
normalization = (i - min_dataset_1) / (max_dataset_1 - min_dataset_1 )
final_normalization.append("{:.20f}".format(normalization))

print("For", i)
print("\t (", i,"-",min_dataset_1,") / (", max_dataset_1, "-", min_dataset_1,")")
print("\tNormalization(", i, ") : {:.20f}".format(normalization))
print("\n")
Normalized Values for our Dataset

Comparing the Input Values and the Normalized Values in the Dataset

Python3

print("  Original DataSet   |         Normalization ")
print()
for i in range (len(dataset_1)):
print(" ", dataset_1[i], " | ", final_normalization[i])
Comparing the Original and Normalized Values

Comparing the Ranges of the Input Values and the Normalized Values:

Python3

import matplotlib
import matplotlib.pyplot as plt

Scatter Plot of the Input Values:

Python3

plt.scatter(dataset_0, dataset_1, label= "stars", color= "blue", marker= "*", s=40)

plt.xlabel('Index')
plt.ylabel('Original Values')
plt.title('Graph of Original Values')
plt.legend()
plt.show()
Scatter Plot of Input Variables

Scatter Plot of the Normalized Values:

Python3

plt.scatter(dataset_0, final_normalization, label= "stars", color= "blue", marker= "*", s=30)

plt.xlabel('Index')
plt.ylabel('Normalization Values')

plt.title('Graph of Normalized Values')
plt.legend()

plt.show()
Scatter Plot of Normalized Variables

Hence we have Reviewed, Understood the Concept and Implemented as well the Concept of Normalization in Machine Learning.

--

--

Brian Daniel Thomas

Hello everyone, I’m Brian Thomas a young aspiring content creator and a contributor towards the Data Science community.