What is Normalization In Machine Learning…?
Before you read about Normalization I suggest you read about Standardization as well. (Since both the topics are quite similar, I’ve kept most of the content same).
Link to my GitHub - Jupyter Notebook!
Just like we read in Standardization about the data being downscaled/standardized to a scale common to all the values (usually -1 to 1) Normalization is somewhat similar but distinct & better as well. Normalization downscales the values in the range 0 to +1
What actually happens when we downscale the Values…?
For any Machine Learning Model, it’s easier to crunch numbers which are smaller numerically (not size), hence if we are able to downscale the values to 0 to 1, it’s extremely good.
But another problem arises, What about the Range of all the values? i.e., if we’re downscaling won’t it affect the range of the values?
This is also handled by Normalization! The range of the Values is also maintained
1] We have collected our data:
Our data can be in various formats i.e., numbers (integers) & words (strings), for now we’ll consider only the numbers in our Dataset.
Assume our dataset has random numeric values in the range 1 to 95,000. Obviously in random order though. Just for our understanding consider a small Dataset of barely 10 values with numbers in the given range and randomized order.
If we just look at these values, their range is so high, if we use a model with 10,000 such values for training it will take lot of time. So we have a problem .
2] Munging/Cleaning the Data by performing Feature Engineering
3] Pre-Processing the Data:
But we have a solution for the same and a very prominent one! (Even Better than Standardization at times)
Normalization helps us solve this by
- Down Scaling the Values to a scale common to all (always in the range 0 to +1).
- And keeping the Range between the values intact.
So, how do we do that? Well there’s a mathematical formula for the same i.e.,
Normalization = (Current_value — Min) / (Max — Min).
Using this formula we are replacing all the input values by Normalization for each and every value.
Hence we get values ranging from 0 to +1, keeping the range intact.
Standardization performs the following:
1) Downscales all the Values to a range between 0 — 1
2) Keeps the Range intact even after downscaling
NOTE : (Just for Better Understanding)
When we Subtract the Smallest Value from the Min, (the Numerator becomes 0) . Hence we get (0) as Output.
When we Subtract the Largest Value from the Max, (the Numerator == Denominator ) . Hence we get (1) as Output.
Hence, all our Normalized Values will ALWAYS be in the range 0–1,
Let’s look at the Execution/ Implementation now
Here we are doing the Following:
1) Calculating the Min of all the Values
2) Calculating the Max of all the Values
3) Substituting the same and Calculating the Normalized Values
We’ll check out the Basic Implementation
Whole Program:
#Input Values for our Dataset
dataset_0 = [10,5,6,1,3,7,9,4,8,2]
dataset_1 = [1,99,789,5,6859,541,94142,7,50826,35464]
n = len(dataset_1)
min_dataset_1 = min(dataset_1)
max_dataset_1 = max(dataset_1)
print("Min(dataset_1) : ", min_dataset_1 )
print("Max(dataset_1) : ", max_dataset_1 )final_normalization = []
#Calculating the Normalized Values for all the Input Values
print("\nNormalization = (Current_Value - Min) / (Max - Min)\n")
for i in dataset_1:
normalization = (i - min_dataset_1) / (max_dataset_1 - min_dataset_1 )
final_normalization.append("{:.20f}".format(normalization))
print("For", i)
print("\t (", i,"-",min_dataset_1,") / (", max_dataset_1, "-", min_dataset_1,")")
print("\tNormalization(", i, ") : {:.20f}".format(normalization))
print("\n")
#Comparing the Original Values and the Normalized Values
print(" Original DataSet | Normalization ")
print()
for i in range (len(dataset_1)):
print(" ", dataset_1[i], " | ", final_normalization[i])
#Comparing the Ranges of the Original Values and the Normalized Values
import matplotlib
import matplotlib.pyplot as plt
#Scatter Plot of the Original Values
plt.scatter(dataset_0, dataset_1, label= "stars", color= "blue", marker= "*", s=40)
plt.xlabel('Index')
plt.ylabel('Original Values')
plt.title('Graph of Original Values')
plt.legend()
plt.show()
#Scatter Plot of the Normalized Values
plt.scatter(dataset_0, final_normalization, label= "stars", color= "blue", marker= "*", s=30)
plt.xlabel('Index')
plt.ylabel('Normalization Values')
plt.title('Graph of Normalized Values')
plt.legend()
plt.show()
Calculating the basic values
1) No. of Input Data
2) Min of Dataset
3) Max of Dataset
Python3
dataset_0 = [10,5,6,1,3,7,9,4,8,2]
dataset_1 = [1,99,789,5,6859,541,94142,7,50826,35464]
n = len(dataset_1)min_dataset_1 = min(dataset_1)
max_dataset_1 = max(dataset_1)print("n : ", n)
print("Min(dataset_1) : ", min_dataset_1 )
print("Max(dataset_1) : ", max_dataset_1 )
Finding the Normalized Values for Each Input Value in the Dataset
Python3
final_normalization = []print("\nNormalization = (Current_Value - Min) / (Max - Min)\n")for i in dataset_1:
normalization = (i - min_dataset_1) / (max_dataset_1 - min_dataset_1 )
final_normalization.append("{:.20f}".format(normalization))
print("For", i)
print("\t (", i,"-",min_dataset_1,") / (", max_dataset_1, "-", min_dataset_1,")")
print("\tNormalization(", i, ") : {:.20f}".format(normalization))
print("\n")
Comparing the Input Values and the Normalized Values in the Dataset
Python3
print(" Original DataSet | Normalization ")
print()
for i in range (len(dataset_1)):
print(" ", dataset_1[i], " | ", final_normalization[i])
Comparing the Ranges of the Input Values and the Normalized Values:
Python3
import matplotlib
import matplotlib.pyplot as plt
Scatter Plot of the Input Values:
Python3
plt.scatter(dataset_0, dataset_1, label= "stars", color= "blue", marker= "*", s=40)
plt.xlabel('Index')
plt.ylabel('Original Values')plt.title('Graph of Original Values')
plt.legend()plt.show()
Scatter Plot of the Normalized Values:
Python3
plt.scatter(dataset_0, final_normalization, label= "stars", color= "blue", marker= "*", s=30)
plt.xlabel('Index')
plt.ylabel('Normalization Values')
plt.title('Graph of Normalized Values')
plt.legend()
plt.show()