Data Normalization With R

Published in

The Startup

7 min readJul 5, 2020

Preprocessing the data is one of the crucial steps of data analysis, one of the preliminary steps in that includes feature scaling. Often, programmers new to data science tend to neglect or bypass the step and directly go to analysing the data; this leads to bias and, in turn, influences the prediction accuracy.

What is Data Normalization?

Data Normalization is a data preprocessing step where we adjust the scales of the features to have a standard scale of measure. In Machine Learning, it is also known as Feature scaling.

Why do we need Data Normalization?

Machine learning algorithms such as Distance-Based algorithms, Gradient Descent Based Algorithms expect the features to be scaled.

Why do these algorithms need the features to be scaled? For answering this, we look at an example. We have data of 35 job holders with two variables years of experience ranging from 1.1 years — 10.5 years and their corresponding salaries ranging from $37k to $122k.

This data was downloaded from this link: https://www.kaggle.com/rsadiq/salary

Distance-Based algorithms, such as SVM, K-Means, and KNN classifies objects by finding similarity between them using distance functions. These algorithms are receptive to the size of the variables. If we don’t scale, the feature with a higher magnitude (Salary), will have more influence than the feature with a lower magnitude (years of experience), and this leads to bias in data, and it also affects the accuracy.

On the other hand, we have Gradient Descent Based Algorithms, which uses the vector-valued function θ. When the scales of the variables are different, θ will descend faster in a lower scale variable as its learning rate will be higher. Therefore, we need to scale the variables to move the gradient with the same learning rate, and this, in turn, will make the convergence faster towards the minima. This can be seen in the diagram below, the left graph is not scaled, and it takes a longer time(as small scale feature has slow learning rate) compared to the scaled data on the right-hand graph.

Note: If we are solving for regression parameters in closed-form, we don’t need to apply scaling. But it is always good step to follow. Scaling is essential in regression when we use regularisation(gradient-descent) with it.

We also have Principal Component Analysis(PCA) that requires scaling as it tries to capture maximum variance. If the data is not scaled, this could lead to bias as the feature with a larger scale could influence the smaller-scale features.

Tree-based models follow some rules for classification and regression. Hence, scaling is not required for tree-based models.

Methods of Data Normalization

Z-score Normalization(Standardization)
Robust Scalar
Min-Max Normalization
Mean Normalization
Unit Length

Z-score Normalization(Standardization)

Z-score Normalization transforms x to x’ by subtracting each value of features by the sample mean and then dividing by the sample standard deviation. The resulting mean and standard deviation of the standardised values are 0 and 1, respectively.

The formula for Standardization is:

Code for Z-score Normalization of Salary Data

**First 6 rows of Standardized data (Z-score Normalization)**

**Descriptive Statistics table of Standardized Years of Experience Variable**

**Descriptive Statistics table of Standardized Salary Variable**

As we can see from the descriptive statistics tables, the sample mean ~ 0 and the mean standard deviation is 1.

We can also see the minimum value and maximum value of z score standardisation are not bounded by boundaries. Even though standardisation is not bounded, the sample mean/variance still gets affected by the outliers. When we have outliers in the data, it is best to use Robust Scalar.

Robust Scalar

As mentioned above, a robust scalar can be used when we have outliers in the data, as it is robust of outliers.

Robust Scalar transforms x to x’ by subtracting each value of features by the median and dividing it by the interquartile range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).

The formula for Robust Scalar is:

Code for Robust Scalar of Salary Data

**First 6 rows of Robust Scalar Normalization**

**Descriptive Statistics table of Robust Scalar of Years of Experience Variable**

**Descriptive Statistics table of Robust Scalar of Salary Variable**

As we can see from the descriptive statistics tables, the median is 0. We can also see the sample mean and sd are not 0 and 1, respectively. And min and max values are not bounded.

Min-Max Normalization

Min-Max Normalization transforms x to x’ by converting each value of features to a range between 0 and 1, and this is also known as (0–1) Normalization. If the data has negative values the range would have been between -1 and 1.
The formula for Min-Max Normalization is:

If we want to scale between some bounded arbitrary set of values [a, b]. Such as when we are dealing with image processing, the pixels need normalized to be between 0 and 255. Then we could use the following formula:

Formula for Min-Max Normalization with arbitrary set of values [a, b]

Min-Max Normalization of Salary Data

**First 6 rows of Normalized data (Min-Max Normalization)**

**Descriptive Statistics table of Normalized Years of Experience Variable**

**Descriptive Statistics table of Normalized Salary Variable**

As we can see from the descriptive statistics table, the minimum value of the feature gets normalized to 0, the maximum value gets normalized to 1, and the remaining values are between 0 and 1.

Mean Normalization

Mean Normalization transforms x to x’ the same way as Min-Max Normalization; one different thing is each value of the feature is first subtracted by the sample mean.
The formula for Mean Normalization is:

Code for Mean Normalization of Salary Data

**First 6 rows of Mean Normalised data**

Unit Length Normalization

Unit Length Normalization transforms x to x’ by dividing each value of the feature vector by Euclidean length of the vector.

Code for Unit Length of Salary Data

The Min-Max Normalization and Unit Length are both bounded by values [0,1]. This is a disadvantage when we have outliers in the data. If the data has outliers, it is best to use Robust Scalar.

Scatter Plot
Now we plot a scatter plot to see the distribution of original, min-max normalized, standardized data, mean normalized, and unit length data.

plotting the original data:

Code for scatter plot of Original Data

plotting all the normalized data:

Code for scatter plot of different methods of scaled data

As we can see comparing the original data and scaled data, scaling didn’t affect the distribution or relationship between salary and years of experience.

Normalization vs Standardization

The most commonly used methods of scaling/normalizing are min-max normalization and standardization. Let’s see the difference how normalization and standardization data are scattered.

Code for Scatter plot for Standardization vs Normalization

**Scatter plot for Standardisation vs Normalization**

As we can see, the normalization data is bounded between 0 and 1, and standardisation doesn’t have any boundaries.

The effect of Normalization vs Standardization on KNN Algorithm

Let’s see the impact of scaling on algorithms such as KNN, where we require the features to be scaled. In this example, we see the importance of scaling the features.

Code for KNN

We calculate the root mean square error for each data. Lesser the RMSE, the better fit is our model. We can see that scaled data have performed much better than the original data. And in here min-max normalization has worked better on KNN.

Choosing which scaling to perform could be confusing. However, there is no correct rule where which methods should be used. Sometimes it depends on what type of problem we have in hand.

If our problem needs the data to be bounded by some arbitrary values such as when we are dealing with image processing problems, the pixels need to be bounded between 0 and 255. Then we use min-max normalization.
If your problem needs the data to be centered by mean, and the standard deviation needs to be one, we use z score standardisation.
If we don’t know the distribution of our data and it doesn’t follow a Gaussian distribution, we use min-max normalization.
If we know that the distribution of our data follows a Gaussian distribution, we use standardisation.

References

I look forward to your comments and feedbacks.

Thank you for reading :)