Normal/Gaussian Distribution/Bell Curve

Jayesh Rao
Analytics Vidhya
Published in
3 min readOct 5, 2020

You can check my analytics projects on https://github.com/jay6445/Data-Analytics-Projects.git

Normal distribution is symmetric around the mean. In a sample of data points, there will be equal distribution of data points on either sides of the mean. Normal distribution helps us get rid of the outliers and makes inferential calculations much easier as it is faster to compare two normally distributed variables than two following different distributions which is very common in real world analysis. The process of converting a distribution into a normal distribution is called Normalization. One of the important characteristic of a normal distribution is that it has no skew which should be understood when we say the distribution is symmetric. In this distribution Mean = Median = Mode. A Normal distribution looks as follows.

Converting any distribution to Normal distribution:

1. Min Max Scaling

The new values in the distribution x are calculated using the following formula

(X1 — MIN(X1) )/ MAX(X1) — MIN(X1)

Similarly x2,x3,x4….xn are calculated for the distribution. I have demonstrated the same in below example

2. Standard Score

Standard Score is calculated by subtracting mean of the distribution from every value and dividing it by the standard deviation of the distribution.

(x1 — μ) / σ

Where μ = mean and σ = standard deviation. We sill calculate all the values similarly.

3. Divide by Max

One of the most simplest method to normalize a distribution is to divide it by the max value. I have demonstrated it below

x1/max(x1)

WHEN IS NORMALIZATION IMPORTANT ?

Look at the x-axis of both the distributions below. The first one that is prices ranges from 0–50000 and the second losses ranges from 50 to 250. Such distributions having a substantially different range are difficult to analyse especially when analyzing the causation. When you start asking questions like is increase in price causing the losses or losses are causing an increase in price ? Such inferences can be calculated when both the variables in the said analysis are in the similar range.

prices
losses

We will therefore normalize the prices distribution by using Divide by Max as following :

prices

We will use the Min Max Scaling method for the second distribution of losses.

losses

Therefore by observing the above differences in before after distribution the ranges of both prices and losses have become similar after normalization and now are easier to compare and derive inferences from.

Note : There is one more important method where we can normalize individual variables in the sample and create a new sample having distributions of their means. Such a distribution will always be a Normal distribution. This is called the Central Limit theorem. We will see about the CLT in my next post.

Get in touch here https://www.linkedin.com/in/jayeshrao

--

--

Jayesh Rao
Analytics Vidhya

Hi good to see y'all, I am an aspiring data analyst and will be posting stuff about Statistics, Python and R and also some interesting projects I do. B-)