Normal Distribution and Machine Learning

Abhishek Barai
Nov 19, 2020 · 6 min read

Normal Distribution is an important concept in statistics and the backbone of Machine Learning. A Data Scientist needs to know about Normal Distribution when they work with Linear Models(perform well if the data is normally distributed), Central Limit Theorem, and exploratory data analysis.

As discovered by Carl Friedrich Gauss, Normal Distribution/Gaussian Distribution is a continuous probability distribution. It has a bell-shaped curve that is symmetrical from the mean point to both halves of the curve.

Image for post
Image for post
Source: Google

A continuous random variable “x” is said to follow a normal distribution with parameter μ(mean) and σ(standard deviation), if it’s probability density function is given by,

Image for post
Image for post
Source: Google

This is also called a normal variate.

If “x” is a normal variable with a mean(μ) and a standard deviation(σ) then,

Image for post
Image for post
Source: Google

where z = standard normal variate

The simplest case of the normal distribution, known as the Standard Normal Distribution, has an expected value of μ(mean) 0 and σ(s.d.) 1, and is described by this probability density function,

Image for post
Image for post
Source: Google
  1. The total area under the normal curve is equal to 1.
  2. It is a continuous distribution.
  3. It is symmetrical about the mean. Each half of the distribution is a mirror image of the other half.
  4. It is asymptotic to the horizontal axis.
  5. It is unimodal.

The normal distribution carries with it assumptions and can be completely specified by two parameters: the mean and the standard deviation. If the mean and standard deviation are known, you can access every data point on the curve.

The empirical rule is a handy quick estimate of the data's spread given the mean and standard deviation of a data set that follows a normal distribution. It states that:

  • 68.26% of the data will fall within 1 sd of the mean(μ±1σ)
  • 95.44% of the data will fall within 2 sd of the mean(μ±2σ)
  • 99.7% of the data will fall within 3 sd of the mean(μ±3σ)
  • 95% — (μ±1.96σ)
  • 99% — (μ±2.75σ)
Image for post
Image for post
Source: Google

Thus, almost all the data lies within 3 standard deviations. This rule enables us to check for Outliers and is very helpful when determining the normality of any distribution.

Application in Machine Learning:

In Machine Learning, data satisfying Normal Distribution is beneficial for model building. It makes math easier. Models like LDA, Gaussian Naive Bayes, Logistic Regression, Linear Regression, etc., are explicitly calculated from the assumption that the distribution is a bivariate or multivariate normal. Also, Sigmoid functions work most naturally with normally distributed data.

Many natural phenomena in the world follow a log-normal distribution, such as financial data and forecasting data. By applying transformation techniques, we can convert the data into a normal distribution. Also, many processes follow normality, such as many measurement errors in an experiment, the position of a particle that experiences diffusion, etc.

So it’s better to critically explore the data and check for the underlying distributions for each variable before going to fit the model.

Note: Normality is an assumption for the ML models. It is not mandatory that data should always follow normality. ML models work very well in the case of non-normally distributed data also. Models like decision tree, XgBoost, don’t assume any normality and work on raw data as well. Also, linear regression is statistically effective if only the model errors are Gaussian, not exactly the entire dataset.

Here I have analyzed the Boston Housing Price Dataset. I have explained the visualization techniques and the conversion techniques along with plots that can validate the normality of the distribution.

Image for post
Image for post
13 Numerical and 1 categorical(chas) feature is present

Histograms: It is a kind of bar graph which is an estimate of the probability distribution of a continuous variable. It defines numerical data and divided them into uniform bins which are consecutive, non-overlapping intervals of a variable.

Image for post
Image for post
histogram of all numerical features

kdeplot: It is a Kernel Distribution Estimation Plot which depicts the probability density function of the continuous or non-parametric data variables i.e. we can plot for the univariate or multiple variables altogether.

Image for post
Image for post
kdeplot of all numerical features

Let’s take an example of feature rm(average number of rooms per dwelling) closely resembling a normal distribution.

Image for post
Image for post

Though it has some distortion in the right tail, We need to check how close it resembles a normal distribution. For that, we need to check the Q-Q Plot.

When the quantiles of two variables are plotted against each other, then the plot obtained is known as quantile — quantile plot or qqplot. This plot provides a summary of whether the distributions of two variables are similar or not with respect to the locations.

Image for post
Image for post
Note: “rm” feature is standardized before plotting qqplot

Here we can clearly see that feature is not normally distributed. But it somewhat resembles it. We can conclude that standardizing (StandardScaler) this feature before feeding it to a model can generate a good result.

Central Limit Theorem and Normal Distribution:

CLT states that when we add a large number of independent random variables to a dataset, irrespective of these variables' original distribution, their normalized sum tends towards a Gaussian distribution.

Machine Learning models generally treat training data as a mix of deterministic and random parts. Let the dependent variable(Y) consists of these parts. Models always want to express the dependent variables(Y) as some function of several independent variables(X). If the function is sum (or expressed as a sum of some other function) and the number of X is really high, then Y should have a normal distribution.

Here ml models try to express the deterministic part as a sum of deterministic independent variables(X):

deterministic + random = func(deterministic(1)) +…+ func(deterministic(n)) + model_error

If the whole deterministic part of Y is explained by X, then the model_error depicts only the random part and should have a normal distribution.

So if the error distribution is normal, then we may suggest that the model is successful. Else some other features are absent in the model but have a large enough influence on Y, or the model is incorrect.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Abhishek Barai

Written by

Data Scientist | NLP Engineer | Quantitative Researcher | Blogger

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Abhishek Barai

Written by

Data Scientist | NLP Engineer | Quantitative Researcher | Blogger

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store