Ever Wondered Why Normal Distribution Is So Important?

Explaining the reasons why Gaussian distribution is so successful and widely used probability distribution

Farhad Malik
Jun 20, 2019 · 9 min read

What is so special about normal probability distribution? Why so many data science and machine learning articles revolve around normal probability distribution?

The world of machine learning and data science revolves around the concepts of probability distributions and the core of the probability distribution concept is focused on Normal distributions. This article illustrates what normal distribution is and why it is widely used, in particular for a data scientist and a machine learning expert.

I have decided to write an article that attempts to explain the concept of normal probability distribution in an easy to understand manner.

I will explain everything from the very basics so that the readers understand the importance of Normal distribution

Article Structure

This article will explain:

  1. What probability distribution is?
Image for post
Image for post
Photo by timJ on Unsplash

A Little Background First

  1. Firstly, the most important point to note is that the normal distribution is also known as the Gaussian distribution.

Normal distribution is also known as Gaussian distribution.

  1. Lastly, an important point to note is that simple predictive models are usually the most used models. This is due to the fact that they can be explained and are well-understood. Now to add to this point; normal distribution is simple and hence its simplicity makes it extremely popular.

Hence it’s worth understanding what normal probability distribution is.

But First, What Does Probability Distribution Mean?

Let me explain by building the appropriate building blocks first.

  • If we want to predict a variable accurately then the first task we need to perform is to understand the underlying behaviour of our target variable.

The higher the probability, the more likely it is for the event to occur.

Image for post
Image for post
Photo by Brett Jordan on Unsplash
  • As an instance, we can start repeating an experiment for a large number of times and start noting the values we retrieve for the variable.

There are a large number of probability distributions and the most widely used probability distribution is known as “normal distribution”. Let’s understand normal distribution now.

Let’s Now Move Onto Normal Probability Distribution

If we plot the probability distribution and it forms a bell-shaped curve and the mean, mode, and median of the sample are equal then the variable has normal distribution.

This is an example of a normal distribution bell-shaped curve:

Image for post
Image for post

It is important to understand and estimate the probability distribution of your target variable.

The following variables are close to normally distributed variables:

  1. Height of a population

Additionally, there are a large number of variables around us which are normal with a x% confidence; x < 100.

Image for post
Image for post
Photo by Mathew Schwartz on Unsplash

What Is Normal Distribution?

A normal distribution is a distribution that is solely dependent on two parameters of the data set: mean and the standard deviation of the sample.

  • Mean — This is the average value of all the points in the sample that is computed by summing the values and then dividing by the total number of the values in a sample.

This characteristic of the distribution makes it extremely simple for statisticians and hence any variable that exhibits normal distribution is feasible to be forecasted with higher accuracy. Essentially, it can help in simplying the model.

Now, what’s phenomenal to note is that once you find the probability distributions of most of the variables in nature then they all approximately follow a normal distribution.

The normal distribution is simple to explain. The reasons are:

  1. The mean, mode, and median of the distribution are equal.

Normal Distribution Is Simply … The Normal Behaviour That We Are Just So Familiar With

But how are so many variables approximately normally distributed? What is the logic behind it?

For the sake of simplicity, let’s consider that there is a random variable, such as the blood pressure of human population, that has a mean m and standard deviation s.

Traditionally, we would gather samples to represent the random variable. Each sample has its own mean. Now if we start repeating the experiment and start gathering more samples and start calculating the mean of each of the sample then the samples mean will have its own probability distribution and this distribution will converge towards the normal distribution as we start increasing the number of samples.

Let's consider that the height of a population is a random variable. We can take a sample of heights, plot its distribution and calculate the sample mean. When we repeat this experiment whilst we increase the number of samples then the mean of the samples will end up being very close to normality.

This is known as the Central Limit Theorem.

This brings us to the core of the article:

If we plot the normal distribution density function, it’s curve has the following characteristics:

Image for post
Image for post

The bell-shaped curve above has 100 mean and 1 standard deviation

  • Mean is the center of the curve. This is the highest point of the curve as most of the points are at the mean.
Image for post
Image for post
  • Approximately 68.2% of all of the points are within the range -1 to 1 standard deviation.

This allows us to easily estimate how volatile a variable is and given a confidence level, what its likely value is going to be.

As an instance, in the grey bell-shaped curve above, there is a 68.2% chance that the value of the variable will be within 101–99.

Imagine the confidence data scientists now have when making future decisions once they understand the probability distribution of a target variable

Normal Probability Distribution Function

The probability density function of the normal distribution is:

Image for post
Image for post

The probability density function is essentially the probability of continuous random variable taking a value.

Normal distribution is a bell-shaped curve where mean=mode=median.

  • If you plot the probability distribution curve using its computed probability density function then the area under the curve for a given range gives the probability of the target variable being in that range.

The larger the standard deviation, the more the volatility in the sample.

How Do I Find Feature Distribution In Python?

The simplest method I follow is to load all of the features in the data frame and then write this script:

Use the Python Pandas library:

DataFrame.hist(bins=10)#Make a histogram of the DataFrame.

It shows us the probability distributions of all of the variables.

What Does It Mean For A Variable To Have Normal Distribution?

The independent random variables that exhibit normal distribution always exhibit a normal distribution. As an instance, if A and B are two variables with normal distributions then:

  • A + B is normally distributed

As a result, it is extremely simple to forecast a variable and find the probability of it within a range of values because of the well-known probability distribution function.

What If The Sample Distribution Is Not Normal?

This section will briefly highlight a few techniques we can utilise.

1. Linear Transformation

The linear transformation focuses on computing the z-score (known as the standard-score) of a sample.

Once we gather a sample for a variable, we can compute the Z-score via linearly transforming the sample using the formula above:

  1. Calculate the mean
Image for post
Image for post

We can also attempt to transform a distribution to a normal distribution. These techniques require assessing the data and their behaviour carefully up-front.

2. Using Boxcox Transformation

You can use SciPy package of Python to transform data to the normal distribution:

scipy.stats.boxcox(x, lmbda=None, alpha=None)

3. Using Yeo-Johnson Transformation

Additionally, the power transformer yeo-johnson can be used. Python’s sci-kit learn provides the appropriate function:

sklearn.preprocessing.PowerTransformer(method=’yeo-johnson’, standardize=True, copy=True)

Note, it is recommended to understand when to use each of the power transformers. Explanation of power transformers such as Box-Cox and Yeo Johnson and their use-cases is beyond the scope of this article. Both of these transformers have their own use cases and both work in a different manner

Problems With Normality

As the normal distribution is simple and is well-understood, it is also overused in the predictive projects. Assuming normality has its own flaws. As an instance, we cannot assume that the stock price follows normal distribution as the price cannot be negative. Therefore the stock price potentially follows a log of the normal distribution to ensure it is never below zero.

We know that the daily returns can be negative, therefore the returns can at times follow a normal distribution.

It is not wise to assume that the variable follows a normal distribution without any analysis.

A variable can follow Poisson, Student-t, or Binomial distribution as an instance and falsely assuming that a variable follows normal distribution can lead to inaccurate results.

Summary

This article illustrated what normal distribution is and why it is so important, in particular for a data scientist and a machine learning expert.

FinTechExplained

This blog aims to bridge the gap between technologists…

Farhad Malik

Written by

My personal blog, aiming to explain complex mathematical, financial and technological concepts in simple terms. Contact: FarhadMalik84@googlemail.com

FinTechExplained

This blog aims to bridge the gap between technologists, mathematicians and financial experts and helps them understand how fundamental concepts work within each field. Articles

Farhad Malik

Written by

My personal blog, aiming to explain complex mathematical, financial and technological concepts in simple terms. Contact: FarhadMalik84@googlemail.com

FinTechExplained

This blog aims to bridge the gap between technologists, mathematicians and financial experts and helps them understand how fundamental concepts work within each field. Articles

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store