Ever Wondered Why Normal Distribution Is So Important?
Explaining the reasons why Gaussian distribution is so successful and widely used probability distribution
What is so special about normal probability distribution? Why so many data science and machine learning articles revolve around normal probability distribution?
The world of machine learning and data science revolves around the concepts of probability distributions and the core of the probability distribution concept is focused on Normal distributions. This article illustrates what normal distribution is and why it is widely used, in particular for a data scientist and a machine learning expert.
I have decided to write an article that attempts to explain the concept of normal probability distribution in an easy to understand manner.
I will explain everything from the very basics so that the readers understand the importance of Normal distribution
This article will explain:
- What probability distribution is?
- What normal distribution means?
- Which variables exhibit normal distribution?
- How to check the distribution of your data set in Python?
- How to make a variable normally distributed in Python?
- Problems with normality
A Little Background First
- Firstly, the most important point to note is that the normal distribution is also known as the Gaussian distribution.
- Secondly, it is named after the genius of Carl Friedrich Gauss.
Normal distribution is also known as Gaussian distribution.
- Lastly, an important point to note is that simple predictive models are usually the most used models. This is due to the fact that they can be explained and are well-understood. Now to add to this point; normal distribution is simple and hence its simplicity makes it extremely popular.
Hence it’s worth understanding what normal probability distribution is.
But First, What Does Probability Distribution Mean?
Let me explain by building the appropriate building blocks first.
- If we want to predict a variable accurately then the first task we need to perform is to understand the underlying behaviour of our target variable.
- What we need to do first is to determine the possible outcomes of the target variable and if the underlying outcomes are discrete (distinct values) or continuous (infinite values). For the sake of simplicity, if we are estimating the behaviour of a dice then the first step would be to consider that it can take any value from 1 to 6 (discrete).
- Then the next step would be to start assigning probabilities to the events (values). Consequently, if a value cannot occur then it is assigned a probability of 0%. If an event always occurs then it gets a probability of 100%. If you add all of the probabilities then it will sum up to 100%.
The higher the probability, the more likely it is for the event to occur.
- As an instance, we can start repeating an experiment for a large number of times and start noting the values we retrieve for the variable.
- Now what we can do is to group the values into categories/buckets. And for each bucket, we can start recording the number of times the variable had the value of the bucket. For example, we can throw a dice 10000 times and can create 6 buckets as there are 6 possible values that a dice can take. After every throw, we can record the number of occurrences for each value.
- We can then plot a line-chart where the x-values would the values of the bucket and the y-axis values would represent the occurance of the bucket value. We will notice that it forms the shape of a curve. This curve is known as the probability distribution curve and the likelihood of the target variable getting a value is the probability distribution of the variable.
- Once we understand how the values are distributed then we can start estimating the probabilities of the events, even by the means of using formulas (known as probability distribution functions). As a consequence, we can start understanding the behaviour of our target variables better. The probability distribution is dependent on the moments of the sample such as mean, standard deviation, skewness, and/or kurtosis.
There are a large number of probability distributions and the most widely used probability distribution is known as “normal distribution”. Let’s understand normal distribution now.
Let’s Now Move Onto Normal Probability Distribution
If we plot the probability distribution and it forms a bell-shaped curve and the mean, mode, and median of the sample are equal then the variable has normal distribution.
This is an example of a normal distribution bell-shaped curve:
It is important to understand and estimate the probability distribution of your target variable.
The following variables are close to normally distributed variables:
- Height of a population
- Blood pressure of adult human
- Position of a particle that experiences diffusion
- Measurement errors
- Residuals in regression
- Shoe size of a population
- Amount of time it takes for employees to reach home
- A large number of educational measures
Additionally, there are a large number of variables around us which are normal with a x% confidence; x < 100.
What Is Normal Distribution?
A normal distribution is a distribution that is solely dependent on two parameters of the data set: mean and the standard deviation of the sample.
- Mean — This is the average value of all the points in the sample that is computed by summing the values and then dividing by the total number of the values in a sample.
- Standard Deviation — This indicates how much the data set deviates from the mean of the sample.
This characteristic of the distribution makes it extremely simple for statisticians and hence any variable that exhibits normal distribution is feasible to be forecasted with higher accuracy. Essentially, it can help in simplying the model.
Now, what’s phenomenal to note is that once you find the probability distributions of most of the variables in nature then they all approximately follow a normal distribution.
The normal distribution is simple to explain. The reasons are:
- The mean, mode, and median of the distribution are equal.
- We only need to use the mean and standard deviation to explain the entire distribution.
Normal Distribution Is Simply … The Normal Behaviour That We Are Just So Familiar With
But how are so many variables approximately normally distributed? What is the logic behind it?
For the sake of simplicity, let’s consider that there is a random variable, such as the blood pressure of human population, that has a mean m and standard deviation s.
Traditionally, we would gather samples to represent the random variable. Each sample has its own mean. Now if we start repeating the experiment and start gathering more samples and start calculating the mean of each of the sample then the samples mean will have its own probability distribution and this distribution will converge towards the normal distribution as we start increasing the number of samples.
Let's consider that the height of a population is a random variable. We can take a sample of heights, plot its distribution and calculate the sample mean. When we repeat this experiment whilst we increase the number of samples then the mean of the samples will end up being very close to normality.
This is known as the Central Limit Theorem.
This brings us to the core of the article:
If we plot the normal distribution density function, it’s curve has the following characteristics:
The bell-shaped curve above has 100 mean and 1 standard deviation
- Mean is the center of the curve. This is the highest point of the curve as most of the points are at the mean.
- There is an equal number of points on each side of the curve. The center of the curve has the most number of points.
- The total area under the curve is the total probability of all of the values that the variable can take.
- The total curve area is therefore 100%
- Approximately 68.2% of all of the points are within the range -1 to 1 standard deviation.
- About 95.5% of all of the points are within the range -2 to 2 standard deviations.
- About 99.7% of all of the points are within the range -3 to 3 standard deviations.
This allows us to easily estimate how volatile a variable is and given a confidence level, what its likely value is going to be.
As an instance, in the grey bell-shaped curve above, there is a 68.2% chance that the value of the variable will be within 101–99.
Imagine the confidence data scientists now have when making future decisions once they understand the probability distribution of a target variable
Normal Probability Distribution Function
The probability density function of the normal distribution is:
The probability density function is essentially the probability of continuous random variable taking a value.
Normal distribution is a bell-shaped curve where mean=mode=median.
- If you plot the probability distribution curve using its computed probability density function then the area under the curve for a given range gives the probability of the target variable being in that range.
- This probability distribution curve is based on a probability distribution function which itself is computed on a number of parameters such as mean, or standard deviation of the variable.
- We could use this probability distribution function to find the relative chance of a random variable taking a value within a range. As an instance, we could record the daily returns of a stock, group them into appropriate buckets and then find the probability of the stock making 20–40% gain in the future.
The larger the standard deviation, the more the volatility in the sample.
How Do I Find Feature Distribution In Python?
The simplest method I follow is to load all of the features in the data frame and then write this script:
Use the Python Pandas library:
DataFrame.hist(bins=10)#Make a histogram of the DataFrame.
It shows us the probability distributions of all of the variables.
What Does It Mean For A Variable To Have Normal Distribution?
The independent random variables that exhibit normal distribution always exhibit a normal distribution. As an instance, if A and B are two variables with normal distributions then:
- A + B is normally distributed
As a result, it is extremely simple to forecast a variable and find the probability of it within a range of values because of the well-known probability distribution function.
What If The Sample Distribution Is Not Normal?
This section will briefly highlight a few techniques we can utilise.
1. Linear Transformation
The linear transformation focuses on computing the z-score (known as the standard-score) of a sample.
Once we gather a sample for a variable, we can compute the Z-score via linearly transforming the sample using the formula above:
- Calculate the mean
- Calculate the standard deviation
- For each value x, compute Z using:
We can also attempt to transform a distribution to a normal distribution. These techniques require assessing the data and their behaviour carefully up-front.
2. Using Boxcox Transformation
You can use SciPy package of Python to transform data to the normal distribution:
scipy.stats.boxcox(x, lmbda=None, alpha=None)
3. Using Yeo-Johnson Transformation
Additionally, the power transformer yeo-johnson can be used. Python’s sci-kit learn provides the appropriate function:
sklearn.preprocessing.PowerTransformer(method=’yeo-johnson’, standardize=True, copy=True)
Note, it is recommended to understand when to use each of the power transformers. Explanation of power transformers such as Box-Cox and Yeo Johnson and their use-cases is beyond the scope of this article. Both of these transformers have their own use cases and both work in a different manner
Problems With Normality
As the normal distribution is simple and is well-understood, it is also overused in the predictive projects. Assuming normality has its own flaws. As an instance, we cannot assume that the stock price follows normal distribution as the price cannot be negative. Therefore the stock price potentially follows a log of the normal distribution to ensure it is never below zero.
We know that the daily returns can be negative, therefore the returns can at times follow a normal distribution.
It is not wise to assume that the variable follows a normal distribution without any analysis.
A variable can follow Poisson, Student-t, or Binomial distribution as an instance and falsely assuming that a variable follows normal distribution can lead to inaccurate results.
This article illustrated what normal distribution is and why it is so important, in particular for a data scientist and a machine learning expert.