Ever Wondered Why Normal Distribution Is So Important?
Explaining the reasons why Gaussian distribution is so successful and widely used probability distribution
What is so special about normal probability distribution? Why so many data science and machine learning articles revolve around normal probability distribution?
I decided to write an article that attempts to explain the concept of normal probability distribution in an easy to understand manner.
The world of machine learning revolves around the probability distributions and the core of probability distribution is focused on Normal distributions. This article illustrates what normal distribution is and why it is widely used, in particular for a data scientist and a machine learning expert.
I will explain everything from the very basics so that the readers understand why Normal distribution is SO important
This article will explain:
- What probability distribution is?
- What normal distribution means?
- Which variables exhibit normal distribution?
- How to check distribution of your data set in Python?
- How to make a variable normally distributed in Python?
- Problems with normality
A Little Background First
- Firstly, the most important point to note is that the normal distribution is also known as the Gaussian distribution.
- It is named after the genius of Carl Friedrich Gauss.
Normal distribution is also known as Gaussian distribution.
- Lastly, an important point to note is that the simple predictive models are usually the most used models due to the fact that they can be explained and are well-understood. Now to add to this point; normal distribution is simple and hence its simplicity makes it extremely popular.
Hence it’s worth understanding what normal probability distribution is.
But First, What Does Probability Distribution Mean?
Let me explain by building the appropriate building blocks first.
Consider the predictive models we might be interested in building in our data science projects.
- If we want to predict a variable accurately then the first task we need to perform is to understand the underlying behaviour of our target variable.
- What we need to do first is to determine the possible outcomes of the target variable and if the underlying outcomes are discrete (distinct values) or continuous (infinite values). For the sake of simplicity, if we are estimating the behaviour of a dice then the first step is to know that it can take any value from 1 to 6 (discrete).
- Then the next step would be to start assigning probabilities to the events (values). Consequently, if a value cannot occur then it is assigned a probability of 0%.
The higher the probability, the more likely it is for the event to occur.
- As an instance, we can start repeating an experiment for a large number of times and start noting the values we retrieve for the variable.
- Now what we can do is to group the values into categories/buckets. And for each bucket, we can start recording the number of times the variable had the value of the bucket. For example, we can throw a dice 10000 times and as there are 6 possible values that a dice can take, we can create 6 buckets. And start recording the number of occurrences for each value.
- We can plot the chart and it will form a curve. This curve is known as probability distribution curve and the likelihood of the target variable getting a value is the probability distribution of the variable.
- Once we understand how the values are distributed then we can start estimating the probabilities of the events, even by the means of using formulas (known as probability distribution functions). As a result, we can start understanding its behaviour better. The probability distribution is dependent on the moments of the sample such as mean, standard deviation, skewness and kertosis.
- If you add all of the probabilities then it will sum up to 100%.
There are a large number of probability distributions and the most widely used probability distribution is known as “normal distribution”.
Let’s Now Move Onto Normal Probability Distribution
If you plot the probability distribution and it forms a bell shaped curve and the mean, mode and median of the sample are equal then the variable has normal distribution.
This is an example normal distribution bell shaped curve:
It is important to understand and estimate the probability distribution of your target variable.
Following variables are close to normally distributed variables:
- Height of a population
- Blood pressure of adult human
- Position of a particle that experiences diffusion
- Measurement errors
- Residuals in regression
- Shoe size of a population
- Amount of time it takes for employees to reach home
- A large number of educational measures
Additionally, there are a large number of variables around us which are normal with a x% confidence; x < 100.
What Is Normal Distribution?
A normal distribution is a distribution that is solely dependent on two parameters of the data set: its mean and the standard deviation of the sample.
- Mean — This is the average value of all the points in the sample.
- Standard Deviation — This indicates how much the data set deviates from the mean of the sample.
This characteristic of the distribution makes it extremely simple for statisticians and hence any variable that exhibits normal distribution is feasible to be forecasted with higher accuracy.
Now, what’s phenomenal to note is that once you find the probability distributions of most of the variables in nature then they all approximately follow normal distribution.
Normal distribution is simple to explain. The reasons are:
- The mean, mode and median of the distribution are equal.
- We only need to use the mean and standard deviation to explain the entire distribution.
Normal Distribution Is Simply … The Normal Behaviour That We Are Just So Familiar With
But how are so many variables approximately normally distributed? What is the logic behind it?
The idea revolves around the theorem that when you repeat an experiment a large number of times on a large number of random variables then the sum of their distributions will be very close to normality.
As height of a person is a random variable and is based on other random variables such as the the amount of nutrition a person consumes, the environment they live in, their genetics and so on, the sum of the distributions of these variables end up being very close to normal.
This is known as the Central Limit Theorem.
This brings us to the core of the article:
We understood from the section above that the normal distribution is the sum of many random distributions. If we plot the normal distribution density function, it’s curve has following characteristics:
The bell-shaped curve above has 100 mean and 1 standard deviation
- Mean is the center of the curve. This is the highest point of the curve as most of the points are at the mean.
- There are equal number of points on each side of the curve. The center of the curve has the most number of points.
- The total area under the curve is the total probability of all of the values that the variable can take.
- The total curve area is therefore 100%
- Approximately 68.2% of all of the points are within the range -1 to 1 standard deviation.
- About 95.5% of all of the points are within the range -2 to 2 standard deviations.
- About 99.7% of all of the points are within the range -3 to 3 standard deviations.
This allows us to easily estimate how volatile a variable is and given a confidence level, what its likely value is going to be.
As an instance, in the gray bell shaped curve above, there is a 68.2% chance that the value of the variable will be within 101–99.
Imagine the confidence you can now have when making future decisions with that information!!!
Normal Probability Distribution Function
The probability density function of normal distribution is:
The probability density function is essentially the probability of continuous random variable taking a value.
Normal distribution is a bell-shaped curve where mean=mode=median.
- If you plot the probability distribution curve using its computed probability density function then the area under the curve for a given range gives the probability of the target variable being in that range.
- This probability distribution curve is based on a probability distribution function which itself is computed on a number of parameters such as mean, or standard deviation of the variable.
- We could use this probability distribution function to find the relative chance of a random variable taking a value within a range. As an instance, we could record the daily returns of a stock, group them into appropriate buckets and then find the probability of the stock making 20–40% gain in the future.
The larger the standard deviation, the more the volatility in the sample.
How Do I Find Feature Distribution In Python?
The simplest method I follow is to load all of the features in the data frame and then write this script:
Use the Python Pandas libarary:
DataFrame.hist(bins=10)#Make a histogram of the DataFrame.
It shows us the probability distributions of all of the variables.
What Does It Mean For A Variable To Have Normal Distribution?
Now what’s even more fascinating is that once you add a large number of random variables with differing distributions together, your new variable will end up having a normal distribution. This is essentially known as the Central Limit Theorem.
The variables that exhibit normal distribution always exhibit normal distribution. As an instance, if A and B are two variables with normal distributions then:
- A x B is normally distributed
- A + B is normally distributed
As a result, it is extremely simple to forecast a variable and find the probability of it within a range of values because of the well-known probability distribution function.
What If The Sample Distribution Is Not Normal?
You can convert a distribution of a feature into normal distribution.
I have used a number of techniques to make a feature normally distributed:
1. Linear Transformation
Once we gather sample for a variable, we can compute the Z-score via linearly transforming the sample using the formula above:
- Calculate the mean
- Calculate the standard deviation
- For each value x, compute Z using:
2. Using Boxcox Transformation
You can use SciPy package of Python to transform data to normal distribution:
scipy.stats.boxcox(x, lmbda=None, alpha=None)
3. Using Yeo-Johnson Transformation
Additionally, power transformer yeo-johnson can be used. Python’s sci-kit learn provides the appropriate function:
sklearn.preprocessing.PowerTransformer(method=’yeo-johnson’, standardize=True, copy=True)
Note, it is recommended to understand when to use each of the power transformer. Explanation of power transformers such as Box-Cox and Yeo Johnson and their usecases is beyond the scope of this article.
Problems With Normality
As the normal distribution is simple and is well-understood, it is also over used in the predictive projects. Assuming normality has its own flaws. As an instance, we cannot assume that the stock price follows normal distribution as the price cannot be negative. Therefore the stock price potentially follows log of normal distribution to ensure it is never below zero.
We know that the returns can be negative, therefore the returns can follow normal distribution.
It is not wise to assume that the variable follows a normal distribution without any analysis.
A variable can follow Poisson, Student-t or Binomial distribution as an instance and falsely assuming that a variable follows normal distribution can lead to inaccurate results.
This article illustrated what normal distribution is and why it is so important, in particular for a data scientist and a machine learning expert.
Hope it helps.