Data Scientist Must Know— Introduction to the normal distribution
There are two different types of hypothesis tests, parametric tests and nonparametric tests. Parametric tests rely on the knowledge of the distribution of the data while nonparametric tests don’t. The distribution widely used in parametric tests is the normal distribution. Many natural phenomena can be modeled by the normal distribution, hence its wide use in testing. This post aims to introduce you to the basics of the normal distribution, it’s unique properties, theorems and more.
The bell shaped curve
The normal distribution is often referred to as the Gaussian distribution, in honor of the mathematician Karl Friedrich Gauss who derived the equation for the normal distribution while researching measurement errors. The normal distribution is often associated with a bell-shaped curve, this curve is called the normal curve. The normal distribution relies on two parameters, the mean and the variance. The mean determines the location of the curve while the variance determines the shape of the curve.
Some of its properties
Not only in hypothesis testing, the normal distribution is also applicable in many different fields. Due to the bell shaped curve, here are some of its properties:
- The mode, which is the value that occur most often, is the point where the curve is at it’s maximum, which is the mean.
- The curve is symmetric about the mean.
- Approximately 95% of the population is between two variances below and above the mean.
The Equation
In order to calculate the probability of a random variable with a normal distribution, we need to calculate the area under the curve. To do that, we need to know the equation of the curve. Developed by Abraham DeMoivre in 1733, the mathematical equation of the normal distribution is given by
To find the area under the curve between two points all we need to do is calculate
While we have the computation ability to compute this complex integral, there is a much easier way to compute the area under the graph of the normal distribution.
The Standard Normal Distribution
Look at the integral again. Notice that the integral would be much easier if the mean is 0 and the variance is 1. The normal distribution with these parameters is called the standard normal. Therefore, the area under the curve of the standard normal is
While it is still difficult to compute by hand, computers can easily compute this integral. To obtain the probability without the use of a computer, a table of probabilities is used to make it easier.
The value inside the table is the area under the curve between negative infinity and the ordinates (denoted as z). Here is how to read the table:
Let X be a random variable with standard normal distribution and we want to find the probability of X greater than 0.62, or in notation form
Using the table, we actually can only get the probability of X less than 0.62. But, remember that
From the table, we can see that the probability of X less than 0.62 is 0.7324. Therefore
Using the standard normal distribution
Now the question is, how do we find the probability of normal distribution with a different mean and variance only using the standard normal distribution? To do this, we need to do a transformation.
Without proof, the random variable Z has the standard normal distribution. So, just by applying this transformation, we could count the probability of a normally distributed random variable with different means and variances.
To better understand this concept, lets see an example
The central limit theorem
Hypothesis testing is associated with sampling theory. While doing research, we take samples from the population to gather data. The central limit theorem is a powerful theorem which is the basis of a lot of tests used by researchers during hypothesis testing.
This theorem is powerful, it tells us that the mean of any data from any distribution will have a standard normal distribution just by transforming the variable. However, keep in mind that this approximation is only good for large values of n. There is no real threshold of how big n needs to be for it to be accurate, but a good rule of thumb is that n greater than or equal to 30 is enough to provide a good approximation.