Data Scientist Must Know— Introduction to the normal distribution

Insufficient
5 min readDec 15, 2022

--

Photo by Mark Basarab on Unsplash

There are two different types of hypothesis tests, parametric tests and nonparametric tests. Parametric tests rely on the knowledge of the distribution of the data while nonparametric tests don’t. The distribution widely used in parametric tests is the normal distribution. Many natural phenomena can be modeled by the normal distribution, hence its wide use in testing. This post aims to introduce you to the basics of the normal distribution, it’s unique properties, theorems and more.

The bell shaped curve

The ‘bell shaped’ normal distribution

The normal distribution is often referred to as the Gaussian distribution, in honor of the mathematician Karl Friedrich Gauss who derived the equation for the normal distribution while researching measurement errors. The normal distribution is often associated with a bell-shaped curve, this curve is called the normal curve. The normal distribution relies on two parameters, the mean and the variance. The mean determines the location of the curve while the variance determines the shape of the curve.

The mean of the normal distribution determines its location
The variance determines its shape
Different shapes and locations of the normal distribution

Some of its properties

Not only in hypothesis testing, the normal distribution is also applicable in many different fields. Due to the bell shaped curve, here are some of its properties:

  1. The mode, which is the value that occur most often, is the point where the curve is at it’s maximum, which is the mean.
  2. The curve is symmetric about the mean.
  3. Approximately 95% of the population is between two variances below and above the mean.
One feature of the normal distribution is that most values are around the mean

The Equation

In order to calculate the probability of a random variable with a normal distribution, we need to calculate the area under the curve. To do that, we need to know the equation of the curve. Developed by Abraham DeMoivre in 1733, the mathematical equation of the normal distribution is given by

The equation for the normal distribution

To find the area under the curve between two points all we need to do is calculate

While we have the computation ability to compute this complex integral, there is a much easier way to compute the area under the graph of the normal distribution.

The Standard Normal Distribution

Look at the integral again. Notice that the integral would be much easier if the mean is 0 and the variance is 1. The normal distribution with these parameters is called the standard normal. Therefore, the area under the curve of the standard normal is

While it is still difficult to compute by hand, computers can easily compute this integral. To obtain the probability without the use of a computer, a table of probabilities is used to make it easier.

The normal distribution table

The value inside the table is the area under the curve between negative infinity and the ordinates (denoted as z). Here is how to read the table:

Let X be a random variable with standard normal distribution and we want to find the probability of X greater than 0.62, or in notation form

Using the table, we actually can only get the probability of X less than 0.62. But, remember that

From the table, we can see that the probability of X less than 0.62 is 0.7324. Therefore

Using the standard normal distribution

Now the question is, how do we find the probability of normal distribution with a different mean and variance only using the standard normal distribution? To do this, we need to do a transformation.

Without proof, the random variable Z has the standard normal distribution. So, just by applying this transformation, we could count the probability of a normally distributed random variable with different means and variances.

To better understand this concept, lets see an example

An example

The central limit theorem

Hypothesis testing is associated with sampling theory. While doing research, we take samples from the population to gather data. The central limit theorem is a powerful theorem which is the basis of a lot of tests used by researchers during hypothesis testing.

The central limit theorem

This theorem is powerful, it tells us that the mean of any data from any distribution will have a standard normal distribution just by transforming the variable. However, keep in mind that this approximation is only good for large values of n. There is no real threshold of how big n needs to be for it to be accurate, but a good rule of thumb is that n greater than or equal to 30 is enough to provide a good approximation.

--

--