Normal (or Gaussian) Distribution

Curious and Learning
6 min readJun 30, 2020

--

Introduction

The Normal (or Gaussian) Distribution gets its name because of how often a real-valued random variable can be modeled using this distribution in nature.

This article discusses the same concepts as covered in this video.

Part-I: Series on Normal Distribution

Examples of Normal Distribution

  • Height, weight or size of objects
  • Errors produced by a machine
  • Test scores

Parameters of Normal Distribution

The distribution in terms of its parameters is referred as: X~N(μ, σ²) where:

  • mean (μ): Around which the data is centered
  • standard deviation(σ): The spread of the data and σ² represents the variance.

To visualize the parameters, let’s look at the following illustrations:

  • Cookies manufactured in a plant: Let’s say, you visit a plant where cookies are manufactured and you pick three of them and find out that each of them has the following diameters.

This makes you curious and you take a sample of 500 cookies and you plot a histogram. And you realize that this histogram does follow a normal distribution with a mean of 5 cm.

  • Height of males in a city: You take another example and measure the heights of a sample of 500 males in a city. You again plot a histogram and find out that this also follows a normal distribution with a mean of 172 cm.

We notice that for both of these examples, mean is central characteristic of this distribution and most of the readings stick close to the mean. There are some readings that are away and this spread in data is precisely what is captured by the standard deviation.

The Empirical 68–95–99 Rule

This empirical rule states that:

  • 68% of the observations fall within one σ of μ, i.e., (μ - σ, μ+σ).
  • 95% of the observations fall within two σ of μ, i.e., (μ -2σ, μ+2σ).
  • 99.5% of the observations fall within three σ of μ, i.e., (μ -3σ, μ+3σ).
Figure 2: Standard Normal Distribution

Properties of Normal Distribution

  • Because of its shape, it’s also known as the bell curve.
  • Normal distribution is symmetric about its mean.
  • The shape of normal density curve remains same with change in mean, i.e., it is translation invariant.
  • An increase in standard deviation flattens the curve.
  • A decrease in standard deviation concentrates the curve.

Applications

  • Application 1: Say, you are a consultant and are hired by a company that manufactures cookies. The requirement is to design a wrapper for the cookies so that at least 99% of them can be packed.

Step1: You gather a large sample of cookies (say 500 to 1000) and estimate the diameter as 5 cm and the standard deviation as 0.05 cm.

Step2: To ensure packing at least 99% of the cookies, you use the 68–95–99 rule because it tells us that 99.7% cookies will fall within 3 SD’s, i.e., 0.15 cm from the mean of 5 cm.

Thus, a good estimate for wrapper diameter is (μ+3𝞼) 5.15 cm and report this figure.

  • Application 2: Pleased with your last report, you are asked for another design suggestion. You need to report an estimate for the sieve size, such that, the food grains larger than 80% of the mean size are retained and the rest below threshold pass out.

Step1: You again take a large sample and estimate the size of food grains as 4 mm and SD as 0.2 mm.

Step2: You’re stuck because you do not know how many standard deviations away from the mean on the right covers 80% of the area in the distribution.

For this, let’s learn about standard normal curves and z-scores.

Standard Normal Curve

Normal distribution with 𝞵 = 0 and 𝞼 = 1

z-scores

It represents how many standard-deviations (𝞂) below or above the mean (μ) a data point (x) is. Therefore, mathematically, it can be calculated as:

It lets us analyze the distribution of z-scores (a standard normal curve) in place of the raw data points.

Why do we analyze z-scores distribution instead of raw data points x?

Standard normal curves are easier to analyze and we have many available computation for this. Since distribution of z-scores in nothing but a normal distribution, we convert our data points x to z-scores for analysis and inference and can further convert back the z-scores to x.

Table of z-scores

We use this table to find out area under the normal distribution to the left of given z-score.

Let’s try to find out area under normal distribution for the z-score 0.55

Pictorially, this looks like:

We can also use statistical packages in languages such as Python and R to find this. The commands for Python and R, respectively, are as follows:

Application-2 continued …

Pleased with your last report, you are asked for another design suggestion. You need to report an estimate for the sieve size, such that, the food grains larger than 80% of the mean size are retained and the rest below threshold pass out.

Step1: You again take a large sample and estimate the size of food grains as 4 mm and SD as 0.2 mm.

Step2: Using either of the methods discussed, you find out the z-score left of which covers 80% of the area under curve.

Since 0.8023 is the closest value to 0.8 in the table, we compute z-score as 0.85. For more precise calculations, we can use the packages above.

We know that z-score = (x — μ)/𝞼,

therefore x = z-score*𝞼 + μ

Size of grain that should be sieved is less than equal to 4 + 0.85*0.2 = 4.17

Therefore, you report the sieve size as 4.17 mm.

References

  1. Wikipedia: Normal Distribution
  2. Pattern Recognition and Machine Learning (2006), Probability Theory

--

--