Properties of the Normal Distribution

JUSTIN OLSON
Analytics Vidhya
Published in
13 min readFeb 16, 2020

--

Grok through Practical Experiments in Python

The powerful, K2 mountain.

Introduction

The Normal distribution (ND), also known as the Gaussian distribution, is a fundamental concept in statistics, and for good reason. It is the most frequently observed of all distribution types and is present in nearly every domain of study.

If data are found to be normally distributed, one can make use of important qualities of the data, such as the mean and standard deviation. The normal underpins an entire class of statistical hypothesis tests. Additionally, by checking for normally distributed residuals one can confirm that a model has been fit to all of the explanatory variability available to it.

The concept of the ND is also applied to model training in the form of data normalization, where data are coerced to have a mean of zero and a standard deviation of one. This often increases the performance of a model, by scaling the data parameters to have the same mean and standard deviation. This allows all of the variables to affect the output equally and makes it easier for the model to learn the right things.

Data normalization is particularly important to the field of deep learning, where it can significantly reduce training time and improve performance of models [1].

This article describes properties of the Normal distribution, the influence of sample size, and how different sources of variances can all yield ND data. Concepts are described using practical experiments coded in Python (all code available here).

Central Tendency & Dispersion

A Normal distribution is observed when continuous numerical data take on a symmetrical, bell-shaped curve (Figure 1). This curve can be characterized by two parameters: central tendency and dispersion.

In Figure 1, We can see that the data is distributed such that the majority of the values are clustered near the center. This characteristic is called central tendency. For a perfectly Gaussian distribution the measures of central tendency (mean, median, and mode) are all equal, but the mean is the most commonly used descriptor.

Figure 1: A Gaussian distribution and standard deviations from the mean. This histogram was produced by randomly sampling one million values from a Gaussian distribution with a mean = 0 (solid vertical line) and standard deviation of 1 (tallest dashed line). The medium- and short- dashed lines indicate 2 and 3 standard deviations, respectively. For our randomly generated data, 68.3% of values fall within 1 standard deviation, 95.5% fall within 2 standard deviations, and 99.7% fall within 3 standard deviations.

Dispersion or variance describes the tendency for the frequency of values to taper off as they stray from the center. The further from the mean, the fewer values will be present. The values are symmetrical about the mean, so a vertical fold at the mean for Figure 1, the frequencies on either side of the mean would match up. This allows us to predictably estimate the frequency of values any distance away from the mean and to expect an equal frequency of values on both sides of the mean.

When data are Normally distributed, we have confidence that the mean and standard deviation thoroughly describe the distribution of the data. Further analyses, such as parametric tests, can then use the mean as a basis for comparison between groups.

Central Limit Theorem

The central limit theorem (CLT) is a driving force behind the usefulness of normal distributions. Many probabilistic and statistical methods operate on the assumption that data are from a normally distributed population. Often, one checks this by evaluating the distribution of his/her data. If data are ND, one can use an entire array of statistical methods that are helpful in data analysis. However, if one has ever worked with real-world data, one will notice that data are rarely perfectly Normal.

The CLT introduces flexibility into the Normality assumption, making it easier to assume that data originated from a ND and enabling one to use those statistical methods more often (provided sufficient sample size).

The CLT says that when one has a enough data, the distribution of the sample means will approximate the Normal distribution [2]. Let’s attempt to verify this with code.

Figure 2: The results of a single experiment with 30 observations. Samples were generated from a random distribution with values ranging from zero to one.
Figure 3: The distribution of means from 100 experiments, each containing 30 observations. Samples were generated from a random distribution of values between zero and one. Solid line is the mean of all 100 means. Dashed line is the mean of a single experiment as seen in Figure 2.

Imagine one conducted an experiment and calculated the mean of the results. The only criteria here is that the scientist gathers a sufficient number of data points to make the mean a good measure of central tendency. For this example, a sample size of thirty is used and the mean of those observations is calculated (Figure 2).

It’s apparent that the data are non-normal for this single experiment.

Now imagine this experiment is repeated 100 times so that the scientist collects a distribution of means. The CLT states that the distribution of sample means should approximate a normal distribution. In Figure 3, one can observe this to be true.

It’s interesting that a normal distribution is being generated despite the fact that the underlying of distributions is random.

One can think of the mean from a single experiment as having originated from a distribution of means. If one knows the distribution of means, one can expect the mean from any one experiment to fall within the distribution of all experimental means. Furthermore, one could build expectations about where he/she expects the single mean to fall, given that it’s more likely to fall towards the center of the distribution (towards the mean of the means). In Figure 3, one can observe that the mean from the independent experiment (dashed line) falls within the distribution of means and in reasonable proximity to the mean of the means (solid line).

One can also think about this from a different perspective. If one has a sample mean (generated from a sufficient amount of data), one can calculate estimates about what the distribution of the population of means would look like. More formally, this means that one can gauge the precision of a population mean estimate.

The implications to our findings regarding the Central Limit Theorem are:

  • The CLT allows one to conduct parametric hypothesis tests when data are not perfectly normal. If the data are not perfectly ND, meaning it is skewed, has outliers, multiple peaks, or asymmetry, one can still evaluate hypotheses using parametric statistical tests (provided sufficient sample size, usually n ≥30)[4].
  • It allows one to use a sample to estimate the mean of an entire population. Since the sample mean becomes a better estimate of the population as the sample size increases, the CLT allows one to gauge the precision of this population mean estimate [3].

This means that if a dataset passes a normal test (checkout scipy.stats normaltest), then a whole host of different parametric (vs nonparametric) statistical testing methods can apply for frequency based hypothesis testing and evaluation. Testing for difference, similarity, and correlations, as well as the statistician’s way to evaluate evidence of an effect (e.g. confidence values, effect sizes and the debated p-values) to determine statistical significance and later scientific importance.

The Influence of Sample Size on the Normal Curve

Imagine a perfect world with a population of normally distributed data. What do samples from this population look like? How does sample size relate to the distribution of these samples? This will be explored using a hypothetical study of the heights of space alien males. The data set contains all of the heights for all 10,000 space alien males in the universe. The distribution of heights can be observed in Figure 4, and is clearly Gaussian.

Figure 4: The population of Alien Male Heights is Normally Distributed. A histogram of 10,000 alien male heights and their respective probability density (y-axis). The 10,000 values were generated using numpy.random.Normal()to generate 10,000 values with a mean of 177.8 cm and a standard deviation of 12.7 cm. The red dashed line is the line of best fit.

Samples drawn from the alien population can be assessed for the effect of sample size on the sample distribution:

Figure 5: The shape of a sample distribution changes with sample size. Histograms of data with n number of values (20, 50, 100, and 500) sampled from a population of 10,000 pseudo-randomly generated heights (mean: 177.8 cm, st dev: 12.7 cm). Five samples were taken at each sample size and they are listed row-wise. This figure was inspired by the work of Altman & Bland [8].

Consider the impact of sample size on the observed distribution (Figure 5). At smaller sample sizes, many of the distributions do not look Gaussian. Even at n=100, distributions are not consistently Gaussian across the board. This could cause one to question whether a particular sample is normal.

When the distribution of data causes one to question the Normality assumption, one may conduct hypothesis testing with non-parametric methods. These methods are ‘distribution-free’, meaning they don’t assume the data follow any specific distribution. Instead of the mean, the median is used for comparison between groups.

Non-parametric methods can be used in cases where parametric methods cannot, including when the outcome is a rank or an ordinal scale, when there are extreme outliers, and when measurements are taken using imprecise methods.

Non-parametric tests make fewer assumptions about the distribution of data, but this comes at a cost.They are often less powerful than their parametric counterparts, making it preferable to use the later if data are Normally distributed.

In the case of predictive modeling with small sample sizes, one could consider tree-based or neural-network-based methods as they don’t make any assumptions about the underlying distribution of the data.

A look into the Prevalence of Gaussian Distributions

Gaussian distributions are the most common type of distribution because there are more processes in nature that will produce them than processes that will produce other distributions. One reason for their prevalence is that the scientific selection process tends to select for Normal distributions because variable may interact in multiple ways to produce normal distributions (Figure 6)

Figure 6: Two Ways Gaussian Distributions can Arise. The Gaussian distribution may arise from a combination of the scientific selection process and variable interactions. The scientific selection process involves formulating a hypothesis, a process that innately limits the scope of the data from many possible values to a select group(s) of interest. This selection of a subset of data eliminates some of the chaos and makes it more likely to observe a Normal distribution. Variable interactions are all the perturbations that introduce variability into the data, causing variation in height, reaction time, or whatever phenomenon is present. Variables can interact in a variety of ways to produce Normally distributed data, including interactions by addition, multiplication, and combinations of the two.

The Role of Variance

Data almost always has variance. We aren’t all the same height (there is variance in height) because there are underlying processes that affect height, causing it to vary. Common sources of variance include our explanatory variables, interactions between explanatory variables, randomness, and measurement error.

Given the example of adult human height, variance in height may originate from genetic influences, age, nutrition status during development, the time of day the individual was measures (one’s height decreases slightly throughout the day due to compression of the spine), measurement error.

No matter the case, variance is almost always present in data. Gaussian distributions can result under all sorts of different processes that create variance and this is one reason for their prevalence.

Central tendency (the mean) is produced from cancelled variations. One of the reasons that Normal distributions are so prevalent is because there are often many sources of variance for a given phenomenon. This makes it likely that some of the variances will cancel one another, leaving ‘neutral’ data points with little to no variance. These data points will be clustered in the center and make-up the mean.

Likewise, data points at the tails will occur less frequently because they require a high amount of variance, all pointing in the same direction. Getting a value to the far left of the mean requires the majority of the variation affecting that value must decrease the value, and any influence that increases the variance would push that value away from the tail and back towards the mean.

For an intuitive description of how variance can create normal distributions, see this sister article: Many Paths to the Center: How Variances can interact to create Normal distributions.

Normal from Addition

Processes with an additive effect can cancel one another, producing a sum of samples that are Normally distributed [6].

To illustrate this process, imagine that we’re studying the heights of adult men in the United States. Pretend we have a sample of 10,000 adult male heights. Without any outside influence, the heights of these males would all be 200 cm.

Also imagine that there are ten variables that affect height, each working to increase or decrease height in an additive manner. Each of the ten variables will be a randomly-generated number between -10 cm and +10 cm (note these variables are not generated from a ND). So any given height can be calculated as the sum of 200 cm plus each of the ten randomly-generated additive perturbations.

To produce a distribution of heights, the above process will be conducted 10,000 times, with the following code snippet:

# Conduct the following loop 10,000 times:
for _ in range(10000):
# the baseline height is 200 cm
baseline_height = 200
# randomly sample 10 values between -10 & 10 (inclusive)
height_perturbations = np.random.uniform(-10, 10, 10)

# total = baseline + perturbations
total_height = baseline_height + height_perturbations.sum()

# add new height to list of heights
heights.append(total_height.round(2))

The resulting distribution of 10,000 heights that were created from additive perturbations:

As we can see above, when random variables interact in an additive manner, the results approximate a Normal distribution. This is because we are adjusting our baseline height ( 200 cm ) by the sum of a list of random positive and negative numbers (our variance). We see values surrounding a mean of 200 (our baseline) because variances tend to completely neutralize each other. But, they don’t always neutralize one another, which is why values surround the mean.

Normal From Small Multiplicative Effects

Now imagine that the ten variables from the above experiment interact via multiplication instead of addition. This effect will be referred to as ‘small multiplicative’, meaning that we will multiple by values that change the baseline value only slightly. Randomly generated variations between the range of 0.95 and 1.05 (inclusive) will be used. These small variations will perturb the baseline height but won’t change them in a drastic, completely unreasonable manner.

Example code:

# Conduct the following loop 10,000 times:
for _ in range(10000):
baseline_height = 200
height_perturbations = np.random.uniform(0.95, 1.05, 10)
total_height = baseline_height * height_perturbations.prod()
heights.append(total_height)

Under these conditions the minimum and maximum possible heights are approximately 120 and 326, respectively. While the range of possible values isn’t balanced evenly around the mean, these values are still within the realm of possibility for heights.

Simulating the above conditions 10,000 times, the data produce the following distribution:

Processes that combine small multiplicative effects approximate the Normal curve. This is because multiplying small effects together is similar to addition.

For example:

200 + 10 = 210200 * 1.05 = 210

So small random effects will produce something akin to a Normal distribution [6]. However, due to compounding multiplicative effects, the curve is not quite symmetrical.

Normal From Large Multiplicative Effects

The same experiment will conducted once again. This time, the effect of 10 large multiplicative effects (effects ranging from 0.75 to 1.25) on height will be explored.

The upper histogram represents the resultant heights, while the lower histogram represents the log-transformed heights.

It is readily apparent that large multiplicative effects contort the height distribution, producing a higher frequency of values immediately below the mean. The values above the mean are less dense, and extend to a much further range. This variation is due to compounding multiplicative effects that cause exponential decay and growth (like compound interest). So large multiplicative effects can compound, producing a large number that grows even larger. Whereas multiplicative effects that shrink the original height appear to have a decreasing impact on the value as the number shrinks. This can also be observed in the histogram of small multiplicative effects, although to a much smaller extent.

Normal Distributions can Emerge from the Scientific Selection Process

The above methods describe different ways random variations can interact to produce a Normal curve. However, there is another process that results in Normally distributed data with important implications.

Scientific hypotheses significantly influences the nature of our data. Normality can also emerge from experimental conditions that constrain the scope of the data [7]. For example, the heights of people tend to be Normally distributed, so it wouldn’t be surprising if a hypothesis concerning the heights of adult American men found that height was Normally distributed.

Now imagine studying the heights of all living things, as first considered by J. Simon [7]. Such a study may include many creatures, including ants and giraffes. Many decisions need to be made before we start the data acquisition process. What counts as living? Viruses? Humans are living creatures composed of eukaryotic cells. How about each of those 200+ different cell types? Intestinal Bacteria? How does one measure the height of fungi, is it better to measure the span of the organism or from the ground up?

When formulating a scientific hypothesis, a lot of additional questions come along for the ride. These questions will affect the nature of the data. A widened scope of analysis to all living beings, makes it unlikely to find that heights take on a Normal curve. Scientists select and tune hypotheses in ways that constrain the question, making a Normal curve more likely. Or, sometimes limitations of available data cause this to happen. But the take-away is experimental designs can and do influence the nature of the data.

Conclusion

The Normal (Gaussian) distribution can be characterized by two qualities: central tendency and variance. The central limit theorem, the driving force behind the ND, allows us to use parametric statistical tests even when data are not perfectly normal. It also allows us estimate the mean of the population and to gauge the precision of that estimate.

Data drawn from a normal distribution do not always appear normal, especially at small sample sizes. In these cases, alternative methods may be considered for statistical hypothesis testing.

The Normal distribution’s prevalence is due in part to scientific selection processes that restrict the scope of analysis, making the ND more likely to be observed. Additionally, there are multiple types of variable interactions that will yield a Normal distribution.

Thanks for reading!

Get Connected: Join Data Science USA on Facebook

Github: https://github.com/jsolson4/Properties-of-the-Normal-Distribution

Acknowledgements: A sincere thank you to Charin Polpanumas (https://github.com/cstorm125) and to Kevin Hulke of Chippewa Valley Technical College Department of Mathematics for their review of this article. Thank you to Richard McElreath for inspiring this piece.

References:

1. Ioffe, S. & Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ArXiv150203167 Cs (2015).

2. Central Limit Theorem. http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Probability/BS704_Probability12.html.

3. Central Limit Theorem Explained. Statistics By Jim http://statisticsbyjim.com/basics/central-limit-theorem/ (2018).

4. Taylor, C. T. C. K., Ph.D., At, I. a P. of M., University, erson & Algebra.”, the author of “An I. to A. What Is So Important About the Central Limit Theorem? ThoughtCo https://www.thoughtco.com/importance-of-the-central-limit-theorem-3126556.

5. Public Conversation, Peter Scully, Ph.D.. February 15th, 2020.

6. McElreath, R. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. (CRC Press, 2016).

7. Simon, J. L. What Does the Normal Curve “Mean”? J. Educ. Res. 61, 435–438 (1968).

8. Altman, D. G. & Bland, J. M. Statistics notes: The normal distribution. BMJ 310, 298 (1995).

--

--