The Normal Distribution, Confidence Intervals, and Their Deceptive Simplicity

Brayton Hall
10 min readAug 16, 2020

--

From LibreTexts

Sometimes the simplest refreshers are best, and when it comes to statistics, concepts like parameter, statistic, z-score, t-test, Student’s t-distribution, standard deviation, Chebyshev’s rule, and confidence interval can tend to merge into disorienting word salads, much like this sentence itself.

This purpose of this post is to provide a quick refresher on these basic concepts for others (myself) when I inevitably forget how exactly to interpret a confidence interval by remembering to never say there is a 95% chance a specific interval contains the true mean, but to instead say Oh god, when will I remember to just keep my mouth shut around statisticians.

It may also be correct to say that 95% of the time the interval will capture the true mean, but let’s get into that below.

Generally speaking, statistics is often semantics, and the English (or whatever human language) interpretation of a result often hinges heavily on connotations and assumptions present in the framing of a problem. See my post on probability via a Monty Hall-type problem for the probability version of this post.

The Basics

First, let’s address some of those salad ingredients.

Normal (or Gaussian) distribution. The Normal distribution, in short, can be described by the function:

And it looks like the blue-green-yellow picture at the top of this post.

Don’t worry about where it comes from. Don’t worry about how it’s derived. In fact, don’t worry about using the formula, as it’s sufficient to know that it merely exists to give the shape to the thing we call a bell curve, another name for the Normal distribution. I think there’s often a confusing lack of transparency surrounding how the Normal distribution is taught, in that why it’s used has nothing to do with the probability density function (PDF) which happens to describe it.

The best way to think about a Normal distribution is as a pseudo-histogram of an infinite number of samples of some random phenomenon, like rolling dice. Take at a look at the follow simple histogram of two-sided dice roll outcomes:

A histogram of the sums of two-sided

This distribution is a histogram which displays every possible combination of sums of outcomes of two-sided dice on the x-axis, and the frequency of occurrence on the y-axis. For example, there is only one possible way to get the number ‘2’ with two dice: by rolling two ones, the nominal snake eyes. Out of 36 possible combinations of dice outcomes, this is represented as 1/36 on the y-axis. There are many more ways to get sums of 6 and 8, and even more ways to get sums of 7. These variety of ‘ways’, or outcomes, is the essence of the Normal distribution.

To repeat: the Normal distribution is simply the logical conclusion of sampling a phenomenon an infinite number of times and displaying it as a histogram. The difference here, and the main intuitive leap, is that a Normal distribution deals with continuous variables, as opposed to discrete variables. Dice outcomes are limited to [1–6], but human height presumably lies on the real number line. That is, humans may be 6 feet tall exactly, or 6.1 feet tall, or 6.314159… feet tall.

This leap from discrete to continuous variables is the main source of the headache for most students learning the Normal distribution, especially when the formula f(x) above pops out of nowhere from pure math and supposedly relates to things like human height, toenail length, and frog croaking time.

Take a look at the Normal distribution again, and take a guess at what the percents and symbols mean:

The Normal Distribution

Taking human height as an example, these percents would mean that 68% of people fall within the blue section, 95% of people fall within the green and blue section, and 99.7% of people fall within the yellow and green and blue section (there is a tiny bit of white at either end accounting for the remaining .3%). The character σ, called sigma, represents these intervals known as standard deviations.

Standard deviation. The standard deviation is a measure of the average distance from any particular data point in a set of data from the mean of that data. This is closely related to variance, but the standard deviation is more informative for reasons explained below. Variance is the result of adding together every squared difference of points in the dataset and the mean. If one pile of apples has 4 apples, another pile has 5 apples, and a final pile has 9 apples, then the mean = (4 + 5 + 9)/3 = 6. The sum of squares of the difference of each pile’s apple-count from the mean, or 6, is:

((6–4)² + (6–5)² + (6–9)²) = (4 + 1 + 9) = 14

So 14 is the variance. But so what? This number carries no relative significance compared to other types of data sets. How would we compare a variance of 14 apples to a variance of 2 inches in a human height dataset? This is where standard deviation comes in. By taking the square root of the variance, we get the standard deviation (std). By dividing any given data point by the standard deviation, we end up with what’s called a z-score, which is the average number of standard deviations from the mean.

Because a z-score is the conversion of any data point into a format relative to its own standard deviation and mean, this results in all z-scores falling into the same grand, relativized scope of comparison via…

The Normal distribution!

This is wild and unintuitive. Truly.

Why in the world do z-scores, the simple act of converting data points into numbers by dividing them by the standard deviation, have anything to do with the probability density function (PDF) for the Normal distribution?

All I’ll say here, for the sake of brevity and simplicity, is that the Normal distribution fundamentally involves circles and the fact that pi is the same for all circles, and that because the act of creating a z-score involves squaring the difference of each data point from the mean, the value of pi is implicitly involved in the standardization of all data sets through z-score conversion.

The fact that the infinite sampling of all continuous data sets converges to the Normal distribution is due in part to the Central Limit Theorem, which I will again avoid expositing for the sake of brevity and simplicity. However, as you can see, things have gotten quite complex from just a few deceptively simple acts. All we’ve actually done is squared some differences involving means and data points, and then added them (variance), taken the square root of the variance (standard deviation), and then divided any given data point by the standard deviation to obtain a z-score!

This is why I said, earlier, that there’s often a confusing lack of transparency surrounding how the Normal distribution is taught, in that why it’s used has nothing to do with the probability density function (PDF) which happens to describe it. The fact that the Normal distribution is the logical conclusion of infinite sampling, and that its mathematical derivation is extremely involved and unintuitive, explains the bizarre and almost magical appearance of the Normal distribution, as well as the frustration of students trying to understand how it relates to the relatively simple concepts of mean, variance, and standard deviations.

Making the Jump to Confidence Intervals

Finally, we’ve reached the titular topic.

A confidence interval is simply an estimate of a true population parameter (such as the mean) from a sample. We’re just going backwards by taking a z-score of a sample and then making a guess about the relative likelihood of whether or not a particular interval on the Normal distribution contains the mean!

This is, again, deceptively simple.

There is actually quite a philosophical leap surrounding confidence intervals, since we are making an assumption that the population in question can, in fact, be described by the Normal distribution. The underlying faith in the Central Limit Theorem is what makes estimation possible. I say faith, because the Central Limit Theorem is an assumption based on the law of large numbers, which implicitly invokes the concept of Almost Surely from probability theory.

Everything about confidence intervals involves this fundamental leap, which is not itself a statistical concept. I.e., one can never provide a statistical measure of how likely it is that the Central Limit Theorem is applying in this particular case. That’s why the CLT is such a crucial assumption, and why statistics gives people headaches, because it’s such a fraught alchemical combination of mathematically-derived functions forged with fundamentally wild assumptions which are philosophical in nature, and which quickly descend into further theories of natural law and induction, mostly notably David Hume’s problem of induction.

David Hume, God of Headaches

But that’s a different topic.

Let’s finally look at how to construct a confidence interval.

The formula for a confidence interval

That’s it! We’ve already talked about everything involved in this formula. The x is the mean of a sample, z is the z-score, the s is the standard deviation of the sample (though we should use σ if we happen to know the population standard deviation, which we often don’t), and n is is the size of the sample.

Again, we’re simply going backwards from a z-score to the population mean via the Normal distribution. Earlier, we noticed that infinite-samples’ means converge to the Normal distribution. So, if we take a sample mean, we can make a pretty good guess about how close that sample mean is to the true mean! (taking into account the fundamental assumption that our population is, in fact, described by the Normal distribution).

First, we decide what level of confidence we want our estimation to involve. The standard trio is 90%, 95%, and 99%. We then subtract this confidence from 100% and call it alpha, or α, after converting into decimal format. So for a 95% CI, we have α =1.00 - .95 = .05. We then split α into two: α/2, since our confidence interval will be symmetric around the presumed true mean: .05/2 = .025. The standard Stats 101 strategy at this point is to look up this value in the completely arcane table of High Magic, the dreaded z-table:

Source

If α = .025, then that means the area under the curve of the Normal distribution for our desired interval will be .975, which we find at the z-score of 1.96 in the table above (the columns represent the hundredth’s place). I call this table arcane and mysterious because, once again, it is the standard Stats 101 strategy to not emphasize that all of these values are literally just outputs from the probability density function:

This is also the function called when you use the PDF function in Python, or on a calculator, but it looks scary for Intro students and so instructors often let the z-table itself become the One Source of True Knowledge since ‘you won’t be expected to calculate those values by hand on the midterm’, and thus sounds the death bell for didactic integrity.

Moving on:

Once this z-value is obtained, we multiple it by the standard deviation of the sample (or σ if we know it) divided by the square root of n (the sample size), and in the process we denormalize the z-score back into the context of the original data, such as inches of human height.

For example, if a sample of 50 human heights resulted in a mean of 70 inches, with a sample standard deviation of 2, we use the CI formula:

70 +/- 1.96(2/7.07)= 70 +/- .55 = (69.45, 70.55)

We can then state with 95% confidence that the interval (69.45 inches, 70.55 inches) captured the true population height mean. We are not supposed to say that there is a 95% chance that the true population mean lies between 69.45 inches and 70.55 inches, because an entirely different sample mean at 95% confidence could result in an entirely different interval. The true population mean could be hiding at the lower end of this interval, or the higher end, but there’s no way to tell without taking another sample. A confidence interval says far more about the variation between samples than the population mean itself, since the population mean never changes. Grammar is quite important here, and fine distinctions in this phrasing are often the cause of pedantic scorn, since it is important not to imply that estimations are capable, just from a single sample, of making population parameter-level claims. We can only make sample-level claims, and interpret those claims on a sample-to-sample basis.

As always, you’re welcome to instead say Oh god, when will I remember to just keep my mouth shut around statisticians, but hopefully after reading this post you’ll be slightly more capable of making confident claims about, well, your confidence.

Thanks for reading!

--

--