A Beginner-Friendly, yet Mathematical Introduction to Confidence Intervals

19 min readDec 19, 2023

This is a follow-up on an article which I recently wrote, where I tried to explain the concept using structure and color. In this article I give much more detailed introductory information. As the title says, it is beginner friendly, as it almost starts from scratch. You only need to know what random variables and probability density distributions are. So let’s get started. (BTW the picture has nothing to do with this text — I only liked the beach atmosphere and hoped for some clicks.)

Here is a rough overview of what is to come: The entire idea about confidence intervals is that you want to make an estimate about an unknown entity, a single number, which is called a point estimate. But then you go a step further and you specify an interval around that estimate such that you are relatively sure (or confident) that the sought value will be within this interval. This is called a range estimate and the interval is the
confidence interval, of course. But how do you come up with the point estimate and the interval? And how do you quantify ”to be relatively sure”. How sure? Statisticans developed a rigorous framework for this. You can compute an estimate of something of which you only have partial information. Intuitively, the less information you have, the less accurate your estimate will be. To formalize this idea, the definitions of a few concepts are necessary. This is like a new vocabulary to learn. The first thing, we need to get out of the way is the concept of mean, variance, and population and sample. This will be tackled in the first few subsections. If you are familiar with this, feel free to skip these parts. Then we will talk about sampling distributions which is THE idea we need to understand to achieve our goal. Another ingredient we need is the (standard) normal
distribution, so we also spend a bit of time on that. And eventually we will look at the sampling distribution of the sample mean — to finally introduce confidence intervals. Figure 1 indicates our roadmap.

Rough roadmap to confidence intervals. We will look at the left square only — the sampling distribution of the sample mean, a follow-up article that covers the second square — the sampling distribution of the sample variance can be found here.

0.1
Descriptive Measures:

A probability distribution (pdf) of a random variable (r.v) can be complex. Sometimes you want to concisely summarize a pdf, for example with a single number. There are various such quantitative, descriptive measures, the most common ones are measures of central tendency and dispersion. Famous measures for central tendency are mean, median, and mode, and famous measures for dispersion are variance, standard deviation, and interquartile range. (You can finde more such measures here and here). Describing a pdf, or describing data sets in general is the subject of descriptive statistics.

0.1.1
The expected value, or mean, or expectation. We need to understand the expected value, which is also called the mean, or just expectation. If you are familiar with this concept, feel free to skip this subsection. We already know that the expected value is a simple, descriptive summary of a pdf/pmf. As a measure of central tendency it tells us which outcome we would observe on average, if we ran an experiment an infinite number of times. The expected value is defined as a function of a r.v., e.g. X. If X is discrete it is usually written as:

Expected value for a discrete random variable.

where P (X = xⱼ ) denotes the probability that X takes the value xⱼ . For a continous X the sum changes to an integral:

Expected value for a continous random variable.

where f_X(xⱼ ) denotes the according pdf. The computed expected value is often denoted as µ_X , where the subscript is just a reference to the corresponding r.v..

Numerical example expected value: If the r.v. X describes the result of rolling a fair 6-sided die its pdf is discrete, since it can only take on the values between 1 and 6, and the probability of each number occurring is 1/6. Therefore,

Numerical example for the computation of an expected value.

Code example expected value: We write the above calculation in Python as follows:

import numpy as np

X = np.arange(1,7,1)
print( (1/6) * X.sum())

Alternatively we can also define a custom pmf with the scipy.stats module and then let scipy compute the expected value for us. The following code is copied (and adapted) from the scipy docs.

import numpy as np
from scipy import stats

#the support, i. e. the possible values X
#can take, goes from 1 to 6
xk = np.arange(1,7,1)
#the probability for all 6 events is 1/6
pk = (1.0/6.0,)*6
custm = stats.rv_discrete(name='custm', values=(xk, pk))
print(custm.expect())

Expectation is linear. This means (without proof):

where X and Y are r.v.s, and c denotes a constant.

0.1.2
The Variance and Standard Deviation

Here is the definition of variance:

In line 5 we use the binomial theorem, in lines 6 and 7 we use the properties from 4. As you can see, variance is defined in terms of expectation. It is the expected value of the squared distance of X to its expected value. The variance describes in a single number how spread out a distribution is. Variance is not linear. Thus,

if X and Y are dependent. But

if X and Y are independent of each other. The variance of a r.v. multiplied by a factor is that factor squared times the variance of the r.v.:

Also, adding a constant, c, does not change the variance:

I omit the proofs for the above statements. Please consult any book about probability for this, my favorite book is K. Blitzstein and Hwang [1].

Numerical example variance: Again, we use the die example from above. Thus the r.v. X describes the result of rolling a fair 6-sided die, its pdf is discrete (it can only take on the values from 1 to 6) and the probability of each number occurring is 1/6. To compute Var(X) we need E[X] and E[X²]. We already know that E[X] = 3.5. What is missing is E[X² ]. This can be computed using LOTUS, which says that we can compute the expected
value of a function of a r.v. X by plugging the function of X into the formula for the expected value. So

becomes :

(I plotted f (xⱼ ) in bold to point out the difference to the above formula 12.) In our example the function of X is X² , so we need to plug this in here:

Now we can compute the variance:

Code example variance: Assuming we extend our Python script from the expected value example, we can write the above calculation in Python as follows:

E_of_X_squared = np.power(xk,2).sum() *(1/6)
EX_squared =3.5**2
var_x = E_of_X_squared - EX_squared
print(var_x)
#compare to scipy's output
print(custm.var())

The standard deviation is simply the square root of the variance.

0.2
Population and Sample
In inferential statistics you often deal with the problem that you do not know the underlying pdf/pmf of a r.v. of interest. You are only able to observe, or measure, a few values for X. For example, you want to know the average body height of women between 20 and 50 years old. You can probably not measure all women on this planet in this age range. What you
can do is measure a few women. This subset is called a sample. The entity of all women on the planet in this age range would be called the population. So a population is the total entity of interest. The population has a pdf — that is usually unknown. This pdf has certain parameters, for example mean and variance. The goal is to estimate the values for the population parameters from the sample that you took. Those estimates are called statistics. Before we tackle this, we take a closer look at what we can do with a sample.

(Test yourself: What is a population? What is a sample? What is the difference between a parameter and a statistic?)

(Answer: The population is the total entity of interest. A sample is a subset of a population. A parameter always refers to the population. It is a
measurable quantity of the population pdf. A statistic is also a measurable quantity but based on the sample distribution.)

0.2.1
Sample Mean
You can compute the sample mean from the sample you took. Note, that to compute the expected value of a r.v. (eq. 1) we needed the pmf/pdf. As just stated, we do not have this information, therefore we cannot use this formula to compute the sample mean. The best we can do is the following (what ”best” means will be explained in a bit):

where n is the sample size, and Xⱼ is interpreted as i.i.d. r.v.s., which only means that we interpret each value that we observe from the experiment as a r.v. and that this observation does not give us any information about the other observations we may make (that is the independence assumption) but we do believe that all observations come from the same process and as such they are all subject to the same underlying probability distribution
(this is the ”identically distributed” idea). This is the same as the arithmetic mean which you are probably familiar with. So if you had taken the following measurements in cm [160, 168, 159, 176, 160] for a sample size n=5, the sample mean would be:

Note, that the sample mean is (usually) abbreviated as X̄ and the (population) mean as µ_X or simply µ. So even the symbols tell you, if you are dealing with a population or just a sample. (Note, the abbreviations might slightly differ, depending on the author. In the book Lock et al. [2] the abbreviation x̄ is used for the sample mean. In the book K. Blitzstein and Hwang [1] X̄ is used, (capital X), and in the book Ott and Longnecker
[3] the symbol ȳ. So, do not get too focused on the exact letter, but rather the fact that different symbols will be used for parameters and statistics.)

0.2.2
Sample Variance
Analogous to the concept of the sample mean, there is also the concept of the sample variance. Resuming the example of measuring the body height of women from above, we now want to estimate the variance of the population based on the information we gained from the sample. The definition of the sample variance is:

Again, the Xⱼ denote i.i.d. r.v.s. (You probably wonder where this formula is coming from. Statisticians work on finding good estimation procedures for population parameters. This formula is one of many and it comes with p
a few useful properties and others not so useful. For example E[Sₙ² ] = σ 2 , unfortunately E[ sqrt(Sₙ²) ] = sqrt(σ), in words: this formula is an unbiased estimator for the variance (yay!), but not for the standard deviation of the population. :( )
Recall that our measurements were X= [160, 168, 159, 176,160]. We already computed the sample mean X̄ for the body weight example above, X̄ = 164.6. We can compute Sₙ² for the body-height example.

In Python we can compute the sample variance using numpy by setting the variable ”ddof”=1. If you don’t set this variable, numpy computes the population variance:

X=np.array([160, 168, 159, 176,160])
X.var(ddof=1)

The sample standard deviation is simply the square root of the sample variance.

Distinguishing between the population and the sample is important. Statisticians do not only use different words for this (parameter and statistic), but also different symbols. For example, the expected value or mean of a (population) distribution is usually denoted µ, but the sample mean is denoted X̄. Below is a table taken from p. 162 of the book Lock
et al. [2] that nicely shows the distinctions.

taken from p. 162 of the book Lock
et al. [2]

Figure 2: Visual description of the relationship between population and sample.

0.3
Sampling Distributions
The idea of sampling distributions lays the theoretic foundation of what is to come, so this is an important subsection. Let’s focus on the sample mean, here denoted by X̄. When you take one sample from a population you can compute the sample mean for that sample. If you take another sample of the same size from the same population you can compute the
sample mean of that sample and it probably won’t be exactly the same value as the first. For example if you measure 3 randomly chosen women and compute their mean height, and then again randomly chose 3 women and compute their mean height these two mean heights are probably different. If you take many such samples and keep computing their
sample means and plot how often each computed sample mean value occurs, you would see that the shape of this plot is almost symmetrical centering around some particular value. This is not a coincidence but is explained by the central limit theorem which says that as the number of samples goes to infinity the sampling distribution of the sample mean
approaches a normal distribution. In other words, the statistic (here the sample mean) you computed from various samples is distributed over a range of possible values. Such a plot is called the sampling distribution of a statistic — here the sampling distribution of the sample mean. (An article about the sampling distribution of the sample variance can be found here.) Again: a sampling distribution is the distribution of the values you obtain from repeatedly taking samples of the same size from a population and
computing the same statistic for each sample. Since we assume that the samples are chosen randomly, i.e. the value of the sample statistic includes randomness, we can model a sample statistic as a r.v.! Or you could argue, that the measurements of a sample are all i.i.d. r.v.. And since X̄ is a functions of those r.v.s it is an r.v. itself. Now, this is where things can get confusing if you don’t pay attention — so slow down here! A statistic, e.g.
the sample mean, is a proper r.v. and we can assign a distribution to it — the sampling distribution of the statistic of interest. Of course, we can compute descriptive measures like mean, mode, variance etc. for this distribution, too! For example, we can compute the mean of the sampling distribution of the sample mean, E[X], and we can also compute the variance of the sample mean, V ar(X). Take a second to digest this.

0.3.1
Mean and Variance of the Sample Mean
The sample mean is:

with Xᵢ being the r.v. representing the i′ th measurement.

In line 23 we used the fact that Expectation is linear (compare 4) and in the last two lines we used the assumption that all Xᵢ are identically distributed. Therefore, the expected value of the sample mean equals the population mean. Again: the mean of the sampling distribution of the sample mean corresponds to the mean of the population! Thus, to estimate the population mean μ from a sample, we need to compute the expected value of the sampling distribution of the sample mean. In math: Because of this we say that the sample mean is an unbiased estimator (hence I earlier called it our ”best guess”). Unbiased here means that the expected value of the parameters equals the population parameter.

Now we look at the variance of the sample mean:

In line 28 we use the fact from 11. In line 29we use the fact that the variance of a sum of r.v.s equals the the sum of the variances of the individual r.v. iff the r.v.s are i.i.d. — which is our assumption here. Note, that since you divide by the sample size n — the larger n the smaller the variance of the sample mean.

We can also compute the sample standard deviation for a single sample. But we can also compute the sample standard deviation for the sampling distribution of the mean (or another statistic). If we compute the sample standard deviation for a sampling distribution of a statistic, we call it the standard error. So if we see the term ”standard error” we know that it is the sample standard deviation but with respect to a sampling distribution. We can even be more explicit — if the statistic of our sampling distribution is the mean, we call the standard deviation not only standard error, but ”standard error of the mean” (SEM).
(Test yourself: What is the difference between the term ”sample standard deviation” and ”standard error”?)

(Answer: The computations are the same, only that ”standard error” implies that this is the sample standard deviation of a sampling distribution.)

The standard error is often abbreviated as ”SE”.
All of the above is summarized with numerical examples in figure 3. Note, although we talked about taking ’many samples’, which is how we derived the idea of sampling distributions — in reality, we often only have a single sample based on which we compute our estimations.

Figure 3: Visual description of the relationship between population and sample.

0.4
The (Standard) Normal Distribution
First, we do a quick review of the normal distribution. We already know from the central limit theorem that, if we take enough samples, the sampling distribution of the sample mean will be normal. Also many other r.v.s have a probability distribution which can be approximated by the normal distribution, so it is very important. It is a continuous probability distribution, determined by 2 parameters, µ and σ, its mean and standard
deviation. A r.v. X with this distribution is denoted by X ∼ N (µ, σ²) and if µ=0 and σ=1, the r.v. follows a standard normal distribution and is usually denoted by Z. Here is the pdf for X, denoted f_X :

The function can look daunting. To familiarize yourself with it you can check out this article that I wrote some time ago, or consult any other source. Fortunately, we don’t really need the function definition here, and I merely put it for completeness. Any normal distribution, X, is a shifted and scaled standard normal distribution: X = σZ + µ. We call the reverse process Z = (X−µ)/σ standardization.

Recall, that the cdf of a continuous r.v. X is usually written as F_X , and that it is used to compute the probability that the r.v. is lower or equal a certain value, e.g. P (X ≤ 0.3) = 0.4, means the probability that the r.v. X takes a value lower or equal than 0.3 is 0.4 (I made up these numbers, they only serve as example for the notation). Furthermore, the cdf of a r.v. is the integral of its pdf. Unfortunately, there is no closed-form solution for the pdf of the standard normal distribution (try to compute the integral
of the formula above). ”Closed-form” in essence means there is no nice, simple formula we can use to compute its value. Therefore, to compute the probability for a normal r.v., we need to revert to look up values in a large table — often called a z-table. Luckily, we can of course also use software. For example, to determine P (Z ≤ −1.96) we can use Python:

#import the normal distribution r.v.
#from the scipy stats module
from scipy.stats import norm

#calling norm without parameters defaults
#to the standard normal distribution
#F(-1.96) = P(X<-1.96)
print(norm.cdf(-1.96))
#0.025
#F(-1.96) for X with mu=3, sigma=2
print(norm(loc=3,scale=2).cdf(-1.96))
#0.065

A particular property of the normal distribution is that 95% of the area under its function are in the range [µ − 1.96σ, µ + 1.96σ] as shown in fig. 4.

Figure 4: Area under the normal curve. The blue part constitutes 95% of the area. The
red part 100%-95%.

In probability parlance: ”The probability that X is between -1.96 and 1.96 equals .95”: In math: for X ∼ N (µ, σ 2 ), P (−1, 96σ ≤ X ≤ 1.96σ) = .95. Because the normal distribution is a valid pdf we know that the area under the curve (its integral, given by the cdf) sums to one. Thus, we can conclude that the red parts under the normal curve in fig. 4 equal 1 minus the area(=probability) that falls into the blue interval of the plot. We
denote this remaining probability as α. Thus, α = 1 − .95 = 0.05. Thus, the probability that the events that are located at the far left or right of our r.v. occur are 0.05. Also, since the curve is symmetric around the mean, we know that the red area under the left tail equals the red area under the right tail and is 0.05/2 = α/2. The red area under the left tail is given by P (X < −1.96σ) = 0.05/2 = α/2 and the red area under the right tail is given by P (X > 1.96σ) = α/2 = 1 − P (X < 1.96σ). Note, that in the case of a standard
normal distribution σ defaults to 1, s.t. P (Z < −1.96) = α/2.
Above, we computed the probability that the r.v. is lower or equal a certain value: P (X ≤ x). Now, pretend we have the inverse problem, we know the probability, but not the x-value. For example, we might be interested in the x-value that corresponds to a left tail area that covers 20% of the area under the cdf (which is, of course a probability of .2). So α/2 = .2 and we want to know the according x-value on the horizontal axes of the pdf, or in other words P (Z <?) = .2, where ? = F _−1 (.2). The inverse of the cdf, F −1 is the percentile function. Again, we can use the z-table to find F _−1 (α/2) for any
given X after standardization. (As explained before, this means we take the value for α/2 for example 0.025 subtract the mean and divide by the standard deviation. The resulting value is referred to as the z-score.) But, also as said before, instead of using a table we can use software, and the software can also do the standardization for us. Here, the code
in Python.

from scipy.stats import norm

#for the standard normal distribution
#F^(-1)(.2)
print(norm.ppf(.2))
#-0.842
#for X with mu=3, sigma=2
print(norm(loc=3,scale=2).ppf(.2))
#1.132

This was our review of the normal distribution. We are now well prepared to understand and construct confidence intervals.

0.5
Confidence Intervals
Maybe you can already see where all this is heading at. The normal distribution has all these nice properties that we just looked at: symmetrie, only 2 parameters, etc. And the great thing is that under 2 conditions, the sampling distribution of the sample mean is also normally distributed! The 2 conditions are: 1) either the sample size n is large, or 2) the population distribution is already known to be normal. Recall, that the sampling
distribution of the sample mean is the distribution over all possible sample means and that the mean of this distribution is the population mean (which we are after). Unfortunately we do not really know the sampling distribution of the sample mean, we only know a few possible sample means. But here is the crux: We do know (from the characteristics of the normal distribution) that 95% of all possible sample means are within the range population mean +/-1.96 SEM.

Figure 5: Visual description of the relationship between a standard normal distribution
and the sampling distribution of the sample mean.

To make this even more explicit, compare figure 5: We know that for a standard normal r.v. Z: P (Z ≤ −1.96) = .025, and thus P (|Z| ≤ 1.96) = .95. Since we know that X̄, the r.v. for the sampling distribution of the sample mean is normal, we can standardize X̄ and write:

We can then manipulate this further:

Therefore, we know that 95% of all our sample means are within the range µ+/−1.96σ/ \sqrt(n), we just don’t know which ones. If we put the same interval around a given sample mean (comp. figure 6), then this interval x̄ + / − 1.96σ/ sqrt(n) contains the true mean µ with a high probability. Or more exactly we believe that for 95 out of 100 sample means this interval
contains the true mean. Finally — this interval is called a 95% confidence interval for µ. As you have seen, the confidence interval equals the sample statistic, here the sample mean, +/- margin of error. The margin of error consists of a factor z_(α/2) and the devation of the sample distribution. For the sample distribution of the sample mean the deviation is σ/ \sqrt(n) as discussed in subsection ”Mean and Variance of the Sample Mean”. The factor z_(α/2) is computed based on the desired confidence level. In the
previous example we have computed a 95% confidence interval, thus the confidence level was 95%. If we let α = 1− the confidence level (converted to a probability), then this is the probability mass that is left for the left and right tail of the normal distribution, and since the normal distribution is symmetric each tail gets half of it. The value z_(α/2) is thepositive z-value at which P (Z <= z) = α/2. If you use Python you can type:

from scipy.stats import norm

confidence_level = .95
alpha_2 = (1-confidence_level) /2
z_alpha_2 = abs(norm.ppf(alpha_2))

So now we know what a confidence interval is, and how to compute it for the sampling distribution of the sample mean:

However, notice that the above formula contains the population parameter σ. We do not know the population variance!

In figure 7 we look at how to estimate this value.
Finally, when using the estimated variance, the distribution is not normal anymore, but was shown to follow the t-distribution. This is summed up in figure 8.

References
[1]Joseph K. Blitzstein and Jessica Hwang. Introduction to Probability Second Edition.
2019. url: https://drive.google.com/file/d/1VmkAAGOYCTORq1wxSQqy255qLJjTNvBI/
view.

[2]R.H. Lock et al. Statistics: Unlocking the Power of Data. Wiley, 2020. isbn: 9781119682165.
url: https://books.google.de/books?id=UiQGEAAAQBAJ.

[3]
R.L. Ott and M.T. Longnecker. An Introduction to Statistical Methods and Data Anal-
ysis. Cengage Learning, 2015. isbn: 9781305465527. url: https://books.google.de/books?id=VAuyBQAAQBAJ.

A Beginner-Friendly, yet Mathematical Introduction to Confidence Intervals

Written by Irene Markelic