On Understanding of Central Limit Theorem (visually)

Marat Kopytjuk
Published in
8 min readOct 28, 2019

--

During your statistics lectures you surely heard of the famous and fundamental Central Limit Theorem. Especially for political surveys, drug effectiveness evaluations and A/B testing for digital goods this theorem allows us to draw conclusions from relatively small sample size in our tests. The “relatively small” refers to the whole population of objects/people/customers which are usually the goal of our investigation.

In this article I will introduce the most important concepts for applying Central Limit Theorem for inductive statistics and show with a simple visual example, how we get a normal distribution from the sum of any distributed random variable with a simple visual example.

You will probably need some basic statistics knowledge to follow the story-line, but I promise, you will get a new point of view about the fundamental thing many people in finance, engineering and medicine apply everyday!

Monthly spending on clothes — Photo by Becca McHaffie on Unsplash

How can we describe populations quantitatively? Let’s take an example. For monthly spending on clothes in central Europe a good descriptive metric would be the arithmetic mean (i.e. average or to be super precise: expected value µ). We have two options to find it out:

  1. Organize a survey of all adults in the region of interest (in our example whole Central Europe) and calculate the mean.
  2. Take a sample, let’s say 1000 randomly selected adults, and calculate the average.

It is obvious, that the first option would be the most precise but very expensive to organize, operate and evaluate. Just think of the amount of paperwork we would have to accomplish! Remember, not everyone has the ability to use computers and electronic forms and we want to include everyone in our survey to get a precise number!

The second option seems to be manageable, but surely not very precise, since taking a different “batch” of 1000 people(formally: sample size N=1000) over and over again will lead to different mean values. You could accidentally grab a group of fashion bloggers, or some tech-people, who spend their money on gadgets instead of different clothes (yes yes I know stereotypes everywhere …).

What happens statistically? By randomly selecting a single adult and asking for her/his monthly spending is equal to drawing from the random variable X, i.e. the measured spending x is a single realization of the random variable X.

Asking N adults and calculating the mean of their spending is the random variable S, defined as:

Random variable S: arithmetic mean of N equally distributed observations xi

Let’s take a short look at the notation. We use capital letters (e.g. X and S) for random variables and small letters for their realizations (i.e. measurements).

Note, the sample distribution S is different to X. When you are reading the article always clarify whether S or X is mentioned.

Additionally not, that we do not know, how X is distributed. Is it a normal distribution? Poisson? Maybe F-distribution? We do not know, but it doesn’t really matter!

Thanks to the Central Limit Theorem we know, that the sum of N equally distributed random variables X (no matter what kind of distribution) will lead to a normal distribution (for big sample sizes N).

What does it mean in our example? Taking a batch of adults and calculating arithmetical mean s is in other, more complicated words, drawing a realization from a normal distribution S. The shape and position of this normal distribution is important for us, since we will infer characteristics of X (average monthly spending per capita in Central Europe) out of it.

Namely, the central limit theorem states, that expected value of S equals to the expected value of X:

The variance of S equals the variance of X divided by the the sample size N, i.e.:

This feels natural, since with increasing batch size, we get closer and closer to observing the whole population and the calculated mean won’t vary less and less between experiments.

Only in case you are interested in the theoretical background of the above formula, take a look at the Bienaymé formula.

A closer look at CLT

Now you surely ask why and how a sum of equal distributed random variables leads to a normal distribution without taking the shape of the distribution of X into account? This is actually the reason, why this article was written.

We find the answer deep in the statistics theory, namely in Convolution of probability distributions. I described a simple sum of two random variables in my last article. There you’ll find a step by step example how the convolution operation is working and how it is calculated numerically.

Before dividing the sum of our measurements by N, we have a sum of N random variables S*:

Imagine, the distribution of X would be a uniform distribution in the range of [0, 1]. We will increase the number of samples and see what effect it will take to the random variable S*.

The resulting random variable is the result of N convolutions (we convolve with the uniform distribution over our last result):

Sum of 4 uniform distributed variables

The convolution operation smooths out the shape of the uniform distribution. No matter what kind of distribution you will use, the result will be always the normal distribution!

Take a look at a very simple visual example from Wikipedia with an artificial distribution (read from top left to bottom right):

https://commons.wikimedia.org/wiki/File:Central_limit_thm.png

The more samples we take, the more the the sample mean’s distribution looks like normal distribution.

Oh great, fancy plots, but how do they help me?

Great question! Since we know, how our mean will be distributed, we can infer some useful information. After asking 1000 people about their outgoings, we can state, that the calculated mean value is within of certain bounds with 95% confidence, since we know about the variance of the resulting normal distribution of S.

Sounds complicated? Let’s talk about what confidence intervals are and how do we derive them.

Confidence intervals

The variance of S tells us sample mean’s magnitude of spread. In other words, small variance will lead to similar sample means and vice versa. Remember, the variance of S depends on the number of samples and the variance of X itself. Given a variance of X we can “adjust” the variance (standard deviation) of S by taking more or less measurements in our batch (i.e. we are varying N). The effect is stated in the equation below:

Influence factors of σ²s
Effect of different standard deviations on the shape of S https://matheguru.com/stochastik/normalverteilung.html

Back to our example, with a fixed sample size of N — how do we calculate confidence intervals? Let’s try to understand the concept step by step.

Imagine we would know the expected value µ in our experiment — theoretically we’ll expect µ to be in [µ-2*σ_s, µ+2*σ_s] in approx. 95% of all potentially possible experiments.

As a reminder an illustration of different σ intervals and their corresponding probabilities. There is a 34.1%+34.1%= 94.2% probability to draw a number in the interval of [-2σ, +2σ].

Von M. W. Toews — Eigenes Werk, based (in concept) on figure by Jeremy Kemp, on 2005–02–09, CC BY 2.5, https://commons.wikimedia.org/w/index.php?curid=1903871

I know what you are thinking in the moment and you are right! We do not have any clue of µ. But having our sample mean s and [s-2*σ_s, s+2*σ_s] would give us an interval which will include µ with 95% confidence. Does it make sense? If not really, take a look at the picture below!

Firstly you’ll see the density curve, our sample mean’s distribution. Below, 25 different sample means with corresponding confidence intervals are illustrated (each of those is a single experiment with N participants). The interval length is the derived from the density curve‘s standard deviation.

X-bar equals to S in our example

Sure, as you see above, it is still possible to sample a mean somewhere far right (like the 5th sample) and the above mentioned [s-2σ_s, s+2σ_s] approach won’t include µ — but since sampling a mean far right to the true mean is highly unprobable, the approach seems to be valid — note, that there’s still a 5% chance to make a mistake (there’s still a reason why we gently say “with 95% confidence”).

In a nutshell: the interval boundaries indicate, that the true mean µ (i.e. the expected value we look for) will be within confidence boundaries in 95% (p=0.05) in all possible experiments we could potentially run. If we repeat the experiment infinite number of times and calculate the mean and confidence boundaries in the same manner as above, the expected value µ will be located within the confidence interval.

Anyway, it still can happen, that we observe some extremely high mean (because we had the bad luck to select a lot of fashion bloggers in our sample) and the confidence interval won’t include the true expected value. The chance of this event is p=0.05=5%. Surely, we theoretically could decrease p to 0 and require 100% confidence intervals (e.g. [s-10σ_s, s+10σ_s]), but this particular interval will be useless for us, since it will reach from -inf to inf.

In order to get smaller intervals and have the same magnitude of confidence the one and only way is to increase N to get a narrower sample distribution of S. Sorry :(

Summary

Whoa! That was a lot of content! What you definitely should understand, is the difference between the randomness of a single measurement X vs. the randomness of the sample mean S. Thanks to the central limit theorem and the convolution operation, the sum of N equally distributed random variables X will lead to a normal distribution without taking care of X’s probabilistic characteristics (we don’t even have to know the underlying distribution). The CLT allows us to do great things, such as estimating expected values and calculating intervals based on the confidence we specify beforehand.

Hope you had fun reading this article. I know, it is not an easy concept to grasp, but I am sure the article could provide a high level overview for a fundamental thing used across different domains in industry and academia.

Credits

Thanks to Anna F. for her support and the patience reading and correcting this article.

--

--

Marat Kopytjuk
Analytics Vidhya

Engineer and student (again) — I love to read and write about statistics, ML, control theory and Python.