Please Stop Talking About the Central Limit Theorem

Patrick Martin
8 min readJul 3, 2022

--

Whenever your favorite online discussion forum asks about the best, cleanest, most underrated, or other superlative result in mathematics, you can be sure statistics’ Central Limit Theorem will be near the top of the list. Even the name itself highlights its stature: the Central refers to its importance among limit theorems, not that it is a theorem about “central limits” (although that is also somewhat accurate). It is a staple of early statistics courses, and students are often rightly confused about what it means: this is partly because the Central Limit Theorem does not say what statistics textbooks want it to say!

This article will explain why the popular conception of the Central Limit Theorem is wrong, and that the theorem itself is bad. Before explaining that, let’s go over what the Central Limit Theorem is.

In the wild world of statistics, a fundamental object is the random variable: an object which can take various values (versus a standard variable, which is presumed to have a single, well-defined value), think of rolling a die. If you believe in the existence of some aleatoric “event space”, a random variable is a function from that event space to (typically) the real numbers. A probability distribution describes the values that random variable can take, and there are a lot of different probability distributions. While there are some examples of random variables that are particularly bad to play with, in this article we’re going to assume that they all have some amount of courteousness. Over the past few centuries, mathematicians have identified certain distributions of particular interest and developed specific tools to aid in their analysis.

The density plot of the normal distribution, with the areas under segments (i.e. the probability the variable takes a value in that range) labeled. (Image from M. W. Toews on Wikipedia)

Perhaps the most well-studied distribution is the normal distribution. Mathematicians have written tables of values for and developed approximations to important properties of the distribution, and when people need to pick a random number, frequently they will pull it from a normal distribution. Of course, according to the Law of We Can’t Have Nice Things, random variables in the wild are rarely normally-distributed.

On the other hand, perhaps the most-used property of a random variable is its mean, its average value. We don’t generally get to know what its true (or population) mean is, but we can estimate it by just taking the average of the samples that we’ve seen so far — this is called the sample mean.

The sample mean is just another number, whose value depends on the random variables that we saw, and so is itself a random variable. As our goal was to estimate the population mean, a natural question is then what the distribution of this random variable is, and in particular, how likely is the sample mean to be close to the population mean?

Upper left, a probability density, and in subsequent plots the density of the sum of 2–4 copies of that variable. Note how the curve becomes more “bell-shaped” — this is a useful fact that is not due to the Central Limit Theorem! (Image by Fangz on Wikipedia)

While a good question, the Central Limit Theorem does not answer this. This is what the Central Limit Theorem says:

There exists a normal distribution such that given some tolerance ε, there exists a sample size N such that for any number t, the probability that √N times the sample mean (of N samples) is greater than t is within ε of the probability that the original normal distribution takes a value greater than t.

The Central Limit Theorem written in symbols. Z is the asymptotic normal distribution and μ is the mean. The probability that the scaled, centered, sample mean exceeds a given value is close to the probability the normal variable does the same, assuming N is large enough.

I want to highlight the two most important parts of the theorem: the there exists and the within ε. But before I do, let’s take a detour to talk about what is called the “Central Limit Theorem for Sums”.

Using the Central Limit Theorem for sums is especially egregious. To see why, let’s peek under the hood of a standard proof of the Central Limit Theorem. For any random variable, we can consider the sequence of expected values: 𝔼[X], 𝔼[X²], 𝔼[X³], and so on. These values are called the moments of a random variable, and in certain circumstances, the infinite sequence of moments uniquely defines a random variable. One of these cases is the normal distribution: if every odd moment of a random variable is zero and every even moment satisfies a certain pattern, then that random variable must be normally distributed.

The proof of the Central Limit Theorem shows that the odd moments of √N times the difference between sample and population means — what I’ll refer to as the “central limit” for short — are on the order of 1/√N. This quantity does tend to zero as N increases, and so the Central Limit Theorem holds. The odd moments of the sum, being √N times larger, are not shown to tend to zero. Without the additional √N control, the distribution of sample sums is allowed to have moment behavior that is very different from that of a normal distribution.

As to why the Central Limit Theorem is bad, let’s go back to that there exists in the theorem statement. The Central Limit Theorem by itself says absolutely nothing about how many samples are needed to achieve its results. The proof gives us something of a hint: the distribution of the central limit approaches a normal distribution at a rate of about 1/√N. In fact, there is a non-asymptotic version of the Central Limit Theorem, the Berry-Esseen Theorem, which says that (with some additional assumptions) the value of ε — the bound on the difference in probabilities — can indeed be taken to be approximately 1/√N. However, while optimal for this problem, a 1/√N convergence rate is really really slow.

Let’s think about that for a moment. A classic rule of thumb is that the Central Limit Theorem “kicks in” after some number of samples, like 30. For reference, 1/√30 is about 0.18 —but the Central Limit Theorem is hoping that that is close to zero! While I’ve certainly seen larger numbers than 0.18, it’s a pretty big stretch to call that “zero”. That so many academic and professional sources present this rule of thumb as mathematical fact is frankly embarrassing.

Let’s go back to the question we want to answer about sample means: if the population mean is, say, 10, how confident are we that a sample mean over 30 samples will not be more than 10.5? This is computed by considering a normal distribution with variance σ² / 30; if σ=1, then the probability that that normal random variable is greater than 10.5 is about 0.3%. The Central Limit Theorem then tells us that our sample mean error is at most that, plus some ε: by Berry-Esseen, in this case the fudge factor is on the order of 8.7% (a little less than half of 1/√30), for a total of 9%. This is an additive error: under the Central Limit Theorem we can never reduce our chance of being off by any amount to below 8.7%.

Plot of the probability of the sample mean in our example exceeding various values. Note that while the normal distribution’s curve (blue) gives near-zero probabilities for values moderately larger than the population mean of 10, the error bound from the Central Limit Theorem (orange) remains large.

In other words, because we only have additive control over the distributions, our attempts to answer our original question of “how likely is the sample mean to be close to the population mean” with the Central Limit Theorem are entirely in vain. Our confidence that the sample mean does not exceed 10.5 is practically the same as our confidence that it does not exceed 50! We simply have no tail control due to the additive nature of the Central Limit Theorem.

This is a little odd, however, because people have been citing the Central Limit Theorem for years without a problem. It turns out that what we want to be true almost is, but it goes by a different (and less snazzy) name: Hoeffding’s Inequality.

If we look back at what we wanted the Central Limit Theorem for, it was usually to be able to compute how likely it was that the sample mean was close to the population mean. This is what Hoeffding’s Inequality directly tells us:

There is some number c so that the probability that the difference between the sample and population means is greater than some value t is at most exp(-cNt² ).

Hoeffding’s inequality written in symbols; μ is the population mean.

The downside here is that the value of c is generally unknown to us, however 0.5 divided by the variance is a reasonable estimate — this is the correct value for a Bernoulli random variable, i.e. one that takes exactly two values, and should be roughly correct for many real-world distributions. It should be noted also that while Hoeffding’s inequality is usually presented as only holding for bounded random variables, the form I have given here holds for a much wider class of variables, called sub-gaussian random variables. Similar results also hold for variables that are not sub-gaussian, such as Bernstein’s inequality.

This bound carries the spirit of what we wanted from the Central Limit Theorem: the general form looks like the density function of the normal distribution with variance σ²/N, which is very similar to what the misguided use of the Central Limit Theorem would yield. There is also no additive slack here; instead, Hoeffding’s inequality puts the slack in the estimate of c, which means there is no confidence floor! In the previous problem with a population mean of 10, the Hoeffding bound on the probability that the sample mean exceeds 10.5 is 2.4%, which is much tighter than the 9% of a rigorous Central Limit Theorem and empirically supported better than the 0.3% from the naïve use, as according to my experiments with several distributions the true probability can certainly exceed 0.3% (the folded normal distribution appears to be around 0.5%, for example). Hoeffding’s inequality also holds for sums, unlike the Central Limit Theorem.

While the bound from Hoeffding’s inequality (green) starts pretty loose, it quickly becomes better than the error bounds of the Central Limit Theorem (orange).

That this looks so similar to what we learn in statistics classes shouldn’t be surprising, however. Empirically, a normal approximation to the distribution of sample means has worked fairly well over the years. But this is just a rule of thumb, not a theorem. The Central Limit Theorem is bad. Please stop calling this useful trick that.

--

--

Patrick Martin

I’m a mathematician and strategy gamer who enjoys looking for patterns in data and investigating what those patterns mean.