Everything You Need To Know About Law of Large Numbers and Central Limit Theorem

Mert Atli
12 min readMay 8, 2023

Law of Large Numbers (LLN)

Imagine you’re at a spectacular magic show, and the magician decides to perform an incredible coin-flipping trick. The magician brings out a large, shiny coin and tells the audience that this magical coin will demonstrate the Law of Large Numbers.

He starts flipping the coin and announces that he’ll flip it not 10, not 100, but a whopping 1,000 times! The audience gasps in amazement. The magician proceeds to flip the coin rapidly, as the audience eagerly tracks the number of heads and tails. The tension in the room is palpable, as everyone is captivated by the unfolding trick.

As the flips continue, the crowd starts to notice a fascinating pattern. At first, the difference between the number of heads and tails seems random and unpredictable. But as the number of flips increases, something incredible happens. The proportion of heads (or tails) starts to get closer and closer to 50%, even though the flips themselves are still random.

The magician explains that this is the Law of Large Numbers in action! The LLN states that as the number of trials (coin flips, in this case) increases, the average of the outcomes (the proportion of heads) converges to the expected value (50% for a fair coin).

The audience bursts into applause, awed by the display of this powerful mathematical principle through a simple yet captivating coin-flipping trick. The magician takes a bow, leaving the audience with a newfound appreciation for the Law of Large Numbers and the magic of mathematics.

But how LLN works actually?

The Law of Large Numbers (LLN) is a fundamental theorem in probability theory that describes the behavior of the average of a large number of independent and identically distributed (i.i.d.) random variables. The LLN can be divided into two versions:

  • the Weak Law of Large Numbers (WLLN)
  • the Strong Law of Large Numbers (SLLN)

In a more rigorous mathematical sense, let’s consider a sequence of i.i.d. random variables X₁, X₂, X₃, …, Xₙ, where each Xᵢ has the same expected value E[Xᵢ] = μ and the same variance Var(Xᵢ) = σ².

We define the sample mean as

Sₙ = (X₁ + X₂ + … + Xₙ) / n

where n is the number of random variables in the sequence.

The Weak Law of Large Numbers states that,

for any ε > 0, lim (n → ∞) P(|Sₙ — μ| > ε) = 0

In other words, as the number of trials (n) approaches infinity, the probability that the difference between the sample mean (Sₙ) and the true expected value (μ) is greater than some small positive value (ε) goes to zero. This implies that the sample mean converges in probability to the true expected value:

Sₙ → μ (in probability) as n → ∞

The Strong Law of Large Numbers states that

P(lim (n → ∞) Sₙ = μ) = 1

This means that the sample mean almost surely (with probability 1) converges to the true expected value as the number of trials goes to infinity:

Sₙ → μ (almost surely) as n → ∞

Both the Weak Law and the Strong Law of Large Numbers highlight that, as the number of i.i.d. random variables in the sequence increases, the sample mean converges to the true expected value, either in probability or almost surely. This is a crucial result that underpins many concepts in statistics, including the idea that large samples can provide accurate estimates of population parameters.

Convergence In Probability vs Convergence Almost Surely

“Almost surely” and “in probability” are two different modes of convergence in probability theory, and they have different implications. Let’s use a sequence of random variables, X₁, X₂, …, Xₙ, converging to a random variable X, as an example to illustrate their differences intuitively.

Convergence in probability:
When we say that a sequence of random variables converges in probability to X, we mean that the probability of the difference between the random variables and X being larger than a small positive number (ε) goes to zero as the number of random variables (n) goes to infinity. In other words, as n increases, the random variables get closer and closer to X with a high probability.

However, convergence in probability does not guarantee that the sequence will converge to X for every possible outcome. There may still be some outcomes where the sequence does not converge to X, but the probability of those outcomes happening becomes vanishingly small as n goes to infinity.

Convergence almost surely:
When we say that a sequence of random variables converges almost surely to X, we mean that the probability of the sequence converging to X is equal to 1. In other words, for almost all possible outcomes, the random variables will converge to X as n goes to infinity. The term “almost surely” indicates that there might still be some outcomes where the sequence does not converge to X, but the probability of those outcomes happening is exactly zero.

In summary, convergence in probability is a weaker notion of convergence, as it implies that the random variables get closer to X with a high probability, but not necessarily for all possible outcomes.
Convergence almost surely is a stronger notion of convergence, as it implies that the random variables converge to X for almost all possible outcomes, with the probability of non-convergence being exactly zero.
Convergence almost surely implies convergence in probability, but the reverse is not necessarily true.

Let’s consider two intuitive examples to illustrate the difference between convergence in probability and convergence almost surely.

Example 1: Coin toss (convergence in probability)

Imagine a game where we toss a fair coin repeatedly. Let Xₙ be the proportion of heads after n tosses. As n goes to infinity, we would expect the proportion of heads to approach 1/2. This is because, in the long run, we expect the number of heads to be roughly equal to the number of tails. Therefore, Xₙ converges to 1/2 in probability.

However, this convergence is not almost sure. There might be an extremely rare scenario where we keep getting heads indefinitely, and the proportion of heads remains 1, even as n goes to infinity. The probability of this occurring is vanishingly small but not exactly zero. Therefore, Xₙ does not converge almost surely to 1/2.

Example 2: Picking balls from an urn (convergence almost surely)

Imagine an urn with infinitely many balls, each labeled with a unique natural number (1, 2, 3, …). We draw balls from the urn one at a time, without replacement, and record the numbers.

Let Xₙ be an indicator random variable such that Xₙ = 1 if the number on the n-th ball is divisible by n, and Xₙ = 0 otherwise. We are interested in the proportion of balls with numbers divisible by their position in the sequence, up to the n-th draw.

As n goes to infinity, the proportion of balls with numbers divisible by their position will converge to 0 almost surely. This is because, as we pick more and more balls, the probability of selecting a ball with a number divisible by its position becomes increasingly small, approaching zero. Since the probability of non-convergence is exactly zero, this example demonstrates convergence almost surely.

In summary, the coin toss example illustrates convergence in probability: the proportion of heads approaches 1/2 as the number of tosses goes to infinity, but there might be rare scenarios where this convergence does not occur.
The urn example illustrates convergence almost surely: the proportion of balls with numbers divisible by their position approaches 0 as the number of draws goes to infinity, and there is no scenario where this convergence fails to occur with non-zero probability.

Central Limit Theorem (CLT)

Imagine you’re participating in a thrilling, worldwide competition called “The Great Sampling Race.” The goal of the race is to estimate the average height of the entire human population using a limited number of samples.

You and thousands of other participants each have a high-tech drone at your disposal. These drones randomly sample people from all around the world, measuring their heights. Each participant’s drone collects a small sample of 30 heights.

Once all the drones have finished their measurements, the participants calculate the average height of their respective samples. Now comes the big reveal! All the sample averages are displayed on a gigantic screen at the race’s headquarters.

To everyone’s amazement, the screen shows a beautiful bell-shaped curve, a pattern that closely resembles the famous normal distribution. The competition organizers announce that the true average height of the human population lies close to the peak of this curve. The crowd erupts into applause, excited by the incredible power of the Central Limit Theorem (CLT) in action.

The CLT states that, for a sufficiently large sample size, the distribution of sample means approaches a normal distribution, regardless of the shape of the original population distribution. This is true as long as the underlying random variables are independent and identically distributed (i.i.d.) with a finite mean and variance. In our race, the sample size of 30 was large enough for the CLT to hold, revealing the true average height through the aggregation of many small samples.

The Central Limit Theorem is a cornerstone of statistics, enabling us to make inferences about populations using a limited number of samples. This powerful and exciting concept is at the heart of many real-world applications, from election polling to data science and quality control in manufacturing.

But CLT works actually?

The Central Limit Theorem (CLT) is a fundamental result in probability theory and statistics that describes the behavior of the distribution of the sum (or average) of a large number of independent and identically distributed (i.i.d.) random variables. The CLT states that, under certain conditions, the distribution of the sum (or average) of these random variables approaches a normal distribution as the number of variables increases, regardless of the shape of the original population distribution.

Let’s consider a sequence of i.i.d. random variables X₁, X₂, X₃, …, Xₙ, where each Xᵢ has the same expected value E[Xᵢ] = μ and the same variance Var(Xᵢ) = σ².

We define the sum of the random variables as

Sₙ = X₁ + X₂ + … + Xₙ

and the sample mean as

Mₙ = Sₙ / n

where n is the number of random variables in the sequence.

The Central Limit Theorem states that, as n approaches infinity, the distribution of the normalized sum (Sₙ — nμ) / (σ√n) approaches a standard normal distribution, which has a mean of 0 and a variance of 1:

(Sₙ — nμ) / (σ√n) → N(0, 1) as n → ∞

In terms of the sample mean, Mₙ, the CLT can also be expressed as:

(Mₙ — μ) / (σ / √n) → N(0, 1) as n → ∞

This means that, for a sufficiently large sample size n, the distribution of the sample mean, Mₙ, is approximately normally distributed with a mean of μ and a variance of σ²/n:

Mₙ ~ N(μ, σ²/n)

Intuitively, as the sample size (n) increases in the Central Limit Theorem (CLT), the distribution of the sample means becomes more and more like a normal distribution, regardless of the shape of the original population distribution. Here’s what happens as n increases:

The shape of the distribution of the sample means becomes more “bell-shaped” and resembles the normal distribution. This occurs even if the original population distribution is not normal (e.g., skewed, bimodal, or uniform). The closer n gets to infinity, the closer the distribution of the sample means will be to a perfect normal distribution.

The variability of the sample means decreases. As n increases, the standard deviation of the distribution of the sample means (known as the standard error) becomes smaller. This is because the standard error is equal to the population standard deviation (σ) divided by the square root of the sample size (√n). As n increases, the denominator (√n) gets larger, causing the standard error to become smaller. This means that the sample means become more tightly clustered around the true population mean.

The accuracy of estimating the population mean (μ) improves. With a larger sample size, the sample mean becomes a better estimator of the true population mean. This is because the distribution of the sample means becomes more concentrated around the true population mean, reducing the likelihood of obtaining an extreme or unrepresentative sample mean.

In summary, as the sample size (n) increases in the CLT, the distribution of the sample means approaches a normal distribution, the variability of the sample means decreases, and the accuracy of estimating the population mean improves. This is a powerful result that allows us to make inferences about a population using a limited number of samples, even when the original population distribution is not normal.

The Central Limit Theorem plays a crucial role in many statistical methods, such as hypothesis testing, confidence intervals, and the estimation of population parameters. Its power lies in the fact that it allows us to make inferences about a population using a limited number of samples, even when the original population distribution is not normal.

Rigorous Proof of CLT

There are several versions of the Central Limit Theorem, and their proofs vary in complexity. Here, I’ll present the proof of the Lindeberg-Levy CLT, which is one of the most common versions of the theorem.

Let X₁, X₂, …, Xₙ be a sequence of independent and identically distributed (i.i.d.) random variables with expected value E[Xᵢ] = μ and variance Var(Xᵢ) = σ². We define the sum of the random variables as Sₙ = X₁ + X₂ + … + Xₙ and the sample mean as Mₙ = Sₙ / n.

The goal is to show that the distribution of the normalized sum (Sₙ — nμ) / (σ√n) approaches a standard normal distribution (N(0, 1)) as n approaches infinity:

(Sₙ — nμ) / (σ√n) → N(0, 1) as n → ∞

I’ll break down the proof of the Lindeberg-Levy Central Limit Theorem into simpler terms for better understanding. The steps remain the same, but I’ll provide more context and intuition for each step.

  1. Define the characteristic function of a random variable X:

ϕ(ω) = E[e^(iωX)]

where i is the imaginary unit and ω is a real number.

A characteristic function is a mathematical tool used in probability theory and statistics to describe the behavior of a random variable. It is a complex-valued function that completely captures the distribution of a random variable and contains all the information about its probability distribution. The characteristic function of a random variable X is defined as:

ϕ(ω) = E[e^(iωX)]

where:

ϕ(ω) is the characteristic function of the random variable X
E[] denotes the expected value
e is the base of the natural logarithm (approximately equal to 2.71828)
i is the imaginary unit (i² = -1)
ω is a real number, and it represents the frequency domain variable
The characteristic function is particularly useful because it has several convenient properties:

The characteristic function of the sum of independent random variables is the product of their individual characteristic functions.
The characteristic function uniquely determines the probability distribution of a random variable. If two random variables have the same characteristic function, then they have the same probability distribution.
The moments (e.g., mean, variance, skewness, kurtosis) of a probability distribution can be derived from the derivatives of the characteristic function evaluated at ω=0.
These properties make characteristic functions particularly useful for working with sums or linear combinations of random variables, as well as for proving results like the Central Limit Theorem.

2. Since Xᵢ are i.i.d., their characteristic functions are the same:

ϕ₁(ω) = ϕ₂(ω) = … = ϕₙ(ω) = ϕ(ω)

Since our random variables are identically distributed, they share the same characteristic function.

3. The characteristic function of a linear combination of independent random variables is the product of their individual characteristic functions. Therefore, the characteristic function of the normalized sum is:

ϕₛ(ω) = E[e^(iω(Sₙ — nμ) / (σ√n))]

=

∏ᵢ E[e^(iω(Xᵢ — μ) / (σ√n))] = ϕ(ω / (σ√n))^n

This step combines the characteristic functions of the individual random variables to find the characteristic function of the normalized sum. It’s based on the property that the characteristic function of the sum of independent random variables is the product of their individual characteristic functions.

4. Use Taylor series expansion on ϕ(ω / (σ√n)):

ϕ(ω / (σ√n)) = ϕ(0) + (ω / (σ√n))ϕ’(0) + (1/2) (ω / (σ√n))² ϕ’’(0) + o((ω / (σ√n))²)

The Taylor series is a way to approximate a function using a polynomial. We use it to approximate the characteristic function of the normalized sum. This step simplifies the expression and prepares it for the next step.

5. Substitute Taylor series expansion into the characteristic function of the normalized sum:

ϕₛ(ω) = (1 — (1/2)(ω² / n) + o(ω² / n))^n

We plug the Taylor series approximation into the expression for the characteristic function of the normalized sum.

6. Now take the limit as n approaches infinity:

lim (n → ∞) ϕₛ(ω) = lim (n → ∞) (1 — (1/2)(ω² / n) + o(ω² / n))^n

The goal is to understand how the characteristic function behaves when the sample size (n) goes to infinity.

7. Apply the limit definition of the exponential function:

lim (n → ∞) ϕₛ(ω) = e^(-1/2 ω²)

The limit simplifies to the exponential function, which is the characteristic function of a standard normal distribution.

Since the resulting characteristic function corresponds to a standard normal distribution, N(0, 1), we can conclude that the distribution of the normalized sum (Sₙ — nμ) / (σ√n) converges to a standard normal distribution as n approaches infinity:

(Sₙ — nμ) / (σ√n) → N(0, 1) as n → ∞

--

--