Understanding The Central Limit Theorem for Data Science

Hrithick Sen
Analytics Vidhya
Published in
4 min readSep 1, 2020
Photo by Andrew Ebrahim on Unsplash

Central limit theorem is one of the most important and very fundamental theorem in statistics that is extensively used in Data science and other related tasks. In this blog we will understand central limit theorem step by step with some python code snippets.

Before jumping onto the core idea of central limit theorem let’s discuss some very basic ideas in statistics.

What is population in statistics?

In probability and statistics, population refers to the total set of observations that can be made. A population is the entire set from which samples are drawn. In case of heights of human, the population is nothing but the set of heights of all the human in the world.

What is sample and sampling distribution in statistics?

In simple terms, samples are the observations that are drawn from the population distribution. Example is, In a population of all human heights we randomly pick up 10 heights.

Let’s talk about sampling distribution now. Suppose you are given a population distribution and you randomly pick sample of size n from it and you do it all total m times. At last you’ll get m samples, each of size n. Then you calculate the mean of each individual sample and end up with m sample means. Now, the distribution of those sample means is called Sampling distribution of sample means.

What is Central Limit Theorem?

Short answer:

The central limit theorem tells us if the mean(μ) and variance(σ²) of the population distribution is finite, then the sampling distribution of the sample means will have N~(μ, σ²/n) as n ∞, where n is the size of each sample

Long Answer:

Assume you have a Random variable X which can have any distribution but X must have a finite mean and variance.

Step 1: You randomly pick sample of size n from X and you do it all total m times. At last you get m samples, each of size n.

Step 2: We calculate the mean of each individual sample(size = n) and end up with m sample means. To be more clear, now you have m numbers and each of them represents mean of a certain sample.

Step 3: Then we plot the distribution of m sample means and we are done.

Let’s perform the above steps in python and see the output.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# defining the sample size and number of samples we want to have
sample_size = 30
sample_number = 1000
sample_means = []
for i in range(0, sample_number):
# randomly picking sample from the population distribution
# In this case the population distribution is an Uniform distribution
sample = np.random.uniform(1, 20, sample_size)
sample_mean = sample.mean()
sample_means.append(sample_mean)

plt.figure(figsize = (8, 6))
sns.distplot(sample_means, bins = 12);

Output:

The Kernal Density Estimation(KDE) of the plot looks like a Normal distribution. Right? That’s what Central Limit Theorem(CLT) is all about.

Central Limit Theorem says, The sampling distribution of the sample means will have a Normal distribution with the mean equal to the population mean and the variance equal to the variance of the population distibution divided by the size of each samples as the sample size tends to reach infinity, irrespective of the type of the population distribution.

So, if the population distribution has N~(μ, σ²) then the sampling distribution of the sample means will have N~(μ, σ²/n) as n ∞, where n is the size of each sample and often times we choose n = 30.

But why should we care about Central Limit Theorem?

Central Limit Theorem is used a lot in Data Analysis tasks and by using central limit theorem we can get the mean of any type of population distribution as long as the mean and the variance of the population distribution is finite.

Suppose, you want to find out the average salary of each human in the world. It is not feasible for you to collect each and every ones salary information in the world, sum it and then divide it by the total number of humans in the world, Right? But by using CLT you can do it in a minute. That is why central limit theorem is powerful and important to know.

References:

What intuitive explanation is there for the central limit theorem?: https://stats.stackexchange.com/questions/3734/what-intuitive-explanation-is-there-for-the-central-limit-theorem

Central Limit Theorem (Wikipedia): https://en.wikipedia.org/wiki/Central_limit_theorem

--

--