Verifying Central Limit Theorem using Python

S Joel Franklin
Analytics Vidhya
Published in
3 min readNov 10, 2019
Photo by Luke Chesser on Unsplash

Let’s first look at what the wikipedia says about the Central limit theorem. ‘The Central Limit Theorem states that in some situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution even if the original variables themselves are not normally distributed’. Sounds confusing?

Let’s put it in simple words. If we repeatedly sample from the population with replacement (Replacement makes sure the samples are independent of each other), the distribution of sample means (also known as sampling distribution of sample mean) approaches a normal distribution as the sample size gets larger irrespective of the nature of distribution of population. Hard to believe? Let’s try verifying the Central Limit Theorem using Python.

We import the necessary packages and define a population of size 1000000 consisting of random numbers. The population is completely random as in real life scenarios.

import numpy.random as np
import seaborn as sns
import matplotlib.pyplot as plt
population_size = 1000000
population = np.rand(1000000)

We define the number of resampling times or the number of samples drawn from population with replacement to be 10000. As of now ‘sample_means’ is randomly initialised. Later it will be used to store the means of samples drawn from population. We define the ‘sample_size’ to be 1. Later we will experiment with different values of ‘sample_size’.

number_of_samples = 10000
sample_means = np.rand(number_of_samples)
sample_size = 1

We run a ‘for loop’ 10000 times. Each time ‘c’ takes up integer values between 1 and population_size and size of ‘c’ is same as ‘sample_size’. The sample is drawn from population and its mean is stored in ‘sample_mean’.

c = np.rand(number_of_samples)
for i in range(0,number_of_samples):
c = np.randint(1,population_size,sample_size)
sample_means[i] = population[c].mean()

The following lines of code are for plotting the histogram and density of sample mean.

plt.subplot(1,2,1)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
sns.distplot(sample_means,bins=int(180/5),hist = True,kde = False)
plt.title(‘Histogram of Sample mean’,fontsize=20)
plt.xlabel(‘Sample mean’,fontsize=20)
plt.ylabel(‘Count’,fontsize=20)
plt.subplot(1,2,2)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
sns.distplot(sample_means,hist = False,kde = True)
plt.title(‘Density of Sample mean’,fontsize=20)
plt.xlabel(‘Sample mean’,fontsize=20)
plt.ylabel(‘Density’,fontsize=20)
plt.subplots_adjust(bottom=0.1, right=2, top=0.9)

Now that we have understood the code, let us look at the graph of ‘sampling distribution of sample mean’ for different values of sample size.

Sample size = 1

Sample size = 2

Sample size = 5

Sample size = 10

Sample size = 30

We can see that the distribution approaches normal as sample size gets larger. In theory the distribution is perfectly normal only when the sample size tends to infinity. But practically we can assume the distribution is normal when sample size is greater than or equal to 30.

--

--

S Joel Franklin
Analytics Vidhya

Data Scientist | Fitness enthusiast | Avid traveller | Happy Learning