Central Limit Theorem made easy!

Ankita Banerji
Analytics Vidhya
Published in
6 min readMay 31, 2021

The Central Limit Theorem (CLT) is one of the most important topics in Statistic. It comes in handy in many real-world problems. In this blog, we will see what Central Limit Theorem is and its importance. We will also verify the properties of the Central Limit Theorem by going through a Python demonstration.

Let us first discuss some terms:

A population is a group of similar items or individuals that we want to study. For example, blood sugar level of all the people in India.

A sample is a subset or small portion of the population with similar characteristics to a larger population.

Suppose you are studying health care data, and you want to determine the average blood sugar level of all people in India. This means that you will have to analyse data on say around 1.2 Billion people. You cannot possibly ask every single person what their blood sugar level is. That would be extremely costly and time-consuming. So how do we get insights about the population when we don’t have the information? The answer is, we can use the sampling distribution.

Image by author

Sampling Distribution

Let us understand sampling distributions with help of an example. Suppose there are 75 students in a class. We need to find the average math score obtained by students in the class. The exact calculated value of the mean score is 68.06; however, for this example, let’s assume that the mean is not known to us. Math score for 75 people is given below. Out of the 75 students, 5 students are chosen at random and their mean score is calculated.

Image by author

The students marked in red are the ones randomly selected. Let’s compute the mean of math score obtained by them, as shown below.

When we choose different samples, the mean changes. After conducting this analysis for a large number of such samples say 100 (each of size 5), we will obtain several such means. Means of all 100 sample means give sampling distribution’s mean.

Image by author

The image given below shows 100 such means and their distribution.

Sampling distribution

So, the sampling distribution of the sample means is a probability density function for the sample means of a population.

A sampling distribution possesses certain useful properties, which are collectively called the central limit theorem. This theorem states that no matter how the original population is distributed, the sampling distribution will hold the following three properties true:

Let us verify these properties with the help of an example in Python.

For this demonstration, we will be using a data set containing information about NBA players. The data set contains over two decades of data on each player who has been part of an NBA team. It captures demographic variables such as age, height, weight and place of birth, biographical details like the team played for, draft year and round. For this demonstration, we will only use the height feature.

# Import packagesimport warningswarnings.filterwarnings("ignore")import pandas as pd, numpy as npimport matplotlib.pyplot as plt, seaborn as snsimport scipy.stats as statsnp.random.seed(42) # With the seed, same set of numbers will appear every time.# PopulationNBS_Dataset = pd.read_csv('all_seasons.csv')NBS_Dataset.head()

NBS_Dataset contains 11145 rows 22 columns. For this demonstration we will only consider ‘player_height’ column.

# Extract Only height ColumnNBS_Dataset_height = NBS_Dataset[['player_height']]

Let us see the distribution of NBA players height.

sns.distplot(NBS_Dataset_height. Player_height)plt.show()
Probability distribution of NBA player’s height

We can see that the distribution is close to normal.

NBS_Dataset_height.player_height.mean()# Output: 200.8128NBS_Dataset_height.player_height.std()# Output: 9.19097

These are the true mean and standard deviation of the population.

Let us take a random sample (size = 30) from this data to analyse the sample mean.

samp_size = 30NBS_Dataset_height.player_height.sample(samp_size).mean()# Output: 196.84999999999994NBS_Dataset_height.player_height.sample(samp_size).mean()# Output:198.28933333333327

Every time we take a sample, our mean value is different. There is variability in the sample mean itself. Does the sample mean itself follow a distribution? Let’s assess this.

Let us pick around 1,000 random samples of size 30 from the entire data set and calculated the mean of each sample.

sample_means = [NBS_Dataset_height.player_height.sample(samp_size).mean() for i in range(1000)];sample_means = pd.Series(sample_means)

Plot the distribution of all these sample means (This is our sampling distribution).

sns.distplot(sample_means)plt.show()
Sampling Distribution

We can observe that the sampling distribution is nearly normal. Now we will compute the mean and standard deviation of this sampling distribution.

sample_means.mean()# Output: 200.77

The mean of this sampling distribution (or in other words, the mean of all the sample means that we had taken), came out to be 200.77. As you can see, this value is pretty close to the original population mean of 200.81. This demonstrates the first property of the Central Limit theorem i.e.,

However, it would not be fair to infer that the population mean is exactly equal to the sample mean. It is because the defects in the sampling process always tend to cause some errors. Therefore, the sample mean’s value must be reported with some margin of error.

sample_means.std()# Output: 1.70NBS_Dataset_height.player_height.std()/np.sqrt(samp_size)# Output: 1.68

Similarly, when we calculated the standard deviation of the sampling distribution, we observed the following relationship:

This verifies the second property of the Central Limit theorem i.e.,

Now that we have verified these two properties, let us observe the effect of sample size on the resulting sampling distribution. In this demonstration, we will observe that as the sample size increases, the underlying sampling distribution will approximate a normal distribution.

sample_sizes = [3, 10, 30, 50, 100, 200]plt.figure(figsize=[10,7])for ind, samp_size in enumerate(sample_sizes):sample_means = [NBS_Dataset_height.player_height.sample(samp_size).mean() for i in range(500)]plt.subplot(2,3,ind+1)sns.distplot(sample_means, bins=25)plt.title("Sample size: "+str(samp_size))plt.show()

We can observe that as the sample size exceeds 30, the sampling distribution becomes more and more normal.

This verifies the third property of the Central Limit theorem i.e.,

Conclusion

Now that we have understood the properties of the Central Limit Theorem, we can use it to infer the population mean from the sample mean.

LinkedIn: https://www.linkedin.com/in/ankita-banerji-8940369b/

Recommended Articles

  1. Hypothesis Testing On Linear Regression
  2. Gradient Decent in Linear Regression
  3. Selecting Number Of Clusters in K-Mean Clustering

--

--