Sampling Variability and Central Limit Theorem

Suppose that we have a sample of height of Pakistani men. Lets say we have a sample of 10,000 Pakistani men heights and the mean height is 1.74 m with standard deviation=0.09 m then

sample mean=1.74 m

sample standard deviation= 0.09 m

but we are interested in the population mean and population standard deviation. Here is the Central Limit theorem comes in.

Central Limit Theorem

The distribution of sample statistics is nearly normal, centered at the population mean and standard deviation equal to the population standard deviation divided by the square root of the sample size.

~N(mean= μ ,SE= σ/ √n)

Sampling Statistics

Suppose we collect n random samples from the population without replacement and calculate the mean for each random sample. These means are called sample statistics. If we plot the distribution of these sample statistics the distribution will be nearly normal and known as sampling distribution.

So as the sample increases the standard error decreases i.e to have less variability around the mean in our sampling distribution we can increase our sample size.

Note that the more skewed our population distribution is the larger sample size is required to get the nearly normal sampling distribution.

Conditions For Central Limit Theorem

  1. Sampled observations must be independent.
  2. if sampling without replacement then n<10% of the population.
  3. The sample should not be too small. If the sample size is too small then we will not get a a nearly normal sampling distribution. n>30 is a general rule of thumb.

Demonstration of Central Limit Theorem in Python

Importing the necessary packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

Loading our dataset

age=pd.read_csv(“heights.csv”)

age.head()

first five rows of our dataset

print(‘Population mean age: ‘+str(age[‘age’].mean()),’Population age standard deviation: ‘+str(age[‘age’].std()))

plt.title(‘Population Age Distribution’)
sns.distplot(age[‘age’])

The population age is right skewed

Choosing 1000 random observation from our population.

samples=age[‘age’].sample(n=1000,replace=False).tolist()

n_samples = [samples[x:x+30] for x in range(0, len(samples), 30)]

n_samples_means=[sum(i)/len(i) for i in chunks]

n_samples_means=np.array(n_samples_means).mean()

standard_error=np.array(n_samples_means).std()

print(‘Sampling distribution mean: ‘+str(n_samples_means),’Standard error: ‘+standard_error))

sns.distplot(n_samples_means)

The sampling distribution mean (41.23) is approximately equal to the population mean (41.37) but the standard error (3.06) is very less than population standard deviation (15.86) since sampling distribution is nearly normal.

Standard Error =σ/ √n according to Central Limit Theorem

Standard Error=σ/√n=15.86/ √30=2.89 which is approximately equal to 3.06

We got standard error=3.06

We have just seen demonstration of Central Limit Theorem. Our sampling distribution is nearly normal with mean approximately equal to the population mean and standard error approximately equal to population standard deviation divided by square root of sample size.

Why Central Limit Theorem?

  1. Central Limit Theorem helps in determining the unknown population parameter.
  2. Since the sampling distribution is nearly normal we can apply z scores to it.

End Note

Its just first version. i will write in more detail about Central Limit Theorem in the next version. Thanks for reading :)