Stat Digest: What is the idea behind sampling distribution?

AI/Data Science Digest
Geek Culture
Published in
7 min readFeb 27, 2023

Why do we actually need sampling distributions in practice?

How to find the population parameter (i.e. the mean age of people living in the USA)? (Image by Author)

How can we find the mean age of people living in the USA?

Find out the age of each and every person living in the USA and take the average. Simple, isn’t it?

But there is one problem. It would be difficult, if not impossible, to find the age of each person in the population.

In other words, it is really difficult to measure a population parameter.

Instead, with the magic of statistics, we first find statistics and then estimate the population parameter from the statistic.

Depending on the random sample you take, we may end up with different sample means. From these varying sample means, how do we estimate the population mean?

Estimating the population mean from sample means

This is where the sampling distribution of means of the samples helps.

We draw a sample of size m, and compute the sample mean. We repeat the process n times. Then, when we plot the frequencies of the means, we get the sampling distribution.

The mean of this sampling distribution is a good estimation of the population mean.

This in fact has connections to the Central Limit Theorem (CLT). CTL says that when you take the means of the samples you draw and plot the frequency plot of these means, it results in a normal distribution (approximately). As we increase the size of each sample, the distribution becomes more normal.

This is a very powerful concept in statistics.

Let’s cement our understanding through a simulated experiment.

Assume that we have a population of 10,000 people and we want to find out the mean age of our population.

The following code simulates 10,000 people’s age:

import numpy as np
np.random.seed(10) #fix the randomness for reproducibility
population = np.random.randint(1, 85, size=10000) #randomly assign age
The age distribution of the population (frequency plot) (Image by Author)

In this case, we can actually compute the population mean which is 42.4. Let’s use this to check how well our sampling distribution estimates the same parameter.

Now let’s draw 1000 samples of size 100.

sample_means = []
for i in range(1000):
sample = np.random.choice(population, size=100, replace=False)
sample_mean = np.mean(sample)
sample_means.append(sample_mean)

The sample means are as follows:

45.77, 37.79, 42.36, 43.17, 44.54, 42.17, 43.05, 42.96, 43.84, 43.13, 41.16, 41.96, 43.5, 43.25, 41.04, 42.9, 39.15, 41.33, 42.99, 46.53, 42.87, 43.47, 40.96, 45.16, 38.93, 45.7, 45.17, 42.81, 42.07, 43.21, 41.18, 49.63, 41.72, 42.71, 44.75, 41.51, 40.17, 38.96, 39.12, 42.69, 38.53, 44.66, 45.1, 40.97, 41.53, 39.56, 42.17, 38.73, 41.67, 43.09, 41.25, 43.29, 40.68, 43.75, 36.62, 44.37, 44.39, 42.73, 38.72, 43.76, 41.0, 39.95, 42.87, 43.85, 39.61, 41.73, 44.13, 38.96, 43.54, 45.68, 46.5, 37.48, 45.37, 44.28, 49.92, 38.67, 39.29, 38.28, 44.34, 40.38, 41.11, 40.85, 41.1, 44.3, 41.78, 39.93, 39.28, 42.58, 43.06, 46.03, 37.21, 45.47, 40.96, 45.9, 39.71, 41.38, 38.48, 41.53, 42.33, 42.38, 40.65, 39.34, 40.1, 43.11, 42.22, 37.73, 39.09, 37.65, 43.47, 41.74, 40.17, 42.07, 44.83, 45.41, 40.24, 41.25, 46.01, 44.55, 42.07, 42.88, 37.93, 43.43, 41.45, 42.63, 39.9, 41.66, 43.93, 45.65, 41.05, 42.77, 48.09, 41.74, 38.07, 41.71, 42.47, 41.31, 43.97, 42.83, 37.01, 42.77, 43.2, 42.42, 45.24, 39.13, 48.19, 40.04, 38.0, 42.31, 42.42, 37.03, 44.92, 40.9, 40.15, 43.18, 41.95, 43.09, 41.16, 43.62, 44.19, 42.42, 39.71, 46.16, 39.47, 40.78, 43.63, 42.45, 41.41, 35.61, 44.37, 41.03, 42.02, 41.53, 41.43, 42.81, 41.66, 42.11, 43.94, 44.11, 42.84, 47.02, 43.04, 42.1, 42.21, 41.77, 41.77, 43.17, 43.3, 40.74, 41.73, 43.19, 42.25, 44.5, 38.48, 44.3, 43.3, 43.71, 44.12, 43.85, 45.26, 42.94, 41.48, 39.01, 41.52, 44.36, 44.26, 38.2, 40.02, 38.73, 44.63, 44.66, 45.55, 39.02, 41.13, 45.17, 39.36, 38.84, 40.3, 42.6, 38.77, 40.67, 44.37, 40.05, 41.84, 42.87, 43.71, 40.2, 43.96, 42.3, 41.17, 46.58, 42.78, 39.36, 41.42, 41.52, 38.18, 43.24, 40.16, 41.25, 43.86, 43.66, 37.29, 39.41, 39.91, 44.81, 39.71, 41.59, 42.85, 39.05, 41.8, 43.67, 42.3, 42.31, 43.66, 48.58, 37.53, 40.81, 43.96, 45.4, 41.33, 41.92, 38.01, 44.05, 37.03, 47.64, 42.52, 42.45, 43.09, 45.76, 43.98, 47.14, 39.73, 47.8, 44.61, 42.72, 45.98, 40.12, 38.47, 42.26, 41.25, 43.52, 41.51, 42.3, 42.82, 43.39, 39.71, 41.74, 43.86, 40.7, 39.9, 41.99, 42.23, 38.96, 43.9, 41.01, 44.84, 39.63, 40.33, 45.48, 41.5, 42.35, 46.83, 44.84, 43.8, 46.14, 39.43, 40.81, 44.73, 41.95, 45.47, 44.32, 38.56, 39.7, 40.38, 45.46, 45.52, 39.78, 38.39, 40.57, 40.0, 41.92, 38.54, 46.73, 43.3, 44.19, 43.89, 45.13, 42.29, 43.85, 46.19, 40.71, 38.3, 41.35, 44.23, 38.92, 46.5, 43.28, 43.77, 41.8, 40.46, 45.86, 38.31, 43.03, 42.74, 38.48, 44.22, 38.26, 40.71, 44.27, 42.62, 42.41, 39.51, 42.9, 40.34, 43.34, 46.75, 45.22, 41.1, 37.99, 42.39, 44.44, 42.18, 40.71, 38.73, 44.81, 44.8, 44.67, 43.1, 46.6, 44.27, 44.08, 44.25, 42.07, 43.87, 43.68, 40.41, 46.17, 39.84, 43.22, 38.33, 42.09, 41.34, 41.19, 42.45, 41.15, 44.04, 42.04, 43.11, 45.89, 36.52, 46.86, 41.52, 42.76, 39.55, 39.28, 43.63, 41.1, 41.98, 42.51, 40.26, 42.01, 40.58, 43.23, 43.53, 43.12, 42.7, 36.7, 43.67, 44.21, 38.76, 40.09, 41.69, 45.02, 44.48, 42.95, 42.54, 41.82, 42.85, 44.56, 45.24, 42.18, 42.12, 42.43, 44.39, 40.04, 40.94, 39.72, 38.22, 42.04, 43.12, 45.49, 41.99, 40.59, 43.54, 45.28, 41.13, 35.13, 40.37, 42.09, 45.92, 46.2, 42.03, 43.45, 40.21, 42.76, 44.27, 38.68, 43.43, 47.81, 43.14, 41.99, 42.75, 40.55, 44.09, 43.5, 40.55, 42.86, 40.81, 47.53, 42.69, 40.42, 42.16, 44.41, 39.52, 41.21, 44.01, 41.95, 44.84, 45.63, 44.35, 44.27, 40.68, 40.43, 40.22, 41.16, 46.33, 40.42, 43.61, 42.52, 42.97, 44.02, 46.23, 40.72, 45.9, 39.99, 36.99, 42.76, 42.38, 40.96, 40.98, 40.71, 45.24, 44.05, 40.2, 39.55, 40.89, 44.57, 43.91, 44.67, 45.23, 45.01, 40.44, 43.11, 43.77, 43.91, 40.79, 42.84, 45.26, 41.6, 43.8, 45.97, 38.88, 42.77, 41.28, 41.95, 40.86, 40.16, 43.1, 39.89, 40.24, 40.77, 41.31, 41.5, 45.53, 42.23, 43.31, 43.63, 42.05, 42.96, 45.48, 40.54, 40.1, 43.14, 43.48, 43.22, 46.38, 45.13, 40.45, 39.85, 38.83, 41.69, 41.44, 46.54, 40.92, 41.71, 39.84, 47.26, 43.57, 41.52, 36.78, 41.76, 44.84, 42.13, 39.66, 43.85, 39.84, 43.7, 41.0, 43.28, 39.79, 40.81, 43.26, 40.92, 44.76, 45.72, 40.86, 45.97, 42.03, 39.08, 43.9, 46.4, 40.09, 41.34, 44.33, 39.31, 47.33, 44.45, 42.36, 43.2, 41.31, 44.19, 47.35, 44.31, 42.74, 43.8, 42.92, 42.5, 41.82, 45.28, 43.11, 40.06, 35.92, 43.35, 39.47, 43.57, 44.62, 43.55, 42.32, 44.07, 44.17, 44.61, 40.03, 43.39, 41.99, 42.19, 43.08, 46.9, 43.07, 38.7, 42.76, 43.86, 40.41, 41.59, 42.79, 45.48, 43.66, 46.7, 45.75, 42.02, 42.04, 42.55, 41.41, 43.69, 40.71, 43.35, 38.87, 38.71, 45.03, 42.79, 44.13, 43.37, 43.46, 40.68, 43.96, 42.17, 39.96, 44.09, 42.66, 42.14, 44.63, 47.1, 43.58, 44.18, 41.93, 41.01, 43.5, 40.69, 42.17, 42.85, 46.52, 44.07, 37.66, 43.34, 40.0, 43.27, 42.01, 42.67, 44.09, 42.73, 39.06, 38.67, 42.59, 45.0, 39.42, 40.08, 41.99, 44.67, 44.59, 45.83, 44.33, 45.57, 43.53, 41.68, 42.28, 39.61, 40.51, 41.37, 44.19, 44.01, 43.32, 45.8, 44.93, 37.13, 41.74, 41.75, 44.11, 42.66, 45.29, 37.28, 40.97, 41.07, 41.71, 40.3, 44.28, 42.83, 40.42, 41.67, 41.34, 46.39, 41.3, 38.77, 44.4, 39.44, 41.48, 44.38, 43.62, 43.71, 41.16, 41.42, 40.24, 44.9, 43.61, 39.94, 42.41, 42.75, 42.88, 45.93, 42.41, 46.56, 41.63, 41.91, 40.56, 39.95, 42.77, 44.0, 46.15, 43.65, 36.15, 41.63, 39.69, 45.78, 45.16, 48.48, 41.25, 43.4, 43.19, 43.42, 43.94, 38.53, 43.0, 40.81, 38.53, 41.22, 41.99, 41.52, 44.23, 44.52, 45.68, 39.79, 43.97, 43.53, 42.04, 41.32, 44.05, 41.56, 42.28, 41.46, 44.24, 42.93, 41.01, 42.74, 42.58, 43.0, 44.43, 43.06, 46.23, 44.38, 40.87, 40.54, 40.24, 40.69, 43.72, 39.36, 43.97, 43.91, 41.85, 42.1, 40.37, 43.51, 42.23, 41.8, 43.01, 43.13, 42.39, 40.28, 43.75, 39.42, 41.35, 39.76, 40.44, 45.25, 41.43, 42.62, 46.86, 38.8, 46.92, 41.44, 43.43, 42.61, 39.13, 40.74, 45.72, 46.25, 39.49, 40.91, 43.17, 39.12, 42.07, 42.23, 37.84, 42.7, 42.26, 43.43, 40.28, 41.47, 44.95, 44.13, 40.58, 34.72, 43.4, 39.91, 40.79, 43.73, 44.32, 43.99, 45.16, 45.03, 46.31, 43.15, 43.76, 44.39, 39.86, 43.68, 39.14, 45.06, 46.14, 45.06, 40.77, 42.28, 42.29, 42.48, 42.57, 45.61, 40.2, 40.42, 40.18, 43.61, 44.64, 41.06, 44.92, 39.27, 40.32, 42.48, 38.64, 41.85, 42.72, 44.36, 39.16, 41.26, 45.76, 46.78, 42.78, 42.86, 38.75, 45.37, 43.67, 43.23, 42.73, 46.28, 44.38, 40.27, 43.9, 46.4, 40.27, 45.85, 47.91, 39.76, 46.48, 41.37, 37.14, 41.33, 44.99, 44.17, 40.77, 41.71, 37.73, 42.28, 44.14, 44.59, 40.46, 41.67, 42.32, 43.65, 40.08, 44.18, 43.17, 43.52, 38.55, 42.61, 40.12, 43.04, 39.04, 43.29, 43.61, 41.91, 42.67, 39.67, 43.06, 40.62, 41.9, 41.74, 42.4, 46.65, 42.39, 45.86, 48.0, 43.01, 45.4, 45.64, 44.54, 39.8, 46.54, 39.53, 45.24, 44.57, 38.43, 46.81, 43.85, 43.9, 41.21, 39.08, 40.38, 40.74, 41.64, 40.93, 45.65, 40.94, 40.86, 43.18, 47.58, 35.7, 39.52, 42.77, 41.57, 42.32, 40.11, 41.79, 39.49, 42.29, 43.09, 40.1, 41.79, 39.91, 47.14, 43.39, 40.86, 41.74, 43.76, 42.77, 44.76, 45.1, 46.77, 41.27, 42.94, 41.69, 43.08, 43.94, 43.21, 40.22, 40.31, 44.43, 43.39, 42.85, 40.39, 44.2, 40.6, 46.22, 40.06, 44.9, 44.87, 42.54, 40.71, 38.09, 44.29, 42.68, 43.35, 44.25, 41.96, 43.81, 39.72, 46.34, 41.79, 37.61, 42.24, 41.68, 43.73, 40.68, 40.25, 40.24, 44.99, 42.48

As CTL suggested, our sampling distribution looks like a bell curve. The mean of this distribution is 42.39 — very close to the population parameter of 42.4!

Another important point here is that irrespective of the distribution of the population (almost uniform in this case), the resulting sampling distribution has the bell curve of a normal distribution. Isn’t that cool? Doesn’t that explain many phenomena in nature follow the normal distribution?

Bonus:

What if we take samples of size 1000 instead of 100?

Yes, you guess it right. The variation in the values around the mean will be lower with a larger sample size.

Notice that the mean is spread mostly around 40.5 and 44.5 compared to 35 and 49 in the previous case.

I’d love to hear your feedback on my post! Your feedback helps me improve my posts.

I hope it was helpful.

Please do share and like the post for greater visibility. Thank you!

--

--

AI/Data Science Digest
Geek Culture

One Digest At a Time. I value your time! #datascience #AI #GenAI #LLMs #dataanalyst #datascientist #probability #statistics #ML #savetime #digest