Sampling Distribution of the Sample Mean — Playing with UN World Population Data

Suraj Regmi
Probability and Statistics Stories
4 min readOct 14, 2019

--

Many a time, we would like to estimate some parameter (like an arithmetic average or mean) in a large population. As collecting full data of the population might not be viable, we might like to estimate the population parameter by taking samples and then doing the estimation. This article talks about how and why such sampling can be used to estimate population parameters, and demonstrate a toy example using UN world population data.

Estimation of Parameters using Statistic Values

Suppose we want to know the average population per country (μ). Here, all the countries is the population, and the average population per country if calculated taking the whole population is called the population parameter (μ).

So, μ = sum of the population of all the countries / total number of countries

Rather than calculating the average taking data of all the countries, we would like to estimate it using a sample of countries. The sample should be chosen in such a way that it is representative of the population data. Out of many ways of sampling, I choose here simple random sampling which is both easy to do and a good sampling method (here).

Now, we take some samples — a sample here refers to n number of countries — and calculate the sample mean (x̄) of each sample. Then, we take the mean of the sample means, and that mean approximates the population mean, provided that the sampling is done enough number of times, and enough data points (n) are taken while sampling.

This is the application of the central limit theorem. The central limit theorem states that the sum of a number of independent and identically distributed random variables with finite variances will tend to a normal distribution as the number of variables grows. As the mean of that tended normal distribution is the same as the population mean, the mean of the sample mean can be used as an approximation of the population mean.

But what about variance?

Intuition says that the higher the value of n, the less variation in the sample mean. So, the first guess would be an inverse relation between the sample mean variation and the number of samples, n. Unsurprisingly, that is the case. The variance of the sample mean and the number of samples (n) are related this way:

σ_x̄ ² = σ² / n

Why?

The random variable here is the sum of n identical independent events (X) divided by n i.e the sample mean.

So, its variance is:

σ_x̄ ² = Variance (sum of n identical independent events / n)
σ_x̄ ² = Variance (sum of n identical independent events) / n²
As the n events are independent and identical,
σ_x̄ ² = n * Variance of an event / n²
σ_x̄ ² = n * σ² / n²
σ_x̄ ² = σ² / n

So, the variance of the sample mean can be found from the variation of the population distribution. In case the population variance (σ) is not available, the variance of the sample mean can be used as the approximation.

UN World Population Data

Here, the population estimation for all the countries by United Nations, Department of Economic and Social Affairs for 2019 is taken as the data. There are 235 countries/areas in overall. We would like to take 50 samples and estimate the population parameter.

Photo by jonathan riley on Unsplash

The following Python code is used for sampling and estimation.

We calculate from the population that,
Population mean (μ) = 32823 thousands
Population standard deviation (σ) = 134178 thousands

The high coefficient of variation of 409% shows the great spread of the data. Indeed, from small islands and Monaco-like countries to highly populated countries like China and India, there is a lot of variation.

The average of the sample means, μ_x̄ (mu_est in the code), is the estimation for the population mean, and σ_x̄ (std_sample in the code) is the standard deviation of the sample mean.

The exact figures are not mentioned here as the statistic values depend on the random state of the sampling process.

As the coefficient of variation of the population data is very high, it takes a high number of samples (n) to estimate accurately the population parameter with a low standard deviation of the sample mean. In the coming blogs, further topics like confidence interval and significance levels will be discussed and applied to some data to see how they can be used in performing statistical studies.

This blog is a part of probability and statistics series, so this blog will be followed up with many other blogs on the related and follow-up topics. Stay tuned!

--

--

Suraj Regmi
Probability and Statistics Stories

Data Scientist at Blue Cross and Blue Shield, MS CS from UAH — the views and the content here represent my own and not of my employers.