Confidence Intervals — Demonstration with the Facebook Data and UN Population Data

Suraj Regmi
Probability and Statistics Stories
4 min readOct 19, 2019

Last two blogs of the publication Probability and Statistics Stories, the sampling distribution of the sample proportion and the sampling distribution of the sample mean were discussed. In those blogs, two toy examples were given, one for each. The estimate of population parameters like mean and proportion are called point estimates as they give a single value as an estimation of population parameters. Interval estimates, on the other hand, are the intervals within which the value of a parameter of a population has a stated probability of occurring. The confidence interval is one of the most prevalent forms of interval estimation. The confidence interval gives the interval which when generated by the identical process has the given probability of containing the estimated parameter.

Photo by M. B. M. on Unsplash

Population Proportion Estimation

The proportion of my Facebook friends having Nepali font name was estimated here. Readers are advised to read the blog before proceeding further as the rest of the content is built on top of the blog content.

While estimating the population parameters, it is mostly the case that the standard deviation is not known of the population. So, we assume that we do not know the standard deviation of the population.

At first, let’s see if the assumptions hold true for inference on this proportion.

  • As online friends are taken at different times and days, they can be assumed random.
  • The expected number of online friends with Nepali font names is 8.45 (on the basis of the sample). Though it is less than 10, let’s suppose that the sampling distribution of the sample proportion follows normal distribution here for demonstration purposes.
  • As the number of online friends is assumed to be less than or equal to 10% of the population, the events can be supposed to be independent.

Taking the samples and doing the four experiments, we got the estimates:
μ_p̂ = 0.0335
σ_p̂ = 0.0116
(calculated from μ_p̂ )

As the standard deviation of the distribution is not known, we would use student t-distribution to estimate the confidence interval.

With the number of samples, n = 253 and a confidence interval of 95%, the value of t* is 1.97.

So, the confidence interval is:
= (μ_p̂ - t*.σ_p̂ , μ_p̂ + t*.σ_p̂ )
= (0.0335 - 1.97 * 0.0116, 0.0335+ 1.97 * 0.0116)
= (0.01, 0.056)

Hence, the confidence interval of the estimation of sample proportion with 95% confidence is 1% to 5.6%. We can say with 95% confidence that the proportion parameter lies in the confidence interval, (1% - 5.6%).

Population Mean Estimation

The population per country was estimated here using 50 samples of the UN population. Readers are advised to read the blog before proceeding further as the rest of the content is built on top of the blog content.

As mentioned above, while estimating the population parameters, it is mostly the case that the standard deviation is not known of the population. So, we assume here too that we do not know the standard deviation of the population.

At first, let’s see if the assumptions hold true for inference on this mean.

  • The samples are taken randomly using the Pandas method. So, the experiment or sampling process is random.
  • As the number of samples is 50 (which is greater than 30), the sampling distribution of the mean follows the normal distribution.
  • The number of samples is more than 10% of the population, so we can’t exactly use the 10% rule for independence. But let’s suppose the independence of the events for demonstration and learning purposes. It is never recommended to assume independence in this way when doing an actual real-life study.

Taking the 50 samples and doing the ten experiments, we got the estimates:
μ_x̄ = 30899 thousands
σ_x
̄ = 13030 thousands (calculated from the arithmetic average of sample SD of the 10 experiments)

As the standard deviation of the distribution is not known, we would use student t-distribution to estimate the confidence interval here too.

With the number of samples, n = 50 and a confidence interval of 95%, the value of t* is 2.

So, the confidence interval is:
= (μ_x̄ - t*.σ_x̄, μ_x̄ + t*.σ_x̄)
= (30899 - 2 * 13030, 30899 + 2 * 13030)
= (4839, 56959)

Hence, the confidence interval of the estimation of sample mean with 95% confidence is (4839 thousands, 56959 thousands). We can say with 95% confidence that the mean parameter lies in the confidence interval, (4839 thousands, 56959 thousands). The confidence interval is quite big as the standard deviation of the samples were, in themselves, big. It is usually difficult to narrow down the confidence interval in such a high level of confidence level with the highly variable data like this.

This blog is a part of probability and statistics series, so this blog will be followed up with many other blogs on the related and follow-up topics. Stay tuned!

--

--

Suraj Regmi
Probability and Statistics Stories

Data Scientist at Blue Cross and Blue Shield, MS CS from UAH — the views and the content here represent my own and not of my employers.