Sampling Distribution of the Sample Proportion — Demonstration with my Facebook Friends Real-Time Data

Suraj Regmi
Probability and Statistics Stories
4 min readOct 13, 2019

Note: This story was published on my Medium previously here. It is added to this publication to collect all my probability and statistics articles under this publication head.

Story

Today, I was scrolling my Facebook chat section. The chat list was sorted alphabetically, with people having names starting from ‘a’ at the top and the people with Nepali font names at the bottom. The Nepali font names were at the bottom because of the Unicode value standards. Though it would be difficult for me to search those Nepali names on Facebook (so was not quite a fan of Nepali font on the names), I loved the way the Nepali font was gleaming at me. And, suddenly, the proportion of the Nepali font friends intrigued me. I was interested in knowing the proportion of Nepali font friends.

Photo by Will Francis on Unsplash

Then, I started to wonder about two things — the sample proportion, and the sampling distribution of the sample proportion. I was scratching my head on trying to find a way to get a good sample to estimate the Nepali names proportion. What if I randomly remember my Facebook friends and sample them? Quite a boring task and the process of remembering might not be random too. What about sample all the friends whose name starts from ‘a’ (and corresponding Nepali letters)? Sorry, not another alphabetical order advantage to my a-ish friends. Take friends who are online now? Easy process, but can it be a good sample? Let’s find out.

Wondering its Mathematics

Assumptions

  1. There is always the same number of online friends on my Facebook while sampling.
  2. The number of online friends is always less than or equal to 10% of total friends.

Mathematics

Let p be the real proportion of the Nepali names. We are trying to estimate p using a sample of n online friends.

We take n online friends randomly at different days and different times, and they act as the sample. We define random variable X here as the total number of online friends who have Nepali font names. As the sample size here is less than or equal to 10% of the population size, picking the names can be assumed as independent events.

So, X here is a binomial random variable having mean np and variation np(1 — p).

Dividing X by n gives the approximation for p and we denote it as . Finding the average of some experiments,
μ_p̂ = sum(random variables X) / (no of experiments * n)

As the sum of random variables, X, divided by no of experiments, n, approximates mean of the distribution, np,
μ_p̂ np / n p

So, the average of the sample experiments, μ_p̂, approximates the population proportion p.

Now, the standard deviation of the sample mean can be calculated by dividing the standard deviation of the population mean by the number of samples.

Experimentation

Data Extraction

For extracting data of total Facebook friends and real-time online friends, I used the selenium library with Python 3. The codes associated with the data extraction can be found in this repository.

Observations

Looking at the total Facebook friends’ data of mine, I seem to have 2244 friends of which 76 friends have Nepali names. That makes the value of p to be 0.034.

I did different experiments* on different days and times and found the following statistics.

9 Nepali names in 300 online friends i.e p̂ =0.03 [10th October 2019 10:05]
4 Nepali names in 261 online friends i.e p̂ =0.015 [11th October 2019 08:30]
8
Nepali names in 228 online friends i.e p̂ =0.035 [11th October 2019 16:40]
12
Nepali names in 223 online friends i.e p̂ =0.054 [11th October 2019 18:11]

Analysis

Averaging the values of p̂, we get μ_p̂ = 0.0335, which does an excellent approximation of the population parameter in just four experiments.

For calculating standard deviation, let’s assume the standard sample size as the average of the four values i.e 253.

The standard deviation of the p̂,
σ_p̂ = 0.014

And, the standard deviation as calculated using the binomial distribution formula,
σ_x = sqrt(p * (1 — p) / n) = sqrt(0.034 * (1–0.034) / 263) = 0.011

We see the standard deviation of our sample proportion is getting closer and closer to the binomial distribution standard deviation.

So, the estimated values being really close to the real values, even for just four experiments, shows two things — that the online friends can be a good way of sampling for our task, and the importance and effectiveness of using sampling techniques to estimate the parameters.

[UPDATE] Four experiments have been done until now, and the updated values would be reflected above.

*The number of experiments will be increased in the coming days up to 10 experiments. We will see that with the number of experiments increasing, the statistic values will be closer to the parameters.

This blog is the part of probability and statistics series, so this blog will be followed up with another blog, “Sampling Distribution of the Sample Mean”, and many more. Stay tuned!

--

--

Suraj Regmi
Probability and Statistics Stories

Data Scientist at Blue Cross and Blue Shield, MS CS from UAH — the views and the content here represent my own and not of my employers.