Statistics & Probability — The CLT & CI

The central limit theorem and confidence interval

Omar Elgabry
OmarElgabry's Blog
9 min readFeb 24, 2019

--

Similarly, a kid just looks like his grandfather — source

This series of articles inspired by Statistics with R Specialization from Duke University. The full series of articles can be found here.

The central limit theorem (CLT)

Often, we want to know how our population looks like from our sample.

Could we use samples to direct us to the population? Yes.

The central limit theorem tells us the more samples we take, the closer the means of our sample means will get to the population mean.

Not only does the central limit theorem work with discrete values, but also with continuous values.

Sample vs Sampling Distribution

source

Sample Distribution. Say we have a population of interest and we take random samples from it. And based on these samples, we calculate a sample statistic (say, mean) for each sample.

Each one of the samples will have their own distribution, which we call sample distribution.

Sampling Distribution. The sample statistics (means) we recorded from each sample also now make new distribution. The distribution of these samples statistics is called the sampling distribution.

So the two terms, sample and sampling distributions sound similar, but they’re different concepts.

(~) The mean height of women in US. Assume that we take random samples of a thousand women from each state, and calculate the mean for each.

  • The mean of the sample means will probably be around the actual population.
  • The standard deviation of a sample will be close to that of the population because after all each one of these samples are simply a subset of our population — assuming that each sample has random observations.
  • The standard deviation of sample means will be low since we expect the average heights of states to be close. We call the standard deviation of the sample means the standard error (SE).

Recognize that when the sample size N increases we expect the sampling variability to decrease.

Conceptually, when the size of each sample is large, the sample means will be much more consistent across samples (close to each other).

Mathematically, SE = (SD of population) / sqrt(N).

CLT: Shape, Center, & Spread

source
  • Shape: The distribution of sample means (the sampling distribution) is be nearly normal.
  • Center: The mean of the sampling distribution is be approximately equal to the population mean.
  • Spread: The standard error (the standard deviation of sample means) is be approximately = (SD of population) / sqrt(N).

Since we often don’t know the population distribution, we assume the “sample” distribution mirrors the population. If SD of population is unknown, which is often the case, we take the SD of one sample.

And believe it or not, you can use a single sample. But it’s never about the sample values, only the sample mean(s).

CLT: Conditions

— 1. Independence.

  • Observations in each sample must be independent. It is more likely if we used random sampling (obs. study) or random assignment (experiment).
  • And if sampling without replacement, the sample size N should be less than 10% of the population.

“without replacement” means we don’t take an observation once more in another sample. The samples that are too large will likely contain observations that are dependent.

That’s why while we like large samples, but we also want to keep the sample size 10% < to our population.

—2. Skewed Distribution

  • If the population distribution is not normal, the more skewed the population distribution, the larger sample size we need for the central limit theorem to apply (rule of thumb: n > 30).

Say, we have a right skewed population distribution

source

If we have a N=10 (small), the sample means (sampling distribution) will be quite variable, and looks like the population distribution (right skewed).

If we increased N=100, sample means will be much more consistent across samples, and so decreases the standard error, and the sampling distribution starts to be normal (unimodal and symmetric).

So, as N size sample increases, we overcome the population distribution skewness and the central limit theorem kicks in.

CLT (for the mean) examples

(~) I’m about to take a trip and the drive is 6 hours. I make a random playlist of 100 songs. Given that the mean length of songs is 3.45 minutes, what is the probability that my playlist lasts the entire drive?.

If 6 hours = 360 minutes, so whats the P(all songs length in a playlist > 360). In other words, if we divided both sides by 10, we get P(average song length > 3.6).

Assuming the conditions of the CLT about independence and skewness apply.

  • Center: The sampling distribution mean ~= population mean = 3.4.
  • Spread: SE = SD of population / sqrt(N) = 1.63 / sqrt(100) = 0.163.

We measure the variability of individual observations from the population with standard deviations. We measure the variability of sample means with standard errors.

Having the mean, and the standard deviation.

z-score = 3.6 — 3.4 / 0.163 = 0.92
percentile = 0.179 ~= 0.18

(~) Given a population histogram of mean = 10, sd = 7, and high right skewed. For each of these plots.

source
  • PlotA: Has a mean = 10, and slightly right skewed.
  • PlotB: Has a mean = 10, sd (variability) and high right skewed are just like the population.
  • PlotC: Has a mean = 10, and normally distributed.

Which represents the following:

  • A single random sample of 100 observations from this population → …… B (since it resembles the population diagram)
  • A distribution of 100 sample means from random samples with size 7 → …A (since N=7 (small), so slightly right skewed).
  • A distribution of 100 sample means from random samples with size 49 → C (since N=49 (large), so normally distribution).

Confidence Interval (for a mean)

A plausible range of values for the population parameter is called a confidence interval.

If we report a point estimate (like mean of a sample), we probably won’t hit the exact population parameter. On the other hand, if we report a range of plausible values, we have a good shot at capturing the parameter.

Recalling: CLT & “68, 95, 99.7% rule”

CLT says that the mean of random sample(s) ~= to population mean.

CLT can be applied to one sample mean. CLT applies on sample mean(s) (or other estimates) but NOT on the sample values.

68, 95, 99.7% rule, in the sampling distribution, says that roughly 95% of random samples will have sample mean that are within 2SE of the population mean. And so the unknown population mean is also going to be within 2SE.

So, the 95% confidence interval of the population mean is approximately the sample’s mean +/- 2SE .

2SE also called margin of error (ME).

(~) A study of 124 couples found that 64.5% turn their heads to the right when kissing. The standard error associated with this estimate is roughly 4%. Which of the below is false?.

  • A) A higher sample size would yield a lower standard error. True
  • B) The margin of error for a 95% confidence interval for a percentage of kissers is roughly 8%. True
  • C) The 95% confidence interval for the percentage of kissers is roughly 64.5% plus or minus 4%. False
  • D) The 99.7% confidence interval for the percentage of kissers is roughly 64.5% plus or minus 12%. True

Recap: Confidence Interval

To summarize. It tells two things:

  1. We are X% (i.e. 95%) confident that the unknown population mean μ is between sample’s mean +/- 2SE.
  2. If we draw an infinite number of random samples, and for each sample we compute the 95% confidence interval, then 95% of all these intervals will contain the unknown μ.

Since the confidence interval relies on CLT, so, in order to use it, we must meet the same conditions as in CLT (independence and skewness).

Having a nearly normal sampling distribution that relies on central limit theory helps in doing statistical inference using confidence intervals and hypothesis test.

Finding the critical value z*

The rule for calculating the CI is CI = x (sample mean) +/- (z*) x SE, where z* is approximately equals to 2 for 95% confidence interval.

z* (critical value) is the Zscore for the confidence interval.

If z* ~= 2, how to get the exact value of z*?.

source

For 95% confidence interval. Given 95% of the curve leaving 1 — 0.95 = 0.05, with 0.05 / 2 = 0.025 on both sides. Whats the Zscore if the percentile is 2.5%?

The exact z* is 1.96 (close to 2). This can be calculated from the Zscore sheet or using R. Now we can get the critical value for any X% confidence level.

CI (for the mean) examples

(~) A surgery for 1,154 U.S. residents. A 95% confidence interval for the average number of hours Americans have to relax after an average workday, was found to be 3.53 to 3.83 hours.

A) 95% of Americans spend between 3.53 to 3.83 hours relaxing after a work day.

False. The confidence interval is not about individuals in the population but instead about the true population parameter. This statement can be true if we say “We are 95% confident, that Americans on average, spend 3.53 to 3.83 hours ….”

B) 95% of random samples of 1,154 Americans will yield confidence intervals that contain the true average number of hours Americans spend relaxing after a work day.

True.

C) 95% of the time the true average number of hours Americans spend is between 3.53 and 3.83 hours.

False. The population mean (true average) is fixed while the interval is computed from a random sample and so it’s random.

D) 95% confident that Americans in this sample spend on average 3.53 to 3.83 hours relaxing after a work day.

False. The confidence interval is not about the sample mean, but is instead about the population mean. This statement can be true if we say that we are 100% confident that Americans “in this sample” spend on average …

(~) 1,151 US residents, the survey reported a 95% confidence interval of 3.40 to 4.24 days being unhealthy in last month.

  • The confidence interval is always about the unknown population mean.
  • The confidence level, tells us how confident we are, that this particular interval captures the true population mean
  • We are 95% confident, that Americans on average, have 3.40 to 4.24 bad mental health days per months
  • That 95% of confidence intervals, created based on random samples of N from the population will contain the true population parameter.
  • A 95% of random samples will yield confidence intervals that capture the true population mean of number of unhealthy days per month.

(~) 50 students were asked about their number of relationships with average of 3.2, and sd of 1.74. The sample distribution was slightly right skewed. Estimate the true average (population) using 95% CI.

We have the sample mean, N, and sd. But, lets check the conditions of the CI:

  • Independence: Since its random sampled and assuming 50 < 10% of all college students. So, they are independent.
  • Skewed Distribution: Since N > 30 and sample not so skewed. So, the “sampling” distribution is normal.

If the data is high skewed, then it’s better for N to be even larger.

CI = mean +/- z*(=1.96) x SE(=s/sqrt(N)) = 3.2 +/- 0.48 = (2.72, 3.68)

Meaning that we are 95% confident that college students on average have been in 2.72 to 3.68 exclusive relationships.

Thank you for reading! If you enjoyed it, please clap 👏 for it.

--

--

Omar Elgabry
OmarElgabry's Blog

Software Engineer. Going to the moon 🌑. When I die, turn my blog into a story. @https://www.linkedin.com/in/omarelgabry