Published in

Analytics Vidhya

# Understanding Confidence Intervals: using Covid-19 Vaccination Sample

The President of the United States announced a few weeks ago that there is no need to wear masks if you are fully vaccinated. But how do we trust whether people are telling the truth about their vaccination status. We can’t go around asking for vaccination certificates from everyone without a mask.

We would like to get an idea of “How many people have been vaccinated in our city” to know whether the vaccinated people should still take precautions.

If we want to get the idea about the vaccination in the city from the data that we can easily collect, wouldn’t it be extraordinary? We are going to do exactly that in next few minutes.

Lets clear some statistical jargon first.

# Sample Statistic

A sample statistic is any number computed from the data that tells us something about the data. We are going to use proportion as a sample statistic for our experiment.

Proportion can be calculated using:

Number of people vaccinated/Total number of people in our sample

# Central Limit Theorem

If we plot all the sample statistics with respect to their frequency of occuring, it will be nearly normally distributed with mean very close to the population mean and the standard deviation will be much less than the population standard deviation.

Following are 200 random samples from a population with mean 0 and standard deviation 20 and each of the sample are of size 30.

We will calculate the sample mean from each of the above samples and plot those .The sampling distribution looks as follows:

As you can see , it looks nearly normal.

You can try changing values for each variables and check the difference it creates in the distribution.You can use this tool to experiment more :https://gallery.shinyapps.io/CLT_mean/

Central Limit Theorem states that “if we take random samples of size, n from a population. The distribution of sample statistics ie. Sampling distribution will be nearly normally distributed, centered at population mean,μ and with standard deviation equal to population standard deviation divided by square root of sample size, σ/sqrt(n)”

where n is the size of each sample

For our experiment, people can either answer yes or no when asked whether they have been vaccinated. As our data is categorical and we will not have any mean value so we are going to use proportions.

# Central Limit Theorem for Proportions

It states that “ if we take random samples of size, n from a population. The distribution of sample proportions will be nearly normally distributed, centered at population proportions,p and with standard error inversely proportional to the sample size”

where n is the size of each sample. p is the %age of people who got vaccinated from the population.

As we can see, the formula is independent of number of samples we use. We can be able to pull up a similar result even with one sample if the following conditions are met:

1. Independence: Sampled observations must be independent. For this to be true, random sampling must take place. We should use less than 10% of the population to ensure random sampling.In our Experiment, we are using sample of 100 people which is definitely less than 10% of the population of the city.
2. Sample Size: There should be atleast 10 successes and 10 failures in the sample. In our sample , we define success as vaccinated. We have 24 vaccinated people out of 100 in our sample so we have 24 successes and 76 failures.

As these conditions are met, we can assume that it will follow a normal distribution.

# Confidence Interval

The confidence interval is the set of values above and below the sample proportion which contains the unknown population proportion with probability equal to defined confidence level.Confidence interval is calculated using:

Point Estimate ± Margin of Error

Point estimate in our case is sample proportion denoted by p̂

# Margin of error

A margin of error tells you how many percentage points your results will differ from the real population value. For example, a 95% confidence interval with a 4 percent margin of error means that your statistic will be within 4 percentage points of the real population value 95% of the time.

Margin of error can be calculated using Z* x Standard Error. Lets see what each of these mean.

# Z* (Critical Value)

It describes how far from the mean of the distribution you have to go to cover a certain amount of the total variation in the data (i.e. 90%, 95%, 99%).

it is dependent on what value of confidence level are we going to use. For example if we use 95% confidence level ,we get Z* as 1.96.Remember that 95% confidence level refers to the middle of the distribution.

The critical value of z is term linked to the area under the standard normal model. Critical values can tell you what probability any particular variable will have.

You can check critical values for some of the commonly used confidence levels in the image below:

# Standard Error

The standard error of the proportion is defined as the spread of the sample proportion about the population proportion. More specifically, the standard error is the estimate of the standard deviation of a statistic. It has a similar nature with standard deviation, as both are the measures of dispersion. It is used to find the accuracy and efficiency of the sample.

Consider that p is the population proportion and p̂ is the sample proportion.

Hence, the formula for the standard error of proportion is defined as follows:

where p̂ is defined as Number of Successes/Sample Size

For estimating a proportion using confidence interval:

Conducting health surveys with community-based random samples are essential to capture an otherwise unreachable population, but these surveys can be biased if the effort to reach participants is insufficient.

Lets say we have the data for number of vaccinated people in our neighbourhood. Since when to get vaccinated is completely dependent on discretion of the individual as of now , we can assume this as random sampling.

Suppose we took a survey of 100 people(who are eligible for vaccination for same amount of time) living in our neighbourhood out of which 24 are currently vaccinated. We want to calculate the %age of vaccinated people in the city on the basis of this sample.

% of people who are vaccinated=(24/100) *100 =24%

The standard error can be calculated as :

it comes out to be ~0.0427

The Z* value for 90% confidence level is 1.645 (check the table above)

We can find the the confidence interval using 90% confidence level by doing following steps :

# Interpretation of Confidence Interval

We are 90% confident that 17% to 31% of all Population in the city have got vaccinated.

We can see that confidence interval is very wide and it will be difficult to make any decisions based on this.

If we increase the sample size to 500 and keep everything else same, the interval will shrink to (0.20 , 0.27).

This means that if we increase the sample size, the confidence interval shrinks. It happens because the sample proportion tends to become population proportion as we increase the sample size.

You can use this method for any case statement like this. We can do similar analysis for covid-19 infected people and extrapolate it to the city. However covid-19 is a contagious virus and taking data from a particular neighbourhood will be biased as in a particular neighbourhood there can be a lot of infections due to the community spread while in another area there might not even be a single case of covid. Whatever case statement we make, We need to ensure that random sampling takes place .

Happy learning !

Refrences:

--

--

## More from Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com