**Theory of Confidence Interval and Confidence Interval of Mean**

**Population and Sample**

The term population is used to refer different meanings in different contexts. For example, consider birthweight of newborns in a particular city over an year period. There were 10,000 childbirths in that particular city. Here the population consists of all those 10,000 newborns. Instead of measuring the weights of all those newborns (which is not challenging in this case), let’s decide to measure birthweight of only 50 random newborns. This subset is what is known in statistics as sample. We can use sample as a proxy for entire population and extrapolate measurements done in sample to make conclusions about the populations. In biomedical research population is not only a much larger dataset consisting of target people (for example, “males in India”) but also the future generations, therefore we assume that it is infinite. In experimental research, population means the ideal situation or underlying mechanism. For example, Gregor Mendel studied 929 pea plants in F2 generation (his sample) to arrive in generalized 3:1 phenotypic ratio (dominant:recessive) in ideal population. His phenotypic ratio of 3:1 is a model- a mathematical description of a simplified view of nature. He used his samples to extrapolate an underlying phenomenon (his model) that is applicable to diploid organisms in general. Statistics goes from sample (specific things) to population (general conclusion and model) and therefore it is considered as an example of logical induction. On the other hand, probability is an example of logical deduction as it goes other way around, from general to specific.

**Population Vs. Sample Notations**

In general, abbreviations (notations) of statistical parameters are presented in Greek alphabets for populations and Latin for sample.

**Theory of Confidence Intervals**

Let’s consider our former example. We precisely know the birth weights of 10,000 newborns in a particular city over a year and let’s consider it as our population. We have also calculated mean birth weight, 2.7kg (population mean). Out of 10,000 newborns, we have randomly selected 50 newborns (our sample) and calculated the sample mean. The value of sample mean could be more or less than the actual population mean. Let sample mean be 2.6 kg with 0.1 kg as standard deviation. We will first calculate a statistic called t-ratio from this:

t= (m-µ)/(s/√n)

where m is sample mean, µ is population mean and denominator (s/√n) is sample SEM.

t-ratio can be defined as difference between sample mean and population mean upon sample standard error or the mean.

t= (2.6–2.7)/(0.1/√50)

= -0.1/0.141

= -0.70922

The value of t ratio is a bit less than zero. This is expected, as the sample mean will be more or less close to population mean, numerator of t ratio will be close to zero, so the ratio will hover around zero. Now we take yet another 50 random samples from the same population and calculate t-ratio. We repeat this 100 times and plot the distribution of t-ratio, like how we plot a histogram. The distribution would look like this:

This is one way to convert approximate Gaussian distribution to a symmetric distribution, but for this method, you should know the true population mean (true population mean remains unknown mostly, except in this kind of simulation studies). As explained, t-ratio will be centered around zero (peak at zero), because most of the sample means would be close to population mean. The shape of t-distribution depends upon degree of freedom (df, which is equal to n-1). If the area under 2.5% of total area at both the tails (i.e., most unusual t-ratios whether it is too low or too high) is chopped off, the resulting area would include range of t ratios that include 95% of samples. To get 95% area, we have to chop the tail precisely at a t score at both the directions; this t score is called t* or t critical. In this case, with degree of freedom 49 and significance level 0.05 (because we have to chop of 5% of most unusual values), t* is 2.01 (calculated using online calculator as explained below). That would mean, if we cut the graph at -2.01 and +2.01 (shaded area in figure) we will get range of t-ratios that include 95% of samples. The shape of t distribution (so as t*) depends upon degree of freedom. For a particular significance level, we can calculate t* from t-distribution table, or using an online calculator:

As stated earlier, t ratio = (m-µ)/(s/√n)

We can rearrange this equation to solve for µ, the true population mean. For the sake of brevity, derivation of following formula is omitted.

µ= m ± t* (s/√n)

where m is sample mean, t* is constant from t-distribution and s/√n is sample SEM

Given the sample mean m, sample standard deviation s and sample size n, it is possible for us to define ranges that include the real population mean with confidence. As already explained, t* depends on the desired confidence, which is arbitrarily 95%. If t* for 95% confidence is used, the resulting ranges of sample means would include the true population mean 95% of times. This range is called 95% Confidence Interval.

**Confidence Limits, Levels and Intervals**

The desired amount of statistical confidence is called Confidence Level. For eg., if you want a very high confidence on your results, you should choose 99% Confidence Level as part of your experimental design. Confidence Level is chosen before experiment is conducted and it is not ethical to change the CL after the data is generated. If you had chosen 99% CL, you will generate 99% Confidence Interval after the experiment. Confidence Interval is a range (lower limit to upper limit) that plot the precision of your sample measurement in comparison with the true population value. Two values that limit this range, the lower limit and upper limit, are called Confidence Limits.

**Confidence Interval of Mean**

Confidence interval of the mean tells you how precisely you have determined the sample mean as an estimate of population mean. On the other hand, precision depends on Confidence Level (CL). CL of 99 is more precise estimate than 95, which is better than 90. As precision increases, wider would be Confidence Interval. For example, 99% CI is a lot wider than 90% CI.

Another statistic Standard Error of the Mean is also similar to CI; SEM also tells you how precisely you have determined sample mean comparing with population mean. However confidence level of SEM is very low; only around 60%, so SEM is quite an inaccurate measure to quantify precision. 95% CI is routinely used across biological and environmental sciences.

In situations where whole population is used to calculate statistic, for example mean of population, Confidence Interval makes no sense. Consider a class with total strength 24 students and mean mark 11.8 out of 25. Here 11.8 is the population mean and we are 100% sure (confident) that the true population mean is this value, no question about it. However, out of 24 students, if I randomly selects 8 students and calculate their mean marks, that mean would be sample mean and CI makes sense in such situations.

95% CI of sample mean can be calculated using two methods. The parametric (distribution-dependent) method assume that our sample is sampled from a roughly Gaussian distribution. This method utilizes a constant from t-distribution (t*) for calculating 95% CI of sample mean. We have already explained how this formula is derived in earlier section. Formula is:

m ± t* (s/√n)

where m is sample mean, t* is a constant from t-distribution, s is sample standard deviation and n is sample size. Also note that s/√n = SEM (Standard Error of the Mean)

We can calculate 95% CI of sample mean given sample mean, sample standard deviation and sample size; raw data is not necessary.

Confidence Interval is a numerical presented as sample mean ± CI, written as (Lower Limit to Upper Limit)

E.g. If 95% CI is 6 and µ=51, CI range= (45 to 57). Notation like 51±6, which is commonly used to describe standard deviation, is not used to describe Confidence Intervals.

Let us consider an example for calculating 95%CI of sample mean. Test marks data with Mean=12.81 S=4.905 n=24. As n=24, df is 23.

Let’s first look up t* for 95% Confidence Level with df=23:

t* is 2.0687

Let’s now calculate w, the width of 95% CI

W= t* (s/√n)

=2.0687 * (4.905 / √24)

=2.0687 * (4.905 / 4.90)

=2.0687 * 1

=2.0687

Finally, 95% CI of mean is

Mean — 2.0687 to Mean + 2.0687

(10.7413 to 14.8787)

In case no Standard Deviation or Mean are given but presented with raw data, we have to calculate mean and SD to calculate 95% CI.

Microsoft Excel formula for t* in CI formula is

=TINV(alpha,n-1)

Where alpha is level of significance (0.05 to calculate 95% CI)

An online calculator for calculating CI of mean is at

*http://www.sample-size.net/confidence-interval-mean*

For calculating CI of mean, there are a number of assumptions. A major assumption which is often overlooked is that the sample must come from a population that is Gaussian (or roughly Gaussian). If the distribution is lognormal etc., CI can not be calculated using the above method. Samples have to be random. (If samples are deliberately chosen non-random, CI cannot be calculated). This also applies if some cells in suspension are clumped (therefore not homogenous). Patients from a particular clinic too are heterogeneous; instead of random samples, situations where true randomization is impossible, we use ‘convenience sample’ as in the case of patients from a particular clinic.

An alternative approach for calculating CI of mean is nonparametric, rank based, and therefore, do not make explicit assumptions about probability distributions.

For example, out of our 24 students, we randomly choose 5 students and their test marks are: {22, 14, 11, 2, 9}

- Step1 Rank Order these values

- Step 2. Make a new subset by picking five random integers from 1 to 5, and picking the value with that rank, repeat is allowed. For example, you mark five pieces of papers with 1 through 5 and place it in a box (as in a lottery). Shuffle the box and randomly pick a paper, record its value, and put the paper back to the box, shuffle, pick once again and so on. You might get same numbers multiple times, and that is allowed. For example, suppose you got 1, 3, 3, 4, 5. Now you should record values of those ranks to make a subset (2, 11, 11, 14, 22). This new subset is called pseudosample
- Do this many times (pseudoreplicates); say 500 pseudoreplicates to generate 500
*pseudosamples*. For each pseudosample, calculate mean. Next, rank order those means (total 500 values from minimum to maximum), and pick 2.5th and 97.5th percentile. As 97.5–2.5 = 95, this range is the 95% CI of mean!

This method is variously known as resampling method, bootstrapping or computer-intensive method and is extensively used in phylogenetics and genomics. A number of studies have revealed that this method is far superior to the earlier method that uses a constant from t distribution. However, this method is not suitable in case you would like to solve the question manually in a test paper.

**Summary**

- Sample is a subset of population and in most cases properties of population remains unknown and we use samples as proxy to make inferences about the population.
- 95% CI informs us how precisely we have calculated the respective sample statistic with respect to the true population mean. It is related to SEM (SEM is approximately 60% CI), and 95% CI is approximately twice the SEM. It is different from SD, as SD captures only the scatter of dataset.
- 95% Confidence Interval of mean can be calculated by using the formula µ= m ± t* (s/√n)
- It is possible to calculate 95% CI of mean without making any assumptions about the distribution of populations from which the samples came. The approach is through resampling (bootstrapping). Manual calculation is almost impossible, but can easily be done in a computer.

## Say Hi!

Linkedin : *https://www.linkedin.com/in/riteshprataps/*