Quick Take: Confidence Intervals
Introduction to the Series
Data Science concepts can go deep and there is ALWAYS a long explanation.
These post aims to answers questions in a way that are by no means comprehensive, but provide a starting point from which candidates can start their explanation.
Hopefully this will be useful for those who are trying to present big topics in a concise manner.
This article is a second in a series that discuss common questions that may be asked as a Data Practitioner / Analyst / Scientist / Statistician.
Today we will focus on Confidence Intervals.
What are Confidence Intervals? How can they help with understanding Data?
Confidence Intervals are a way we can use sample data to infer some information on the population.
Methods to describe sample data in an effective way is important in Data analysis, as very often collecting data from a whole population can get very difficult and expensive very fast.
Imagine if we have to guess your age of the reader of this post, yes your age! How do you suggest I should go about it?
I don’t think anyone will be impressed if my guess is , “Anywhere from 0 to 100 years old?” . I would be right (high confidence), but as you can imagine this information isn’t that useful to anyone, as the interval is too big.
On the other extreme, what are the chances I am right if I guess 25–27 years old? I am more specific, but have very low confidence this is the right number. We can also say this has a high Margin of Error.
What if I say there is a 95 percent chance you are between 25–55? This statement, if true, will be more useful to us, as it gives us a better idea to what a typical reader of this post looks like with good confidence.
As we demonstrated above, by attaching a confidence level that reflects uncertainty, we can shrink the interval to provide a more meaningful description of the sample and resulting in a better inference on the population, which in the above example is ALL readers of this post.
Confidence intervals can be used for for any point estimate of a population (e.g. mean , proportion being the most common) that has a distribution we can approximate to normal.
Sounds useful doesn’t it? So, what do we need to construct a confidence interval?
We need to know the Point estimate (for our example, it is Mean Sample Age), Reliability Factor and Standard Error. Getting the value of the sample mean will be same, whether we have information on the population variance beforehand, but the formulas for Reliability factor and Standard Error will be dependent on if Population Variance is known.
Reliability Factor is a function of confidence level and, depending whether we have information on the population, the Z-statistic or the student T- statistic.
Similarly, which form of the Standard Error we use will be depend if the population variance is known.
Additional talking points:
Having a known population is very difficult in practice, and we will end up using the Student T-distribution more.
Z-Statistic vs Student T-distribution comparison: Fatter tails to take into account more uncertainty due to population variance not being known.
Conditions for C.I to be useful/valid:
Random Sampling and its importance as a condition for applicability of Confidence Intervals.
Point estimate’s distribution is approximately normal: Using Central Limit Thorem to approximate normal is commonly used. Nice article explaining CLT below.
Sample size should be smaller than 10% in relation to population size or C.I. e.g sample size , n <100 if population is 1000.
A better explanation on these conditions:
Disclaimer : This is a refresher on these topics, and I hope this quick read will help jolt some synapses. This is by no means comprehensive and not meant to teach concepts from scratch. For that I will recommend article below.
What Data Science question do you get often ? How did you explain it quickly ? Feel free to a comment or a clap if you liked it!