How Confidence and Prediction intervals work
Learning it the easy way
Confidence Intervals
Confidence Intervals are estimates that are calculated from sample data to determine ranges likely to contain the population parameter(mean, standard deviation)of interest. For example, if our population is (2,6), a confidence interval of the mean suggests that the population mean is likely between 2 and 6. And how confidently can we say this? Obviously 100%, right? Because we know all the values and we can calculate it very easily.
But in real-life problems, this is not the case. It is not always feasible or possible to study the whole population. So what do we do? We take sample data. But can we rely on one sample? No, because different samples from the same data will produce different mean.
So we take numerous random samples (from the same population) and calculate confidence intervals for each sample and a certain percentage of these ranges will contain the true population parameter.
This certain percentage is called the confidence level. A 95% confidence level means that out of 100 random samples taken, I expect 95 of the confidence intervals to contain the true population parameter.
Still, Confused? Let’s understand this through an example.
Suppose you work at a sports goods company that manufactures basketballs. There are several manufacturing plants and each plant produces 10,000 balls every day. You have to test the balls before the company packages and ships them off.
One of the standard industry procedure is to check the 1st bounce height of the balls. Your colleague suggests to take a random sample (from the same manufacturing plant) and form a confidence interval around the mean bounce height.
You take 100 balls, drop them from the first floor of your office and measure the mean bounce and the 95% confidence interval for the mean bounce is 110–120 cms. I can say that I am 95% confident that the mean bounce height of all the basketballs (the entire population from one plant) falls in this range.
Caution: It does not mean that 95% of the time, the bouncing height will be in this range.
As we can see in this picture above, there are 100 confidence intervals for 100 different samples. The red line is the true value of the population. 95 of these intervals contain the true value, the other 5 do not.
Some important considerations:
- As we draw larger and larger samples, our confidence interval becomes narrower. A larger sample will reduce the sampling error, give more precise estimates and thus smaller intervals. Suppose, you decide to test 5000 balls, you’ll get a better estimate of the range of bouncing height.
- As we increase the confidence level, say from 95% to 99%, our range becomes wider. It may sound counter-intuitive but think about it. To have more confidence that an interval contains the true parameter, the range should be wider. e.g I can be 100% confident that the bouncing height of the ball is 0 to infinity.
I hope you got an idea of confidence intervals, now let's see what prediction intervals are.
Prediction Intervals
What happens after fitting a regression model? We predict the value of the dependent variable.
The range that likely contains the value of the dependent variable for a single new observation given specific values of the independent variables, is the prediction interval.
Let’s take the earlier example of basketball to understand this concept.
Suppose, using a regression model (often linear regression) and given specific values of independent variables like rubber material, stitching type (based on the basketballs already produced), we train our model. This model now predicts a prediction interval of 105–125 cm. We can now be 95% confident that the bounce height of the next basketball produced with the same settings will lie in this range.
Note that we are not predicting the mean here rather an individual value, so there’s greater uncertainty involved and thus a prediction interval is always wider than the confidence interval.
Prediction intervals can be often confused with confidence intervals. They are related but the two processes have different calculations and purposes.
The prediction interval predicts in what range a future individual observation will fall, while a confidence interval shows the likely range of values associated with some statistical parameter of the data, such as the population mean.
I hope you enjoyed reading about CI and PI and learned something out of it.