Central Limit Theorem and Machine Learning | Part-2
This blog is a continuation of “Central Limit Theorem and Machine Learning”. Please visit the part-1 for prior knowledge about the topic. The link is here.
As we discussed earlier, we can’t simply assume a sample’s mean parameter output as our whole population parameter output though their values are close. We need to validate this uncertainty using a Confidence Interval methodology.
What is Confidence Interval?
In statistics, a confidence interval refers to the probability that a population parameter will fall between a set of values for a certain proportion of times. Confidence intervals measure the degree of uncertainty or certainty in a sampling method. They can take any number of probability limits, with the most common being a 95% or 99% confidence level.
The confidence interval tells us how confident we are in our results. With any survey or experiment, we’re never 100% sure that our results could be repeated. If we’re 95% sure or 99% sure, that’s usually considered “good enough” in statistics. That percentage of sureness is the confidence interval.
For example, we survey a group of pet owners to see how many dog food cans they purchase a year. We test our statistics at the 99 percent confidence level and get a confidence interval (200,300). That means they buy between 200 and 300 cans a year. We’re super confident (99% is a very high level!) that our results are sound, statistically.
This is helpful when we don’t know anything about the large population parameter. Then we’ll be uncertain about our sample parameter as it doesn’t depict any precision.
Z-Score:
A z-score describes a raw score in terms of its distance from the mean when measured in standard deviation units. The z-score is positive if the value lies above the mean and negative if it lies below the mean.
It is useful to standardize the values (raw scores) of a normal distribution by converting them into z-scores because:
- It allows us to calculate the probability of a score occurring within a standard normal distribution
- It also enables us to compare two scores from different samples (which may have different means and standard deviations).
Calculation:
Let 0<α<1, and let (100 * (1-α))% denote the confidence level.
Suppose we have a standard normal distribution “Z.”
(Z_α/2) denote a z-score with α/2 probability to its right.
Similarly, let (-Z_α/2) denote a z-score with α/2 probability to its left.
Let α= 0.1, then (Z_α/2) = 1.645 to it’s right and (Z_α/2) = -1.645 to it’s left.
Let’s get back to our BlackFriday sales data analysis. Let’s calculate the 95% confidence interval for the mean for known standard deviation.
95% confidence level with a known standard deviation:
To calculate the limit for a 95 percent confidence level the formula will be,
upper 99% limit = x_bar(sample mean) + SE(standard error=σ/sqrt(n)) * ((Z_0.05/2)=1.96)
lower 99% limit = x_bar(sample mean) — SE(standard error=σ/sqrt(n)) * ((Z_0.05/2)=1.96)
For each experiment, we can see the calculated mean falls between the lower and upper limit. We can infer that the purchase mean(μ) parameter for the whole dataset lies between 8277.73 and 10246.79, and we are 95% confident about this.
95% confidence level without standard deviation:
When we don’t any the standard deviation of a population which is very common for huge population, in the below case, suppose we don’t know the standard deviation of the purchased feature. In this case, we have to calculate the sd using sample sd/sqrt(number of samples) as shown below.
upper 99% limit = x_bar(sample mean) + SE(standard error=s/sqrt(n)) * ((Z_0.05/2)=1.96)
lower 99% limit = x_bar(sample mean) — SE(standard error=s/sqrt(n)) * ((Z_0.05/2)=1.96)
We can also clearly see that the calculated mean falls under the lower and upper boundary for each experiment. Though the margin is minimal now, we can say that we are 95% confident about this.
Similarly, we can calculate the 99% confidence interval for both cases.
From the aspect of machine learning:
With the knowledge that the sample mean will be a part of a Gaussian distribution from the central limit theorem, we can use knowledge of the Gaussian distribution to estimate the likelihood of the sample mean based on the sample size and calculate an interval of desired confidence around the skill of the machine learning model.
Note: The z-score of the normal distribution can be taken from the z-table.
Please give it a clap if you like the blog. Please find the full code here.