Central Limit Theorem and Machine Learning | Part-2

Abhishek Barai
Nov 22, 2020 · 5 min read

This blog is a continuation of “Central Limit Theorem and Machine Learning”. Please visit the part-1 for prior knowledge about the topic. The link is here.

As we discussed earlier, we can’t simply assume a sample’s mean parameter output as our whole population parameter output though their values are close. We need to validate this uncertainty using a Confidence Interval methodology.

What is Confidence Interval?

In statistics, a confidence interval refers to the probability that a population parameter will fall between a set of values for a certain proportion of times. Confidence intervals measure the degree of uncertainty or certainty in a sampling method. They can take any number of probability limits, with the most common being a 95% or 99% confidence level.

The confidence interval tells us how confident we are in our results. With any survey or experiment, we’re never 100% sure that our results could be repeated. If we’re 95% sure or 99% sure, that’s usually considered “good enough” in statistics. That percentage of sureness is the confidence interval.

For example, we survey a group of pet owners to see how many dog food cans they purchase a year. We test our statistics at the 99 percent confidence level and get a confidence interval (200,300). That means they buy between 200 and 300 cans a year. We’re super confident (99% is a very high level!) that our results are sound, statistically.

This is helpful when we don’t know anything about the large population parameter. Then we’ll be uncertain about our sample parameter as it doesn’t depict any precision.

Z-Score:

A z-score describes a raw score in terms of its distance from the mean when measured in standard deviation units. The z-score is positive if the value lies above the mean and negative if it lies below the mean.

It is useful to standardize the values (raw scores) of a normal distribution by converting them into z-scores because:

  1. It allows us to calculate the probability of a score occurring within a standard normal distribution
  2. It also enables us to compare two scores from different samples (which may have different means and standard deviations).
z-scores and corresponding centiles

Calculation:

Let 0<α<1, and let (100 * (1-α))% denote the confidence level.

Suppose we have a standard normal distribution “Z.”

(Z_α/2) denote a z-score with α/2 probability to its right.
Similarly, let (-Z_α/2) denote a z-score with α/2 probability to its left.

Let α= 0.1, then (Z_α/2) = 1.645 to it’s right and (Z_α/2) = -1.645 to it’s left.

Image for post
Image for post
Image for post
Image for post
some values of z score

Let’s get back to our BlackFriday sales data analysis. Let’s calculate the 95% confidence interval for the mean for known standard deviation.

Image for post
Image for post

95% confidence level with a known standard deviation:

To calculate the limit for a 95 percent confidence level the formula will be,

Image for post
Image for post
x_bar = sample mean, σ = known sd of population, n= number of samples

upper 99% limit = x_bar(sample mean) + SE(standard error=σ/sqrt(n)) * ((Z_0.05/2)=1.96)

lower 99% limit = x_bar(sample mean) — SE(standard error=σ/sqrt(n)) * ((Z_0.05/2)=1.96)

Image for post
Image for post
with lower and upper limits

For each experiment, we can see the calculated mean falls between the lower and upper limit. We can infer that the purchase mean(μ) parameter for the whole dataset lies between 8277.73 and 10246.79, and we are 95% confident about this.

95% confidence level without standard deviation:

When we don’t any the standard deviation of a population which is very common for huge population, in the below case, suppose we don’t know the standard deviation of the purchased feature. In this case, we have to calculate the sd using sample sd/sqrt(number of samples) as shown below.

Image for post
Image for post
x_bar = sample mean, s = sample sd of population, n= number of samples

upper 99% limit = x_bar(sample mean) + SE(standard error=s/sqrt(n)) * ((Z_0.05/2)=1.96)

lower 99% limit = x_bar(sample mean) — SE(standard error=s/sqrt(n)) * ((Z_0.05/2)=1.96)

Image for post
Image for post

We can also clearly see that the calculated mean falls under the lower and upper boundary for each experiment. Though the margin is minimal now, we can say that we are 95% confident about this.

Similarly, we can calculate the 99% confidence interval for both cases.

From the aspect of machine learning:

With the knowledge that the sample mean will be a part of a Gaussian distribution from the central limit theorem, we can use knowledge of the Gaussian distribution to estimate the likelihood of the sample mean based on the sample size and calculate an interval of desired confidence around the skill of the machine learning model.

Note: The z-score of the normal distribution can be taken from the z-table.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Abhishek Barai

Written by

Data Scientist | NLP Engineer | Quantitative Researcher | Blogger

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Abhishek Barai

Written by

Data Scientist | NLP Engineer | Quantitative Researcher | Blogger

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface.

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox.

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic.

Get the Medium app