Why Central Limit Theorem

Anirudh Dayma
Analytics Vidhya
Published in
4 min readApr 11, 2020
Photo by Hannes Richter on Unsplash

I have come across many people who say that Central Limit theorem is one of the most useful theorem and it is used widely by Machine Learning engineers and Data Scientists. Today in this post I will sight one very good use case of Central limit theorem.

Central limit theorem says that mean of the sampling distribution of the sample means is equal to the population mean irrespective of the distribution of the population and when the sample size is greater than 30.

Let us try of understand the meaning of the highlighted term above, sampling distribution means that the distribution is made up of samples and the later part i.e. sample means implies that the distribution is of the statistic “means of the sample”. We know in Central limit theorem we create number of samples with size greater than 30, calculate the mean of the samples and then plot them.

Mathematically it states that

Let μ be the population mean and σ be the population standard deviation. If we draw a sample of size N from the population then according to CLT the mean of sampling distribution of sample means is given as

and the standard deviation of sampling distribution of sample means is given as

So when I was going through hypothesis testing and its terminologies like null hypothesis, alternate hypothesis etc. I realized that we need a distribution of null hypothesis and only then we can test if the we have enough evidence to reject the null hypothesis so, we need a distribution to start and using that we could do the hypothesis testing. Central limit theorem can be used to get this distribution.

Hypothesis testing at its core checks whether our statistic belongs to the null hypothesis distribution or some other distribution. If it does not belong to our null hypothesis distribution we say that our statistic comes from some other distribution and reject the null hypothesis.

Let’s consider a real life example to have a look at the use of Central limit theorem.

Suppose we are a part of a washing machine company and we want to check if our machine washes clothes faster than any average machine in the market. We try running our washing machine for 100 times and get that the average time taken by our machine is 5.3 mins and standard deviation is 2.1 mins. Other machines in the market take 6 mins on an average. We need to check whether you have significant evidence to say that our machine is faster than the average machines?

So we have a sample with size 100, sample mean = 5.3 mins, sample’s standard deviation = 2.1 mins, population mean = 6 mins.

Null hypothesis would be our machine is similar to average machine i.e. average time taken by our machine is 6 which implies the sample comes from distribution with mean =6.

Alternate hypothesis would be that our machine is better than average machine i.e. average time taken by our machine is less than 6 which implies the sample comes from a different distribution with mean not equal to 6.

So consider that we draw number of similar samples, calculate their mean and plot them, it will be a sampling distribution of sample means. The mean of this distribution will be equal to the population mean i.e. 6 (as stated by CLT) and the standard deviation can also be calculated using CLT. But we don’t know the population standard deviation so we assume the sample standard deviation i.e. 2.1 as an estimator of population standard deviation (by this I mean that we can say that population standard deviation is approximately equal to sample standard deviation). Now that we have population standard deviation so using CLT we can find the standard deviation of the sampling distribution of the sample means.

Note: the standard deviation of the sample i.e. 2.1 and standard deviation of the sampling distribution of the sample means are two different things. First one is the standard deviation of the 100 samples and second one is the standard deviation of the distribution that we have created using similar samples and their means (sample means). I know it is difficult to understand but take your time and let it sink in. Going through the article multiple times will make things clear.

Now the distribution that we have just created is a distribution with mean equal to 6 mins and it is nothing but the distribution of null hypothesis. So we can proceed with our test and check if we find significant evidence to reject the null hypothesis or not. Remember that we will still be checking if our sample comes from a distribution with mean = 6, this is what hypothesis testing does at its core. We won’t be covering that in this article but definitely in an upcoming article as the motive of this article was to sight a use case of Central Limit theorem.

I hope now you know a real life use-case of Central Limit theorem. Feel free to drop comments or questions below, you can find me on Linkedin.

--

--

Anirudh Dayma
Analytics Vidhya

Machine Learning | Data Science Enthusiast | Technical Writer