Central Limit Theorem and Machine Learning | Part-1

Abhishek Barai
Nov 22, 2020 · 6 min read

Note: Here I will try to cover the idea of the Central Limit Theorem, and it’s significance in statistical analysis, and how it is useful in Machine Learning. In case you haven’t checked, please find the link for normal distribution blog here.

Image for post
Image for post
Source: Google

Suppose we want to study the average age of the whole population of India. As the popullation of India is very high, it will be a tedious job to get everyone’s age data and will take lot of time for the survey. So instead of doing that we can collect samples from different parts of India and try to make an inference. To work with samples we need an approximation theory which can simplify the process of calculating mean age. Here the Central Limit Theorem comes into the picture. It is based on such approximation and has a huge significance in the field of statistics. It uses sampling distribution to generalize the samples and use to calculate approx mean, standard daviation and other important parameters.

CLT states that if you have a population with mean μ, sd σ, and take sufficiently large random samples from the population with replacement, then the distribution of the sample means will be normally distributed.

This will hold whether the source population is normal or skewed, provided the sample size is sufficiently large (usually n > 30). If the population is normally distributed, then the theorem holds even for samples smaller than 30.

Note: CLT will be valid when the samples are reasonably large. If we have fewer data points, then the samples have to be small, which is not an ideal case to justify CLT.

What is the Sampling Distribution?

The plot of independently taken samples from a large dataset having mean μ and sd σ is called a sampling distribution. Basically, it’s a distribution plot of the samples with the associated parameters.

Formulation of CLT:

For a population(n) if “X” has finite mean μ and sd σ, CLT is defined by,

Image for post
Image for post

where the sample mean and sd is,

Image for post
Image for post

So the average of the sample means will be approximate to the population mean(μ), and the sd(σ) will be the average standard error.

What is the standard error?

The standard error(SE) of a statistic is the standard deviation of its sampling distribution or an estimate of that standard deviation. The sampling distribution of a population mean is generated by repeated sampling and recording of the means obtained. This forms a distribution of different means, and this distribution has its own mean and sd.

Mathematically, the variance of the sampling distribution obtained is equal to the population's variance divided by the sample size. As the sample size increases, the sample means cluster more closely around the population mean. Therefore, the relationship between the standard error of the mean and the standard deviation is,

Image for post
Image for post
standard error

Data Analysis:

Here I have taken the Black Friday sales dataset for the analysis of CLT. The dataset consists of 5,50,068 data points.

Image for post
Image for post
It contains 12 columns. The “Purchase” column will be our feature to examine CLT.
μ and σ of overall purchases are 9263.97 and 5023.07 units respectively.

Distribution Plot:

Image for post
Image for post
distribution plot of overall purchases

The distribution is asymmetric. Here we have to take more than 30 samples and plot the sampling distribution of means to check whether it follows normal distribution or not.

What are the assumptions for sample generation?

  1. Samples should be taken randomly.
  2. It should be independent of each other.
  3. The overall sample size shouldn’t exceed more than 10% of the whole dataset.
  4. The sample size should be sufficiently large(n>30) when the original dataset is skewed or asymmetric.

Mean Distribution plots:

Image for post
Image for post

As we can see, the more number of samples results in the higher probability of the sampling distributions of the mean being normally distributed.

Let’s calculate the mean μ and sd σ of each distribution and check how much it is closer to the μ and σ of the overall purchase data.

Image for post
Image for post
all calculated values
Image for post
Image for post

As the number of samples increases, the sample mean and sd becomes closer to the original mean and sd. So our approach and observations using CLT are valid.

Machine Learning Aspect:

How CLT helps in generalizing large datasets?

Machine Learning models generally treat training data as a mix of deterministic and random parts. Let the dependent variable(Y) consists of these parts. Models always want to express the dependent variables(Y) as some function of several independent variables(X). If the function is sum (or expressed as a sum of some other function) and the number of X is high, then Y should have a normal distribution.
Here ml models try to express the deterministic part as a sum of deterministic independent variables(X):

deterministic + random = func(deterministic(1)) +…+ func(deterministic(n)) + model_error

If the whole deterministic part of Y is explained by X, then the model_error depicts only the random part and should have a normal distribution(according to CLT).
So if the error distribution is normal, then we may suggest that the model is successful, and we can apply linear algorithms to the dataset for better results. Else some features are absent in the model but have a large enough influence on Y, or the model is incorrect.

Statistical Inference:

Making statistical inferences about a given data is what a Data Scientist or ML engineer does every day. This theorem gives us the ability to quantify the likelihood that our sample will deviate from the population without taking any new sample to compare it with. We don’t need the whole population's characteristics to understand the likelihood of our sample being representative of it.

So this means if we don't know about the actual population mean(μ), then we can infer the sample mean as our actual mean(μ). In the above case, if we take the 500 samples with 100 data points each example, then 9262.26 units can be considered as the originally purchased mean.

No…

Though the sample mean is almost the same as the original mean(μ), but a single number estimate by itself(500 samples with 100 data points) provides no information about the precision and reliability of the estimate concerning the larger population.

Q. Then how can we decide the population mean, or in ML terms when we have a final trained model, how can we make an inference about how skillful the model is expected to be in practice?

The presentation of this uncertainty is called a confidence interval.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Abhishek Barai

Written by

Data Scientist | NLP Engineer | Quantitative Researcher | Blogger

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Abhishek Barai

Written by

Data Scientist | NLP Engineer | Quantitative Researcher | Blogger

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store