Central Limit Theorem and Machine Learning | Part-1

Abhishek Barai
Analytics Vidhya
Published in
6 min readNov 22, 2020

Note: Here I will try to cover the idea of the Central Limit Theorem, and it’s significance in statistical analysis, and how it is useful in Machine Learning. In case you haven’t checked, please find the link for normal distribution blog here.

Source: Google

Suppose we want to study the average age of the whole population of India. As the popullation of India is very high, it will be a tedious job to get everyone’s age data and will take lot of time for the survey. So instead of doing that we can collect samples from different parts of India and try to make an inference. To work with samples we need an approximation theory which can simplify the process of calculating mean age. Here the Central Limit Theorem comes into the picture. It is based on such approximation and has a huge significance in the field of statistics. It uses sampling distribution to generalize the samples and use to calculate approx mean, standard daviation and other important parameters.

What is Central Limit Theorem?

CLT states that if you have a population with mean μ, sd σ, and take sufficiently large random samples from the population with replacement, then the distribution of the sample means will be normally distributed.

This will hold whether the source population is normal or skewed, provided the sample size is sufficiently large (usually n > 30). If the population is normally distributed, then the theorem holds even for samples smaller than 30.

Note: CLT will be valid when the samples are reasonably large. If we have fewer data points, then the samples have to be small, which is not an ideal case to justify CLT.

What is the Sampling Distribution?

The plot of independently taken samples from a large dataset having mean μ and sd σ is called a sampling distribution. Basically, it’s a distribution plot of the samples with the associated parameters.

Formulation of CLT:

For a population(n) if “X” has finite mean μ and sd σ, CLT is defined by,

where the sample mean and sd is,

So the average of the sample means will be approximate to the population mean(μ), and the sd(σ) will be the average standard error.

What is the standard error?

The standard error(SE) of a statistic is the standard deviation of its sampling distribution or an estimate of that standard deviation. The sampling distribution of a population mean is generated by repeated sampling and recording of the means obtained. This forms a distribution of different means, and this distribution has its own mean and sd.

Mathematically, the variance of the sampling distribution obtained is equal to the population's variance divided by the sample size. As the sample size increases, the sample means cluster more closely around the population mean. Therefore, the relationship between the standard error of the mean and the standard deviation is,

standard error

Data Analysis:

Here I have taken the Black Friday sales dataset for the analysis of CLT. The dataset consists of 5,50,068 data points.

It contains 12 columns. The “Purchase” column will be our feature to examine CLT.
μ and σ of overall purchases are 9263.97 and 5023.07 units respectively.

Distribution Plot:

distribution plot of overall purchases

The distribution is asymmetric. Here we have to take more than 30 samples and plot the sampling distribution of means to check whether it follows normal distribution or not.

What are the assumptions for sample generation?

  1. Samples should be taken randomly.
  2. It should be independent of each other.
  3. The overall sample size shouldn’t exceed more than 10% of the whole dataset.
  4. The sample size should be sufficiently large(n>30) when the original dataset is skewed or asymmetric.

Mean Distribution plots:

As we can see, the more number of samples results in the higher probability of the sampling distributions of the mean being normally distributed.

Let’s calculate the mean μ and sd σ of each distribution and check how much it is closer to the μ and σ of the overall purchase data.

all calculated values

As the number of samples increases, the sample mean and sd becomes closer to the original mean and sd. So our approach and observations using CLT are valid.

Machine Learning Aspect:

How CLT helps in generalizing large datasets?

Machine Learning models generally treat training data as a mix of deterministic and random parts. Let the dependent variable(Y) consists of these parts. Models always want to express the dependent variables(Y) as some function of several independent variables(X). If the function is sum (or expressed as a sum of some other function) and the number of X is high, then Y should have a normal distribution.
Here ml models try to express the deterministic part as a sum of deterministic independent variables(X):

deterministic + random = func(deterministic(1)) +…+ func(deterministic(n)) + model_error

If the whole deterministic part of Y is explained by X, then the model_error depicts only the random part and should have a normal distribution(according to CLT).
So if the error distribution is normal, then we may suggest that the model is successful, and we can apply linear algorithms to the dataset for better results. Else some features are absent in the model but have a large enough influence on Y, or the model is incorrect.

Statistical Inference:

Making statistical inferences about a given data is what a Data Scientist or ML engineer does every day. This theorem gives us the ability to quantify the likelihood that our sample will deviate from the population without taking any new sample to compare it with. We don’t need the whole population's characteristics to understand the likelihood of our sample being representative of it.

So this means if we don't know about the actual population mean(μ), then we can infer the sample mean as our actual mean(μ). In the above case, if we take the 500 samples with 100 data points each example, then 9262.26 units can be considered as the originally purchased mean.

No…

Though the sample mean is almost the same as the original mean(μ), but a single number estimate by itself(500 samples with 100 data points) provides no information about the precision and reliability of the estimate concerning the larger population.

Q. Then how can we decide the population mean, or in ML terms when we have a final trained model, how can we make an inference about how skillful the model is expected to be in practice?

The presentation of this uncertainty is called a confidence interval.

--

--