Understanding the Central Limit Theorem
Central Limit Theorem is one of the foundational concepts in Statistics. In this post we will understand with a practical example
For accessing the code in this example, please refer to https://github.com/chidamnat/practicalDS/blob/master/mastery/central_limit_theorem.ipynb
Let us say that we are interested in finding the mean housing prices in Ontario province.

Drilling down to understand the population distribution

Population seems to be extremely skewed to the right and quite different from a textbook normal distribution. Suppose we want to learn the mean price of the housing population and obviously we can’t collect the data of the entire population as it may turn to be a costlier affair in most of the cases. This is where CLT comes to our rescue.
The Central Limit Theorem states that regardless of the underlying population distribution, the probability distribution of the sum / mean of the large sample drawn from the population tends to be normally distributed
Thus we can estimate the population’s mean without knowing the complete population and by only constructing a distribution of means from large number of samples drawn from the underlying population. This seems to approximate the population’s mean very well.
How closely it represents ? This is where the standard error or the standard deviation of the sampling distribution comes into the picture. Larger the sample size, lower the standard error. However the improvements to the estimation is not linear but it is given by the below formula
SD of the means of the samples = population SD / sqrt(n)
The below code with increasing the sample size to create a distribution of sample means seems to approximate the population mean pretty closely and with increase in sample size for each experiment, leading to reduced sampling error (SD of distribution of the sample means)

