Practical Sampling Distribution and Central Limit Theorem for Data Engineers

Bharat Dadwaria
TheCyPhy
Published in
8 min readMay 13, 2020

Whenever it comes to Statistical Data Analysis or Exploratory Data Analysis, we have to go through the distribution of the data. So while working on the very big datasets mostly we apply our analysis on a sample, not the whole population which may lead to certain variation between the sample analysis and population analysis. This approximation introduce us with the Sampling Distribution.

Statistical inference techniques are based on Sampling distribution of a statistic.

Usually, in data analysis or sampling analysis, we face the same type of problem while handling big size data-set. In that situation what to do. How to infer and analyses statistics from that large datasets. Well, one possible solution may be to use sampling techniques such as random sampling to draw a random sample and calculate the statistics for that sample. Now based on that one sample, anyone can come and tell that this is the statistics folks. That’s it!

But wait, what if that sample contains an outlier value that affect the whole sample. Maybe we have to modify our analysis. What about taking two random samples and finding their individual statistics, but still the outlier issues. So the simple way is to take k-random samples of size n and evaluate the statistics for each of them based on their Sample. After that, we will have a small data set having different statistics for each sample.

The statistics from different samples may vary differently. Now in order to find out the nature of the whole population data-set based on sampling let’s plot that sampling data-set.

Generated 50 samples
10 out of 50 Sample_Means

The distribution drawn by the sample data-set is known as Sampling Distribution. But wait, the above sampling distribution for sample mean is following Normal Distribution. Is it due to the Population dataset's distribution or something else? Well, here the Central Limit Theorem came in the picture. This is one of the most important theorems that every data scientist should know.

Central Limit Theorem (CLT) :

Central Limit Theorem : The Sampling distribution of the sample mean converges to Normal Distribution as the samples size increases. And the same applies to any distribution.

In simple words, take any datasets which follow any distribution, take some samples from that and evaluate sample mean for each of the sample, and plot those sample means you will able to find that the sampling distribution (Distribution came from sample statistics) starts following Normal Distribution as the sample size increases. Yeah, isn’t the magic of data. Well, I don’t believe in magic nor any superpower, So why don’t you go and try it by yourself and figure out the reason, an exercise for you!

Central Limit Theorem

Now one of the questions is raising in my mind that by using sampling distribution we try to infer statistics of population, based on sampling data. But by which significant level we got the exact same value? Well, since the population size is countable infinite so we can never able to find exact same value of statistics (that is mean and variance) as the population statistics, but based on the sampling distribution we can approximate the values of those statistics which will vary close to the population mean.

So Sampling distribution for sample mean follows the Normal Distribution as the sample size increases as we have already seen. But there are many other variations are available in this Sampling distributions such as Z-Distribution, Chi-Square Distribution, T- Distribution, and F-Distribution. Let’s visualize each of the distributions.

Z-Distribution

Z distribution is also known as Standard Normal Distribution is the normal distribution with mean 0 and variance 1. Here we used to generate some random samples with loc=0 (mean) and scale=1 (variance) and the plot that data.

Code to Visualize the Z distribution

Here from the following plot, we can interpret that the z distribution is simply the standard form of the normal distribution that’s why it is also known as Standard normal distribution

The Z-distribution plot
Z score and probability

For any sample mean on sampling distribution, how many standard deviation far that point is from the population mean is known as Z-Score. And the density function for the Z-distribution is defined as following where x and z are the same that is z-score.

Note that here z=x and vice versa and this is the Cumulative density function for z=x
Code snippet for evaluating P(z<Z)

Chi-Square Distribution:

Chi-Square Distribution is a continuous probability distribution that is widely used in Statistical Inferences. The Chi-Square Distribution related to the standard normal distribution. If a random variable Z has the standard normal distribution, then Z-square has the Chi-Square distribution with 1 Degree of freedom.

Degrees of Freedom refers to the maximum number of logically independent values, which are values that have the freedom to vary, in the data sample.

Code snippet for chi-square distribution visualization
Chi-Square Distribution plot for various degree of freedom

Suppose we have five z_score values as follows :

Then the chi-Square test will be evaluated (z¹²+z²²+z³²+z⁴²+z⁵²) as follows :

Chi-square score
PDF of Chi-Square Distribution
PDF of Chi-Square distribution value and degree of freedom

T- Distribution :

When the sample size is very small then the standard normal distribution is not able to visualize the nature of the population’s statistics. T-Distribution is often used when there is a small size of sample size and the standard deviation is not known to us. In T-Distribution, we have the exact same sample mean as the normal distribution sample mean which is at the center but little bit flatter than the normal distribution.

Z distribution vs t-distribution

t-score is exactly the same as the z-score from standard normal distribution but we try to evaluate the t-score when the sample size is very small (~n < 30) and the population mean is not known to us. In that condition, we can use the sample standard deviation in place of population standard deviation. The following standard error of a statistic is the sample standard deviation.

Code snippet for t-Score
T -Distribution PDF
T-Distribution PDF where t is the t-Score Degree of freedom as parameters

F-Distribution :

The F-Distribution is related to the Chi-Square Distribution. The ratio of two independent chi-square random variables over their degrees of freedom follows F-Distribution, and this score is called as f-score. Whenever we deal with the ratio of variances the F-distribution may rises.

F-Score
Code snippet of F-Distribution at different degrees of freedom
A plot of F-distribution at different degrees of freedom

The mean and median of F-distribution is completely depends on the degrees of freedom. As from the above plot, the mean of the distribution is the ratio of (1st degree of freedom) and (second degree of freedom -2) and median in case of the same degrees of freedom is exactly 1.

Code snippet for F-Score
PDF of F-Distribution
F-Distribution PDF code where f is the f-score and dof1 and dof2 are Degrees of freedoms of u1 and u2

These all are the sampling distribution that can help a statistical engineer or data scientist to infer the statistics from a large size population with low significant error based on sampling distribution. This is the reason why every data scientist or statistical engineer should know about the Sampling distribution and Central Limit theorem. Following is the table describes each of the sampling distribution parameter-wises properly.

https://www.youtube.com/watch?v=RrkkLKGVyBk&list=PLbMVogVj5nJRt-ZxRG1KRjxNoy7J_IaW2&index=5

Applications :

In Machine learning, when we talk about Modal evaluation, we usually split the whole dataset into the train, test, and validation part. And the test part should be able to generalize the whole population very well because based on that sample, we will perform the evaluation of the model or generalizes the model’s accuracy.

Another major application of the Sampling Distribution is Bootstrapping. Bootstrapping is a resampling technique that is used in machine learning to estimate the statistics of the population. Bootstrapping is used in ensemble techniques.

Bootstrapping can be used to estimate summary statistics such as the mean or standard deviation. It is used in applied machine learning to estimate the skill of machine learning models when making predictions on data not included in the training data.

Random Forest Bootstrapping. Source: https://medium.com/greyatom/a-trip-to-random-forest-5c30d8250d6a

--

--

Bharat Dadwaria
TheCyPhy

I'm a Computer vision research engineer, exploring the intersecting field of Computer vision and Robotic vision. https://bharatdadwaria.github.io