Why Most Articles on the Central Limit Theorem Are Misleading
Most articles say that sample means of random variables drawn from any distribution tends to Gaussians but this is incorrect
Recently I have come across many articles on medium claiming that the central limit theorem is very important for data scientists to know and claiming to teach or exemplify the theorem but doing it incorrectly.
Demonstration
The statements usually go like
the distribution of sample means of random variables - drawn not necessarily from a Gaussian - tends to a Gaussian when the sample size is sufficiently large
This is simply incorrect and before explaining why, let me show you some code to demonstrate how the above statement is violated
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
%matplotlib inlinen=1000
n_samples=10000_,axs=plt.subplots(3,2,figsize=(15,15))
for i,df in enumerate((0.9,1.5,1.9,2.1,2.5,3.0)):
ax=axs[i//2,i%2]
data=np.random.standard_t(df=df,size=(n,n_samples)).mean(1)
mean=0 if df>1 else np.nan
std=np.sqrt(df/(df-2)/n_samples) if df>2 else np.nan
sns.histplot(data,ax=ax)
r=stats.normaltest(data)
ax.set_title('df={df},
mean={sample_mean:.2f},
std={sample_std:.2f}/{std:.2f},
pvalue={pvalue:.2f}'.format(df=df,
std=std,
pvalue=r.pvalue,
sample_mean=data.mean(),
sample_std=data.std()))
plt.tight_layout()
In this code I take the sample means of 10000 Student’s t-distribution variables. This quantity itself is a random variable that as per the claim above ought to be distributed normally. I repeat this 1000 times and plot the histogram. Furthermore, this is done for Student’s t-distribution with degrees of freedom 0.9,1.5,1.9,2.1,2.5 and 3.0.
The title of each plot shows the degrees of freedom, the actual/theoretical mean, the actual/theoretical standard deviation and the p-value for a Gaussian test. The lower the p-value the less likely is the data drawn from a Gaussian.
The plot for degrees of freedom .9 has a sample standard deviation of 125 but has samples at 30 times that. Its clearly not normally distributed. Similarly the plot for degrees of freedom 1.5 and 1.9 show samples at about 10 times the standard deviation and are clearly not Gaussian. The p-values also lead us to the same conclusion (and are based on tests using more rigorous versions of the same argument). At degrees of freedom 2.1 eyeballing does not rule out the data being drawn from a Gaussian but the p-value is still too low.
Thus, we see that for degrees of freedom less than 2 the data is extremely unlikely to have been drawn from a Gaussian. This demonstrates that the way most articles present the CLT is incorrect.
Theory
The Student’s t-distribution does not have a mean for degrees of freedom less or equal to 1 and does not have a variance for degrees of freedom less than or equal to 2.
Since the sample mean and sample variance of n-samples of IIDs is
it follows that these quantities are not defined when the degrees of freedom are less than or equal to 1 and 2 respectively. It thus follows that the sample-mean distribution does not tend to a Gaussian when the degrees of freedom is less than or equal to 2.
Turns out what most articles claim to be the Central Limit Theorem is in fact true for a restricted class of distributions (those that have a well defined first and second moment). If you want to learn more about it read the wikipedia page on stable distributions and the generalized central limit theorem.