What is the Central Limit Theorem?
According to Google, the Central Limit Theorem(CLT) states that “under many conditions, independent random variables summed together will converge to a normal distribution as the number of variables increases.”
Essentially, CLT says that a distribution of sample means will approximate a normal distribution as the sample size gets larger, regardless of the original distribution.
Why do we need it?
Abbreviated answer: It allows us to use sample statistics to estimate population parameters, and allows us to treat non-normal data as normal. Given the nature of normal data, we can build confidence intervals around the estimated mean population value, and use these confidence intervals to analyze and make inferences about a sample in relation to the rest of the population.
In the real world, it’s very rare that we have access to the entire population we want to survey. Therefore, we do our best by taking samples and using these to estimate the corresponding measurements for the population. These estimations of population parameters are called point estimates and if you’ll remember, according to the Central Limit Theorem, they have predictable behavior: after enough sampling, these point estimates will form a normal distribution.
It’s extremely important to note that even if our real population data is not normally distributed, the Central Limit Theorem ensures that our mean distribution — the distribution of mean values taken from samples — will always be normal.
This is important for two reasons. First, we now have an extremely accurate estimate for our population mean. Second, we can now start to construct confidence intervals surrounding our point estimates.
These confidence intervals give us a range by which we can describe various levels of certainty for our estimates. Ideally, these ranges will be small, indicating that we have a high degree of confidence that the parameter is very close to our estimate.
We can also use this information to estimate the probability of samples taking on extreme values that deviate from the population mean. Essentially, we can find out if a sample is significantly statistically different than the rest of the population, and we can know to start asking some more questions about that specific sample.
Let’s say we already know the mean and standard deviation of asthma rates in the US. If we take a sample from a specific city, let’s say Philadelphia, and find that the mean asthma rate of this sample is substantially higher than that of the overall population, we may want to ask some questions like “what’s the probability this was just caused by chance during sampling?” If the probability of this is significantly low, then we have further reason to believe that Philadelphia has higher rates of asthma and that its population is statistically different than that of the overall population.
The computation would look something like this…
We already have the population mean, and according to the CLT the averages of our sample values will take on a normal distribution. After sampling Philadelphia, we would take the mean of that sample and compare it to the distribution of other sample means. It’s usually quite rare that our sample mean falls farther than 2 or 3 standard deviations from the center of the distribution of means (~2.35% and .15% respectively). As such, samples that have means falling outside of this scope are worth further investigation.
Fig2 represents a normal distribution, and standard deviations off of the mean. Typically, 95% of our data will lie within 2 standard deviations of our mean. Any data outside of this range is an extreme minority (less than 5% of the population falls outside of this range) and is always worth a deeper look.
Let’s try to understand the Central Limit Theorem even further by playing with some data. If you want to use the notebook I created for this walkthrough, you can find it here.
We’re going to need the following libraries:
The data we will be working with is a single column of 10,000 non-normally distributed integers. You can find the .csv on my GitHub here. Click “Raw” to view the raw code for the document, this should just be a column of numbers. Then, navigate to the “File” tab in your browser and select “Save Page As.” You will be prompted to choose a file type, “comma-separated-values” should already be selected. Select a location and click “Save.”
Now that we have our data let’s load it in, and check that the distribution of our values is non-normal. We pass
squeeze=True to load in our .csv file as a Pandas Series object as opposed to a DataFrame; if we had multiple columns in our file this would squash them into one.
An easy way to check the distribution of your data is to use Seaborn’s
distplot function to return a histogram of your values and their counts.
Data looks pretty non-normal to me.
Another way we can check for normality is by using
scipy.stat.normaltest (remember we loaded in
st). The output of this function may not be entirely intuitive if you don’t understand hypothesis tests and p-values. In a hypothesis test, you have a null and an alternative hypothesis that you will be testing for. The result of your hypothesis test will be a p-value that indicates how likely it is to see the results that you are seeing, given that the null hypothesis is true. If the p-value is exceedingly low (typically p<0.05), then you can have some level of confidence in rejecting the null hypothesis. If p=>0.5, typically we stick with the null hypothesis.
normaltest function, the null hypothesis is that our data is normally distributed. Given our returned p-value of 0.0, our test is telling us that our data is almost certainly not normally distributed.
Next, we’re going to write a very simple function that will take in a Pandas object called
data (syntax will be Pandas specific) and an integer
n, and return a sample of
data with length
n. We’ll call this
Let’s test our function and print the top 5 rows of our sample.
Next, we’re going to write another extremely simple function called
get_sample_mean() that will return the mean of a sample.
Let’s test this function, too, by taking a new sample and getting its mean.
Now that we have these two functions, let’s write a third function that will take in a Pandas object
data, a number of samples to take
dist_size and a sample size
n, and return a
dist_size -length array containing the means of the samples taken.
According to the Central Limit Theorem, values generated by this function should more and more resemble a normal distribution as
n increase. Let’s prove this.
First, we’re going to draw 10 samples of 3 data-points each and visualize the results. Then we’ll do the same for larger sample sizes and numbers of samples, and compare the results.
Sure enough, it’s not pretty but we’ve essentially just proved the Central Limit Theorem! By randomly sampling our population (non_normal_dataset.csv) and increasing the size and number of samples, our results started to converge on a normal distribution.
Today we learned how to apply the central limit theorem in practice. We learned how to determine if a dataset is normally distributed or not, and from there, we wrote functions to generate samples and sample means. Finally, we used these functions to generate a normal distribution of population means from non-normal data, prove the Central Limit Theorem, and learn more about non-normally distributed datasets.