Testing for Normality — Applications with Python

So you have a dataset and you’re about to run some test on it but first, you need to check for normality.

Think about this question, “Given my data … if there is a deviation from normality, will there be a material impact my results?”

In doing so are three tests you might want to consider:

  1. The Shapiro-Wilk test;
  2. The Anderson-Darling test, and;
  3. The Kolmogorov-Smirnov test.

As well, there are some visual measures to be implemented:

  1. Box Plots
  2. QQ Plots

So why do all these tests in the first place? I’m going to quote directly from this post Cross Validated, which breaks down how to think about normality tests.

It is a (a bit strongly stated) fact that formal normality tests always reject on the huge sample sizes we work with today. It’s even easy to prove that when n gets large, even the smallest deviation from perfect normality will lead to a significant result. And as every dataset has some degree of randomness, no single dataset will be a perfectly normally distributed sample. But in applied statistics the question is not whether the data/residuals … are perfectly normal, but normal enough for the assumptions to hold.


The Shapiro-Wilk tests if a random sample came from a normal distribution. The null hypothesis of the test is the data is normally distributed. If the p value returned is less than.05, then the null hypothesis is rejected and there is evidence that the data is not from a normally distributed population.

However, It is completely possible that for p > 0.05 and the data does not come from a normal population. Failure to reject could be from the sample size being too small to detect the non-normality. So, keep this in mind when interpreting the results.

As well the stats.shapiro method will throw a warning on samples for size > 5000. So, for data sets larger than that you might want to choose another test for normality.

As we can see from the examples below, we have random samples from a normal random variable where n = [10, 50, 100, 1000] and the Shapiro-Wilk test has rejected normality for x_50. Therefore, we might have to use some additional measure to see if the null hypothesis for x_50 should indeed be rejected.

# Create the random variables with mean 5, and sd 3
x_10 = stats.norm.rvs(loc=5, scale=3, size=10)
x_50 = stats.norm.rvs(loc=5, scale=3, size=50)
x_100 = stats.norm.rvs(loc=5, scale=3, size=100)
x_1000 = stats.norm.rvs(loc=5, scale=3, size=1000)
# Print the p values
print stats.shapiro(x_10)
print stats.shapiro(x_50)
print stats.shapiro(x_100)
print stats.shapiro(x_1000)
# (The test statistic, the p-value) for x_N above
(0.9122101664543152, 0.2965205907821655) # Null Accepted
(0.9418673515319824, 0.015981076285243034) # Null Rejected
(0.9918511509895325, 0.810341477394104) # Null Accepted
(0.9987027645111084, 0.6903988718986511) Null Accepted

Let’s look at the first visualization method, the box-plot, to see if we can back up the results from this test. These plots should not be used solely to detect normality, but instead, to look for symmetry, skewness, and outliers to give an indications of non-normality. Generally, box-plot graphs are not very informative to understand normality, and have been included for completeness of the article.

While we do see more outliers x_50 and x_1000 on all the samples, there are no clear indications of non-normality for each graph.

What about looking at the QQ Plot?

The formal name is the quantile-quantile (QQ) plot, and it determines if two different data sets — the one you provide and a normally distributed set — come from a population with common distributions.

A quantile is the percent of points below the given value. If the two distributions being compared are from a common distribution, the points in from QQ plot the points in the plot will approximately lie on a line, but not necessarily on the line y = x. Because the samples have a mean of 4 and a sd of 3, the parameter line='s' was used to for our plot. The s is a standardized line for the expected order of the statistics, scaled by the standard deviation of our samples with the mean added. Basically, it puts the line where it needs to be to compare our data with what would be expected from a normal distribution.

import statsmodels.api as sm
import pylab
sm.qqplot(x_10, loc = 4, scale = 3, line='s')
sm.qqplot(x_50, loc = 4, scale = 3, line='s')
sm.qqplot(x_100, loc = 4, scale = 3, line='s')
sm.qqplot(x_1000, loc = 4, scale = 3, line='s')
QQ Plot for X_10 (Left); QQ Plot for X_50 (Right)

The QQ plot is a much better visualization of our data, providing us with more certainty about the normality. From QQ plot for x_50 we can be more assured our data is normal, rather than just relying on the p value from the Shapiro-Wilk test or the box-plot.

Just to see how a QQ plot would look on data which is not normally distributed, let’s generate a plot for a bimodal distribution.

# Create bimodal dataset
N = 10000
x_bimodal = np.concatenate((np.random.normal(-1, 1, int(0.1 * N)),
np.random.normal(5, 1, int(0.4 * N))))[:, np.newaxis]
# Now Graph
res = plt.hist(x_bimodal, bins=100)
sm.qqplot(x_bimodal[:,0], line='s')
Bimodal Histogram (Left); QQ-Plot for the same data (Right)

We can clearly see from the QQ plot, we can reject any notion x_bimodal was collected from a normal sample. As a reference, here is a visualization of QQ plots with different types of data

QQ plots drawn from different types of data which are not quite normal

The Shapiro-Wilk test is popular to determine normality, and usually performs very well, but it’s not universally best. You must be aware of the limitations spelled about above, and use additional methods to verify your results.


The Kolmogorov–Smirnov tests if a sample distribution fits a cumulative distribution function (CDF) of are referenced distribution. Or, if the CDF between of two different samples fit each other. These are not necessarily just for normal distributions, but we’ll use it in our example. Essentially, we are testing the sample data against another sample, to compare their distributions for similarities.

Similar to Shapiro-Wilk, our null hypothesis for our sample is the distribution is identical to the other distribution we’re testing it against. If p < .05 we can reject the null, and conclude our sample distribution is not identical to a normal distribution.

# Perform test KS test against a normal distribution with
# mean = 5 and sd = 3
print stats.kstest(x_10, 'norm', args=(5, 3))
print stats.kstest(x_50, 'norm', args=(5, 3))
print stats.kstest(x_100, 'norm', args=(5, 3))
print stats.kstest(x_1000, 'norm', args=(5, 3))
KstestResult(statistic=0.3799548575756928, pvalue=0.08314016853183959) # Null Accepted
KstestResult(statistic=0.08747802305302399, pvalue=0.8387908580519111) # Null Accepted
KstestResult(statistic=0.061047174276107175, pvalue=0.8501205705664947) # Null Accepted
KstestResult(statistic=0.0185742755283389, pvalue=0.8805510946984335) # Null Accepted

Looks like our results are telling not to reject the null hypothesis for all x_N, however x_50 is right on the fence. It’s probably because x_50 does not have enough data to accurately judge if it from the same distribution. Let’s see how it would look graphically, when comparing the CDFs of the sample with a sampled normal variable with the same mean and standard deviation.

Now it makes more sense that the p value was so close to the rejection region., given the large deviation from the two CDF lines. Let’s look at the same CDF graph, except for x_100.

def ks_plot_norm(data):
length = len(data)
plt.figure(figsize=(12, 7))
plt.plot(np.sort(x_100), np.linspace(0, 1, len(x_100), endpoint=False))
plt.plot(np.sort(stats.norm.rvs(loc=5, scale=3, size=100)), np.linspace(0, 1, len(x_100), endpoint=False))
plt.legend('top right')
plt.legend(['Data', 'Theoretical Values'])
plt.title('Comparing CDFs for KS-Test')


Looks much better now that the KS test has more data to compare to, which makes sense as x_50 has fewer samples than x_100.

Let’s perform the same test, this time with uniform variables tested against a uniform distribution. We should not see the null hypothesis rejected.

# Try with a uniform distubtion
x_uni = np.random.rand(1000)
stats.kstest(x_uni, lambda x: x)
KstestResult(statistic=0.0271651556549517, pvalue=0.44898299308719736) # Null Accepted

How would our KS test CDF graph look if we compared our x_100 to the bimodal data we generated earlier. We should expect large deviations from each CDF and a KS test which rejects the null hypothesis.

# Generate binomial data
N = 100
x_bimodal_100 = np.concatenate((np.random.normal(-1, 1, int(0.1 * N)),
np.random.normal(5, 1, int(0.4 * N))))[:, np.newaxis]
# Plot the CDFs
def ks_plot_comp(data_1, data_2):
Data entereted must be the same size.
length = len(data_1)
plt.figure(figsize=(12, 7))
plt.plot(np.sort(data_1), np.linspace(0, 1, len(data_1), endpoint=False))
plt.plot(np.sort(data_2), np.linspace(0, 1, len(data_2), endpoint=False))
plt.legend('top right')
plt.legend(['Data_1', 'Data_2'])
plt.title('Comparing 2 CDFs for KS-Test')

ks_plot_comp(x_100, x_bimodal_100[:,0])
x_100 is in red while the binomial data is in blue

We can see x_100 and x_bimodal_100 diverge considerably, so the Kolmogorov-Smirnov test provides a p value rejecting the null hypothesis.

# Check if the distributions are equal
stats.ks_2samp(x_100, x_bimodal_100[:,0])
Ks_2sampResult(statistic=0.35, pvalue=0.0003814088466521276) # Null rejected


The Anderson-Darling tests if data comes from a particular distribution. The null hypothesis — similar to the previous two tests — is the sample comes from a population that follows a particular distribution. In this case, we will be testing if our data comes from a normal distribution.

The advantage here is, this is a more sensitive test to check different distributions. However, that critical values must be calculated for each distribution so it might be more work determining those for different distributions. The python package allows for these different distributions — normal, exponential, logistic, etc. — to be tested with the dist parameter.

# Perform the AD test
anderson_results_10 = stats.anderson(x_10, dist='norm')
anderson_results_50 = stats.anderson(x_50, dist='norm')
anderson_results_100 = stats.anderson(x_100, dist='norm')
anderson_results_1000 = stats.anderson(x_1000, dist='norm')
# Results for x_50
AndersonResult(statistic=0.5486337673481927, critical_values=array([0.538, 0.613, 0.736, 0.858, 1.021]), significance_level=array([15. , 10. , 5. , 2.5, 1. ]))
# Results for x_1000
AndersonResult(statistic=0.127392173654016, critical_values=array([0.574, 0.653, 0.784, 0.914, 1.088]), significance_level=array([15. , 10. , 5. , 2.5, 1. ]))
# Null accepted at .05 ... and at all other significance_level in
# this example

The results give us the p values for various significance levels [15. , 10. , 5. , 2.5, 1. ] so if you’re working with boundaries outside of .05 you can see those results as well.

Here our p value for .05 is outside the rejection region of 0.784, meaning we cannot reject the null hypothesis our data comes from a normal distribution. Again what we would have expected

And there you have it, three different tests and a few graphical measures to see if your data is distributed normally. As well, some methods to see if your data is distributed in different ways.

The main take away from this article is to understand what your normality tests mean, why to implement them, and when to be sceptical of them. There is no one best test for normality and it is good to be suspicious of your results until you can verify them through additional measures.

The notebook used for this project can be found here.


Additional Reading