Try normality tests using all available Python libraries

5 min readDec 23, 2023

Introduction

Statistical inferences often assume a population as a normal distribution based on the central limit theorem, which states that a standardized sample mean taken from even non-normal distributions follows a normal distribution. However, if a distribution is highly skewed, the mean may not accurately represent the central tendency of the population, even with a large sample size. In such cases, a more robust statistic, such as the median, is preferred as a measure of central tendency. When taking a smaller number of samples from a population, it is common practice to test for normality before applying statistical methods based on the normal distribution.

There are several methods to test normality: graphical, frequentist, and Bayesian. The most quantitative graphical method uses a quantile-quantile plot (QQ plot) of the standardized sample mean against the standard normal distribution. For normally distributed data, the points plotted on the QQ plot should align approximately on a straight line, indicating a strong correlation. A measure of the goodness of the fit could indicate the degree of normality.
The frequentist approach involves testing data against the null hypothesis that it follows a normal distribution. The methods commonly associated with frequentist statistics are listed on Wikipedia as follows:
- D’Agostino’s K-squared test,
- Jarque-Bera test,
- Anderson-Darling test,
- Cramér-von Mises criterion,
- Kolmogorov-Smirnov test,
- Lilliefors test,
- Shapiro-Wilk test,
- Pearson’s chi-squared test.
The text states that the Shapiro-Wilk test has the highest power for a given significance level, closely followed by the Anderson-Darling test compared to the Kolmogorov-Smirnov, Lilliefors, and Anderson-Darling tests.

In this article, I will use Python libraries to encapsulate all the test methods listed in Wikipedia into a function. The techniques are demonstrated using random samples from a normal and a uniform distribution.

Encapsulating all the test methods listed in Wikipedia

I wrote a function to encapsulate all the tests for normality, where ‘lilliefors’ and ‘chisquare’ are renamed using the ‘namedtuple’ to harmonize these names with those of the other libraries. We need more preprocessing code to generate the observed and expected frequencies to apply the chi-square test to the normality test. The argument (‘number_of_bins’ ) in ‘normality_tests’ function is used for binning the frequency arrays only for the chisqure test. Several blog posts have made mistakes when using the chi-square for the normality test.

from scipy.stats import (kstest, shapiro, normaltest, jarque_bera, 
                         cramervonmises, anderson, chisquare)
from statsmodels.stats.diagnostic import lilliefors
from scipy.stats import (norm, uniform, triang, expon, arcsine, gamma)
from collections import namedtuple
from tabulate import tabulate
import numpy as np

def normality_tests(samples, number_of_bins = 15):
    ''' normality tests using all available python libraries
        number_of_bins: effective only for chi-square test,
        1.Kolmogorov-Smirnov test,
        2.Shapiro-Wilk test,
        3.D'Agostino's K-squared test,
        4.Jarque–Bera test,
        5.Lilliefors test,
        6.Cramér–von Mises criterion,
        7.Chisquare test,
        8.Anderson-Darling test
        '''
    
    # Lilliefors renamed using named tuple to hamornize its name with others
    LillieforsResult = namedtuple('LillieforsResult', ['statistic', 'pvalue'])
    lil = lilliefors(samples)
    
    # for chi-square test
    f_obs, bins =  np.histogram(samples, bins=number_of_bins)
    tempo = [norm.cdf(bins[i+1]) - norm.cdf(bins[i])
             for i in range(len(bins)-1)]
    f_exp = np.array(tempo)*len(samples)/np.sum(tempo)       # normalized
    ChisquareResult = namedtuple('ChisquareResult',['statistic', 'pvalue'])
    chi = chisquare(f_obs=f_obs, f_exp=f_exp)
    
    return [kstest(samples, 'norm'),          
            shapiro(samples),                 
            normaltest(samples),   
            jarque_bera(samples), 
            LillieforsResult(lil[0], lil[1]), 
            cramervonmises(samples, 'norm'),
            ChisquareResult(chi[0], chi[1]),
            anderson(samples)]
    
def pass_fail(test):
    '''make decisions whether pass or fail with a significance level of 5 %'''

    result = ['pass' if test[i].pvalue > 0.05 else 'fail' for i in range(7)]
    result.append('pass' if test[7].critical_values[2]>test[7].statistic 
                  else 'fail')
    return result

def print_out(out, result):
    '''print out the test result using tubulate library
       out: from normality_tests function
       result: from pass_fail function'''
    
    # method names from function.__doc__
    method = [normality_tests.__doc__.translate(
             str.maketrans('', '', '\n')).split(",")[i].
             strip() for i in range(1,9)]
    array =[]
    for i in range(len(out)-1):                      # from method 1 to 7
        sub_array = [method[i], f'{out[i].statistic:.4f}', 
                     f'{out[i].pvalue:.4f}', result[i]]
        array.append(sub_array)  
    # a significance level of 5 % for method 8    
    array.append([method[-1], f'{out[-1].statistic:.4f}', 
                  f'{out[-1].critical_values[2]:.4f}*', result[-1]])
    print(tabulate(array, headers=["test methods", "statistic", 
                                "p-value*", "result"]))
    print('*Not a p-value but a critical-value for a signif.level of 5%')

Testing the normally distributed random number samples

First, we will test the randomly sampled data from the standard normal distribution using the code above. Note that the Shapiro-Wilk test limits the number of samples to 5000. The pass or fail results are determined at a 5% significance level. All the p-values are more significant than the significance level. The value in the p-value column for the Anderson-Daring test reports a critical value rather than a p-value, where the test is accepted as passed if the critical value is greater than the test statistic. As we would expect, all tests pass successfully. That is to say, we cannot reject the null hypothesis of normality.

# number of samples, limited in Shapiro-Wilk test(n<5000)
n = 5000      
np.random.seed(1)

# sampled data from the standard normal distribution
samples = norm.rvs(size= n)   
      
out = normality_tests(samples)
result = pass_fail(out)

print_out(out, result)

test methods               statistic  p-value*    result
-----------------------  -----------  ----------  --------
1.Kolmogorov-Smirnov          0.0148  0.2196      pass
2.Shapiro-Wilk                0.9996  0.4603      pass
3.DAgostino's K-squared       0.1206  0.9415      pass
4.Jarque–Bera                 0.1515  0.9271      pass
5.Lilliefors                  0.0064  0.9021      pass
6.Cramér–von Mises            0.3518  0.0973      pass
7.Chisquare                  18.424   0.1881      pass
8.Anderson-Darling            0.3053  0.7860*     pass
*Not a p-value but a critical-value for a signif.level of 5%

Test the uniformly distributed random number samples

We will repeatedly test the uniformly distributed random number samples using the same test sets as the following. All the p-values are much less than the significance level of 5 % except the Anderson-Daring test, which is accepted as rejected in case the critical value is less than its test statistic. As expected, all tests fail, which means we can reject the null hypothesis of normality.

n = 5000
np.random.seed(1)

# sampled data from a uniform distribution
samples = uniform.rvs(size= n)   

out = normality_tests(samples)
result = pass_fail(out)
print_out(out, result)

test methods               statistic  p-value*    result
-----------------------  -----------  ----------  --------
1.Kolmogorov-Smirnov          0.5     0.0000      fail
2.Shapiro-Wilk                0.9554  0.0000      fail
3.DAgostino's K-squared    4209.99    0.0000      fail
4.Jarque–Bera               299.368   0.0000      fail
5.Lilliefors                  0.0602  0.0010      fail
6.Cramér–von Mises          349.38    0.0000      fail
7.Chisquare                 121.909   0.0000      fail
8.Anderson-Darling           55.5424  0.7860*     fail
*Not a p-value but a critical-value for a signif.level of 5%

Summary

I presented a Python function that encapsulates all the normality test methods listed in Wikipedia based on all available Python libraries. This test kit was successfully demonstrated on two sets of random number samples taken from a normal and uniform distribution, respectively.

References

https://en.wikipedia.org/wiki/Normality_test

Appendix

For reader’s reference, allow me to append the printout of the list(out) as follows.

out

[KstestResult(statistic=0.5000386777283669, pvalue=0.0),
 ShapiroResult(statistic=0.9553597569465637, pvalue=1.3175929207036562e-36),
 NormaltestResult(statistic=4209.988241733745, pvalue=0.0),
 Jarque_beraResult(statistic=299.3682423936702, pvalue=0.0),
 LillieforsResult(statistic=0.0602284694055234, pvalue=0.0009999999999998899),
 CramerVonMisesResult(statistic=349.3798394484765, pvalue=1.06223759344104e-07),
 ChisquareResult(statistic=121.90902712608003, pvalue=2.6586860380724203e-19),
 AndersonResult(statistic=55.54241504767742, critical_values=array([0.576, 0.655, 0.786, 0.917, 1.091]), significance_level=array([15. , 10. ,  5. ,  2.5,  1. ]))]