Try normality tests using all available Python libraries

MCMC Addict
5 min readDec 23, 2023

--

Photo by Fauzan My on Unsplash

Introduction

Statistical inferences often assume a population as a normal distribution based on the central limit theorem, which states that a standardized sample mean taken from even non-normal distributions follows a normal distribution. However, if a distribution is highly skewed, the mean may not accurately represent the central tendency of the population, even with a large sample size. In such cases, a more robust statistic, such as the median, is preferred as a measure of central tendency. When taking a smaller number of samples from a population, it is common practice to test for normality before applying statistical methods based on the normal distribution.

There are several methods to test normality: graphical, frequentist, and Bayesian. The most quantitative graphical method uses a quantile-quantile plot (QQ plot) of the standardized sample mean against the standard normal distribution. For normally distributed data, the points plotted on the QQ plot should align approximately on a straight line, indicating a strong correlation. A measure of the goodness of the fit could indicate the degree of normality.
The frequentist approach involves testing data against the null hypothesis that it follows a normal distribution. The methods commonly associated with frequentist statistics are listed on Wikipedia as follows:
- D’Agostino’s K-squared test,
- Jarque-Bera test,
- Anderson-Darling test,
- Cramér-von Mises criterion,
- Kolmogorov-Smirnov test,
- Lilliefors test,
- Shapiro-Wilk test,
- Pearson’s chi-squared test.

The text states that the Shapiro-Wilk test has the highest power for a given significance level, closely followed by the Anderson-Darling test compared to the Kolmogorov-Smirnov, Lilliefors, and Anderson-Darling tests.

In this article, I will use Python libraries to encapsulate all the test methods listed in Wikipedia into a function. The techniques are demonstrated using random samples from a normal and a uniform distribution.

Encapsulating all the test methods listed in Wikipedia

I wrote a function to encapsulate all the tests for normality, where ‘lilliefors’ and ‘chisquare’ are renamed using the ‘namedtuple’ to harmonize these names with those of the other libraries. We need more preprocessing code to generate the observed and expected frequencies to apply the chi-square test to the normality test. The argument (‘number_of_bins’ ) in ‘normality_tests’ function is used for binning the frequency arrays only for the chisqure test. Several blog posts have made mistakes when using the chi-square for the normality test.

from scipy.stats import (kstest, shapiro, normaltest, jarque_bera, 
cramervonmises, anderson, chisquare)
from statsmodels.stats.diagnostic import lilliefors
from scipy.stats import (norm, uniform, triang, expon, arcsine, gamma)
from collections import namedtuple
from tabulate import tabulate
import numpy as np

def normality_tests(samples, number_of_bins = 15):
''' normality tests using all available python libraries
number_of_bins: effective only for chi-square test,
1.Kolmogorov-Smirnov test,
2.Shapiro-Wilk test,
3.D'Agostino's K-squared test,
4.Jarque–Bera test,
5.Lilliefors test,
6.Cramér–von Mises criterion,
7.Chisquare test,
8.Anderson-Darling test
'''

# Lilliefors renamed using named tuple to hamornize its name with others
LillieforsResult = namedtuple('LillieforsResult', ['statistic', 'pvalue'])
lil = lilliefors(samples)

# for chi-square test
f_obs, bins = np.histogram(samples, bins=number_of_bins)
tempo = [norm.cdf(bins[i+1]) - norm.cdf(bins[i])
for i in range(len(bins)-1)]
f_exp = np.array(tempo)*len(samples)/np.sum(tempo) # normalized
ChisquareResult = namedtuple('ChisquareResult',['statistic', 'pvalue'])
chi = chisquare(f_obs=f_obs, f_exp=f_exp)

return [kstest(samples, 'norm'),
shapiro(samples),
normaltest(samples),
jarque_bera(samples),
LillieforsResult(lil[0], lil[1]),
cramervonmises(samples, 'norm'),
ChisquareResult(chi[0], chi[1]),
anderson(samples)]

def pass_fail(test):
'''make decisions whether pass or fail with a significance level of 5 %'''

result = ['pass' if test[i].pvalue > 0.05 else 'fail' for i in range(7)]
result.append('pass' if test[7].critical_values[2]>test[7].statistic
else 'fail')
return result

def print_out(out, result):
'''print out the test result using tubulate library
out: from normality_tests function
result: from pass_fail function'''

# method names from function.__doc__
method = [normality_tests.__doc__.translate(
str.maketrans('', '', '\n')).split(",")[i].
strip() for i in range(1,9)]
array =[]
for i in range(len(out)-1): # from method 1 to 7
sub_array = [method[i], f'{out[i].statistic:.4f}',
f'{out[i].pvalue:.4f}', result[i]]
array.append(sub_array)
# a significance level of 5 % for method 8
array.append([method[-1], f'{out[-1].statistic:.4f}',
f'{out[-1].critical_values[2]:.4f}*', result[-1]])
print(tabulate(array, headers=["test methods", "statistic",
"p-value*", "result"]))
print('*Not a p-value but a critical-value for a signif.level of 5%')

Testing the normally distributed random number samples

First, we will test the randomly sampled data from the standard normal distribution using the code above. Note that the Shapiro-Wilk test limits the number of samples to 5000. The pass or fail results are determined at a 5% significance level. All the p-values are more significant than the significance level. The value in the p-value column for the Anderson-Daring test reports a critical value rather than a p-value, where the test is accepted as passed if the critical value is greater than the test statistic. As we would expect, all tests pass successfully. That is to say, we cannot reject the null hypothesis of normality.

# number of samples, limited in Shapiro-Wilk test(n<5000)
n = 5000
np.random.seed(1)

# sampled data from the standard normal distribution
samples = norm.rvs(size= n)

out = normality_tests(samples)
result = pass_fail(out)

print_out(out, result)
test methods               statistic  p-value*    result
----------------------- ----------- ---------- --------
1.Kolmogorov-Smirnov 0.0148 0.2196 pass
2.Shapiro-Wilk 0.9996 0.4603 pass
3.DAgostino's K-squared 0.1206 0.9415 pass
4.Jarque–Bera 0.1515 0.9271 pass
5.Lilliefors 0.0064 0.9021 pass
6.Cramér–von Mises 0.3518 0.0973 pass
7.Chisquare 18.424 0.1881 pass
8.Anderson-Darling 0.3053 0.7860* pass
*Not a p-value but a critical-value for a signif.level of 5%

Test the uniformly distributed random number samples

We will repeatedly test the uniformly distributed random number samples using the same test sets as the following. All the p-values are much less than the significance level of 5 % except the Anderson-Daring test, which is accepted as rejected in case the critical value is less than its test statistic. As expected, all tests fail, which means we can reject the null hypothesis of normality.

n = 5000
np.random.seed(1)

# sampled data from a uniform distribution
samples = uniform.rvs(size= n)

out = normality_tests(samples)
result = pass_fail(out)
print_out(out, result)
test methods               statistic  p-value*    result
----------------------- ----------- ---------- --------
1.Kolmogorov-Smirnov 0.5 0.0000 fail
2.Shapiro-Wilk 0.9554 0.0000 fail
3.DAgostino's K-squared 4209.99 0.0000 fail
4.Jarque–Bera 299.368 0.0000 fail
5.Lilliefors 0.0602 0.0010 fail
6.Cramér–von Mises 349.38 0.0000 fail
7.Chisquare 121.909 0.0000 fail
8.Anderson-Darling 55.5424 0.7860* fail
*Not a p-value but a critical-value for a signif.level of 5%

Summary

I presented a Python function that encapsulates all the normality test methods listed in Wikipedia based on all available Python libraries. This test kit was successfully demonstrated on two sets of random number samples taken from a normal and uniform distribution, respectively.

References

  1. https://en.wikipedia.org/wiki/Normality_test

Appendix

For reader’s reference, allow me to append the printout of the list(out) as follows.

out

[KstestResult(statistic=0.5000386777283669, pvalue=0.0),
ShapiroResult(statistic=0.9553597569465637, pvalue=1.3175929207036562e-36),
NormaltestResult(statistic=4209.988241733745, pvalue=0.0),
Jarque_beraResult(statistic=299.3682423936702, pvalue=0.0),
LillieforsResult(statistic=0.0602284694055234, pvalue=0.0009999999999998899),
CramerVonMisesResult(statistic=349.3798394484765, pvalue=1.06223759344104e-07),
ChisquareResult(statistic=121.90902712608003, pvalue=2.6586860380724203e-19),
AndersonResult(statistic=55.54241504767742, critical_values=array([0.576, 0.655, 0.786, 0.917, 1.091]), significance_level=array([15. , 10. , 5. , 2.5, 1. ]))]

--

--