Normality Test with Python in Data Science

Shivam Mishra
Analytics Vidhya
Published in
3 min readAug 10, 2020

Shapiro-Wilk Test, Anderson-Darling Test, D’Agostino’s K-squared Test

Data Science

Table of content:-

  1. Shapiro- wilk Test
  2. Anderson-Darling Test
  3. D’Agostino’s K-squared Test

1. Shapiro- wilk Test

The Shapiro–Wilk test is a test of normality in frequentist statistics. It was published in 1965 by Samuel Sanford Shapiro and Martin Wilk.

The Shapiro-Wilk test is uesd to calculates a W statistic that tests whether a random sample, x1,x2,…,xn comes from (specifically) a normal distribution . Small values of W are evidence of departure from normality and percentage points for the W statistic, obtained via Monte Carlo simulations.

This test has done very well in comparison studies with other goodness of fit tests.

Assumptions:

  • Observations in each sample are independent and identically distributed (iid).

Hypothesis:

H0: Data follows Normal Distribution.

H1: Data does not follows Normal Distribution.

#Python code
#Example of Shapiro Wilk Test

from scipy.stats import shapiro
data = [1,1.2,0.2,0.3,-1,-0.2,-0.6,-0.8,0.8,0.1]
stat, p = shapiro(data)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
print("Data follows Normal Distribution")
else:
print("Data does not follow Normal Distribution")
OUTPUT:
Data follows Normal Distribution

2. Anderson-Darling Test

The Anderson–Darling test is a statistical test of whether a given sample of data is drawn from a given probability distribution. In its basic form, the test assumes that there are no parameters to be estimated in the distribution being tested, in which case the test and its set of critical values is distribution-free.

It can be used to check whether a data sample is normal. The test is a modified version of a more sophisticated nonparametric goodness-of-fit statistical test called the Kolmogorov-Smirnov test.

Assumptions:

  • Observations in each sample are independent and identically distributed (iid).

Hypothesis:

H0: Data follows Normal Distribution.

H1: Data does not follows Normal Distribution.

#Python code
#Example of Anderson-Darling Test

from scipy.stats import anderson
data = [1,1.2,0.2,0.3,-1,-0.2,-0.6,-0.8,0.8,0.1]
result = anderson(data)
OUTPUT:
AndersonResult(statistic=0.19788206806788722, critical_values=array([0.501, 0.57 , 0.684, 0.798, 0.95 ]), significance_level=array([15. , 10. , 5. , 2.5, 1. ]))

The test statistic is 0.1979. We can compare this value to each critical value that corresponds to each significance level to see if the test results are significant.

#Python code
print('stat=%.3f' % (result.statistic))
for i in range(len(result.critical_values)):
sl, cv = result.significance_level[i], result.critical_values[i]
if result.statistic < cv:
print('Data follows Normal at the %.1f%% level' % (sl))
else:
print('Data does not follows Normal at the %.1f%% level' % (sl))
OUTPUT:
Data follows Normal at the 15.0% level
Data follows Normal at the 10.0% level
Data follows Normal at the 5.0% level
Data follows Normal at the 2.5% level
Data follows Normal at the 1.0% level

3. D’Agostino’s K-squared Test

The D’Agostino’s K² test calculates summary statistics from the data, namely kurtosis and skewness, to determine if the data distribution departs from the normal distribution, named for Ralph D’Agostino.

  • Skew is a quantification of how much a distribution is pushed left or right, a measure of asymmetry in the distribution.
  • Kurtosis quantifies how much of the distribution is in the tail. It is a simple and commonly used statistical test for normality.

Assumptions:

  • Observations in each sample are independent and identically distributed (iid).

Hypothesis:

H0: Data follows Normal Distribution.

H1: Data does not follows Normal Distribution.

#Python code
#Example ofD’Agostino’s K-squared Test
from scipy.stats import normaltest
data = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]
stat, p = normaltest(data)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
print('Data follows normal')
else:
print('Data does not follow normal')
OUTPUT:
stat=3.392, p=0.183
Data follows normal

Contact me through:-

LinkedIn:- https://www.linkedin.com/in/shivam-mishra17/

Email:- shivammishra2186@yahoo.com

Twitter:- https://twitter.com/ishivammishra17

--

--

Shivam Mishra
Analytics Vidhya

I am a student of masters. I like to support our data science community.