Data Scientist Must Know — How to do the Shapiro Wilk Normality Test

4 min readDec 17, 2022

When it comes to statistical theories for tests or models, the normality assumption is usually always present. Not only because of its powerful central limit theorem, but also because of its unique properties. When you are learning statistics in courses, you are usually allowed to assume the normality of your data. However in the real world, things don’t really work like that. A statistical test is needed to determine whether your data is normally distributed or not. This post aims to introduce you to the Shapiro Wilk test, arguably the most prevalent normality test out there.

The Test

This test is based on how different the data behaves to a normally distributed data. This test assumes that the data comes from a random sample. Assume that we have n observations of a random sample. The hypothesis for this test is:

Null hypothesis : the sample comes from a normal distribution

Alternative hypothesis : the sample does not come from a normal distribution

The steps for this test is as follows

Calculate the statistic ‘D’ as

2. Order the sample in ascending order with the following notation

3. Calculate the test statistic ‘W’ as

You could search the table up on the internet. Below is an example of the Shapiro wilk table.

4. Compare the value of the test statistic with the Shapiro Wilk test table

This table is filled with the quantiles of W. Reject the null hypothesis if W is less than the 0.05 quantile (or any other quantile for different errors). Same as the table above, you can search the internet for this table. An example of the quantile table is shown below.

The Shapiro Wilk Table to compare with W

An Example

Lets say you have a sample size of 10, after doing the calculations, you got W=0.651. Lets say you use an error of 5%, from the quantile table, we found that the 0.05 quantile of the W statistic is 0.842 . Since your W is less that 0.842, we reject the null hypothesis.

The Shapiro Wilk Test on Software

Obviously, you wouldn’t be doing this test by hand. Lucky for you, softwares can do the test for you. In this post, I’m going to show you the Shapiro Wilk test on RStudio and Python. This example uses the house prices data, which is named ‘df’ and we are going to check whether the ‘LotFrontage’ column is normal or not. For this example, lets use an error of 5% (0.05)

Shapiro Wilk Test in RStudio

The Shapiro Wilk test is provided in the stats library (which should be automatically installed on your RStudio). The code for the test is:

shapiro.test(df$LotFrontage)

The result from the test is as follows

From the result, we see that W=0.8804 and the p-value is less than 0.05. Therefore, we reject the null hypothesis that the data comes from a normal distribution

Shapiro Wilk Test in Python

For this test, we use the stats package from Scipy. If you’re using a Jupyter Notebook, the Scipy package should have been automatically installed, if its not, use the following code to install the package

!pip install scipy

the code used to do the test is

from scipy.stats import shapiro
shapiro(df['LotFrontage'])

The result from the test is as follows

The Shapiro Wilk Test on Python

Same as before, we see that the statistic is W=0.8804 and that the p-value is less than 0.05.

Some comments on this test

The Shapiro wilk test is my go to normality test whenever I need to do it. However, there are some drawbacks to this test. In particular, this test only determines whether the data is normal or not. We don’t know the mean and the variance of the population. If you are only testing for normality this wouldn’t be a problem. However, this test is useless if you are trying to test whether the sample comes from a normal distribution with a specific mean and variance. To test whether a particular sample of data comes from a specific dataset, you can use the Kolmogorov Smirnov test.