Normality Testing

Published in

AI Skunks

11 min readMar 14, 2023

Normality Testing:-Normality testing is a statistical procedure to determine if a sample of data is from a normally distributed population. The goal is to determine if a sample comes from a population with a Gaussian distribution or not. This is important because many statistical methods assume that the data is normally distributed. Before going in detail about Normality Testing, let’s take a look about, what Normal Distribution is?

Normal or Gaussian Distribution

Normal distribution, also known as Gaussian distribution, is a continuous probability distribution that is widely used in statistics to model real-world phenomena. It is characterized by its bell-shaped curve, which is symmetrical around the mean.

The probability density function (PDF) of a normal distribution is given by:

f(x) = (1/σ√(2π)) * exp(-(x-μ)²/(2σ²))

where:

μ is the mean of the distribution
σ is the standard deviation of the distribution
x is any possible value of the variable being measured

The normal distribution has several important properties, including:

The mean, median, and mode of the distribution are all equal. The total area under the curve is equal to 1. The standard deviation determines the spread of the distribution. The larger the standard deviation, the more spread out the distribution is. The Empirical Rule or 68–95–99.7 rule states that: a) About 68% of the values fall within one standard deviation of the mean. b) About 95% of the values fall within two standard deviations of the mean. c) About 99.7% of the values fall within three standard deviations of the mean.

Normal distribution is widely used in statistical analysis, hypothesis testing, and modelling real-world data in various fields, including finance, engineering, social sciences, and more.

Examples of Normal Distribution

Normal distribution can be observed in various real-world phenomena. Here are some examples:

1) Height of individuals: The height of individuals within a population generally follows a normal distribution.

2) IQ scores: Intelligence Quotient (IQ) scores of a population also follow a normal distribution.

3) Body weight: The body weight of individuals within a population also follows a normal distribution.

4) Errors in measurements: Errors in measurements, such as in laboratory experiments or manufacturing processes, can also be modeled using a normal distribution.

5) Exam scores: The scores of a large group of students in an exam often follow a normal distribution, assuming that the exam is well-designed and fairly administered.

6) Financial returns: The daily returns of stocks and other financial assets often follow a normal distribution.

7) Reaction time: The time it takes for an individual to respond to a stimulus also follows a normal distribution.

8) Blood pressure: The systolic and diastolic blood pressure of a population also follows a normal distribution.

These are just a few examples of where normal distribution can be observed in real-world phenomena. The distribution is so commonly used that it is often referred to as the “normal” or “bell-shaped” distribution.

Some common methods for normality testing include the Shapiro-Wilk test, Anderson-Darling test, and the Kolmogorov-Smirnov test. The results of the test will give a p-value, which is used to determine the likelihood that the sample is from a normally distributed population. If the p-value is below a certain threshold (e.g. 0.05), it is concluded that the sample is not normally distributed.

Shapiro-Wilk test

Shapiro-Wilk test The Shapiro-Wilk test is a statistical test used to determine if a sample of data is normally distributed. It tests the null hypothesis that the sample comes from a normal population. The test is based on the idea that if a sample is drawn from a normal distribution, the sum of the squares of the ranked standardized values will be close to the expected value. The test statistic is calculated and compared to a critical value to determine if the null hypothesis should be accepted or rejected.

The Shapiro-Wilk test can be visually explained by using a Q-Q plot, also known as a quantile-quantile plot. This plot shows the observed values on the y-axis and the expected values based on a normal distribution on the x-axis. If the sample comes from a normal distribution, the points on the plot should follow a straight line. If the sample deviates significantly from normality, the points on the plot will deviate from the line. In this way, the Q-Q plot provides a graphical representation of the results of the Shapiro-Wilk test, allowing you to quickly assess the normality of your sample.

The Shapiro-Wilk test statistic (W) is calculated as:

W = (a' * X^2) / (sum(X^2) - (sum(X)^2)/n)^(1/2)

where X is the standardized sample data, a' is a vector of weights, and n is the sample size. The weights (a') are chosen to maximize the correlation between the transformed sample data and a normal distribution. The test statistic W is compared to a critical value from a table or distribution to determine if the null hypothesis of normality should be accepted or rejected. A low value of W indicates that the sample deviates significantly from normality, while a high value of W supports the null hypothesis.

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

# Generate a random sample from a normal distribution
np.random.seed(0)
data = np.random.normal(0, 1, 100)

# Perform the Shapiro-Wilk test
w, p_value = stats.shapiro(data)

# Check the p-value of the test
alpha = 0.05
if p_value > alpha:
    print("Sample looks normal (fail to reject H0)")
else:
    print("Sample does not look normal (reject H0)")

# Generate a Q-Q plot
stats.probplot(data, plot=plt)
plt.show()

Output

Generates a random sample of 100 data points from a normal distribution and performs the Shapiro-Wilk test to check if the sample comes from a normal distribution. The p-value of the test is compared against the significance level (alpha) to determine whether to reject or fail to reject the null hypothesis (that the sample comes from a normal distribution). Finally, a Q-Q plot is generated to visually inspect the normality of the sample.

Anderson-Darling test

The Anderson-Darling test is a statistical test used to determine if a sample of data is normally distributed. It is a modified version of the Kolmogorov-Smirnov test, with a greater emphasis on the tails of the distribution. The test statistic measures the difference between the cumulative distribution function of the sample and the cumulative distribution function of a normal distribution. The larger the difference between the two, the stronger the evidence against normality. The test statistic is compared to critical values from a table or distribution to determine if the null hypothesis of normality should be accepted or rejected. The Anderson-Darling test is considered to be one of the most powerful tests for detecting deviations from normality, especially in the tails of the distribution.

The Anderson-Darling test can be visually explained by using a Q-Q plot, also known as a quantile-quantile plot. This plot shows the observed values on the y-axis and the expected values based on a normal distribution on the x-axis. If the sample comes from a normal distribution, the points on the plot should follow a straight line. If the sample deviates significantly from normality, the points on the plot will deviate from the line. In this way, the Q-Q plot provides a graphical representation of the results of the Anderson-Darling test, allowing you to quickly assess the normality of your sample. Unlike the Shapiro-Wilk test, the Anderson-Darling test gives more weight to the tails of the distribution, so deviations from normality in the tails will result in a greater deviation from the line on the Q-Q plot.

The Anderson-Darling test statistic (A²) is calculated as:

A² = -n — (1/n) * sum[(2i-1) * log(F(X_i)) + (2n-2i+1) * log(1-F(X_i))]

where n is the sample size, X_i is the ith ordered sample value, F(X_i) is the cumulative distribution function of a normal distribution evaluated at X_i, and the sum is taken over all i = 1 to n. The test statistic A² is compared to critical values from a table or distribution to determine if the null hypothesis of normality should be accepted or rejected. A high value of A² indicates that the sample deviates significantly from normality, while a low value of A² supports the null hypothesis.

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

# Generate a random sample from a normal distribution
np.random.seed(0)
data = np.random.normal(0, 1, 100)

# Perform the Anderson-Darling test
result = stats.anderson(data)

# Check the p-value of the test
critical_values = result.critical_values
significance_level = result.significance_level
for i in range(len(critical_values)):
    if result.statistic < critical_values[i]:
        print("Sample looks normal (fail to reject H0) at significance level: ", significance_level[i])
    else:
        print("Sample does not look normal (reject H0) at significance level: ", significance_level[i])

# Generate a Q-Q plot
stats.probplot(data, plot=plt)
plt.show()

Generates a random sample of 100 data points from a normal distribution and performs the Anderson-Darling test to check if the sample comes from a normal distribution. The test statistic is compared against the critical values at different significance levels to determine whether to reject or fail to reject the null hypothesis (that the sample comes from a normal distribution). Finally, a Q-Q plot is generated to visually inspect the normality of the sample.

Kolmogorov-Smirnov test

The Kolmogorov-Smirnov (KS) test is a statistical test used to determine if a sample of data is normally distributed. The test statistic measures the maximum difference between the cumulative distribution function of the sample and the cumulative distribution function of a theoretical normal distribution. The larger the difference, the stronger the evidence against normality. The test statistic is compared to critical values from a table or distribution to determine if the null hypothesis of normality should be accepted or rejected. The Kolmogorov-Smirnov test is a simple and widely used test for normality, but it can be less powerful than other tests such as the Anderson-Darling test, which places more emphasis on the tails of the distribution.

The Kolmogorov-Smirnov test can be visually explained by using a Q-Q plot, also known as a quantile-quantile plot. This plot shows the observed values on the y-axis and the expected values based on a normal distribution on the x-axis. If the sample comes from a normal distribution, the points on the plot should follow a straight line. If the sample deviates significantly from normality, the points on the plot will deviate from the line. In this way, the Q-Q plot provides a graphical representation of the results of the Kolmogorov-Smirnov test, allowing you to quickly assess the normality of your sample. The Kolmogorov-Smirnov test is a two-sided test, so deviations from normality in either tail will result in a deviation from the line on the Q-Q plot.

The Kolmogorov-Smirnov test statistic (D) is calculated as:

D = max |F(X_i) — i/n|

where X_i is the ith ordered sample value, F(X_i) is the cumulative distribution function of a normal distribution evaluated at X_i, and n is the sample size. The maximum difference between the sample cumulative distribution and the theoretical cumulative distribution is taken over all i = 1 to n. The test statistic D is compared to critical values from a table or distribution to determine if the null hypothesis of normality should be accepted or rejected. A large value of D indicates that the sample deviates significantly from normality, while a small value of D supports the null hypothesis.

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

# Generate a random sample from a normal distribution
np.random.seed(0)
data = np.random.normal(0, 1, 100)

# Perform the Kolmogorov-Smirnov test
D, p_value = stats.kstest(data, 'norm')

# Check the p-value of the test
alpha = 0.05
if p_value > alpha:
    print("Sample looks normal (fail to reject H0)")
else:
    print("Sample does not look normal (reject H0)")

# Generate a Q-Q plot
stats.probplot(data, plot=plt)
plt.show()

Generates a random sample of 100 data points from a normal distribution and performs the Kolmogorov-Smirnov test to check if the sample comes from a normal distribution. The p-value of the test is compared against the significance level (alpha) to determine whether to reject or fail to reject the null hypothesis (that the sample comes from a normal distribution). Finally, a Q-Q plot is generated to visually inspect the normality of the sample.

Example — Car Purchasing Dataset

Importing Libraries

import pandas as pd
import numpy as np
import math
from scipy.stats import norm
import matplotlib.pyplot as plt
from scipy.stats import norm
%matplotlib inline

Reading the dataset

cars = pd.read_csv("car_purchasing.csv", encoding='latin-1', na_values='NA')

Data Exploration

cars.car_purchase_amount
#exploration

Exploring the data
From the histogram we can see that normal probability distribution is best suited for this.

cars.car_purchase_amount.hist(density=True, bins=20)
#exploration

using MLE to find the best parameters.

mean = cars.car_purchase_amount.mean()
variance = cars.car_purchase_amount.var()
stddev = math.sqrt(variance)
print("Mean from maximum likelihood",mean)
print("Variance from maximum likelihood",variance)
print("Standard deviation from maximum likelihood",stddev)

Normal PDF

We calculate the mean and variance of the data, then plot the normal pdf on top of the histogram

cars.car_purchase_amount.hist(density = True)
x_min,x_max = plt.xlim()
plt.plot(np.linspace(x_min,x_max), norm.pdf(np.linspace(x_min,x_max),mean,stddev))

mu, std = norm.fit(cars.car_purchase_amount)
print("Mean after fitting", mu)
print("Standard deviation after fitting", std)

Since the mean and standard deviation before and after fitting are so close, the normal pdf s plotted before and after fitting overlaps each other

# plotting before and after fitting
cars.car_purchase_amount.hist(density = True)
x_min,x_max = plt.xlim()

plt.plot(np.linspace(x_min,x_max), norm.pdf(np.linspace(x_min,x_max),mean,stddev),linestyle='dashed', color='green')
plt.plot(np.linspace(x_min,x_max), norm.pdf(np.linspace(x_min,x_max),mu,std),linestyle='dashed', color='yellow')

Shapiro Wilk Test

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

# Generate a random sample from a normal distribution
data = cars.car_purchase_amount
# Perform the Shapiro-Wilk test
w, p_value = stats.shapiro(data)

# Check the p-value of the test
alpha = 0.05
if p_value > alpha:
    print("Sample looks normal (fail to reject H0)")
else:
    print("Sample does not look normal (reject H0)")

# Generate a Q-Q plot
stats.probplot(data, plot=plt)
plt.show()

Anderson-Darling test

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt


data = cars.car_purchase_amount
# Perform the Anderson-Darling test
result = stats.anderson(data)

# Check the p-value of the test
critical_values = result.critical_values
significance_level = result.significance_level
for i in range(len(critical_values)):
    if result.statistic < critical_values[i]:
        print("Sample looks normal (fail to reject H0) at significance level: ", significance_level[i])
    else:
        print("Sample does not look normal (reject H0) at significance level: ", significance_level[i])

# Generate a Q-Q plot
stats.probplot(data, plot=plt)
plt.show()

Kolmogorov-Smirnov test

import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

data = cars.car_purchase_amount
# Perform the Kolmogorov-Smirnov test
D, p_value = stats.kstest(data, 'norm')

# Check the p-value of the test
alpha = 0.05
if p_value > alpha:
    print("Sample looks normal (fail to reject H0)")
else:
    print("Sample does not look normal (reject H0)")

# Generate a Q-Q plot
stats.probplot(data, plot=plt)
plt.show()

Reference

Normality test - Wikipedia

In statistics, normality tests are used to determine if a data set is well-modeled by a normal distribution and to…

en.wikipedia.org

StatQuest: The normal distribution, clearly explained!!! - StatQuest!!!

Edit description

statquest.org

Normality Testing

Normal or Gaussian Distribution

Examples of Normal Distribution

Shapiro-Wilk test

Anderson-Darling test

Kolmogorov-Smirnov test

Example — Car Purchasing Dataset

Normal PDF

Shapiro Wilk Test

Anderson-Darling test

Kolmogorov-Smirnov test

Reference

Normality test - Wikipedia

In statistics, normality tests are used to determine if a data set is well-modeled by a normal distribution and to…

StatQuest: The normal distribution, clearly explained!!! - StatQuest!!!

Edit description

Written by Ananthakrishnan Harikumar