Kolmogorov-Smirnov test may not be doing what you think when parameters are estimated from the data sample

Pablo Alonso
7 min readSep 12, 2021

--

As a data scientist, you might be interested in checking if your data follows a particular distribution of your interest. One common way for checking this is the Kolmogorov-Smirnov goodness of fit test, which is conveniently implemented in SciPy. So, if we want to check if our data follows, for example, a normal distribution we can just do the following:

mu = np.mean(my_data)
s = np.std(my_data)
_, pvalue = kstest(my_data, "norm", args=(mu, s))

If the p-value is small enough (e.g. the commonly used 5%) we will reject that the data follows the specified distribution, right?

Not quite!

If we read carefully, the p-values are no longer reliable if the distribution parameters are estimated from the data itself. If we do not pay attention this can be a pitfall in our analysis. In this post, I will show you examples of this problem and possible solutions. You can find a Jupyter notebook with all the code of post here [1].

This post is based on these stack-exchange replies [2, 3, 4, 5]. Also, as p-values are involved, I encourage you to check these articles [6, 7] about p-values.

Let’s jump in!

Kolmogorov-Smirnov test visualized

First, let’s get an insight into what the KS test does. As any hypothesis test, we assume a null hypothesis: the sample coming from an unknown continuous distribution (P) was drawn from the reference continuous distribution (P0):

Given the sample, the test computes its empirical Cumulative Distribution Function (eCDF) and compares it to the Cumulative Distribution Function (CDF) of the distribution under study.

The comparison consists of subtracting the CDF from the eCDF and keeping the maximum difference, in absolute value, as the statistic Dn. For a large enough number of samples, the distribution Dn follows the Kolmogorov distribution. If the value of Dn is more extreme than the selected threshold α, we reject the null hypothesis that the sample was drawn from the reference distribution. The chance of encountering a value of Dn more extreme than the one observed, under the null hypothesis, is given by the p-value of the test.

(left) PDF of a Normal(0, 1) and a 60 size sample from it and (right) the empirical Cumulative Density Function (eCDF) of the sample in blue, the Cumulative Distribution Function (CDF) of the Normal(0, 1) in black and the Dn statistic of the Kolmogorov-Smirnov test.

If you are interested in going into the details of the test you can have a look at [8] and [9].

KS-test problems

As we have already discussed in the introduction of this post, the p-values of the KS test are not correct when the reference distribution parameters are obtained directly from the sample under study.

In this paper [10], it is shown that when location and scale are the parameters to be estimated from the sample, we can construct the tables of significance values for the KS statistic for those distributions. This is particularly what the Lilliefors test does for the normal and exponential distributions.

Lilliefors test is just a KS test with a different manner of computing the p-values that accounts for the location and scale estimation using a Montecarlo simulation just once. I won’t dig into the details here, but if you are interested you can take a look at Lilliefors original papers [11, 12] or check the statsmodels implementation [13] of the significance values simulation.

Let’s see an example to see the Lilliefors test in action. Let’s think of a process believed to be Normal distributed N(0,1), which in reality is a combination of one main mode N(0,1) that appears 75% of the time plus two noise modes N(1.2, .7) and N(-1.5, .55). So the real process is:

1600 samples drawn from a Mixture of Gaussians: 0.75 N(0, 1) + 0.125 N(1.2, .7) + 0.125 N(-1.5, .55) from a synthetic process. The idea is to simulate a process we believe is N(0, 1) but in reality has two noisy modes at 1.2 and -1.5, with 12.5% probability each.

If we applied the KS-test with the fully specified Normal(0, 1), as it is meant, we get:

>>>kstest(my_data, "norm", args=(0, 1))
KstestResult(statistic=0.06213, pvalue=8.198e-06)

That is a very small p-value, thus we’d reject the null hypothesis and we’d conclude that the sample does NOT come from a N(0, 1).

However, let’s do the test now just thinking that the process is gaussian estimating its parameters from the sample:

>>>kstest(my_data, ‘norm’, args=norm.fit(my_data))
KstestResult(statistic=0.0299, pvalue=0.111869)

Now the p-value is 0.11, not that small. We would not be able to reject the null hypothesis. Let’s compare with a Lilliefors test:

>>>lilliefors(my_data, dist="norm")
(0.029837, 0.002003)

The p-value of the Lilliefors test is 0.002, enough to safely reject the null hypothesis. With this example, we can see the effect of using the right test when estimating the parameters from the sample.

Beyond Lilliefors: other distributions

In this section, we explore the possibility to expand the KS test with estimated parameters for distributions other than normal and exponential, thus Lilliefors cannot be used.

As these posts at stack-exchange [2, 3, 4, 5] already suggest, we can use Montecarlo simulations to compute the p-values. The idea is as follows:

1)Estimate distribution D parameters P from the sample S of size n_sample
2) D0 <- KStest(S, D(P)) # KS test with distribution with estimated parameters
3) statistics <- []
4) For 1...number of simulations Do:
nx <- sample "n_sample" samples from the distribution D(P)
P2 <- parameters of D estimated from nx
Dn <- KStest(nx, D(P2))
statitics.append(Dn)
5) p_value <- proportion(D0 <= statistics)

So, in plain English, estimate the parameters P of the distribution from the data and compute the K-S test statistic D0. Then, draw as many samples as your data size from the fitted distribution. Fit the parameters of the distribution again with the samples you have just drawn and compute its K-S test statistic. Repeat the process. Finally, check how many runs were more or as extreme as D0 and divide by the number of runs. That’s your p-value.

A similar approach, although with a twist that uses the probability integral transform, is used in statsmodels for computing the p-values of the Lilliefors test for N(0, 1) and Exp(1).

Jumping into an example. Let’s sample 500 points from a Beta(2, 5) and try to check if it follows a lognormal distribution.

If we run a kstest with a fitted lognorm we get:

>>>kstest(my_data, ‘lognorm’, args=lognorm.fit(my_data))
KstestResult(statistic=0.0400, pvalue=0.38786)

But with a 1000-run Montecarlo, simulation we get a p-value of 0.017, a huge difference.

Real-world data example

Enough of synthetic data, let’s see all the concepts we have talked about so far in a real-world example. We will use the well-known Iris flower dataset. More specifically, the Sepal length attribute, as shown below.

Histogram of the Sepal length attribute of the Iris flower dataset along with a fitted normal from the data (black) and a fitted lognorm (red).

We will test if the data follows a normal or lognormal distribution with the KS test, Lilliefors test and our Montecarlo implementation.

(left) Tests results for the Normal fit and (right) Tests results for the lognormal result

As we can see, the p-values of the KS test are bigger in an order of magnitude, which means that are conservative in the sense that the Type I error is smaller than what the p-value reports.

For the normal test we can use the Lilliefors test, but not for the lognormal. We can see a discrepancy between the Lilliefors and our Montecarlo results. This comes from two main sources:

  • Statsmodels used much more (orders of magnitude) samples for its Montecarlo simulation than this experiment.
  • Statsmodels uses an approximation for sample sizes that are not exactly the same as the ones simulated.

Resources

Remember that you can find the code and replicate all the results of this post in the jupyter notebook of this repository [1].

Conclusion (TL;DR)

In this post, we have had a look into the Kolmogorov-Smirnov goodness of fit test, the idea behind it and the issue that arises when we fit distribution parameters from the sample data. If the distribution under study is Normal or Exponential, we can overcome this limitation by using the Lilliefors test. For any other distribution, we must perform a Montecarlo simulation to obtain the p-values.

Other goodness of fit tests, like the Anderson-Darling test, which is more sensitive on the distribution tails, also suffers from the same problem and Montecarlo simulation must be carried out when fitting parameters from the data.

Reference

[1] Repository with code and Jupyter notebook of this article
[2] A naive question about the Kolmogorov Smirnov test
[3] Simulation of KS-test with estimated parameters
[4] How can one compute Lilliefors’ test for arbitrary distributions?
[5] Why do two implementations of the Anderson-Darling test produce such different p-values?
[6] Cohen J. Things I have learned (so far). Am Psychol. 1990;45:1304–1312
[8] Sullivan GM, Feinn R. Using Effect Size-or Why the P Value Is Not Enough. J Grad Med Educ. 2012;4(3):279–282.
[8] Kolmogorov-Smirnov and Mann-Whitney-Wilcoxon tests
[9] Kolmogorov-Smirnov test
[10] F. N. David and N. L. Johnson. The Probability Integral Transformation When Parameters are Estimated from the Sample. Biometrika. 1948; Vol. 35, pp.182–1990
[11] Lilliefors, Hubert W. On the Kolmogorov-Smirnov Test for Normality with Mean and Variance Unknown. Journal of the American Statistical Association. 1967; 62:318, 399–402
[12] Lilliefors, Hubert W. On the Kolmogorov-Smirnov Test for the Exponential Distribution with Mean Unknown, Journal of the American Statistical Association. 1969; 64:325, 387–389
[13] Statsmodels lilliefors_critical_value_simulation.py

--

--