Large amount of observations: Statistical Test not so Statistical

Laurae
Data Science & Design
4 min readNov 13, 2016

Laurae: This post is about why a statistical test is not much statistical when the amount of observations you feed is “large”, with a full example on the Allstate competition and its “loss” normal / gaussian distribution transformations. The post was originally at Kaggle.

junwang writes:

boxcox(loss+200) seems more normal than log(loss+200), but its mae is larger. If normality is not an impact, what drives log(loss+200) has better performance?

Diego writes:

good question, I just want to add a graph to help visualize it:

Also is strange that the normality test (scipy.stats mstats.normaltest) returns a p-value of zero for both data sets:

log: NormaltestResult(statistic=3797.686866828305, pvalue=0.0)
boxcox: NormaltestResult(statistic=2755.7509220854868, pvalue=0.0)

According to the docs (https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.mstats.normaltest.html), “This function tests the null hypothesis that a sample comes from a normal distribution”. If the p-val is very small, it means it is unlikely that the data came from a normal distribution.

Anyone would know why?

There are several ways if you want to assess normality of a distribution: visually and/or statistically.

For a visual assessment, you draw the Q-Q plot of quantiles vs theoretical quantiles. If your data are exactly gaussian (normal/gaussian distribution), the points should align on a single line. In reality for real data sets, this is rarely the case and it is up to you how lenient you allow non-normality when points are deviating from the diagonal line.

For instance, this is a normal distribution:

This is clearly not a normal distribution and requires prior transformation (ignore axis names):

As for statistical tests, you have many normality tests. The most 4 known normality tests, in order of strength from empirical testings:

  • Shapiro-Wilk
  • Anderson-Darling (the one provided on the link you provided)
  • Lilliefors
  • Jarque-Bera

All existing normality tests are “failing” (not providing a reliable answer) when the amount of samples is large enough. For instance on Allstate, all these tests are unreliable. Example of Shapiro-Wilk test:

  • When the amount of samples increase (like from 1000), the null hypotheis has higher odds of being rejected despite providing same-distributed data, due to potential deviation — this effect is even stronger when playing with 5+ -digit observations
  • When the amount of samples decrease (like 100), the null hypothesis has lower odds of being rejected due to less sensitivity to outliers but forces the researcher to look more carefully at visuals — and to also use several tests (because one could reject while another one not reject the null hypothesis)

This is true on all the normality tests. Any small deviation from normality when the sample size is large enough leads to rejection of the null hypothesis. Even if your Q-Q plot shows a near-perfect alignment: kurtosis become heavily sensible from a “large amount of observations” point of view for normality tests. A very slight difference can impact the test result as your sample size increases.

This is also true for most omnibus tests (any test trying to spot a variance difference). Most normality tests belong to omnibus tests. Hence why normality tests should be avoided on Allstate data for instance.

Extra: in contrary to the omnibus tests being sensible to kurtosis, the tests relying on means are sensible to skewness.

Reminding the testing hypotheses if needed:

  • Null hypothesis: the data is following a normal distribution
  • Alternative hypothesis: the data is not following a normal distribution
  • p-value < alpha (significance threshold, typically alpha=0.05): the null hypothesis is rejected (one must reject the fact that “the data is following a normal distribution”, but it does not prove “the data is not following a normal distribution”)

If you sample a lot of observations and do the same tests, the results you might get can be entirely different. Ideally:

  • Create many samples of different sizes from the label data (and if possible with a normal distribution) — or just create synthetic normal sets
  • Compute the p-value from a normality test
  • Compute the ratio of null hypothesis rejection
  • Draw the curve of sample size vs ratio of null hypothesis rejection

And you should end with an increasing ratio curve, even if the sample is from the same distribution.

--

--