The Permutation Test: A Statistical Test that has (More than) Survived the Test of Time

Moshe Sipper, Ph.D.
4 min readMay 17
AI-generated image (craiyon)

Statistical analysis investigates trends, patterns, and relationships in quantitative data, and is a basic tool in data science. In fact, beyond being a tool, many machine learning and deep learning methods are actually based, to a greater or lesser extent, on preexisting statistical methods.

A statistical (hypothesis) test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis. Here I’d like to talk about something called permutation testing.

The permutation test is little known — in my experience — beyond statistical circles (indeed, I learned of it a few years ago from my biomedical informatician colleague Jason H. Moore, who wears, among others, a statistician’s hat). For example, you will 𝘯𝘰𝘵 find it in “17 Statistical Hypothesis Tests in Python”, nor in most other such posts, summaries, and cheat sheets.

Well, just plain… darn.

Conceived in the 1930s by Fisher and Pitman, the permutation test is conceptually simple, barely makes any assumptions, and in my opinion beautifully demonstrates the core of a statistical test.

It has likely not gained traction because for many years it was too costly to compute — the permutation test really only works well once you’ve gained access to a modern computer (on that I completely agree with statistician Lloyd Rieber in this YouTue video).

The permutation test is simple to understand and code. It is a nonparametric test to assess the null hypothesis that two different groups come from the same distribution.

How does permutation testing work?

Let’s say you have two different samples, X and Y. For simplicity we’ll assume they’re of equal size (in general this need not be the case). Now let’s go through the following steps:

  1. Compute the difference between the median of sample X and the median of sample Y; this is the observed value of the test statistic — call it T. You’re welcome to replace median with other measures, such as mean.
  2. Combine all samples in X and Y into a single pool — let’s call it Z.
  3. Randomly divide Z into X’ and Y’ (of equal size). Yup, randomly.
  4. Compute the difference between the median of X’ and the median of Y’ — and record it.
  5. Repeat steps 3 and 4 until all permutations are evaluated. If the number of permutations is too high— stop when you’ve done enough random permutations (my go-to value is 10,000).
  6. The p-value of the test equals the proportion of sampled permutations where the absolute difference was greater or equal than T. (This is the two-sided p-value; for one-sided p-value remove `absolute difference’).

Essentially, what we’re saying is: If X and Y are really nothing special and you could easily obtain such sample pairs by randomly dividing your pool of samples—then you’ll end up with a high p-value, meaning the null hypothesis is true, that is, X and Y come from the same distribution.

On the other hand, if the p-value is low, meaning only a small proportion of our random permutations were greater than T, then X and Y are “really” different, that is, they come from different distributions.

Here’s a cool visualization. For those interested in a thorough paper on the topic, there’s “Permutation Methods: A Basis for Exact Inference”.

I’ve used this test in several papers to happily prove (or, less happily, disprove) that a new method I’m proposing is better than others (e.g., to name just two: “Conservation Machine Learning: A Case Study of Random Forests”, and “Neural Networks with À La Carte Selection of Activation Functions”).

For example, say I’ve designed a new learning algorithm and I run it 100 times, recording the best model performance of each run. So now I have 100 scores (for example, 100 accuracy values). Now, for comparison, I’ll also run a known model 100 times, say LightGBM (or ResNet50, or…), obtaining 100 score values.

Finally, I compare the median value of the 100 scores of my new algorithm with the median value of the 100 scores of LightGBM. Hurray! My median value is higher. But, wait — is my higher value statistically significant? Enter permutation testing. Low p-value? I’ve indeed designed a sweet algorithm. High p-value? Back to the drawing board…

A Python implementation is available as part of the 𝚖𝚕𝚎𝚡𝚝𝚎𝚗𝚍 package (beware: default method is ‘exact’, meaning compute all permutations, which is nigh impossible for large datasets — best set this parameter to ‘approximate’; I learned this the hard way, when I had too many permutations and found myself waiting… and waiting… and waiting…).

I’ll end this post with a quote from “The Introductory Statistics Course: A Ptolemaic Curriculum” by G. W. Cobb: “In this context, we need to remember that Pitman’s seminal work on the permutation test was published in 1937, at a time when today’s computing power was not even imagined. We’ve seen what happened to Bayesian methods once statisticians were able to compute posterior distributions. We should think hard about the permutation test now that it is so easy to implement. I’m inclined to take things a step beyond that, and suggest that we may even be resisting the permutation test in part because the theory is so analytically shallow. The computer is the only possibility, which may make things entirely too simple for some people!”

AI-generated image (craiyon)
Moshe Sipper, Ph.D.

Award-winning Professor of AI, who has published over 200 scientific publications, including several books — both fiction and nonfiction.