The Permutation Test — Evaluating machine learning model predictions

Rohan Saha
Samur.AI
Published in
4 min readMay 7, 2022

Every machine learning problem involves various steps ranging from data cleaning to model evaluation. This article focuses on a crucial and valuable step during the model evaluation process, more important for classification problems. Let’s talk about the permutation test. It’s also called as the resampling test.

Photo by Kenny Eliason on Unsplash

NOTE: The article assumes that you have some familiarity with machine learning methods.

Let’s understand what is the idea behind the permutation test. The main premise of the test is to find out, how our trained model compares to a chance model. In other words, we want to find out what’s the likelihood that the (better) accuracy/performance that we obtained from our trained model is not due to chance. This is an important question and worth thinking about, especially when we use procedures such as nested cross-validation where the training set has an effect on the final performance of the model. If we think deeply about it, you may start to see that we will end up calculating a p-value to tell us how confident we are in the performance of our model.

The concept is best understood with an example. I will be using a simple binary classification problem in python to illustrate the test. The dataset that we will be using is the iris dataset.

Let’s pause for a second and think… If we had only two classes in the target variable y then what would be the accuracy of a binary classification model if the model predicts the target variable randomly?

That’s right! It would be 50%. If that’s not clear, take a moment to why this is true. As you will see, most of the code is standard machine learning code. Okay, now let’s write some code.

First, let’s import the necessary libraries.

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn import metrics
from sklearn.model_selection import train_test_split

Let’s load the iris dataset and then the features and the targets. One thing to note here is that the dataset has three target variables instead of two.

iris = datasets.load_iris()
X = iris.data
y = iris.target

As the next step, let’s divide the dataset into train and test splits.

# Obtain train test splits.
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42, stratify=y)

And the exciting part.. training a model.

Since we have a classification problem, we will use Logistic Regression as our model.

NOTE: scikit-learn’s LogisticRegression model changes automatically to ‘multinomial’ solver when the number of classes is more than two.

# Fit a model
model = LogisticRegression()
model.fit(X_train, y_train)

Finally, let’s obtain the accuracy by comparing the predictions to the ground truth values.

preds = model.predict(X_test)
print("Accuracy: {0}/100".format(metrics.accuracy_score(preds, y_test)*100))
Accuracy: 96.66666666666667/100

Woah! ~96.67% accuracy!

Now, let’s ask the question, “Is this result different from the performance that a chance model might provide?”

To simulate a chance model, we shuffle (permute) the assignments of the samples to the target values. In other words, we randomly assign the target labels to the corresponding samples. Why does this work? It works because this creates a scenario where there is no association between the input samples (X) and the output labels (y). In other words, since all the assignments of the labels to the samples are random, the best model will randomly assign the labels to the samples.

Ideally, we run the permutation test many times because it gives us a better estimate by evaluating various random assignments. If we only evaluated one random shuffle, then it is highly possible that the particular random assignment result in a pretty good model. This might happen but it is rare.

Okay, so let’s run our model for 100 permutations. Since our dataset is small, the code will run fairly quickly.

The code shown below is the exact same code above but with a small change. On each ‘permutation’ iteration, we are shuffling the labels/targets to simulate the random assignment.

# Permutation test or the resampling test.
permuted_accuracies = []
permutation_iters = 100 # We run the procedure many times to have a 'good enough' estimate of the accuracy of a chance model.
for i in range(permutation_iters):
# Shuffle the dataset here so that the target species/labels are `randomly` assigned to the samples.
np.random.shuffle(y)

# When we obtain train and test splits, the assignments are random.
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42, stratify=y)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
preds = model.predict(X_test)
permuted_accuracies.append(metrics.accuracy_score(preds, y_test)*100)

Let’s look at how good our chance model is.

print("Average permuted accuracy is: ", np.mean(permuted_accuracies))Average permuted accuracy is:  33.06666666666667

The chance accuracy is close to 33% because we have three classes in the target ‘y’. This means that the model is not able to learn the association between the input samples and the target labels and thus predicts the labels randomly.

And that’s it! That’s the permutation test to evaluate how good our model is compared to a chance model.

Although we can directly see that an accuracy of ~ 98% is greater than a chance accuracy of 33%, we still need to obtain a measure that quantifies this effect. In other words, obtain a measure that will tell us if our non-permuted results are significantly above chance. To do this, we can calculate a p-value which we’ll cover that in a different post!

Some helpful resources and papers:

  1. A visual explanation of the permutation test.
  2. Permutation test for studying classifier performance.

If you like this article, consider buying me a coffee :)

--

--

Rohan Saha
Samur.AI

I write about byte sized articles on machine learning and how to survive academia.