How to leverage permutation tests and bootstrap tests for baselining your Machine Learning models

Published in

Data Science at Microsoft

7 min readAug 22, 2023

Across the dynamic landscape of Machine Learning, practitioners often dive into the deep end, crafting complex models to solve intricate problems. While these models are vital, it’s equally important to remember that the foundation of any good model is a solid baseline. Baselining provides us with a reality check on our model’s performance, supplies a defense against overfitting, and serves as a reference point for simpler models.

Imagine for a moment that we’re data scientists at a fledgling e-commerce company. We’ve developed a Machine Learning model that predicts whether a customer will make a purchase within the next month based on their activity on our site. The model’s performance looks promising, but we need to establish a robust baseline. How can we be sure that our model is doing something meaningful and not merely making lucky guesses?

Baselining is a crucial step that helps us set realistic expectations and guides us toward model improvements. But how can we effectively perform baselining? The answer lies in robust statistical methods: permutation tests and bootstrap tests.

Understanding permutation tests

Permutation tests, or randomization tests, operate under the null hypothesis that the labels and predictions are independent. In the context of Machine Learning, the null hypothesis assumes that the model is no better than random guessing.

Suppose we have a dataset with n instances and a binary classification model. After training, the model produces a set of predictions. To perform a permutation test:

Compute a test statistic (such as accuracy or AUC-ROC), T, from the original predictions and labels.
Randomly permute (shuffle) the labels and compute the test statistic, T_perm, for each permutation. Repeat this process N times to generate a distribution of permuted test statistics.
Calculate the p-value, the proportion of T_perm that is greater than or equal to T. If the p-value is less than a predetermined significance level (typically 0.05 or 0.01), reject the null hypothesis.

It is pretty straightforward to implement a permutation test in Python, as shown in the following code.

In the permutation test, a p-value of 0 implies that none of the permuted sets of predictions performed as well as or better than the model. By increasing the number of permutations, we create more opportunities for a permuted dataset to achieve a similar or higher score than our model would achieve purely by chance.

For many decent or good models, it’s unlikely that a random permutation of test set predictions would be better, which results in the estimate of the p-value often ending up being precisely zero. In reality, the p-value here is just very very small, but a Monte Carlo–based process with a limited number of samples can’t approximate it with enough precision to yield a non-zero result. This isn’t a huge problem in itself, but it can cause problems with computing confidence intervals. One way around this issue is a hack in which we make sure that one of the Monte Carlo permutation samples is the original set of test predictions, ensuring that we get a non-zero p-value. This results in a more conservative test but in practice it seems to be a good compromise to ensure that we can compute useful confidence intervals.

Bootstrap tests in a nutshell

Bootstrap tests use resampling with replacement to estimate the distribution of a statistic. It compares two models to check whether we’re seeing one as better than the other due to chance or low test data volume.

Here’s how we can use them in model evaluation:

Train your model and a baseline model, and compute the performance difference, D.
Draw a bootstrap sample from the data (a sample of the same size as the original data, but drawn with replacement), and recompute the performance difference, D_bootstrap, for each sample. Repeat this N times to generate a distribution of bootstrapped differences.
Calculate the p-value, the proportion of D_bootstrap that is less than or equal to 0. If the p-value is less than a predetermined significance level, we conclude that our model significantly outperforms the baseline.

We can also write our own code to implement a bootstrap test in Python.

In the bootstrap test example above, we compared our model against a dummy model, which is selected as a majority predictor. This test shows that our model outperforms the majority predictor in all bootstrapped samples and thus can be considered as significantly better than the baseline model.

In a bootstrap test, you’re not limited to any specific baseline model. The choice of the baseline model is flexible and largely depends on the context of your Machine Learning problem.

For instance, a baseline model can be as simple as a model that always predicts the majority class in a classification problem (as in the previous example) or a model that always predicts the mean (or median) target value in a regression problem. Alternatively, you may choose a more sophisticated model that you consider a “minimum viable model” for your problem as a baseline. The key is that the baseline model should be simple and serve as a point of comparison for the more complex models you want to evaluate. In the next example, we fit a logistic regression model as the baseline and execute a bootstrap test to evaluate our Random Forest model against the logistic regression baseline.

When models fail the test

In the example above, when comparing your model (Random Forest classifier) with the baseline (logistic regression), the p-value is 0.33, which is greater than the pre-specified threshold value 0.05. This result indicates that the model’s performance fails to outperform the simpler baseline model. In this case, the more complex decision tree model may be overfitting, not generalizing well to new data, or the problem might not benefit from tree modeling techniques.

Baselining, as shown above, can save us from falling into the trap of complexity. If a simpler model can do the job, or if our complex models don’t seem to be doing anything more than random guessing, it’s a signal to revisit our model and data.

Don’t want to write your own code? No problem!

In the examples above, I provide my own code to demonstrate how to implement these two statistical tests in Python. If you want to conduct a permutation test or bootstrap test quickly to evaluate your model’s performance against a random model or some baseline models, you could leverage existing packages and functions in relevant Python libraries. For example, the Python library scikit-learn provides an excellent framework to perform permutation tests with a built-in function: permutation_test_score.

Below we plot a histogram of the permutation scores (the null distribution). The red line indicates the score obtained by the classifier on the original data. The score is much better than those obtained by using permuted data and the p-value is thus very low. This indicates that the model is statistically significantly better than a random model.

There are currently no existing Python packages or functions that can perform bootstrap tests for comparing two or more models’ performance. As mentioned earlier, it is straightforward to implement a bootstrap test in Python with several lines of code. By doing it in a DIY (do it yourself) way, you can also customize the inputs of your own functions such as model task types and evaluation metrics based on how you want to compare model performance.

Wrapping up

Permutation tests and bootstrap tests offer complementary ways to assess whether your model is significantly better than random chance or a specific baseline. Incorporating them into your evaluation toolkit can lead to more robust and reliable models, and ultimately, better decision-making in your data science projects. As always, ensure you understand these methods’ underlying assumptions and limitations to use them effectively.