Why You Need to Spend More Time Evaluating your Machine Learning Models

Turns out a lot of AI Research has been getting this wrong

Aug 28 · 5 min read

Validation. Such a sweet thing. People following me will know that I always recommend adding sources of randomness to your Machine Learning Model Training. Want the receipts? My article, “Why and How to integrate Randomness into your Machine Learning Projects” praises adding randomness to your model training process, and gives you various possible ways you can do this. In other articles and videos, I have talked about how I prefer exploring configurations and adding randomness before going to complex Deep Learning networks. “Accounting for Variance in Machine Learning Benchmarks” is a paper that backs up everything I’ve been going with. Here’s a happy quote from the paper:

We show a counter-intuitive result that adding more sources of variation to an imperfect estimator approaches better the ideal estimator at a reduction in compute 51xcost.

In this article, I will (try to) keep the gloating to a minimum. I will also be going over some interesting takeaways from the paper, including the recommendations by the authors to build fantastic benchmarks for Machine Learning Pipelines. Let me know what you think, and which of the takeaways you found most insightful. If you like such content, make sure you’re following me here so that you can keep up with Machine Learning Research and ideas. Reach out to me if there are any topics in the Machine Learning/AI research space you would like broken down.

Why Evaluating Models Matters

Simple, helps us compare to determine the best Machine Learning models. The reason we have multiple metrics and ways to compare the models is that for different problems, different metrics will be more relevant.

What most model comparisons get wrong.

More complexity->Too Costly to accurately evaluate the model over the diverse sets

Spend some time reading that passage. It hints at something really crucial. Having extremely complex models will directly stop you from going over multiple configurations (and data splits) and comparing the models over a diverse set of tests. When we don’t test over these sources of variation, we might actually get the objectively wrong answer.

Remember these are relative to the variance from bootstrapping

Above is a pretty concise illustration of the various ways we could induce variance into our learning agents. The numbers can’t be ignored. The variance can literally change the results of your comparison. Below is a passage that sums up the main point of this section.

This picture is the TL;DR of the whole section

Naturally, there are factors aside from pure metrics that we care about. If the model is too expensive, it’s not worth anything. You might be wondering how we can compute model complexity. Here is a video that introduces one of the best metrics for computing efficiency (performance with respect to complexity), the Bayesian Information Criterion, in less than a minute.

How to Design Better Benchmarks for ML Pipelines

If you’ve read this far, the next question on your mind is going to be about how we can get to building better benchmarks for our models. Fear not. As promised, here are the aspects you want to focus on for great comparison benchmarks.

Randomize as many sources of variations as possible

Good model comparisons will have a lot of randomized choices. Think back to a lot of the arbitrary choices we make during our machine learning process. The random seed for initializations, data order, how we initialize the learners, etc. Randomizing these will allow for better-performing models. To quote,

“a benchmark that varies these arbitrary choices will not only evaluate the associated variance (section 2), but also reduce the error on the expected performance as they enable measures of performance on the test set that are less correlated (3). This counter-intuitive phenomenon is related to the variance reduction of bagging (Breiman, 1996a; Buhlmann et al., 2002), and helps characterizing better the expected behavior of a machine-learning pipeline, as opposed to a specific fit

I found the comparison to bagging particularly interesting. This is why I recommend taking some time to go over various ML concepts etc. It will help you come across ideas and associations to understand things better and be innovative.

Use Multiple Data Splits

Most people use a single train-test-validation split. They will batch their data once and be done with it. More industrious people might also run some cross-validation. I would recommend also playing around with the ratios used for building the sets. In the words of the team, “For pipeline Accounting for Variance in Machine Learning Benchmarks comparisons with more statistical power, it is useful to draw multiple tests, for instance generating random splits with a out-of-bootstrap scheme(detailed in appendix B).

Account for variance to detect meaningful improvements

It’s important that you always remember that there is a degree of randomness in your results. Running multiple tests is one way to reduce it. But it will never go away unless you go through every possible permutation (this might be impossible, and definitely needlessly expensive). Minor improvements might just be a result of random chance. When dealing with models, always keep a few close-performing ones on hand.


This was an interesting paper. The authors did a great job showing how many arbitrary choices in the Machine Learning Process can skew the results. It’s talks of the need to have comprehensive testing to account for randomness. The fact that this paper validates so much of what I have been saying was the icing on the cake. While nothing the paper claimed was controversial, the extent to which it showed how variance can change results was certainly eye-opening.

To get the most of the paper, I would recommend going through the Appendix. It contains some really interesting details such as their bootstrap procedure and a comprehensive list of the parameters that can be randomized to get thorough testing.


Here is the annotated version of the paper. Feel free to download it and go through it to see what I found interesting/come across my insights.

Reach out to me

This section contains the details of all my links/work.

If that article got you interested in reaching out to me, then this section is for you. You can reach out to me on any of the platforms, or check out any of my other content. If you’d like to discuss tutoring, text me on LinkedIn, IG, or Twitter. If you’d like to support my work, using my free Robinhood referral link. We both get a free stock, and there is no risk to you. So not using it is just losing free money. If you’re preparing for coding/software interviews follow my substack. I post high-quality questions and explanations there to help you nail your FAANG/software interviews.

Check out my other articles on Medium. : https://rb.gy/zn1aiu

My YouTube: https://rb.gy/88iwdd

Reach out to me on LinkedIn. Let’s connect: https://rb.gy/m5ok2y

My Instagram: https://rb.gy/gmvuy9

My Twitter: https://twitter.com/Machine01776819

My Substack: https://devanshacc.substack.com/

Live conversations at twitch here: https://rb.gy/zlhk9y

Get a free stock on Robinhood: https://join.robinhood.com/fnud75


Data Scientists must think like an artist when finding a solution


Data Scientists must think like an artist when finding a solution, when creating a piece of code.Artists enjoy working on interesting problems, even if there is no obvious answer.


Written by


I write high-performing code and scripts for organizations to help them generate more revenue, identify areas of investment, isolate redundancies, and automate


Data Scientists must think like an artist when finding a solution, when creating a piece of code.Artists enjoy working on interesting problems, even if there is no obvious answer.