The hidden validation set you aren’t using …

Published in

Data Science at Microsoft

11 min readMay 9, 2023

A tree outside of but merged with numeric bag — Image generated by DALL·E 2.

I love random forests. Really, I love all decision tree–based models. The concept of learning by partitioning is not only simple and intuitive, it’s also easy to use to reason about what’s happening when you’re trying to understand how a model might be “seeing” your data.

For me, unless it’s not a good match for the dataset, using a random forest is the way I like to start playing around with a dataset. One of the reasons is that some random forest implementations have a built-in validation set of sorts. If you’re not familiar with the type known as “out-of-bag” (OOB) validation, it’s an excellent addition to your Machine Learning (ML) development toolkit that I find is often underutilized — in fact, I almost never run into anybody else who regularly uses it!

In this article, I dive into how out-of-bag validation takes less code and mental overhead, how it’s more stable than cross-validation, and how it’s much faster than cross-validation. As helpful as it can be, I also cover when not to use it.

If you’d like to run some code, check out the notebook version of this article, in which I explore how out-of-bag validation can speed up your workflow and potentially even give you a more accurate estimate of model performance than k-fold cross-validation at a fraction of the computational cost!

Why should you care about what OOB validation can do for you?

Two scenarios provide examples of how OOB validation can help: turning around a new dataset quickly and easily, and calibrating model outputs.

Scenario 1: Quick and easy turnaround on a new dataset

Imagine you’re given a large dataset with mostly unknown characteristics, and you’re tasked with exploring whether training a given model on it will be feasible. It’s unknown whether the dataset can support a model, whether the model needs to be nonlinear, and how much data would be required to ensure that the model is extracting the right signal.

In a situation like this, you don’t want to get stuck in the exploration phase trying to decide whether a model may even be feasible on the dataset.

Speed is essential here. You don’t want to take two weeks to decide whether a model will work. You could probably train a linear model rather quickly, but if it doesn’t work you won’t know whether you should have spent more time processing the features and scaling them, among other tasks. You also won’t know whether it didn’t work because the relationship with the target is highly nonlinear.

I recommend opting for something nonlinear just in case. Something tree-based will make it less likely that your model underperforms based on preprocessing choices. Because there is a lot of data, however, it would also be nice if you could quickly iterate through a set of models — in other words, build a learning curve — without having to spend time and mental cycles writing code to set up a validation set or set up k-fold cross validation.

OOB validation can get you up and running with accurate estimates of out-of-sample performance with less code, and with a much lower run-time than k-fold cross validation. It can transform this type of experiment from taking half a day to something you can do over your lunch break.

Scenario 2: Calibrating model outputs

Let’s say you’re training a model on a highly skewed target — say 12-month revenue for a set of customers whose size varies widely. The model has trouble fitting to this distribution, so you log transform it, and the model now appears to be working quite well.

But you don’t want to predict the log of the revenue — you want the actual revenue! Contrary to popular assumption, you can’t naively invert the log transform to get what you want because it will systematically underpredict.

Instead, you can use a post-calibration step to fit your models’ predictions to the actual target distribution. A common way to do this is to gather k-fold predictions, and then fit these to the target.

However, to get good results, you may be forced to use a rather large value for k. If k = 20, you’re now waiting 20 times as long to get the cross-validated predictions. Here’s where OOB validation comes to the rescue: Train your model once, and voilà — you’ve already got validation predictions for each training sample, with no waiting, and no unstable calibration results!

Bootstrapping and the OOB sample

By default in scikit-learn, the Random Forest Regressor uses bootstrap sampling. That is, if you feed in n samples, every decision tree in the forest is trained on n samples chosen with replacement from the original n.

Because the sampling happens with replacement, there are almost always some samples left out of training on each tree. Those rows are described as being “out of bag” (or OOB as we’ve been referring to it).

The OOB sample is interesting because it has the properties of a randomly selected validation set. For each estimator added to the forest we have a “free” validation set that can be used in a manner similar to other forms of validation like “leave one out” or k-fold cross validation.

OOB prediction

We can form a prediction for the OOB samples simply by feeding them into the decision tree that was trained during the round in which they were left out of training. These predictions are generated for each estimator, so in general each sample will have many OOB predictions.

The standard way of producing a prediction from a random forest is to feed a sample into each decision tree, get a prediction from each, and then average all the predictions. In a similar way, we can form an “OOB prediction” for each sample in the training set by averaging all the individual OOB predictions for that sample.

Because a given sample was never involved in the training of the tree that produces its OOB prediction, the OOB prediction can provide an estimate of generalization to unseen samples. In fact, it can be mathematically proven that the OOB prediction is in general a slightly pessimistic evaluation of generalization performance.

The OOB prediction is like having a “free” set of cross-validation predictions

The best thing about the OOB prediction is that it comes “ free” (i.e., at a very small computational cost) as long as you ask for it, which you can do simply by specifying “oob_score=True”. Do that and train the model and then you’ll be given access to “oob_prediction”, which you can use to compute OOB statistics that are very similar to cross-validation statistics.

In scikit-learn, getting OOB predictions and results is as easy as specifying a single Boolean parameter.

OOB versus cross-validation

My argument is that OOB validation is simply better, in general, than k-fold cross validation. Here are my reasons:

It’s easier and simpler to code.
It’s more precise (i.e., performance metric estimates have less variation).
Because you train only one model instead of k models, it’s k times faster!

In addition, some downstream tasks such as stacking and calibration — and potentially more — are easier to do with OOB predictions than with a k-fold–based approach. I explore each of these reasons as follows.

The code is simpler than k-fold

I find that the code is simply easier to write — I don’t know about you, but for me there’s always a little bit of mental overhead or boilerplate in setting up a k-fold loop or using a cross-validation convenience wrapper. OOB eliminates all that: No loops, no cross-validation wrappers — just a single extra parameter and your OOB predictions are ready to go!

Train a model, obtain OOB predictions, and evaluate model quality in three lines of code.

Comparison to k-fold. Compare what I’ve just described to using k-fold and manually indexing folds:

This code is substantially more comples, leaves room for error, and requires more mental overhead.

There’s a lot to track here in terms of indexing and other elements. (To be fair, there are things you can do with this approach that you can’t do with the OOB technique, such as stratified sampling.)

Comparison to X-val convenience wrapper. You can also compare to scikit-learn’s cross validation convenience functions, which can be less elegant and less flexible than the OOB approach:

This is more concise than coding up x-val from scratch, but arguably it has just as much mental overhead, including having to remember a special function and make a scorer, among others.

This example shows that you must remember to make your own “mean_squared_error” scorer using “make_scorer” or you must feed in “neg_mean_squared_error”, which results in the output score being negative — it’s a bit awkward!

I went with making my own scorer to avoid any downstream effects from potentially forgetting to negate the output of cv_results[‘test_score’] (for example, accidentally selecting the worst model instead of the best model from a hyperparameter search).

Even the somewhat streamlined “cross_validate” convenience function isn’t as clean, simple, and terse as the OOB validation method.

Each of these may not make a huge difference on its own but efficiencies in code add up!

OOB validation is more precise than k-fold cross-validation

Let’s measure the precision of OOB validation compared to k-fold cross validation. We’ll do this by repeatedly running an experiment in which we compute a validation statistic (mean squared error) and then analyze the variation of that statistic across repeated runs.

Code for running the OOB vs k-fold experiment. Check the linked notebook for the complete code needed to reproduce and modify the experiment.

Table of mean squared errors comparing OOB validation to k-fold for various values of k. The OOB MSEs are distributed more tightly than the 20-fold estimate and were computed almost 20 times faster.

You can see from these stats that each of these methods converges on a very similar estimate of the mean squared error on average. Note, however, that the OOB estimate exhibits lower variance than each of the k-fold methods.

When tuning a model, these kinds of swings in the estimation of model performance can have an influence on hyperparameter decisions.

OOB validation is much faster than k-fold cross-validation

At 20 folds, our k-fold estimate of the MSE has comparable variation to our OOB estimate — while taking 20 times as long to compute!

In general, you can think of OOB validation as requiring only “1x” computation, while k-fold cross-validation requires “kx”. In other words, it takes k times as long or k times as much compute. That can mean a huge difference in compute time.

If you want to squeeze all you can out of your model while tuning, OOB is often a superior option, being either much more stable than comparably fast k-fold estimates (albeit still more than twice as fast as three-fold cross-validation) or an order of magnitude faster than comparably stable k-fold estimates.

The impact of this can’t be overestimated — having your experiments run two to 20 times faster can be transformative in terms of your ability to iterate on a model.

Bonus 1: One more time!

In addition, you’ll probably want to retrain the model one more time after performing k-fold cross-validation. This is because your k-fold models will each have been trained on only a subset of the data.

You could simply use one of the models from one of the folds, but depending on how many folds you’ve chosen you might be missing out on a significant portion — e.g., 33 percent, 25 percent, or 20 percent, among others — of the dataset. That’s usually enough to make a difference.

This isn’t necessarily the case with OOB. If you know the hyperparameters that you’re using (or with the model available after training), the model is already trained after having performed OOB validation, so there’s no need to train it again — it’s already been trained on all the data.

Bonus 2: It’s even more precise as you increase the number of estimators

As you add estimators, you get more “OOB folds,” meaning that your OOB performance estimates become more precise. This tends to track to generally better model performance.

But because adding estimators also linearly increases the training time, the added speed of OOB methods makes them a welcome alternative when you’re tuning a model with a large number of trees.

The standard deviation of the OOB estimate of MSE decreases proportionally as the square root of the number of estimators.

So then, why wouldn’t you want to use OOB validation?

As is always the case in Machine Learning, it’s “horses for courses!” You need to choose the right algorithm or approach for the task at hand.

Do you have very large data?

If you have a large enough dataset, you might be considering using a simple validation set (e.g., with an 80/20 split), in which case you wouldn’t be using k-fold and you wouldn’t necessarily see an increased speed benefit from using OOB validation.

However, I’d suggest that OOB validation will still provide a more accurate and precise validation score in this context (because you’re scoring on all the data, not just 20 percent of it) and at the end you’ll have a model that’s been trained on 100 percent of the data and not 80 percent. And if you do decide to go back and train a model on the full dataset after doing an 80/20 validation, you’re back in a case in which the OOB method is twice as fast! It still looks like a clear winner.

Do you have very small datasets? (You might want to stratify)

It may be the case that on small datasets (of about 100 samples), OOB’s sampling doesn’t provide a good estimate of the generalization error. (But neither may cross-validation, for that matter.) However, better sampling — such as stratified sampling — may solve that issue for you. The easiest way to perform stratified sampling using scikit-learn is to use a cross-validation procedure. So, if you think you might need to do stratified sampling, consider skipping the OOB.

Does your data have a temporal component?

If you need to do something like a back test, then an OOB validation score might not be what you want.

However, I will say that I’d still personally want to see the OOB validation stats even if I was also getting stats from a back test! They tell you something different about what kind of signal the model is capturing. As long as you don’t interpret the OOB results as representative of the quality of a forecast into the future, you’ll be able to leverage these as additional information about what your model is learning (or is failing to learn).

But what if you’re not using a bag-based model like random forest?

If a random forest or related model isn’t the right choice for your problem, then of course OOB validation may not be a good choice either, as the best model for the problem may not even support OOB validation.

Although… you can technically turn any model into a bag-based model!

Conclusion

OOB validation is an underutilized tool that deserves to have a spot in the data science workflow, especially in cases in which it’s necessary to get a model on tabular data out the door quickly. Make sure to add it to your toolbox, to speed up and simplify your model development workflow!

Phillip Adkins is on LinkedIn.