Nested Cross-Validation Against Overfitting

3 min readDec 4, 2022

In machine learning tasks, we check our models with a validation set so that they do not overfit. In fact, we use the cross-validation approach to ensure that this validation set is constantly changing, in order not to choose the best model by indirectly overfitting to a fixed validation set. But even if we use cross validation, our model is likely to surprise us given an unseen, additional test data. In fact, some of the popular techniques we use allow our model to make some decisions by overfitting to the validation set.

Hyper-Parameter Opimization and Feature Selection Case

When you try to find a optimal parameter or feature set for your model with the given data using K-Fold or any cross-validation scheme, you are feeding the characteristics of the dataset to your model indirectly. Thus, your model is converged to a hyper-parameter or feature set that will perform well throughout all of the dataset. It would likely perform poorly in an unseen data scenario.

Early-Stopping Case

While training your model with cross-validation, you terminate the training of your model early by looking at the validation set. In fact, your model is indirectly overfitting the validation split. This scheme also does not allow you to simulate the unseen data scenario.

Nested Cross-Validation

With Nested Cross-Validation, you will be able to perform the two applications I mentioned above again using a cross-validation scheme, and you will also learn your model performance on unseen data.

Standard Cross-Validation

Normally, when you try to train your model using a cross-validation scheme, you would use a standard loop like the one below:

But this scheme has the weaknesses I mentioned above.

Nested Variant

Image Source: https://www.baeldung.com/cs/k-fold-cross-validation

Let’s go over the implementation:

First of all, we implement the first CV, which we call outer CV. This is the same CV approach we used before.
Then we divide the train part of this CV again with another CV. We can also call it inner CV. Here, we actually isolated the validation split of our outer CV from the inner CV diagram.
From now on, the validation data of our outer CV can be counted as unseen data for all the operations we do on the internal CV. Because there is no possibility of seeing that data in any inner CV fold.

An example model loop written with this approach could also be:

It’s exactly the same approach I described above. The models are trained with the inner CV. Also their predictions for validation splits of validation outer CV are collected. You can also get a score by averaging the scores of these outer CV validation split predictions, or you can combine them with the OOF approach to build a prediction vector of all the data you have.

Conclusion

It is in your best interest to use this cross-validation approach for any problem where you want to simulate performance on unseen data as much as possible. Because every naive approach that include tuning or early-stopping causes your model performance to be biased against the data you use.