Cross validation the right and wrong way

William
Analytics Vidhya
Published in
3 min readMar 1, 2018
Photo by Mika Baumeister on Unsplash

Cross validation is a core machine learning technique, which is often taught right away in online introductory data science courses. This is with good reason, as we have to test our models on real data, and cross validation allows for more precise testing than say a single validation set would do alone. However there are a simple mistake one can make, which can sometimes be forgotten by practitioners, and I hope to over that quickly here, so that you may avoid it in the future.

The wrong way

When applying a model there is often various tuning options available, and one of course wishes to select the parameters which minimizes some error characteristic. One way to do this is to apply cross validation and test a set of different tuning parameters, to see which set gives the lowest error rate.

You’ve guessed it. This is the wrong way, but it is only slightly wrong. The problem with the approach is that you train lots of different models, and some are bound to do better than others, so by selecting the best performing model, and reporting the corresponding error rates you are essentially taking the model which fit the best on the test data. Now there is nothing wrong with this approach, as this is generally how people fit models in order to select the best performing. The only missing piece is the actual testing part, which should be done with either a hold out validation set or nested cross validation, for actually testing how well this selected model performs.

The problem becomes especially apparent with data containing a high number of features and low n. Selecting features based on correlation with labels does not guarantee a real correlation, since when selecting between many different features the selection process is bound to find some of them correlated with the labels.

Say we have 5.000.000 features and 50 samples. If we look hard enough there are properly some of these features which have a really high correlation with the labels, and therefore looks like good predictors.

So if we apply some screening procedure and select the promising features based on labels, and then go on to estimate error rates using cross validation, we will be able to obtain a very low error rate. The problem is that the screening process can be thought of as training, since we are selecting features based on labels, and a separate validation should be performed.

Another variant of this problem is model tuning using cross validation. If we do a lot of tuning to improve some evaluation metric, then we are essentially using the labels again in the evaluation process.

The right way

Select the features using cross validation, but with a hold out validation set. First tune model and feature selection with cross validation, and perform the final evaluation on the validation set. This way the actual testing is done on unseen data, and the feature selection bias on the error is only happening in the cross validated part of the data set.

Take away point: Keep some data for final validation, and do whatever you want to the rest of the data.

From comments: Suggested procedure

For the first question it’s usually recommended to go with something around 80/20, so 80% for the training.

As for the 2. question, when you train your model this way you will loose whatever information the 20% data has. And that is the ‘price’ for being able to actually test your model. BUT when all model training and selection is done, you retain the model with whatever parameters you have found on the whole dataset. Doing this you use all of your data, and no information is lost in the final model. This additional data hopefully also improves your model, since it is not based on more data.

Concretely the workflow looks something like this:

  1. Remove 20% for validation.
  2. Run cross validation on the remaining 80% to select optimal parameters or similar.
  3. Train model with all 80% of the data with optimal parameters.
  4. Test the model on the 20% validation set, you may see a drop in performance.
  5. If satisfied with results retrain the model on 100% of the data and deploy model for use.

--

--

William
Analytics Vidhya

I write shorts about various machine learning projects that I am working on.