What is Hyperparameter Tuning (Cross-Validation and Holdout Validation)?

Suppose you are hesitating between two types of models (say, a linear model and a polynomial model): how can you decide between them?

  • One option is to train both and compare how well they generalize using the test set.

Now suppose that the linear model generalizes better, but you want to apply some regularization to avoid overfitting.

How do you choose the value of the regularization hyperparameter?

  • One option is to train 100 different models using 100 different values for this hyperparameter.

Suppose you find the best hyperparameter value that produces a model with the lowest generalization error — say, just 5% error.

You launch this model into production, but unfortunately, it does not perform as well as expected and produces 15% errors. What just happened?

The problem is that you measured the generalization error multiple times on the same test set, and you adapted the model and hyperparameters to produce the best model for that particular test set. This means that the model is unlikely to perform as well on new data.

A common solution to this problem is called holdout validation:

Holdout validation-

In this, the dataset is split into 3 parts:

Training Set, Validation Set, and Holdout Set

What is a Training Set?

  • A training set is the subsection of a dataset from which the machine learning algorithm uncovers, or “learns,” relationships between the features and the target variable.
  • The sample of data used to fit the model.
  • A training dataset is a dataset of examples used for learning, that is to fit the parameters (e.g., weights) for a model

What is a Validation Set?

  • A validation dataset is a dataset of examples used to tune the hyperparameters of a model. Also known as development set, or dev set

What is a Test or Hold out (Same Thing) dataset?

  • A test dataset is a dataset that is independent of the training dataset. If a model fit to the training dataset also fits the test dataset well, minimal overfitting has taken place. And find out generalization error.

The Validation dataset is used during training to track the performance of your model on “unseen” data. I wrote the unseen in quotes because although the model doesn’t directly see the data in validation set, you will optimize the hyper-parameters to decrease the loss on the validation set (since increasing val loss will mean over-fitting).

However, by doing so, you may over-fit the hyper-parameters to validation set (So that the loss will be low on that specific validation set, but will become worse on any other unseen set). That’s why you usually keep another 3rd set, called test set (or held-out set), which will be your truly unseen data, and you will test the performance of your model on that test set only once, after training your final model.

You simply hold out part of the training set (NOT test set) to evaluate several candidate models and select the best one. The new held-out set is called the validation set (or sometimes the development set, or dev set).

You train multiple models with various hyperparameters on the reduced training set (full training set minus the validation set), and you select the model that performs best on the validation set.

After this holdout validation process, you train the best model on the full training set (including the validation set), and this gives you the final model.

Lastly, you evaluate this final model on the test set to get an estimate of the generalization error.

This solution usually works quite well. However,

  • if the validation set is too small, then model evaluations will be imprecise: you may end up selecting a suboptimal model by mistake.
  • if the validation set is too large, then the remaining training set will be much smaller than the full training set. Why is this bad?

Well, since the final model will be trained on the full training set, it is not ideal to compare candidate models trained on a much smaller training set. It would be like selecting the fastest sprinter to participate in a marathon.

One way to solve this problem is to perform repeated cross-validation, using many small validation sets.


Each model is evaluated once per validation set after it is trained on the rest of the data. By averaging out all the evaluations of a model, you get a much more accurate measure of its performance.

There is a drawback, however: the training time is multiplied by the number of validation sets.

