Resampling Methods — ISLR Series: Chapter 5

Now that we learned some basic models, it is time to determine how well our model performs on new data that the model has not seen before (model assessment) and how to choose the right model (model selection). The tool that is used in statistics to assess and select models is resampling.

Resampling (wait for it…) is the process of fitting a model multiple times using different subsets of the training data. For each subset, we fit a model on it to obtain additional information about the fitted model. Using the new information obtained, we can tweak the model to improve the performance. The two resampling methods discussed in the chapter are cross-validation and bootstrap.

Validation Set Approach

The process of training a model is to split the data into two sets, training set and test set. When we run our model on the train and test set, we get (drumroll!!!!) a train error and a test error respectively.The reason we do this is to train on a set and test it on another to assess how well our model performs. If we train on all the data and test it on the same data, then of course the model will perform well because the model has seen the data before. The value of a model is determined on how accurate it is on unseen data.

The question occurs when there is not enough data to train your model AND save some on the side for a test set.

The validation set approach is the most basic one where the observations are randomly divided into a training set and validation set (sometimes referred to the hold-out set). The model is fitted using the training set and tested on the validation set. The error of the validation set can be used to estimate the test error.

Very simple right? Unfortunately it has some drawbacks to it. The error of the validation set can be very volatile depending on the split of the data. That means every time we implement a validation set approach, chances are that we will be getting different validation set errors. The second drawback is that if we split the data in half, then the model is being deprived of half of the data. This means that the model will not perform well.

That is when cross-validation can be implemented. The basic idea of cross-validation is to hold out a subset from the training set, train the model on the data, then test the model on the held out set. We implement this few times to get an estimate of the test error. We will go over two different types of cross validation: leave-one-out cross-validation (LOOCV) and k-fold cross validation.

Cross Validation

A possible solution to the solve the drawbacks of the validation set approach is to implement leave-one-out cross-validation (LOOCV). Instead of subsetting half of the data into a train set and the other half into a test set, LOOCV holds out only one observation for the test set and trains on the rest. But that is not all. LOOCV repeats this process until every single observation has a chance to be in the test set by itself. If there were n observations in the dataset, then LOOCV is performed n times, each time collecting the error rate of that single observation. The test error estimate will be the average of all n error rates.

LOOCV solves the variability of validation set approach because the model gets a chance to train on almost all the data (except for that lonely observation held out) which means there is no randomness when splitting the data into train and test set. Also since it is trained on almost all the data, the model will perform better on unseen data.

The drawback to LOOCV is that we will have to implement this n times. If there are 1 million data points then we would have to implement this one million times! Performing LOOCV can be computationally inefficient and time consuming. But don’t worry. K-fold cross-validation comes to the rescue.

K-fold cross-validation splits the data into k subsets or folds of equal size. The first fold is treated as the validation set and the model is fitted on the remaining folds. The error will be calculated using that validation fold. This process is repeated k times, where each fold gets a chance to be the validation fold. The k-fold CV estimate will be the average of all the validation fold errors.

The benefit of k-fold over LOOCV is that k-fold is less computationally expensive. In practice, k is usually 5 or 10. That means it only cross validates 5 or 10 times, as opposed to n times if using LOOCV.

We have been talking about how to use cross validation to estimate test errors which fall under model assessment. But we can also use these approaches for model selection. If we created multiple models for instance linear regression, support vector machines, random forest and a neural network, how can we determine which one is the best one? Since we get an estimate of the test error when we apply cross-validation, we can use that to determine which model to choose from. The model with the least validation error AND the least variance between the validation errors will be the best model to use.

The graph above demonstrates how to use cross validation to select the best model. The blue line is the actual test error. The black dashed line is the LOOCV and the orange line is the k-fold cross-validation. The ‘Flexibility’ (x-axis) can be thought of as the different models because each increase of flexibility is an increase of power the feature is raised to. If flexibility is 2 then that means the feature is raised to the 2nd power. We can try a variety of models and look at the cross-validation scores to find which is the best model. On the left, the best model is about 9. In the center, the best model is about 6 and on the right the best model is 10. As we can see the cross-validation errors are very close to the actual test errors. In a classification setting, instead of using MSE, we will be using error rate (how often we misclassified our label).


Bootstrapping is used to quantify the uncertainty associated with a given estimator or statistical learning method. The way bootstrap works is by getting distinct datasets by continuously sampling observations from the original data set. To get these distinct datasets, bootstrap randomly selects a number of observations with replacement.

The idea behind bootstrapping is with these randomly generated datasets, we can average the standard error for all these datasets to approximate the standard error of the original data set.

The standard error can tell us how much on average our prediction will differ from actual values. Repeating this statement with an example: If the standard error is 0.12, then that means we would expect our prediction, on average, to differ from the actual value by 0.12.

Collaborators: Michael Mellinger


This is a learning journey for us so if there is something that is incorrect or unclear let us know in the comments and we can clear it up.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store