Deriving Final Predictive Model using Cross-validation and Bootstrap Aggregation.
This post deals with a very common question in Machine Learning, i.e. “How do I get the final model given training data so that it can perform/generalize well on the test set?” or “How do I get the predictive model using k-fold Cross-validation?”. The common answers to these common questions are Cross-validation and/or Ensemble Methods. But again, I rarely found any post combining all the details which tell step by step procedure to perform the same. Here, I have attempted to combine all the details available from various posts/blogs to the simplest possible way to reduce the troubles.
Let’s have a look at short question-answer story between two friends, Santa and Banta before we proceed to the further discussion:
Santa: I have training data. I don’t have labels for the test data. I want to know which model will be good on test data, Support Vector Machine (SVM) or Feedforward Neural Network (FNN)?
Banta: I think you should apply a k-fold cross-validation using both the models. Whichever model gives you better average cross-validation accuracy, that model will be the good choice for your test data.
Santa: Yeah, sure. I did it. I found that that FFN is working better. So, I would use FFN for the test data. But wait, I have one more problem, how do I know which learning rate (hyper-parameter) will be good for my model? For example, which choice is better, 0.1 or 0.01?
Banta: I guess again you can take k-fold cross-validation to get the best hyper-parameters like learning rate. You can estimate average accuracy on k-fold cross-validation using learning rate as 0.1 and 0.01, respectively. Whichever gives you better performance, that value of learning rate will be best suited for test data.
Santa: I did it as you suggested and found that 0.01 is better than 0.1. Thanks. But, I have one more doubt. How would I choose my final model for evaluation on test data? Cross-validation does not say anything about the final model, right?
Banta: Yes, cross-validation does not talk about getting the final model. To get the final model, you have to learn some ensemble techniques like Bootstrap Aggregation (also known as bagging). This will give you your final predictive model for the test data.
Santa: Thank you very much for the help. Now, after bootstrap aggregation, I am having my final predictive model ready for the test data.
— — — — — — — — — — — — End of Story— —— — — — — — — —
So, from the above story, one can conclude as follows:
Step1: Perform k-fold cross-validation on training data to estimate which method (model) is better.
Step2: Perform k-fold cross-validation on training data to estimate which value of hyper-parameter is better.
Step3: Apply ensemble methods on entire training data using the method (model) and hyper-parameters obtained by cross-validation to get the final predictive model.
Let’s discuss each step in detail one be one:
Step1:
Let’s start with a definition of Cross-validation:
“Cross-validation is a technique for evaluating ML models by training several ML models on subsets of the available input data and evaluating them on the complementary subset of the data. Use cross-validation to detect overfitting, i.e., failing to generalize a pattern.”[1]
Now, let me point out a general mistake people do while reading the definition. In the definition “ML models” refer to a particular method ( Linear Discriminant Analysis (LDA) or Support Vector Machine or CNN or LSTM) but not the particular instances of the method as different models. So we might say ‘we have a SVM model’ but we should not call two different sets of the trained coefficients as different models. At least not in the context of model selection[2].
Now, Let us assume that we have two models, say a SVM and a FNN. Our task is to define which model is better? We can do k-fold cross-validation and see which one proves overall better at predicting the validation set points. But once we have used cross-validation to select the better performing model, we train that model (whether it be the SVM or the FNN) on entire training data (how do we train will be discussed later). We do not use the actual model instances we trained during cross-validation for our final predictive model.
So, when we do k-fold cross validation, essentially we are measuring the capacity of our model and how well our model is able to get trained by some data and then make predictions on the data it has not seen before. To know more about various types of cross-validation and their advantage, please read this article.
Step2:
Similar to Step1, k-fold cross-validation can be used to get the values of hyper-parameter that generalizes well on unseen test data. Here we search for better hyper-parameter rather than better methods (model) unlike in Step1.
Summary of Cross-validation from Step1 and Step2:
- Purpose of cross-validation is not to come up with our final predictive model.
- The purpose of cross-validation is model checking, not model building[2].
- Another purpose of the cross-validation is hyper-parameter tuning.
- But if you want, you “May” use cross-validation to get your final model as follows: (a): You may use the best performing model during cross-validation as the final model. You may need to do so because neural network models are computationally very expensive to train multiple times. (b): The trained models from the cross-validation can be combined to provide a cross-validation ensemble that is suppose to have better performance on average than a given single model. For example, in case of classification, you can take majority vote for final decision and in case of regression, you can take the average.
- Ideally, we are not supposed to use these k trained instances of our model to do any real prediction. For that we want to use all the training data and come up with the best model possible. This can be done with ensemble methods like Bootstrap Aggregation (Bagging).
Step3:
There are various techniques to perform ensemble of the models. To know about various ensemble methods and their various advantages (like it reduces the variance of the final predictive model), please read this article. In this post, we will discuss Bootstrap Aggregation (usually shortened to 'bagging') which is one of the ensemble techniques to get our final predictive model using the entire training data, which can generalize well on unseen data (test data).
The bootstrap method is as follows. Let there be a sample X of size N. We can make a new sample from the original sample by drawing N element with replacement, randomly and uniformly, . In other words, we select a random element from the original sample of size N and do this N times[3]. All elements are equally likely to be selected, thus each element is drawn with the equal probability 1/N. The following figure shows a dummy example for bootstrap aggregation procedure.
For example, let’s say we have bag with 100 balls (training data). Let’s say we want to create 5 new sample training data from the original training data. The sizes of these 5 new sample training data can be same as original training data or less than that. In general, we can take size of new sample training data same as the original data. To create each new sample training data, we are drawing balls from the bag one at a time. At each step, the selected ball is put back into the bag so that the next selection is made equiprobable i.e. from the same number of balls (100). Note that, because we put the balls back, there may be duplicates in the new sample. Let’s call this new sample training data X1. By repeating this procedure 5 times, we create 5 bootstrap sample training data X1, X2, X3, X4, X5. Each of them have 100 balls (of course some points inside any sample will be duplicates). These new training sets are known as bootstrap samples. 5 models are fitted using the above 5 bootstrap samples and the outputs are combined by a majority voting scheme for final classification. Examples not selected in a given bootstrap sample are used as the validation set to estimate the performance of the model. The learning criteria could be the early stopping on validation loss on validation set when we are training a neural network.
The combination of these 5 models (majority vote in classification and average in regression) is the final predictive model for the test data.
[1] https://docs.aws.amazon.com/machine-learning/latest/dg/machinelearning-dg.pdf#cross-validation
[3] https://www.kaggle.com/kashnitsky/topic-5-ensembles-part-1-bagging