Validation strategies
As already mentioned in the last article, overfitting is a phenomenon that negatively affects the performance of your model. With validation, one can efficiently prevent it. Let’s discuss more in detail what are the strategies you can adopt to perform this important step and focus on the reasons to use each one of them.
Hold-out recap
In the previous story we introduced a first way to divide your dataset in order to perform validation, called Hold-out strategy.
It consists in splitting your data into two parts: the training part is used to fit our model, the validation part to evaluate its performance. Using the scores from this evaluation step, we can choose the best model and the best hyperparameters for the selected model.
If we have enough data, using Hold-out is usually a good choice. In particular we can use this scheme if we get similar scores for the same model when we try different splits.
But what is the optimal size of our sets?
A common split is using 80% of data for training and the remaining 20% of the data for testing.
Remember these are the default sizes setted in train_test_split, the command we have used to perform Hold-out in our notebook (another useful command is ShuffleSplit: this method just returns random indices to split data into training and test sets).
Leave-One-Out cross validation
If your dataset is small, it makes sense the validation set is small, in order to save enough examples for the training...
But this means that the validation error will be a bad estimation of the test error, because a too little validation set doesn’t capture the behaviour of the test set! For example, to measure the performance of your model with only one example is too constrictive.
The solution is the Leave-one-out strategy: in that scheme we iterate through every sample in our data, each time using one example for validation and the remaining examples for training.
You will need to retrain the model N times (if N is the number of samples in the dataset); in the end, a prediction for every sample in the training set is computed and then you can calculate the global loss by average the single losses (this averaging process is called cross validation ).
This method can be helpful if we have too little data and just enough models to choose from.
In scikit-learn there exists a function able to do this: LeaveOneOut.
As an example, let’s consider a very small dataset, with only 10 examples:
from sklearn.model_selection import LeaveOneOut
from sklearn.datasets import make_classificationX_clf, y_clf = make_classification(n_samples=10, n_features=4, random_state=3)plt.scatter(X_clf[:, 0], X_clf[:, 1], marker=’o’, c=y_clf, s=25, edgecolor=’k’)
plt.title(“Data”)
Now, let’s find the best classifier to perform a binary classification task on these points: we try with a naive bayes classifier:
loo = LeaveOneOut()from numpy import mean, std score=[]from sklearn.naive_bayes import GaussianNB NB2=GaussianNB()for train_index, valid_index in loo.split(X_clf): #Generate indices to split data into training and test set
print(“TRAIN:”, train_index, “VALID:”, valid_index) X_train, X_valid = X_clf[train_index], X_clf[valid_index] y_train, y_valid = y_clf[train_index], y_clf[valid_index] NB2.fit(X_train, y_train) print(NB2.score(X_valid, y_valid)) score.append(NB2.score(X_valid, y_valid)) #Average accuracy with cross validation
print(“Accuracy: %0.2f (+/- %0.2f)” % (mean(score), std(score) * 2)) >>>
TRAIN: [1 2 3 4 5 6 7 8 9] VALID: [0] 1.0
TRAIN: [0 2 3 4 5 6 7 8 9] VALID: [1] 0.0
TRAIN: [0 1 3 4 5 6 7 8 9] VALID: [2] 1.0
TRAIN: [0 1 2 4 5 6 7 8 9] VALID: [3] 1.0
TRAIN: [0 1 2 3 5 6 7 8 9] VALID: [4] 1.0
TRAIN: [0 1 2 3 4 6 7 8 9] VALID: [5] 1.0
TRAIN: [0 1 2 3 4 5 7 8 9] VALID: [6] 0.0
TRAIN: [0 1 2 3 4 5 6 8 9] VALID: [7] 1.0
TRAIN: [0 1 2 3 4 5 6 7 9] VALID: [8] 1.0
TRAIN: [0 1 2 3 4 5 6 7 8] VALID: [9] 1.0
Accuracy: 0.80 (+/- 0.80)
K-fold cross validation
K-Fold divides data in K groups of samples of equal size, called folds (if K=size of dataset, this is equivalent to the Leave One Out strategy).
The prediction function is trained over K-1 folds, and then the fold left out is used as a validation set only once. After this procedure, we average scores over these K-folds (cross validation).
This method is a good choice when we have enough data and we get different scores and optimal hyperparameters for different splits.
You can also estimate mean and variance of the loss. This is very helpful in order to understand the significance of improvement!
Here it is important to understand the difference between K-fold and a K-repeated hold-out. In the first case it is possible to average the scores in order to obtain a mean score indicating the quality of the model. In the second case, this is not possible: some samples may never have been used as validation examples, while others may appear several times. An average of the scores is no longer informative!
Let’s apply this strategy to our cancer breast classification problem and select the best classifier with the K-Fold strategy (check out the previous article for more details about this example).
Useful commands are KFold, that provides train/test indices to split data in train/test sets, and cross_val_score, for evaluating the scores by cross-validation: let’s try this second one command.
from sklearn.model_selection import cross_val_scorefor clf in [SVC(random_state=0), RandomForestClassifier(random_state=0), GaussianNB(), DecisionTreeClassifier(random_state=0), LogisticRegression(random_state=0), MLPClassifier(random_state=0)]:
model = clf scores = cross_val_score(model, Xtrain, ytrain, cv=5)print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))>>>
Accuracy: 0.63 (+/- 0.01) --> SVC
Accuracy: 0.95 (+/- 0.05) --> RandomForestClassifier
Accuracy: 0.93 (+/- 0.06) --> GaussianNB
Accuracy: 0.92 (+/- 0.06) --> DecisionTreeClassifier
Accuracy: 0.94 (+/- 0.10) --> LogisticRegression
Accuracy: 0.94 (+/- 0.06) --> MLPClassifier
This time, the winner is Random Forest Classifier, and it is a more conscious and precise choice.
But what happens if you are dealing with time series? Or with unbalanced data sets? Is a random split still appropriate?
Take a look at the next story to find out how to act in these particular cases and to discover some useful splitting strategies suitable for ML competition!
You can find my notebook with the entire code here.
References:
- ‘Learning From Data, a short course’, Hsuan-Tien Lin, Yaser S. Abu-Mostafa, Malik Magdon-Ismail
- Visualizing cross-validation in scikit-learn
- How to Win a Data Science Competition: Learn from Top Kagglers
- How to Select Your Final Models in a Kaggle Competition