Cross-Validation: Estimator Evaluator

Salil Kumar
The Startup
4 min readNov 9, 2020

--

Cross-Validation Example for 10 folds [1]

Introduction

Often, we split our dataset into training and test set using the train_test_split method of scikit learn. The reason we split our data into training and test set is that we are interested in measuring how well our model generalizes to new unseen data. In this post, I’ll try to familiarize you all with other ways of splitting your dataset and evaluating your model performance.

What is Cross-Validation?

Showing different Cross-Validation Strategies[2]

It’s basically a statistical method of evaluating generalization performance. In cross-validation, the dataset is split repeatedly and multiple models are trained.

Some of the version of Cross Validations are :

k-fold Cross-Validation

k-fold with k = 5 [3]

When performing k-fold validation the dataset is partitioned into k equal parts called folds. One part is kept for testing purposes and other (k-1) parts are used for training the model. So the first model is built on ‘2’ to ‘k’ folds and then tested on the first fold. In the second fold, another model is trained on all parts of the dataset except the second part which acts as the test set for the second model. The process is repeated k times. The k is specified by the user and usually, its value ranges between 5–10.

The benefit of k-fold is that we get a range of accuracies, so we can get to know how the model would perform in worst-case and best-case scenarios.

The main disadvantage of k-fold validation is increased computational cost. As we are training k models instead of a single model.

Stratified k-fold Cross-Validation

Stratified k-fold Cross-Validation [4]

Using k-fold cross validation might not always be a good approach. The simple k-fold fails when the dataset contains numerous classification classes. It is quite possible to get a class selected in the test set while all other ones in the training set. In this, our model would produce accuracy, not more than 0%.

In stratified, we split the data such that the proportion among the classes remain the same in each fold as they are in the whole dataset.

Leave-one-out Cross-Validation

LOOCV Example [5]

In LOOCV, each fold is a single sample i.e. we train the model on the whole dataset except a single data point which is taken to be test set. The LOOCV can be very time consuming for a particularly large dataset, but sometimes provide better estimates on a smaller dataset.

Shuffle-split Cross-Validation

[6]

Shuffle-split validation lets us control the number of iterations independently of the training and test set. Also by providing train, test size (not necessarily need to sum to1)as a parameter we can test our model on a part of the dataset rather than the whole dataset. This can be used as a way of analyzing larger datasets.

GroupKFold Cross-Validation

GroupKFold Cross-Validation Example for 4 classes [7]

In GroupKFold the k folds are made according to the number of classes present in the dataset. One data related to a single class act as a test set while all other act as the training set. The process is repeated just like we did in k-fold.

In the above image, the GroupKFold is applied to the dataset containing 4 classes. As you see that for each split, each group is either entirely in the training set or test set.

References

[1] Introduction to Support Vector Machines and Kernel Methods by Johar M. Ashfaque

[2] Introduction to Machine Learning with Python by Andreas C. Muller

[3] Medium Post: K-Fold Cross Validation by Krishni

[4]Analyzing E-mail Text Authorship for Forensic Purposes by Malcolm Corney

[5] Campus Datacamp LOOCV Module validation in python

[6] Entry 18 Cross-Validation by Julie Fisher

[7] Introduction to Machine learning with scikit-learn Cross-Validation and Grid Search by Andreas C. Müller

P.S.: All images were taken from Google images

Conclusion

Congratulations, now you know theoretical knowledge about Cross-Validation and various cross-validation strategies that are used in the data science field.

“What we learn with pleasure, we never forget

~Alfred Mercier

--

--