A Complete Guide to Choose the Correct Cross Validation Technique

Anant Kumar
Analytics Vidhya
Published in
4 min readNov 7, 2020

This article is a Complete Guide to various Cross Validation Techniques that are used often during the development cycle of a Predictive Analytics Solution, in the regression or classification setting. It’d also provide a sample usage code in Python for each of the CV Technique mentioned with a real dataset.

What is Cross Validation?

Validation of the model to identify / evaluate if it fits the data accurately whilst also ensuring that it does not overfit.

Overfit : Model fits perfectly on train data but does not generalize / performs poorly on the test data.

From my experience, I’d say CV is an essential & major step to build production-ready ML Models. (Considering Data cleansing & preprocessing, Feature Engineering as part of the Feature Enrichment process rather than Model Building.)

Note :- Cross Validation Technique depends on the underlying data distribution. Hence, a good understanding of using CV Techniques is required.

Code: Read Data

Different Types of Cross Validation Techniques

K-Fold Cross Validation : It is the simplest form of the CV Techniques hence discussed, the other CV techniques can be recognised as a variant of this technique.
The following diagram gives a clear understanding of the different CV Techniques -
The data is randomly shuffled and K-Unique Datasets with equal number of samples (mostly) is created. The model is successively built on any (K-1) Datasets (Folds) whilst validating on the left out dataset and keeping a track of the previous evaluation metrics of the built models.

Fig(1) : K-Fold Cross Validation
Code : K Fold Cross Validation

Stratified K-Fold Cross Validation : A CV Technique that may be considered as a derived version of the K-Fold Cross Validation Technique. This technique mantains the ratio of each labels in a Fold (any K-1 Dataset) constant. Hence, each fold essentially has the same ratio of each labels.
As you might have figured this out, Stratified K-Fold CV is used primarily with Skewed Datasets.

Fig(2) : Stratified Cross Validation
Code : Stratified K-Fold Cross Validation

Leave-One-Out Cross Validation : This CV technique trains on all samples except one. It is a K-Fold CV where K = N where N is the number of samples in the data.
Since training on N different possible combinations of the data is costly in terms of compute power for training & validating the model, this CV Technique is preferable when working with small datasets.

Fig(3) : Leave One Out Cross Validation
Code : Leave One Out Cross Validation

Group K-Fold Cross Validation : A CV Technique that creates train and test groups such that the same group does not appear in two different folds and the number of distinct groups is the same in each fold.
Therefore, Group K-Fold CV is commonly used with large datasets in adjuncture with Stratified Sampling.

Fig (4) : Group K-Fold Cross Validation
Code : Group K Fold Cross Validation

Hold-Out Based Cross Validation : A CV Technique which is different from other CV Techniques where the data is split into two sets, rather than K-Folds :-
a. Train Set : The model is trained on this dataset.
b. Hold-Out Set : The model is validated on this dataset. This dataset must mantain the ratio of labels as constant.
This CV Technique is commonly used when working with large datasets since it’ll be computationally intensive for aforementioned CV Techniques.

Fig(5) : Hold Out Cross Validation
Code : Hold Out Cross Validation

For user simplicity, this article focuses on the regression setting as the scikit-learn library can be directly used in the classification setting.

Note :- In the regression setting, K-Fold CV, Group K-Fold CV, Leave-One Out CV can be used as is. For using Hold-Out Based CV, Group K-Fold CV with Stratified Sampling, K-Fold Stratified CV, the target variable should be divided into bins and then used in a similar manner as in the classification setting.

The code shared has this concept of binning included.

Cheers !

Happy Learning !

--

--

Anant Kumar
Analytics Vidhya

Machine Learning & Deep Learning Practitioner | Learning is Continuous | Github : https://github.com/anant-kumar-0308