Cross Validation Techniques

Faizan Ansari
Analytics Vidhya
Published in
5 min readJun 27, 2021

This write-up contains explanation of different cross validation techniques and their implementations in python.

source : https://tenor.com/bfqDW.gif

Cross validation is a method of estimating expected prediction error.

While choosing machine learning models, we need to compare models, to see how different models perform on our dataset, this goes beyond just comparing different models but also in explaining to the those who are working with you.

You needs to know how valid is your dataset or models that you’re working with . So, needs for cross validation comes up when data is usually limited and training and testing on the same portion of the data does not gives us an accurate view of how our model performs .

If I have programmed some model to guess something ,I don’t want to use the same data for testing that I use for training, I want to see how it performs on the different set of same dataset (which is not used in training), otherwise I’m just kind of fooling myself , if I give same data for testing on which I have trained that model with.

You need to see how well your model fits your data vs how well it fits new data coming. So, training a model in the same data means that model will eventually learns well only for that data and fails on new data this is called overfitting. Or it may undergo Underfitting ( It occurs when the model does not fit properly to training data. It does not find patterns in the data and hence when it is given new data to predict, it cannot find patterns in it too. It under-performs on both known and unseen data.)

What is cross-validation ?

“Cross-Validation in machine learning is a technique that is used to train and evaluate our model on a portion of our database, before re-portioning our dataset and evaluating it on the new portions.” — — — — In simpler words instead of splitting our dataset into 2 different parts (one for training and other for testing) , Split your dataset into multiple portions, use some of them for training and rest for testing . This ensures that our model is training and testing on new data at every new step.

Steps in cross-validation :

Step 1: Split the data into train and test sets and evaluate the model’s performance as ‘Measure 1’. (pink block represent training data and yellow block represent testing data.)

Step 1 : split the data in train and test sets and evaluate  model performance
step 1 (source : https://www.simplilearn.com/ice9/free_resources_article_thumb/7-step1.JPG)

Step 2 : Split the data and split into new train and test sets. Re-evaluate model’s performance. As we ‘Measure 1’ in similar way we measures ‘Measure 2’ , ‘Measure 3’ and so on.

step 2 (source : https://www.simplilearn.com/ice9/free_resources_article_thumb/8-step-2ml.JPG)
Repeating Step 2 of cross-validation . (source : https://www.simplilearn.com/ice9/free_resources_article_thumb/9-repeating-ml.JPG)

Step 3: To get the actual performance metric , take the average of all measures.

Step 3 (source : https://www.simplilearn.com/ice9/free_resources_article_thumb/10-step3.JPG)

Types of Cross-Validation :

  1. Hold Out Validation Approach : Simplest split, in this we usually do a train and test split. This means that suppose you have a dataset we try to do train-test split, we train our model with the training dataset and we validate our model with the test dataset. (the technique which we generally use).
Hold Out Validation Approach (source : https://editor.analyticsvidhya.com/uploads/62390hold%20out_datavedas.jpg)

2. K-Fold Cross Validation : Based on the random state “Hold Out Validation Approach” , generates random train test splits , and we may get different-different accuracy, this may leads to either overfitting or underfitting of model. In order to overcome this we go for K- Fold cross validation , in K-fold cross validation we can make k splits. It split dataset into k consecutive folds (without shuffling by default). Each fold is then used once as a validation while the k-1 remaining folds form the training set.

example K-Fold Cross Validation when k =5 (source : http://ethen8181.github.io/machine-learning/model_selection/img/kfolds.png)

3. Stratified K-fold Cross Validation : This method is useful when there are minority classes present in our data. In some cases, while partitioning the data, some testing sets will include instances of minority classes while others will not. When this happens, our accuracy will not properly reflect how well minority classes are being predicted. To overcome this, The data is split so that each portion has the same percentage of all the different classes that exist in the dataset.

Stratified K-fold Cross Validation(Source : https://raw.githubusercontent.com/satishgunjal/images/master/Stratified_KFold_Cross_Validation.png)

4. Leave One Out Cross Validation (LOOCV) : Leave-one-out cross-validation is a special case of cross-validation where the number of folds equals the number of instances in the data set. Thus, the learning algorithm is applied once for each instance, using all other instances as a training set and using the selected instance as a single-item test set. This process is closely related to the statistical method of jack-knife estimation. (we select one data point as the test data. The rest of the dataset will be used for training and the single data point will be used to predict after training.). Generalize version of LOOCV is LPOCV (Leave P Out Cross Validation).

LOOCV (source : https://slideplayer.com/slide/13519285/82/images/26/Leave-one-out+Cross+Validation.jpg)

5. Repeated Random Test-Train : This technique is a hybrid of traditional train-test splitting and the k-fold cross-validation method. In this technique, we create random splits of the data in the training-test set manner and then repeat the process of splitting and evaluating the algorithm multiple times, just like the cross-validation method. (create a random split of the data like the train/test split described above, but repeat the process of splitting and evaluation of the algorithm multiple times, like cross validation.)

Python Code for Cross Validation Techniques discuss above:

source : https://media.giphy.com/media/OWtpNt0fbvwLeKbHcB/source.gi

--

--

Faizan Ansari
Analytics Vidhya

Junior Research Fellow, Indian Statistical Institute, Kolkata