Deeply Explained Cross-Validation in ML/AI

Shachi Kaul

Published in

Analytics Vidhya

7 min readMay 1, 2020

Blog Milestones

Background info
Why cross-validation?
What is cross-validation?
- Types
- Applications
Code Implementation
- Manually
- Scikit-learn Library
Recommendation and Inferences

Presenting an interesting learning about how to deal with the training and evaluation of ML/AI model. Blog will answer about which model to select or which are those tuned parameters for your model to perform best. So..
Ready! Get! Set! Go!

Background Info

In building any ML model, there are certain steps commonly followed such as data preprocessing, data partitioning into train/test, training and evaluating. Technically, training is performed on train set, model is tuned on validation set and evaluated on test set. It is seen that different set of data from the same whole dataset yield different metric score which creates uncertainty in model performance. Hence, cross-validation comes into picture for an accurate estimate of model.

Why Cross-Validation?

To create a model, training is performed on train data and tested on test data which needs whole data to be divided in the following different ways.

Use whole data as train/test
Model uses whole data to train and also to test, makes sense? O c’mon, not at all! Let’s take in this way, you studied 10 questions for your exams and your knowledge will be assessed on those same 10 questions in exams. Lol, you will get 100%.

2. Split data into train and test
Data is divided into train and test, ratio wise. This can be achieved using train_test_split function of sklearn.

Using random_state results into same set of random numbers every time you run the code. Hence, your results won’t change unless you change random_state value. If not using random_state, you will get different score on each run of code. So, you aren’t sure how exactly your model will perform on unseen data.
This uncertainty in results bumps cross-validation into picture.

3. Cross-Validation
Data though divided into train/test just like train_test_split except the fact of repeating training and evaluating on different set of data.
It’s like, 5 times your data gets splitted into train/test, trained and evaluated each time. Each time splitting is done differently from the earlier then evaluated and average of all 5 evaluated results. Still not cleared. Don’t worry, in-depth explanation still awaits.

Let’s get drunk more.

What is Cross-Validation?

Cross-Validation is basically a resampling technique to make our model sure about its efficiency and accuracy on the unseen data. In short, Model Validation technique, up for other applications.

Bunch of train/test splits — testing accuracy for each split — average them

Quick steps as:
1: Divide data into K partitions. These partitions will be of equal size.
2: Treat Fold-1 as test fold while K-1 as train folds.
3: Compute score of test-fold.
4: Repeat step 3 for all folds taking another fold as test while remaining as train.
5: Take average of scores of all the folds.

It partitions our data into train/test in such a way that the previous set won’t repeat and for each set training and testing is performed once. Each partition (aka fold) is of equal size. Can say, every datapoint plays as train and test in its life journey. This is basic functionality of cross-validation which can be useful while tuning hyper-parameters or selecting ML model such as logistic or decision trees for classification problem.

It prevents over-fitting and under-fitting by choosing optimal value of K. Also, accurate model estimate than train_test_split method.

Types of Cross-Validation

Illustrated above are the types used in common. Let’s know about them.

Leave-one-out Cross-Validation (LOOCV):
This is very old technique which is replaced by k-fold and stratified k-fold but still useful in certain scenarios. Data is partitioned into blocks representing each with 1 record as test while remaining as train. Each and every record is treated as test and those many iterations to evaluate on each.
Let’s pitch an example.
Data of 20 records means 20 partitions, each having one record. This leads to 20 iterations of training and evaluating.

In 1st iteration, model is tested on 1st block (fold) and tested on remaining 19 folds, giving out an accuracy. In the next iteration, another block having next record as test set while remaining as train, gives another accuracy. LOOCV makes testing and training on all 20 records in 20 iterations, hence expensive computational power. Average of all sets are then computed and thrown as a final accuracy aka k-fold score.

Merits:
- It gives sort of certainty of model performance when tested on unseen data.

Limitations:
- As compared to train_test_split, it takes more computational power .
- As everything is tested and trained, leads to higher variance and testing unseen data in production leads to poor results.
Results can lead to higher variation as when an outlier datapoint is tested.

K-Fold Cross-Validation:

A variant of cross-validation where data is divided into partitions as train/test based on “K”. Here, K refers to any integer while fold is to a partition (or iteration). Model performs training on K-1 partitions and testing on Kth partition of data.
Example for 4-fold cross validation,
Data of 20 records, given 4-fold. Data is divided into 4 partitions. Each partition has (20/4=)5 records.

In 1st iteration, model tested on 1st block and trained on remaining 4 blocks, resulted an accuracy. In next iteration, another block as test set while remaining as train, resulting into another accuracy. This process of dividing and evaluating is done for all 5 folds. Average of all sets are computed and thrown as a final accuracy which is treated as a model accuracy.

Merits:
- Overcome the problem of computational power like in LOO to some extent.
- Since not every record is treated as test set like LOOCV hence, hence model may not be affected much if any outlier is present in data. Overcomes problem of variability.

Limitations:
- In any iteration, there is a possibility that test set may have records of just one class. This will make data imbalance and impact our model.

Stratified K-Fold Cross-Validation:
This is an improved version of K-Fold where now each fold has same percent of samples of each target class. Let’s say binary classification having dependent classes 1/0. Things will go wrong when only records of class 1 fall into test set and model is trained and evaluated, results into data imbalance situation. Thus, stratified comes into picture.

Applications of Cross-Validation

Improving model via hyper-parameter tuning
Compare models to help in choosing one
Best features to be selected for model

For the implementation, refer the code here.

Code Implementation

The crux of cross-validation is basically to evaluate each fold acted as test set and average of them. The whole process of cross-validation can be done either manually via for loop or using sklearn’s library, cross_val_score.

Manually

Per below demonstrated,
- KFold class is instantiated and folds is specified via n_splits, hence 5-fold CV.
- Loop iterating over train/test data from KFold.split()
In each iteration,
— Data is splitted into one set of train/test (just like train_test_split)
— Logistic Regression is trained on train folds
— Accuracy score of test fold is calculated
Score for test fold in each iteration is appended and cross-validation score is computed taking mean of all scores.

Scikit-learn Library

Python’s scikit-learn library provides cross-validation classes to partition and compute average score.
cross_val_score is scikit-learn library that returns score for each test fold i.e. list of accuracy scores in each iteration. K-fold is an average of scores, deduced by taking mean of cross_val_score result.
cross_val_predict is another scikit-learn library that returns the predicted value of each test fold.

Default, integer cv parameter assumes StratifiedK-Fold. Like above, it is Stratified 5-folds. To use K-Fold, pass as cv=KFold(n_splits=5).

Recommendation and Inferences

Use Stratified K-Fold when classification technique comes into picture.
Wisely chose K value to maintain the balance of bias and variance required for any model to perform better.
Lower k leads more bias , less variance (underfit)
Higher k leads less bias , more variance (overfit)
If data itself is very small, k-fold won’t make sense. LOOCV can be dragged in this kind of scenario.

Please find my github code for acknowledging the concept covered in this blog.

References

Worth watching to learn more about cross-validation.

Take a chance to visit this blog to know about how CV is helpful in improving model.

You are free to follow this author if you liked the blog because this author assures to back again with sharing more interesting ML/AI techs. Also, feel free to let me know if any mistake in understanding or concepts are there.

Thanks,

Happy Reading! :)

Can get in touch via LinkedIn.