What is Cross Validation and When to use Which Cross Validation

Anurag Lahon
THE RISE OF UNBELIEVABLE
3 min readAug 3, 2020

Cross Validation techniques is used based on the datasets.

Cross Validation is a step in the process of building machine learning models which ensures that we do not overfit and our model fit data accurately.

We will take the Heart Disease UCI dataset.

The training accuracy is 92.61% and testing accuracy is 73%.The model fits perfectly on the training dataset but perform poorly on testing dataset. This is called overfitting. The model will learn well on training data well but will generalize on new data.When test loss increases as we keep improving training loss it also leads to overfitting.

Some popular types of Cross Validation techniques:

  1. K-fold cross-validation
  2. Stratified K-fold cross-validation
  3. Hold-out based cross-validation
  4. Leave-one-out cross-validation
  5. Group K-fold cross-validation

Cross-validation is dividing training data into few parts.We train the model on some parts and test on remaining parts.

K-fold cross-validation : We divide the data into k different sets which are exclusive of each other. We can use this process with almost all kinds of dataset.

Stratified k-fold cross-validation : If we have a skewed dataset for binary classification with 90% positive samples and 10% negative samples.If we use K fold cross-validation this will result in folds with all positive samples.Hence , we should use stratified k-fold cross-validation.It keeps the ratio of targets in each fold constant.In each fold , we will have 90% positive and 10% negative samples.It will give similar results across all folds.To use stratified k-fold for regression problem,we have to divide target into bins and then we use stratified k-fold for classification.If our dataset is large(>10K) then we don’t care about the number of bins.If we don’t have a lot of data we use a simple rule like Sturge’s Rule to calculate the number of bins.

Sturge’s rule :

Number of Bins = 1+ log2(N)

where N is number of samples

Hold-out based cross-validation : If we have a large amount data then we should use hold-out cross-validation.Depending on the algorithm we choose,training and validating can be very expensive for a large dataset.So we choose hold-out based cross validation.It is frequently used in Time-series dataset.

Group k -fold cross-validation :We use group k-fold when you have groups you don’t want split across the training and test sets.For example, if your data includes multiple rows for each customers (but it still makes sense to train on individual transactions/rows), and your production use-case involves making predictions for new customers, then testing on rows from customers that also have rows in your training set may be optimistically biased.

Leave-one-out cross validation : Rather than choosing one model, the thing to do is to fit the model to all of the data, and use LOO-CV to provide a slightly conservative estimate of the performance of that model.

For every i = 1, . . . , n:

We train the model on every point except i, We compute the test error on the held out point.I

Average the test errors :

It can be computational expensive since it involves fitting the model n times.

Cross validation is one of the most essential step when building machine learning models.Cross-validation techniques depends on the data and we might need to adopt new techniques depending on the data and the problem.

Reference : web.stanford.edu , stackoverflow

--

--