Cross-Validation Techniques

This article aims to explain different Cross-Validation techniques and how they work.

Abhigyan
Geek Culture
13 min readAug 30, 2021

--

Contents:

→ Introduction
→ What is Cross-Validation?
→ Different Types of Cross-Validation

1. Hold-Out Method
2. K-Folds Method
3. Repeated K-Folds Method
4. Stratified K-Folds Method
5. Group K-Folds Method
6. Shuffle Split Method
7. Stratified Shuffle Split Method
8. Group Shuffle Split Method
9. Leave-One-Out Method
10. Leave-P-Out Method
11. Leave-One-Group-Out Method
12. Leave-P-Group-Out Method
13. Time Series Cross-Validation Method
14. Blocked Cross-Validation Method
15. Nested Cross-Validation Method

→ Conclusion
→ Reference

Introduction

Imagine building a model on a dataset and it fails on unseen data.
We cannot just fit the model on our training data and lay back hoping it will perform brilliantly for the real unseen data.
This is a case of over-fitting, where our model has learned all the patterns and noise of training data, to avoid this we need some kind of way to guarantee that our model has captured most of the patterns and is not picking up every noise in the data(low bias and low variance), one of the many techniques to handle this is the Cross-Validation.

What is Cross-Validation?

  • In machine learning, Cross-validation is a technique that evaluates any ML model by training several ML models on subsets of the input data and evaluating them on the complementary subset of the data.
  • It is mainly used to estimate any quantitative measure of fit that is appropriate for both data and model.
  • In the cross-validation approach, the test results are usually not biased because the data used for training and testing are mostly non-overlapping.

Let’s First create two variables which I will be using to demonstrate further:

data = ['Subset1', 'Subset2', 'Subset3', 'Subset4', 'Subset5', 'Subset6', 'Subset7', 'Subset8', 'Subset9', 'Subset10']Y = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]df = {"data":data, "Y":Y}df = pd.DataFrame(df)df

Different methods of Cross-Validation are:

→ Hold-Out Method:

  • It is a simple train test split method.
  • Once the train test split is done, we can further split the test data into validation data and test data.
    for example:
    1. Suppose there are 1000 data, we split the data into 80% train and 20% test.
    2. My train data consist of 800 data points and the test will contain 200 data points.
    3. Then we split our test data into 50% validation data and 50% test data.
x_train, x_test, y_train, y_test = model_selection.train_test_split(df.data, df.Y, test_size = 0.2)for i,n in zip(x_train, y_train):
print(i, "::", n)
for i, n in zip(x_test, y_test):
print(i, "::", n)
Train Set
Test Set

**Notice: That the test set has only one class, now this can lead to biased result

Using Stratify Parameter

Stratification is the process of rearranging the data so as to ensure that each fold is a good representative of the whole.
For example, in a binary classification problem where each class comprises of 50% of the data, it is best to arrange the data such that in every fold, each class comprises of about half the instances.

x_train, x_test, y_train, y_test = model_selection.train_test_split(df.data, df.Y, test_size = 0.2, stratify = df.Y)for i,n in zip(x_train, y_train):
print(i, "::", n)
for i, n in zip(x_test, y_test):
print(i, "::", n)
Train Set
Test Set

→ K-Folds Method:

  • In this method, we split the data-set into k number of subsets(known as folds) then we perform training on all the subsets but leave one(k-1) subset for the evaluation of the trained model.
  • We iterate k times with a different subset reserved for the testing purposes each time.
  • It ensures that every observation from the original dataset has the chance of appearing in the training and test set.
  • The k results from the folds can then be averaged (or otherwise combined) to produce a single estimation. The advantage of this method is that all observations are used for both training and validation, and each observation is used for validation exactly once.
kfold = model_selection.KFold(n_splits=5)print("Train", "||", "Test")
for train, test in kfold.split(df.data, df.Y):
print(train, "||", test)
tn = []
tt = []
for train, test in kfold.split(data):
tn.append(np.take(data,train))
tt.append(np.take(data,test))
kfold_df = pd.DataFrame({"train":tn, "test":tt})
kfold_df

Now, If we pass in the shuffle parameter:

kfold = model_selection.KFold(n_splits=5, shuffle=True, random_state=1)
print("Train", "||", "Test")
for train, test in kfold.split(data):
print(train, "||", test)

We can see that the splits are no longer in an ordered manner, this is because the data are shuffled and then separated.

tn = []
tt = []
for train, test in kfold.split(data):
tn.append(np.take(data,train))
tt.append(np.take(data,test))
kfold_df = pd.DataFrame({"train":tn, "test":tt})
kfold_df

→ Repeated K-Folds Method:

  • The repeated K-fold method uses K-fold Cross-Validation and repeats it for n times the user wants.
  • A single run of the k-fold cross-validation procedure may result in a noisy estimate of model performance.
    Different splits of the data might bring about totally different outcomes.
  • A noisy estimate of model performance results in confusion as to which model should be used to compare and select a final model to address the problem.
  • One solution to reduce the noise in the estimated model performance is to increase the k-value. However, it will increase the variance.
  • Repeating the k-fold cross-validation process multiple times and report the mean performance across all folds and all repeats.
Rkfold = model_selection.RepeatedKFold(n_splits=5, n_repeats=5, random_state=2)print("Train", "||", "Test")
for train, test in Rkfold.split(data):
print(train, "||", test)

We can see that the data was separated into 12 sets (5 n_splits * 5 repeats), and no two test sets is repetitive

tn = []
tt = []
for train, test in Rkfold.split(data):
tn.append(np.take(data,train))
tt.append(np.take(data,test))
Rkfold_df = pd.DataFrame({"train":tn, "test":tt})
Rkfold_df

→ Stratified K-Fold Method:

  • The difference between K-Fold Cross-Validation and Stratified K-Fold is that the K-Fold splits the data into k “random” folds, meaning the subsets consist of data points randomly picked and placed.
    Whereas, In Stratified Cross-Validation splits the data into k folds, making sure each fold is an appropriate representative of the original data. (class distribution, mean, variance, etc).
  • The biggest issue with any classification problem is that of the problem occurring due to an imbalanced class.
    If we use the K-Fold CV method on the imbalanced data, we may cause training to be biased on one class. As in K-Fold we randomly take out K subsets and there is a high chance that we may get folds that consists of majority classes. In order to handle this type of issue, Stratified K-Fold is used with the help of the Stratification Process.
strkfold = model_selection.StratifiedKFold(n_splits=5)tn_x = []
tn_y = []
tt_x = []
tt_y = []
for train, test in strkfold.split(data, Y):
tn_x.append(np.take(data,train))
tn_y.append(np.take(Y,train))
tt_x.append(np.take(data,test))
tt_y.append(np.take(Y,test))
strkfold_train = pd.DataFrame({"train_x":tn_x, "train_y":tn_y})
strkfold_test = pd.DataFrame({'test_x':tt_x, "test_y":tt_y})
Train Set
Test Set

→ Group K-Folds Method:

  • Group K-Folds is a method that takes into account the group passed as a parameter.
  • The difference between Every other K-Folds and Grouped K-Folds is that when the data is split into two sets. It takes out individual groups for the training set.
grpkfold = model_selection.GroupKFold(n_splits=5)
print("Train", "||", "Test")
for train, test in grpkfold.split(data, Y, groups=groups):
print(train, "||", test)

→ Comparing the above two images, we can see the group present at index 8 and 9 were first taken out.
→ Which was then followed by the index 6 and 7 and so on.

The splits are made on the basis of individual groups

tn = []
tt = []
for train, test in grpkfold.split(data, Y, groups=groups):
tn.append(np.take(data,train))
tt.append(np.take(data,test))
grpkfold_df = pd.DataFrame({"train":tn, "test":tt})
grpkfold_df

→ Shuffle Split Method:

  • Repeated random subsampling validation also referred to as Monte Carlo cross-validation splits the dataset randomly into training and validation. Unlikely k-fold cross-validation split of the dataset into not in groups or folds but splits in this case in random.
  • The number of iterations is not fixed and decided by analysis. The results are then averaged over the splits.
  • Random subsampling (e.g., bootstrap sampling) is preferable when you are either under-sampled or when you have the situation above, where you don’t want each observation to appear in k-1 folds.
  • The proportion of train and validation splits is not dependent on the number of iterations or partitions.
  • Some samples may not be selected for either training or validation.
  • Not suitable for an imbalanced dataset.
shsplit = model_selection.ShuffleSplit(n_splits=5, test_size=0.2, random_state=3)print("Train", "||", "Test")
for train, test in shsplit.split(data):
print(train, "||", test)

Random Subsets are taken and separated

tn = []
tt = []
for train, test in shsplit.split(data):
tn.append(np.take(data,train))
tt.append(np.take(data,test))
shsplit_df = pd.DataFrame({"train":tn, "test":tt})
shsplit_df

→ Stratified Shuffle Split Method:

strshsplit = model_selection.StratifiedShuffleSplit(n_splits=5, test_size=0.2)
print("Train", "||", "Test")
for train, test in strshsplit.split(data, Y):
print(train, "||", test)
tn_x = []
tn_y = []
tt_x = []
tt_y = []
for train, test in strshsplit.split(data, Y):
tn_x.append(np.take(data,train))
tn_y.append(np.take(Y,train))
tt_x.append(np.take(data,test))
tt_y.append(np.take(Y,test))
strshsplit_train = pd.DataFrame({"train_x":tn_x, "train_y":tn_y})
strshsplit_test = pd.DataFrame({'test_x':tt_x, "test_y":tt_y})
Train Set
Test Set

→ Group Shuffle Split Method:

grpshsplit = model_selection.GroupShuffleSplit(n_splits=5)
print("Train", "||", "Test")
for train, test in grpshsplit.split(data, groups=groups):
print(train, "||", test)

→ Group Shuffle Split tends to repeat the same splits as the splits are chosen after shuffling the data.
→ This repetitive behavior may generalize the model.

Group shuffle split produces repetitive validation sets

tn = []
tt = []
for train, test in grpshsplit.split(data, groups=groups):
tn.append(np.take(data,train))
tt.append(np.take(data,test))
grpshsplit_df = pd.DataFrame({"train":tn, "test":tt})
grpshsplit_df

→ Leave-One-Out Method:

  • The Leave-One-Out Cross-Validation, or LOOCV, the procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model.
  • This approach leaves 1 data point out of training data, i.e. if there are n data points in the original sample then, n-1 samples are used to train the model, and p points are used as the validation set.
  • This is repeated for all combinations in which the original sample can be separated this way, and then the error is averaged for all trials, to give overall effectiveness.
loo = model_selection.LeaveOneOut()print("Train", "||", "Test")
for train, test in loo.split(data):
print(train, "||", test)

Each separation takes a particular subset as a validation set, so if data is divided into n subsets, n splits will be made.

tn = []
tt = []
for train, test in loo.split(data):
tn.append(np.take(data,train))
tt.append(np.take(data,test))
loo_df = pd.DataFrame({"train":tn, "test":tt})
loo_df

→ Leave-P-Out Method:

  • This is similar to that of the Leave-One-Out Method, However, In this method, we take p number of points out from the total number of data points in the dataset containing n number of data points.
  • The model is trained on (n-p) data points and tested on p data points.
lpo = model_selection.LeavePOut(2)tn = []
tt = []
for train, test in lpo.split(data):
tn.append(np.take(data,train))
tt.append(np.take(data,test))
lpo_df = pd.DataFrame({"train":tn, "test":tt})
lpo_df

→ Leave-One-Group-Out Method:

This basically works like the group K-Folds method, except that the order of the split of K-folds starts from the end and Leave-One-Group starts from the beginning.

logo = model_selection.LeaveOneGroupOut()
print("Train", "||", "Test")
for train, test in logo.split(data, groups=groups):
print(train, "||", test)

→ Leave-P-Groups-Out Method:

  • It leaves p groups out and trains the model on n-p subsets, and then tests on the p groups.
lpgo = model_selection.LeavePGroupsOut(2)print("Train", "||", "Test")
for train, test in lpgo.split(data, groups=groups):
print(train, "||", test)

→ Time Series cross-validation

  • The order of the data is very important for time-series-related problems. For time-related datasets, random split or k-fold split of data into train and validation may not yield good results.
  • For the time-series dataset, the split of data into train and validation is according to the time also referred to as forwarding chaining method or rolling cross-validation. For a particular iteration, the next instance of train data can be treated as validation data.
tsplit = model_selection.TimeSeriesSplit(n_splits=9, max_train_size=10)print("Train", "||", "Test")
for train, test in tsplit.split(data):
print(train, "||", test)

The model is trained on sequential subsets and not on randomly selected subsets.

tn = []
tt = []
for train, test in tsplit.split(data):
tn.append(np.take(data,train))
tt.append(np.take(data,test))
tsplit_df = pd.DataFrame({"train":tn, "test":tt})
tsplit_df

→ Blocked Cross-Validation

  • The Blocked Cross-Validation procedure is similar to the standard form described above.
  • The difference is that there is no initial random shuffling of observations.
  • In time series, this renders K blocks of contiguous observations. The natural order of observations is kept within each block but broken across them.
  • However, this may introduce leakage from future data to the model. The model will observe future patterns to forecast and try to memorize them.
  • That’s why blocked cross-validation was introduced. It works by adding margins at two positions. The first is between the training and validation folds in order to prevent the model from observing lag values which are used twice, once as a regressor and another as a response.
  • The second is between the folds used at each iteration in order to prevent the model from memorizing patterns from an iteration to the next.
blocks = 2
n = len(data) // 2
for i in [data[i:i + n] for i in range(0, len(data), n)]:
train, test = model_selection.train_test_split(i)
print(train, "||", test)

→ Nested cross-validation

  • Nested Cross-Validation (Nested-CV) nests cross-validation and hyperparameter tuning.
  • It is used to evaluate the performance of a machine learning algorithm and also estimates the generalization error of the underlying model and its hyperparameter search.
outter_cv = model_selection.KFold(n_splits=5)
inner_cv = model_selection.KFold(n_splits=4)
for train, test in outter_cv.split(data):
tr = np.take(data, train)
te = np.take(data, test)
print(tr, "||", te)
print("-----------------------------------")
for in_train, in_test in inner_cv.split(tr):
in_tr = np.take(tr, in_train)
in_te = np.take(tr, in_test)
print(in_tr, "||", in_te)
print("<<<<<<<<<<----------------------------------->>>>>>>>>>>")

Conclusion

  • No one Cross-Validation approach can be said as a sure shot technique that works for your data.
  • Time Series and Blocked CV are the best way to prevent over-fitting of the models.

Reference

https://scikit-learn.org/stable/modules/cross_validation.html

Like my article? Do give me a clap and share it, as that will boost my confidence.
Also, check out my other post and stay connected for future articles on the basics of data science and machine learning series.

Also, do connect with me on LinkedIn.

Photo by Markus Spiske on Unsplash