Cross Validation

Published in

The Startup

7 min readNov 13, 2020

In this article we will be discussing cross-validation. I assume you must be familiar with the basic terminology of machine learning.

Before going into Cross validation Let me give you a scenario i.e Once you are done with training your model, Can you assume that it is going to work well on data that it has not seen before? In other words, can you be sure that the model will have the desired accuracy and variance in the production environment?

Well if you chose the angel guy then GREAT you are on the right track. Because Cross validation (CV) is one of the techniques used to test the effectiveness of machine learning models .

As to evaluate the performance of any machine learning model we need to test it on some unseen data. Based on the models performance on unseen data we can say whether our model is Under-fitting/Over-fitting/Well generalized.

So , Does cross validation sound great to you ?

If Yes , Then let’s just briefly look at Definition of Cross Validation

llustration of k-fold cross-validation when n = 12 observations and k = 3. After data is shuffled, a total of 3 models will be trained and tested.

According to Wikipedia Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.

The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation.

It is a popular method because it is simple to understand and because it generally results in a less biased or less optimistic estimate of the model skill than other methods, such as a simple train/test split.

k-fold cross validation

The general procedure is as follows:

Randomly split your entire dataset into k”folds”
For each k-fold in your dataset, build your model on k — 1 folds of the dataset. Then, test the model to check the effectiveness for kth fold
Record the error you see on each of the predictions
Repeat this until each of the k-folds has served as the test set
The average of your k recorded errors is called the cross-validation error and will serve as your performance metric for the model

Now the Question Arises How to choose the right value of k?”

Therefore there are Some Important Point to keep in mind -

A poorly chosen value for k may result in a bad idea of the skill of the model, such as a score with a high variance (that may change a lot based on the data used to fit the model), or a high bias, (such as an overestimate of the skill of the model).
Lower value of k is more biased, and hence undesirable. On the other hand, a higher value of K is less biased, but can suffer from large variability.
A value of k=10 is very common in the field of applied machine learning, and is recommend if you are struggling to choose a value for your dataset.
If a value for k is chosen that does not evenly split the data sample, then one group will contain a remainder of the examples. It is preferable to split the data sample into k groups with the same number of samples, such that the sample of model skill scores are all equivalent.

Code Example in Python :

from sklearn.model_selection import KFold 
kf = RepeatedKFold(n_splits=5, n_repeats=10, random_state=None) 

for train_index, test_index in kf.split(X):
      print("Train:", train_index, "Validation:",test_index)
      X_train, X_test = X[train_index], X[test_index] 
      y_train, y_test = y[train_index], y[test_index]

Types of Cross-Validation :

Three commonly used variations based on the value of K are as follows:

Train/Test Split: Taken to one extreme, such that a single train/test split is created to evaluate the model.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.4, random_state=0)

X_train.shape, y_train.shape

X_test.shape, y_test.shape


clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)

Here the value of k can be set to 2 . But With this approach there is a possibility of high bias if we have limited data, because we would miss some information about the data which we have not used for training. If our data is huge and our test sample and train sample has the same distribution then this approach is acceptable.
Leave-one-out cross-validation:

from sklearn.model_selection import LeaveOneOut

X = [1, 2, 3, 4]
loo = LeaveOneOut()
for train, test in loo.split(X):
    print("%s %s" % (train, test))

Here k may be set to the total number of observations in the dataset such that each observation is given a chance to be the held out of the dataset.
Stratified: The splitting of data into folds may be governed by criteria such as ensuring that each fold has the same proportion of observations with a given categorical value, such as the class outcome value. This is called stratified cross-validation

from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, random_state=None)
# X is the feature set and y is the target
for train_index, test_index in skf.split(X,y): 
    print("Train:", train_index, "Validation:", val_index) 
    X_train, X_test = X[train_index], X[val_index] 
    y_train, y_test = y[train_index], y[val_index]

Repeated: This is where the k-fold cross-validation procedure is repeated n times, where importantly, the data sample is shuffled prior to each repetition, which results in a different split of the sample.
Nested: This is where k-fold cross-validation is performed within each fold of cross-validation, often to perform hyperparameter tuning during model evaluation. This is called nested cross-validation or double cross-validation.

Cross Validation for time series:

Splitting a time-series dataset randomly does not work because the time section of your data will be messed up. For a time series forecasting problem, we perform cross validation in the following manner.

Folds for time series cross valdiation are created in a forward chaining fashion
Suppose we have a time series for yearly consumer demand for a product during a period of n years. The folds would be created like:

fold 1: training [1], test [2]
fold 2: training [1 2], test [3]
fold 3: training [1 2 3], test [4]
fold 4: training [1 2 3 4], test [5]
fold 5: training [1 2 3 4 5], test [6]
.
.
.
fold n: training [1 2 3 ….. n-1], test [n]

from sklearn.model_selection import TimeSeriesSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
tscv = TimeSeriesSplit(n_splits=3)for train_index, test_index in tscv.split(X):
     print("Train:", train_index, "Validation:", val_index)
     X_train, X_test = X[train_index], X[val_index]
     y_train, y_test = y[train_index], y[val_index]

TRAIN: [0] TEST: [1]
TRAIN: [0 1] TEST: [2]
TRAIN: [0 1 2] TEST: [3]

Books You can refer to :

Applied Predictive Modeling, 2013.
An Introduction to Statistical Learning, 2013.
Artificial Intelligence: A Modern Approach (3rd Edition), 2009.

Summary

In this Article we went on from why we should use cross validation to implementing cross validation .

We learned :

Cross validation (CV) is one of the techniques used to test the effectiveness of machine learning models .
General procedure for implementing k-Cross validation
Some point you must keep in mind while trying cross validation.
Types of cross validation based on k values

Did you find this article helpful? Please share your opinions/thoughts in the comments section below.

And in the end i would like to say JUST REMEMBER :

Cross Validation

k-fold cross validation

Types of Cross-Validation :

Further Reading :

Cross Validation for time series:

Related Articles :

Books You can refer to :

Summary

Written by Sahil Khandelwal