Machine Learning: Cross Validation

chetan kumar b k
4 min readDec 12, 2022

--

Part 1: Basic understanding of Cross-Validation

Cross-validation is a resampling technique used to determine how well a machine-learning model will work with a different test dataset. The dataset for any machine learning method is split into training and testing data.

Splitting the dataset
  • Training data is used to build an ML Model whereas testing data is used to Validate a built model.

This article is divided into two parts,

  • Part 1: Basic understanding of Cross Validation
  • Part 2: an in-depth understanding of the various different cross-validation techniques and their advantages and disadvantages.

What will happen if we validate the model using the training data itself?

If you did not validate the model, it is possible that the model may be either “Underfitting (or) Overfitting (or) generalized for the data.

Simply put, the machine-learning model’s goal is to have less error. That is, the model should exhibit less error (higher accuracy) for new data.

We can’t be sure that the model without validation will work well on data that has never been seen before, (or), in other words, we can’t be sure that the model will have the desired accuracy and variation in a production environment. That’s why, in order to overcome these assumptions, We need to validate our model.

So, from the above, we can conclude that based on the model’s performance on unseen data, we can say whether our model is underfitting, overfitting, (or) well-generalized. Cross-validation is one of the techniques used to improve the effectiveness of machine learning models.

What is Cross-Validation?

“Cross-validation is a statistical method used to estimate the performance (or accuracy) of machine learning models.”

NOTE: Cross-validation is used to protect against overfitting and underfitting in predictive models. To perform CV, we need to keep a sample portion of the data that was not used to train the model; later, it can be used for testing and validating.

Steps involved in cross validation?

  1. Divide the entire dataset into training and testing.
  2. Reserve a testing dataset for validation purposes.
  3. Train the model using the training dataset.
  4. The built model is then validated using testing data. This will help us gauge the effectiveness of the model’s performance.

NOTE: If our model delivers a positive result on testing data, then we go ahead with the current model itself. If not, then we will perform cross-validation to determine the effectiveness of the ML model.

Types of Cross-validation

Types of Cross validation

Holdout Method

  • As with normal procedure, we will divide the dataset into training and testing. We train the model using training data and then validate it using test data.

K-fold Cross Validation

  • It is an improved version of the holdout method. In holdout, the score of the model depends on training and testing data, whereas in K-fold CV, it does not depend on it.
  • The dataset is divided into a “K” number of subsets, and the holdout method is repeated a K number of times.

Stratified K-fold Cross Validation

  • It is similar to a K-fold CV. It is just an update of the K-fold with respect to the dependent variable class.
  • In a K-fold CV, we are randomly shuffling the data and then divide it into K-folds. Here there is a chance that we may get imbalanced class folds, which may cause our training to be biased. To overcome this issue, we make use of a stratified K-fold CV.

Leave One Out Cross Validation (LOOCV)

  • In this exhaustive approach, we will consider only one data point from the available dataset as test data and the remaining data as training data to train the model. This process iterates for each data point.

Leave P Out Cross Validation (LPOCV)

  • LOOCV leaves one data point out. Similarly, we leave out the “P” training examples in order to have a P-sized validation set for each iteration. Then this is called LPOCV.
  • Suppose we take the “P” number of points out of the total number of data points in the data set (say “n”), and while training the model, we train it on (n-p) data points and test the model using “P” data points. We repeat this process for all the possible combinations of P in the original data set. The accuracy from all of these iterations is then averaged to get the final accuracy.

In part 02: we’ll learn more about these various cross validation methods, including the code and their benefits and drawbacks.

Note: The article is for educational purposes only and not intended for business purposes.

--

--