MLearning.ai
Published in

MLearning.ai

Cross validation in Data Science

Introduction

If you know anything about machine learning you know that when building a machine learning model , we need data to train it. How do we do that? We split the whole data set into two unequal parts on random. Say, 80% of the data as train set and rest of the 20% of the data as test data. Then we train the selected ML model on the training set and to evaluate the model performance we test the model on the unseen test data that we kept aside for further validation. Cross validation is a technique for evaluating machine learning model and testing its performance. It also helps compare different ML models for a specific predictive modelling task. Its simple and easy to implement.

Different CV techniques

There are many CV techniques roaming around here and there. Only some of them are are commonly used in the practical world, others just exists. We won’t discuss all of them in detail, but you will get the idea behind every method at least. So they are

A- Hold-out

B- K-fold cross validation

C- Leave-p-out

D- Stratified k-folds

E- Repeated k-folds

F- Nested k-folds

G- Time series CV

H- Validation set approach.

We intentionally leave the k-fold CV as in bold because that is the topic of interest in this article and just a few words about the other so that you get a brief idea about those technique. So let’s go…

A. Hold-out

We already talked about this method in this article. You’ve probably have used it too. You just don’t know the name of the technique. It precedes as follows..

1. Split the whole data set into two parts. Generally 80% of the data goes into training set and the rest of the (20%) goes to the testing set. The split rule isn’t universal, so choose your split ratio accordingly.

2. Train the model on training set

3. Validate the model on the test set.

4. Calculate the accuracy or performance of your model

Done.

It’s so simple that it seems trivial. It is useful when you have a large amount of data and you can afford to train the model only once.

Hold-out isn’t a preferable method for cross validation since an imbalanced data set might cause in inaccurate model.

B. K-Fold cross validation

K-fold cross validation is a commonly used CV method .

In this approach we divide the whole data set into k groups of equal sizes. These are called Folds.

The algorithm works as follows:

1. Choose the value for k on your own. Usually it is taken 5 or 10.

2. Split the dataset into k equal parts.

3. Choose one set as test set and rest of the k-1 folds as training set.

4. Train the model using the k-1 training sets .

5. Validate on the test set

6. Save the result

7. Repeat the 3rd to 5th step with a new fold as test set and rest of them as training set. Use a different model than previous one.

8. Get the arithmatic mean of the performances.

Since we are training and testing the models on different parts of the data set it results in a more stable model. If we want to increase the model performance and robust then we can increase the value of k.

There are different ways to choose value for k. like

A. Choose the value of k in such a way that both the train and test set is large enough that they are statistically representative of the main data set.

B. K = 10. it has been seen with many experiment that k = 10 is a good choice as it usually result in a model with low bias and variance.

C. K = n. we will talk about this more later in this article.

One disadvantage of k-fold CV is that for a bigger value of k the computational cost is really big and more time complex since we are training multiple model (k ).

C. Leave p-out

In this approach we take p samples and make it test set and rest of the samples (n — p) to make training set. We can choose p samples out of n samples in nCp ways. So unlike k-fold CV the test sets can overlap.

1. Choose p samples and this will be our test set

2. Choose the rest of the samples (n — p) as training set.

3. Train the model using the training set. Remember on each iteration a new model should be trained

4. Validate the model on the test set.

5. Save the result.

6. Repeat the step 3 — step 5 a number of nCp times.

7. Average the results

Disadvantage of this technique is that for larger p it can be computationally expensive .

One special case is when p = 1. many people consider this as a whole different technique (leave — one — out CV).

As you can realize nC1 = n. that means in this method you have to iterate n times. You take out just one sample and train the model in rest of the n — 1 samples. Validate the model with just one data point. Again (LOOCV) this method is also computationally expensive and time consuming.

D. Stratified k — fold

Sometimes we deal with imbalanced data. Like in a regression problem the data we have is price of some type of product. If the price of some product is high and rest of them is considerably low and these low price items are more in number than the costly ones. Or in the case of binary classification samples of one class is more than the other class. Let’s say in a dataset of 1000 samples about cats and dogs there are 700 cats and 300 dogs. In these kind of imbalanced dataset the standard k-fold cross validation technique might see some issue. To counter this problem they came up with a new variant of k- fold CV. Stratified k-fold CV.

It works almost the same way that k-fold CV works. The difference is that , Stratified k-folds splits the data set such a way that each fold contains the same percentages of samples of each target as in the main data set. In the case of regression, Stratified k-fold makes sure that the mean target value is approximately equal in all the folds.

Stratified k — fold deals with bias and variance well. You can guess that it is better than standard k — fold CV, yes, it is.

E. Repeated k-folds

In this method k is not the number of folds as in in the standard k — fold. K is the number of times we will train the model.

Let’s say we choose that 20 % of the data will be our test set and rest are training set. Now the technique randomly selects 20 % the data from the original dataset and train the model on the rest of the samples and validates the model on the 20% samples that it took aside. It does this k times and averages the result

Advantage of this method is that it is more robust since it chooses train and test set on random. Disadvantage is that there is no guarantee that all the sample will be selected at least once for testing.

F. Nested k-fold

This is where k-fold cross-validation is performed within each fold of cross-validation, often to perform hyperparameter tuning during model evaluation.

G. Time series CV

This is entirely a different topic. When we have sequential data like time series. In this case we cannot random data point to either train or test set. In this article we are not going to discuss about time series CV. We are yet to write any article on time series , when we’ll do that a link will be assigned to this article. Please wait until that.

H. Validation set approach.

In this approach we divide the original data set into two equal parts , training set and test or validation set.

Major disadvantage of this approach is that only 50% data are trained the model on . so what happens is that the model fails to capture many patterns that are in the original data but not in the training data set.

That was it. Hope you will find it helpful. If you do please share this story to a data science aspirant you know and follow me on medium.

Or in these platform.

LinkedIn , Github, Twitter , medium

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store