K-fold cross validation explained:

Zack Stern
6 min readDec 16, 2018
See how many folds this guy has? We are going to fold our data over on itself too!

One idea that I found to be a bit abstract when I first learned about it is something called “k-fold Cross-Validation”. Here I am going to set out to explain this concept in simple terms so hopefully you will be able to completely understand what is going on, why we would use K-fold cross-validation (KFCV) and finally, how to implement it on your own in python/jupyter notebooks.

Borrowing from a scene in “Pulp Fiction” , let’s start by just breaking down the title itself:

We have “K” , as in there is 1,2,3,4,5….k of them.

“Fold” as in we are folding something over itself.

“Cross” as in a crisscross pattern, like going back and forth over and over again.

“Validation” as in we are checking the accuracy of something (our model in this case).

If we were to just do a quick wikipedia search (a quickipedia) we would find the following, definition of k-fold cross-validation:

In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times, with each of the k subsamples used exactly once as the validation data. The k results can then be averaged to produce a single estimation.


The wiki is not bad but what did you get out of that? K? Que? Lots of K’s right? So many “K”s mentioned…Aint no party like a k-fold party cause a k-fold party don’t stop…well, actually, it does stop…at k-number of times so…yeah. We’re talkin’ bout K.

K is just a number we have chosen for how many times we will split our data by. If we have 1,000 data points and we choose a k-value of five, we just divide 1,000 by five, and we get five sets of 200 points each (what wikipedia calls subsamples). Then we take and hold onto four of those five samples (wikipedia says ‘k-1’) So, four sets of 200 each (800 total ) and one set of 200.

First though, let us back up a bit and just look at the idea of train and test splits of data all together. If you are brand new to statistical models you might be unfamiliar with the idea of splitting your data into two (or more) parts. That is, we will randomly assign each data point to either a test set, or a training set of data. One part, the training data, we will use to train our model on. The second part, the test data, we will use to test our trained model out on and see how well our model does at, say, predicting the value of a house, or whatever it is we are trying to achieve. We might evaluate our model with some type of metric. If you are using a mulit-linear regression model you may be using R² or root mean squared error to evaluate your models performance; first on the training set, then on the test set.

One issue that may arise with this practice is if you already are starting out with a smaller data set. Now you have to split that already small set into two sets and you have even less data to train on or to test out your model on. Another issue that arises with a simple train/test approach is that, even though you may have divided your data up randomly, outliers or other anomalies may only be present in one set and not be equally represented in the other. This may make your test set seem to perform particularly poorly, but had that one data point been in the training data, our model would not have seemed to suffer as much. If we ran our model several times and moved those outliers around randomly we would have a better idea of how our model would perform out in the ‘real world’

Now you might be seeing how we get to KFCV. The solution to our problem of splitting the data and getting extreme values in one and not the other, is to not just split your data into train and test once, but to do it multiple times…that is, we will do it ‘K-times’ :)

So K-fold works like this: Say you selected a K value of 5. That means we will split and build a model FIVE times, score it FIVE times and then average the results of each of those five models.

For example, say we started with 1,000 data points in our set. We choose a k-value of 5 (it can be anything but 5 and 10 are commonly chosen values of k), so now we have 5 sets of 200 data points. Now, we train a model using 800 of those data points. Next we evaluate how well our model is doing by testing it on the 200 data points we held out and scoring the result. Now do it over and over again!

Check out the simple animation below:

Again, using the example of having a thousand data points and a k-value of 5, each card represents 200 data points. First we randomly assign 200 data points to each card. We hold out the red card and now we train a model on the data represented by all the blue cards. Then we test it on the data represented by the red card. And we do it over and over again ‘k’ times. Also, we can shuffle the data around each time as well so that each time the cards change they are getting 200 randomly assigned data points. So we are completely mixing up and “crossing’ our data.

Then we ‘validate’ our models by looking at the scores from all our models, and taking the average of those scores.

You might be thinking to yourself now, “if a k of 5 or 10 is so great, why don’t we just use k of 100 or k of 1,000?” And you could do that. Taken to the extreme you could have k equal to the number of data points in your entire set and run the process, for example, 1,000 times for all 1,000 data points. This is called “Leave One Out Cross Validation” or LOOCV. But it would be computationally expensive to do that. So we use a smaller value of K and assume that the variation in our data will be evident and accounted for without having to hold out each and every point, which could take a very long time depending on the size of our data and what type of model we are testing.

I hope now that you have a better idea of how k-fold cross validation works, why you need it and why it helps. Below you’ll find the code and the walk through for implementing this in python (in a jupyter notebook).

This guy is a Scottish fold (their ears are folded over). The biggest K-value he could have is Two.