Understanding Cross Validation’s purpose

I recently did a presentation on K-Folds Cross Validation. K-Folds cross validation is one method that attempts to maximize the use of the available data for training and then testing a model. It is particularly useful for assessing model performance, as it provides a range of accuracy scores across (somewhat) different data sets.

K-Folds cross validation is pretty straightforward. It’s an extension of train-test split where data is split into a training set, used to fit a model, and a testing set, used to determine how well the model performs against a relevant performance metric. Example performance metrics are accuracy, MSE, or misclassified observations.

In K-folds cross validation, the data is divided into k equal parts as shown in the picture below. Using the data, k iterations of model building and testing are performed. Each of the k parts is used in one iteration as the test data, and in the other k-1 iterations as part of the training set. At the end, the performance metrics from across the iterations can be analyzed to determine an average, range, standard deviation, or other useful metric.

As I developed the presentation, my understanding of the purpose of Cross Validation evolved. Cross validation is not a model fitting tool of itself. Its coupled with modeling tools like linear regression, logistic regression, or random forests. Cross validation provides a measure of how good the model fit is, both for accuracy (bias), and variance.