I recently wrote about hold-out and cross-validation in my post about building a k-Nearest Neighbors (k-NN) model to predict diabetes. Last week in my Machine Learning module, many students had questions about hold-out and cross-validation methods for testing, so I thought it deserved its own post.
Hold-out is when you split up your dataset into a ‘train’ and ‘test’ set. The training set is what the model is trained on, and the test set is used to see how well that model performs on unseen data. A common split when using the hold-out method is using 80% of data for training and the remaining 20% of the data for testing.
Cross-validation or ‘k-fold cross-validation’ is when the dataset is randomly split up into ‘k’ groups. One of the groups is used as the test set and the rest are used as the training set. The model is trained on the training set and scored on the test set. Then the process is repeated until each unique group as been used as the test set.
For example, for 5-fold cross validation, the dataset would be split into 5 groups, and the model would be trained and tested 5 separate times so each group would get a chance to be the test set. This can be seen in the graph below.
Hold-out vs. Cross-validation
Cross-validation is usually the preferred method because it gives your model the opportunity to train on multiple train-test splits. This gives you a better indication of how well your model will perform on unseen data. Hold-out, on the other hand, is dependent on just one train-test split. That makes the hold-out method score dependent on how the data is split into train and test sets.
The hold-out method is good to use when you have a very large dataset, you’re on a time crunch, or you are starting to build an initial model in your data science project. Keep in mind that because cross-validation uses multiple train-test splits, it takes more computational power and time to run than using the holdout method.
To see an example of comparing hold-out and cross-validation while testing a machine learning model, check out my post here. The article shows how testing a model varies between both methods and has uses the python Scikit-learn library to implement both methods.
Thanks for reading! To keep up to date with my machine learning content, follow me :)