Evaluating Machine Learning Models: A Guide to Selecting the Right Performance Assessment Method for Your Dataset and Task

5 min readMar 26, 2023

In the previous article; “Comparing K-Fold Cross-Validation Methods: Strategies for Effective Model Evaluation in Diverse Data Scenarios,” we learned about k-fold cross-validation and its various types. Dive into the article to gain a better understanding of these crucial evaluation methods and enhance your machine learning model’s performance.

In addition to k-fold cross-validation, there are several other methods for assessing the performance of machine learning models. This article covers other popular techniques, such as the holdout method, leave-one-out cross-validation, leave-p-out cross-validation, repeated random subsampling, and bootstrapping. By delving into each method’s strengths, weaknesses, and ideal use cases, the article aims to help you choose the most suitable evaluation method for your specific task and dataset. Here are some other methods and when to use them:

Holdout Method: The dataset is split into two distinct subsets, one for training and one for testing (usually with a ratio like 70:30 or 80:20). The model is trained on the training set and evaluated on the testing set. This method is simple and fast but may lead to high variance in the performance estimates, especially when the dataset is small. Use the holdout method when you have a large dataset and the evaluation speed is a concern.
Leave-One-Out Cross-Validation (LOOCV): This is a special case of k-fold cross-validation where k is equal to the number of data points. In each iteration, one data point is used as the test set, and the remaining data points are used as the training set. This method can provide low-bias performance estimates but is computationally expensive. Use LOOCV when you have a small dataset, and the computational cost is not a concern.
Leave-P-Out Cross-Validation (LPOCV): In this method, a specific number of data points (p) are left out as the test set in each iteration, while the remaining data points are used for training. This method is more computationally expensive than LOOCV but provides more reliable performance estimates. Use LPOCV when you have a small dataset and you want to balance computational cost with performance estimation accuracy.
Repeated Random Subsampling (Monte Carlo Cross-Validation): This method involves randomly splitting the dataset into training, and testing sets multiple times, training the model on the training set, and evaluating it on the testing set. The performance estimates from each split are averaged to provide a final performance estimate. This method can provide more accurate performance estimates for small datasets, but it can lead to overlapping test sets. Use repeated random subsampling when you have a small dataset and want to balance evaluation speed with performance estimation accuracy.
Bootstrapping: This method involves sampling with replacement from the original dataset to create multiple new datasets, each of the same size as the original dataset. The model is trained and evaluated on each new dataset. Bootstrapping can provide more accurate performance estimates for small datasets, but it may lead to overfitting due to resampling. Use bootstrapping when you have a small dataset and want to estimate the performance with higher confidence.

When choosing a method for assessing the performance of a machine learning model, consider the size and characteristics of your dataset, the computational cost, and the specific requirements of your task. Each method has its advantages and disadvantages, so it’s essential to select the one that best fits your problem domain and dataset characteristics.

Here is a simple example of applying the five methods for assessing the performance of a machine learning model using Python and scikit-learn. We will use the iris dataset and a logistic regression classifier as an example:

import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, KFold, LeaveOneOut, LeavePOut, ShuffleSplit, cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.utils import resample

# Load the iris dataset
data = load_iris()
X, y = data.data, data.target

# Create a logistic regression classifier
clf = LogisticRegression(max_iter=1000)

# Holdout Method
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("Holdout Method Accuracy: ", accuracy_score(y_test, y_pred))

# K-Fold Cross-Validation
k_fold = KFold(n_splits=5)
cv_scores = cross_val_score(clf, X, y, cv=k_fold)
print("K-Fold Cross-Validation Mean Accuracy: ", np.mean(cv_scores))

# Leave-One-Out Cross-Validation
loocv = LeaveOneOut()
cv_scores = cross_val_score(clf, X, y, cv=loocv)
print("Leave-One-Out Cross-Validation Mean Accuracy: ", np.mean(cv_scores))

# Leave-P-Out Cross-Validation
lpocv = LeavePOut(p=2)
cv_scores = cross_val_score(clf, X, y, cv=lpocv)
print("Leave-P-Out Cross-Validation Mean Accuracy: ", np.mean(cv_scores))

# Repeated Random Subsampling (Monte Carlo Cross-Validation)
ss = ShuffleSplit(n_splits=10, test_size=0.3, random_state=42)
cv_scores = cross_val_score(clf, X, y, cv=ss)
print("Repeated Random Subsampling Mean Accuracy: ", np.mean(cv_scores))

# Bootstrapping
n_iterations = 100
n_size = int(len(X) * 0.7)
bootstrap_scores = []

for i in range(n_iterations):
    X_resample, y_resample = resample(X, y, n_samples=n_size)
    
    # Convert the numpy arrays to lists
    xy = [tuple(row) for row in np.column_stack((X, y))]
    xy_resampled = [tuple(row) for row in np.column_stack((X_resample, y_resample))]
    
    # Find the test set by excluding resampled data points
    test_indices = [i for i, pair in enumerate(xy) if pair not in xy_resampled]
    X_test, y_test = X[test_indices], y[test_indices]
    
    clf.fit(X_resample, y_resample)
    y_pred = clf.predict(X_test)
    bootstrap_scores.append(accuracy_score(y_test, y_pred))
print("Bootstrapping Mean Accuracy: ", np.mean(bootstrap_scores))

Holdout Method Accuracy:  1.0
K-Fold Cross-Validation Mean Accuracy:  0.9266666666666665
Leave-One-Out Cross-Validation Mean Accuracy:  0.9666666666666667
Leave-P-Out Cross-Validation Mean Accuracy:  0.965413870246085
Repeated Random Subsampling Mean Accuracy:  0.96
Bootstrapping Mean Accuracy:  0.9554679336669741

This code snippet demonstrates how to apply each of the five performance assessment methods in Python using scikit-learn. Keep in mind that these examples use the iris dataset and a logistic regression classifier; you may need to adjust the code for different datasets and classifiers to suit your specific use case.

👏 Don’t forget to give this article some claps and share it with your network to support my work! Feel free to follow my Medium profile for more insightful content on machine learning and data science. Thank you for being so supportive! 🚀

Evaluating Machine Learning Models: A Guide to Selecting the Right Performance Assessment Method for Your Dataset and Task

Written by Sahel Eskandar