Model Performance Evaluation

7 min readApr 6, 2019

Before starting, a reference read— Evaluating Learning Algorithms

You need to know how well your algorithms perform on unseen data. The best way to evaluate the performance of an algorithm would be to make predictions for new data to which you already know the answers. The second best way is to use clever techniques from statistics called re-sampling methods that allow you to make accurate estimates for how well your algorithm will perform on new data.

In general, all machine-learning algorithms are optimizing something specific or it could be minimizing some kind of error. In a traditional way, your machine-learning algorithm could be performing Classification (supervised) or Clustering (unsupervised) or performing a regression (to learn a function or to predict specific parameters of a function).

Classification, often just counts errors, but may also need to consider errors differentially across classes, as well as class-specific costs and biases. There are lots of problems using raw errors, or percentage errors, or the complementary concept of accuracy, let alone single class measures like Recall, Precision and F-measure.

Clustering algorithm selection or development aside, once a set of clusters has been realized there remains the question of quality of membership assignment relative to the initial purpose for the clustering. These techniques either treat internal criteria (Are quantities that involve the vectors of the data set themselves e.g. proximity matrix. They are used to assess either the clustering itself or its producing algorithm by measuring characteristics like cohesion, separation, distortion and likelihood. Because these criteria are greatly affected by parameters defined a priori, such as number of clusters required or minimum density, internal criteria are thus sensitive to both the quality of the clustering and the a priori criteria used for evaluating them.), external features (Are used to simply measure how similar a clustering is to another clustering, gold standard or desirable-feature template and as such produce measures independent of the producing algorithm and a priori clustering evaluation, data set, or problem specific criteria.) or relative criteria (Are used to rate a clustering by comparing it to other clusterings, produced by the same algorithm with different input parameter values. In this predefined criteria are selected to suit the algorithm and data set.).

The relative and internal criteria approaches use Monte Carlo methods to evaluate whether a clustering is significantly different from chance, whereas external features are used to compare the memberships and structures of two clusters.

Regression may involve looking at either absolute error vs relative error (relative to magnitude) or squared error (as opposed to absolute value). Mean Absolute Error (or MAE) is the sum of the absolute differences between predictions and actual values. It gives an idea of how wrong the predictions were. The measure gives an idea of the magnitude of the error, but no idea of the direction (e.g. over or under predicting). The Mean Squared Error (or MSE) is much like the mean absolute error in that it provides a gross idea of the magnitude of error. Taking the square root of the mean squared error converts the units back to the original units of the output variable and can be meaningful for description and presentation. This is called the Root Mean Squared Error (or RMSE). The R2 (or R Squared) metric provides an indication of the goodness of a set of predictions to the actual values. In statistical literature this measure is called the coefficient of determination. This is a value between 0 and 1 for not and perfect respectively.

Classification Performance Measure

Why can’t you prepare your machine learning algorithm on your training dataset and use predictions from this same dataset to evaluate performance? The simple answer is over-fitting. Imagine an algorithm that remembers every observation it is shown during training. If you evaluated your machine learning algorithm on the same dataset used to train the algorithm, then an algorithm like this would have a perfect score on the training dataset. But the predictions it made on new data would be terrible. We must evaluate our machine learning algorithms on data that is not used to train the algorithm.

The evaluation is an estimate that we can use to talk about how well we think the algorithm may actually do in practice. It is not a guarantee of performance. Once we estimate the performance of our algorithm, we can then re-train the final algorithm on the entire training dataset and get it ready for operational use. Next up we are going to look at four different techniques that we can use to split up our training dataset and create useful estimates of performance for our machine learning algorithms:

Train and Test Sets
k-fold Cross-Validation
Leave One Out Cross-Validation
Repeated Random Test-Train Splits

Train and Test Sets

The simplest method that we can use to evaluate the performance of a machine learning algorithm is to use different training and testing datasets. We can take our original dataset and split it into two parts. Train the algorithm on the first part, make predictions on the second part and evaluate the predictions against the expected results. The size of the split can depend on the size and specifics of your dataset, although it is common to use 67% of the data for training and the remaining 33% for testing.

This algorithm evaluation technique is very fast. It is ideal for large datasets (millions of records) where there is strong evidence that both splits of the data are representative of the underlying problem. Because of the speed, it is useful to use this approach when the algorithm you are investigating is slow to train. A downside of this technique is that it can have a high variance. This means that differences in the training and test dataset can result in meaningful differences in the estimate of accuracy. In the example below we split the Pima Indians dataset into 67%/33% splits for training and test and evaluate the accuracy of a Logistic Regression model. Partial code:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
test_size = 0.33
seed = 7
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size,random_state=seed)
model = LogisticRegression()
model.fit(X_train, Y_train)
result = model.score(X_test, Y_test)

K-fold Cross-Validation

Cross-validation is an approach that you can use to estimate the performance of a machine learning algorithm with less variance than a single train-test set split. It works by splitting the dataset into k-parts (e.g. k = 5 or k = 10). Each split of the data is called a fold. The algorithm is trained on k > 1 folds with one held back and tested on the held back fold. This is repeated so that each fold of the dataset is given a chance to be the held back test set. After running cross-validation you end up with k different performance scores that you can summarize using a mean and a standard deviation.

The result is a more reliable estimate of the performance of the algorithm on new data. It is more accurate because the algorithm is trained and evaluated multiple times on different data.

The choice of k must allow the size of each test partition to be large enough to be a reasonable sample of the problem, whilst allowing enough repetitions of the train-test evaluation of the algorithm to provide a fair estimate of the algorithms performance on unseen data. For modest sized datasets in the thousands or tens of thousands of records, k values of 3, 5 and 10 are common. In the example below we use 10-fold cross-validation.

num_folds = 10
seed = 7
kfold = KFold(n_splits=num_folds, random_state=seed)
model = LogisticRegression()
results = cross_val_score(model, X, Y, cv=kfold)

Leave One Out Cross-Validation

You can configure cross-validation so that the size of the fold is 1 (k is set to the number of observations in your dataset). This variation of cross-validation is called leave-one-out cross-validation. The result is a large number of performance measures that can be summarized in an effort to give a more reasonable estimate of the accuracy of your model on unseen data. A downside is that it can be a computationally more expensive procedure than k-fold cross-validation. In the example below we use leave-one-out cross-validation.

from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
leaveoneout = LeaveOneOut()
model = LogisticRegression()
results = cross_val_score(model, X, Y, cv= leaveoneout)

Repeated Random Test-Train Splits

Another variation on k-fold cross-validation is to create a random split of the data like the train/test split described above, but repeat the process of splitting and evaluation of the algorithm multiple times, like cross-validation. This has the speed of using a train/test split and the reduction in variance in the estimated performance of k-fold cross-validation. You can also repeat the process many more times as needed to improve the accuracy. A down side is that repetitions may include much of the same data in the train or the test split from run to run, introducing redundancy into the evaluation. The example below splits the data into a 67%/33% train/test split and repeats the process 10 times.

kfold = ShuffleSplit(n_splits=n_splits, test_size=test_size, random_state=seed)
model = LogisticRegression()
results = cross_val_score(model, X, Y, cv=kfold)

What Techniques to Use When

This section lists some tips to consider what re-sampling technique to use in different circumstances.

Generally k-fold cross-validation is the gold standard for evaluating the performance of a machine learning algorithm on unseen data with k set to 3, 5, or 10.
Using a train/test split is good for speed when using a slow algorithm and produces performance estimates with lower bias when using large datasets.
Techniques like leave-one-out cross-validation and repeated random splits can be useful intermediates when trying to balance variance in the estimated performance, model training speed and dataset size.

The best advice is to experiment and find a technique for your problem that is fast and produces reasonable estimates of performance that you can use to make decisions. If in doubt, use 10-fold cross-validation.