Judgement Day: Evaluating Classification Models

Published in

CodeX

7 min readNov 27, 2022

I remember an episode on Banana Data where the hosts were discussing AI written jokes and discussing what was then the beginning of Gmail’s predictive text features. It got me wondering about model evaluation: given the broad applications of machine learning models, what exactly is a good model and what is a bad one?

Before getting in too deep, let’s first clear up a very important word in model evaluation which is often just thrown around incorrectly: accuracy.

Seems simple enough, right? While accuracy may seem like a great metric to evaluate out models with, there are plenty of situations where this can be more problematic than not. This is particularly true for unbalanced datasets: datasets where there’s more data for one class compared to the rest. Then, our model will be trained on the unbalanced dataset, resulting in a high accuracy score favouring the class which has more representation. However, the accuracy score for our test set, or unseen data, could be drastically low. So, what to do?

In this article, I’ll focus on various model metrics that can be used in combination to judge our model. However, it’s important to note there are ways of handling unbalanced datasets to balance them out prior to even creating models. Check out this article which has a great summary on handing unbalanced data. There are other instances where accuracy may not be an ideal score because of the criteria of success for the project; machine learning is applied to a multitude of disciplines so domain knowledge is required to identify the most suitable metrics. So what are the various metrics and how do we work them out?

Confusion Matrix

One great way to work out various model metrics is to first look at a confusion matrix. A confusion matrix is a type of matrix. Think of it as a table where we have a breakdown of the number of correct and incorrect predictions a model makes compared with the true values of the situation being modelled.

An example of a confusion matric by packt

Let’s just quickly go through what the TN, FP, FN, and TP means, and then we’ll get back to how we can use these for model evaluation.

TN: True Negative — the number of predictions that were correctly predicted as class 0.
FP: False Positive — the number of incorrect predictions classifying a class 0 objects as class 1.
FN: False Negative — the number of incorrect predictions classifying a class 1 object as class 0.
TP: True Positive — the number of predictions that were correctly predicted as class 1.

How are these useful? Let’s go through a few of the most common metrics we use for classification models.

Recall: True Positive Rate

Recall is sometimes referred to as the sensitivity of the model. This helps us work out how good the model is at predicting objects in the positive class.

Precision: Positive Predicted Value

Let’s use my breast cancer classification project as an example here to understand what the positive predicted value is measuring: it’s the number of correctly classified as malignant from all the malignant samples.

So, which one should we use? Of course, in an ideal world, or in the world of data where you have perfectly distinguishable classes in the dataset, we could have both of them at a score of 1.0. This rarely happens. We find that as we tune our models and improve one of these metrics, the other one will inevitably fall. That’s not necessarily a bad thing though and doesn’t mean our models are doomed. There are plenty of situations where it is favourable to prioritise one over the other. For example, in medical tests, recall is often favoured as this gives a higher false positive rate than precision as medical diagnosis relies on multiple tests and rather than just one.

How do we prioritise one of these over the other? We set a threshold value which is determined by the situation to establish at what rate should one be favoured. If we have an unbalanced dataset, for example, we can plot our precision scores against our recall scores for varying threshold values, to identify the optimal threshold value. Take a look at this article which provides a great insight to precision recall curves.

F-1 Score

This is a value ranging from 0 to 1, where 1 indicates perfect precision and recall and 0 indicates poor scores for both.

As the F-1 score depends on the recall and precision score, it can be a much better metric to judge our model by. The higher the F-1 score, and sothe closer it is to 1, the higher the other two metrics must be.

Fall-out: False Positive Rate

As I’ve mentioned earlier there are instances where the false positive rate can be favoured over other metrics when evaluating a model. This can be worked out as:

Specificity: True Negative Rate

I like to think of this as the recall score for the negative class.

There can be advantages to using specificity over sensitivity sometimes, but again, this is dependent on the nature of the situation.

ROC and the Area under the ROC

The Receiver-operator characteristic curve (ROC) is a plot of the True Positive Rate against the False Positive Rate (Recall vs Fall-Out) for all threshold values. Below are two ROCs for two different K-Nearest Neighbours models I’d created for my binary classification project.

The figure on the left indicates a model that has a 50% chance of predicting classes correctly, so if we move away from the straight line then the model is better at predictions. The figure on the right shows a model that is much better; in this case, the model correctly makes a prediction 99.99% of the time. This ‘chance’ is represented by the area under the curve, so the higher the area, then the better the model.

Matthews Correlation Coefficient (MCC)

The MCC is another great solution to imbalanced datasets for classification problems.

Looks complicated right? I wouldn’t worry about memorizing the formula, but rather understanding what the formula means. At the end of the day, Scikit Learn or Stats model can give us the MCC in just a line of code. So what is the MCC? You can think of the MCC is something like a chi squared statistic that works for binary and multiclass classification problems. From the equation above we can see that the MCC will have a high score only if a high number of correct predictions are made for both positive and negative classes. It can be a better judge of the model for imbalanced datasets compared to the accuracy score and F1-score.

Evaluating models using Scikit Learn

If you are interested in pythonic methods of model evaluation, I’d recommend using Scikit Learn. It spouts out pretty much all of these metrics with minimal effort. Here’s a quick example of what necessary imports and typical outputs are in Python. The code and the figures below are from my project, feel free to look into these deeper if you’re interested.

The typical format is:

from sklearn.metrics import <metrics you want>

metrics(actual_values , model_predictions)

So, for one of my K-Nearest Neighbors Model, here’s an example of what Scikit Learn gives us.

The output:

As you can see, most of the metrics are also calculated for us in the classification report, which is a nice one-liner and saves us a little hassle.

I always like to visualise the ROC curve, so here’s what I did for this model:

Overall, this is yet another invaluable tool in the data science discipline.