Taking a look at Accuracy, Precision, and Recall for Classification tasks

Photo of a measuring tape
Measuring your classifiers. Photo by patricia serna on Unsplash

At GumGum, we train machine learning models on unstructured data, like text and images. An integral part of model development is model evaluation. Especially for classification tasks, there are several metrics out there. Most known are accuracy, precision, recall, the F-beta score, and the ROC AUC. Each of those gives a slightly different picture of the model's performance. With this series, I want to give those metrics a fresh look and dive into what each of them can do for you, and, even more importantly, what they cannot do for you.

Blogs in this Series:

  1. Taking a fresh look at metrics for Classification Tasks at GumGum — Introduction
  2. Taking a look at Accuracy, Precision, and Recall for Classification tasks
  3. Taking a close look at Precision for classification tasks
  4. ….

Overview

In this blog post, I want to focus specifically on accuracy, precision, and recall as metrics to measure classification tasks. I will be explaining why accuracy is not a good metric for imbalanced datasets, and how precision & recall can be used in that case.

Please look at the first blog in this series where I introduce Brand-Safety classification at GumGum, which I will be referring to in this blog post.

Let’s dive right into our metrics and start out with accuracy.

What is accuracy and what do we need to watch out for?

Accuracy, also sometimes referred to as top-1-accuracy (see this StackOverflow for more information on Top-1 accuracy vs Top-5 accuracy), calculates the number of correctly predicted examples divided by the number of total samples in the dataset. Accuracy does not differentiate between labels.

accuracy = #correctly predicted examples / #total samples

Let's use the binary classification task for Brand Safety and say we have 100 examples in our dataset. 50 of the examples are “safe” and 50 are “unsafe”. This is a balanced dataset.

Diagram illustrating the next paragraph: Scenario 1 with 50 true positives + 25 true negatives = 75 correct predictions / 100 = .75 and Scenario 2 with 25 true positives + 50 true negatives = 75 correct predictions / 100 = .75
Created by Sanja Widmer, using Excalidraw

The visualization above will help showcase the two following scenarios.

Scenario 1: Let’s say we correctly identify all our “unsafe” examples and miss 50% of the “safe” examples, then our accuracy will be

(50 + 0.5*50)/100 = 75/100 = 75%

Scenario 2: Let’s reverse our case and say we miss 50% of the “unsafe” examples and correctly predict all “safe” examples as “safe”. Then our accuracy would be

(50*0.5 + 50)/100 = 75/100 = 75%.

We get the same accuracy for both cases, even though the model performs very differently.

Drawback 1) Accuracy does not tell us anything about which class is performing poorly/well. Given the same accuracy number for two different models, we cannot conclude that the two models perform the same.

Let's look at an imbalanced dataset instead. Let's say out of our 100 examples, 10 are “unsafe” examples and 90 are “safe” examples. Let's walk through our examples again.

Diagram showing the content of the next paragraph: Scenario 1 with 10 true positives + 45 true negatives = 55 correct predictions / 100 = .55 and Scenario 2 with 5 true positives + 90 true negatives = 95 correct predictions / 100 = .95
Created by Sanja Widmer, using Excalidraw

Scenario 1: Let’s say we correctly identify all our “unsafe” examples and miss 50% of the “safe” examples, then our accuracy will be

(10 + 0.5*90)/100 = 55/100 = 55%

Scenario 2: Let's reverse our case and say we miss 50% of the “unsafe” examples and correctly predict all “safe” examples as “safe”. Then our accuracy would be

(10*0.5 + 90)/100 = 95/100 = 95%.

Now, we can observe another problem. In the second scenario, we get an accuracy of 95%. This sounds amazing, when in fact, our classifier is missing half of all “unsafe” examples. This is a problem, especially when the minority class is important, like for GumGum’s Brand Safety classifier. With each miss of an “unsafe” example, we expose our customers to brand-unsafe content. Conversely, the first scenario has a relatively bad accuracy number, but we are successful at protecting our clients from unsafe content.

Another example, where the minority class is of extreme importance would be fraud detection, where datasets can even be more unbalanced, like 1 in 10000 emails could be a fraud.

A major risk when training and testing on imbalanced datasets is the case when a classifier might learn to always predict the majority class and still get 99.99% accuracy. This is a useless classifier with amazing-looking accuracy metrics.

Drawback 2) In imbalanced datasets, accuracy can be a misleading metric, disguising a potentially useless model behind an amazing-looking performance number.

For GumGum, Brand-Safety classification is an imbalanced dataset problem, where around 16% of URLs are unsafe. The minority class is the “brand-unsafe” class. Therefore accuracy is not a good metric for us, as we do not get enough information about the important class. (If you are not sure what Brand-Safety classification is, please check out the previous blog post in this series for a quick reminder.)

What do precision and recall bring to the table?

Oftentimes, when evaluating imbalanced classification results, precision and recall are the top-recommended metrics. Let’s see why that is.

Let's start off with the definitions. Precision and Recall are calculated based on the results from the confusion matrix. From the confusion matrix, we collect the true positives (TP), false positives (FP), and false negatives (FN) for each label.

Source by Joydwip Mohajon

A true positive (TP) is the case where the classifier predicted a positive and the ground-truth is also a positive.
A false positive (FP) is the case where the classifier predicted a positive and the ground-truth is negative.
A false negative (FN) is the case where the classifier predicted a negative and the ground-truth is a positive.

In all examples, I will treat the “unsafe” brand-safety class as the positive class, since this is the class of interest for GumGum.

recall
= #correctly predicted positives / #of ground truth positives
= TP / (TP + FN)

precision
= #correctly predicted positives / #of predicted positives
= TP / (TP + FP)

Recall gives us the percentage of how many of the actual “unsafe” examples our classifier got correctly. This metric is very important for brand safety, as ideally, we want to capture all the brand-unsafe content. Recall is a metric that is independent of the class distribution in the dataset because it only looks at the “unsafe” (positive) examples. E.g. if I get 60 out of 100 examples right, this is the same as getting 600 out of 1000 examples right. To be clear, the dataset can have an effect on recall. For instance, if a dataset contains almost exclusively ‘easy-to-classify’ examples, the recall on that dataset will be higher. Conversely, if a dataset contains almost exclusively ‘hard-to-classify’ examples, the recall on that dataset will be lower.

Precision gives us the percentage of how many of our predicted “unsafe” examples are actually unsafe. It adds something very important to the equation. It measures how well the model can tell apart the “unsafe” class from the “safe” class. If precision is below 50%, this means our classifier confuses more “safe” examples as “unsafe” than predicting actual “unsafe” examples correctly. The dataset does have an effect on precision as well. I will go into depth on this in my next blog post to give it the proper space.

Let's look at what precision and recall can do for us:

  1. Recall and Precision both work on the individual class results. So we will calculate precision and recall for the “safe” class as well as the “unsafe” class. This means:
    In an imbalanced dataset, precision and recall for the minority class will give us information about the classifier’s performance on the minority class specifically. With this, it protects us from letting the majority class predictions overrule the minority class predictions.
  2. If a classifier always predicts “safe”, both precision and recall for the “unsafe” class will be 0, no matter how well we are performing on the “safe” (majority) class.
  3. Recall and Precision show antagonistic behavior when the model is evaluated at different decision boundaries.
    With a low decision boundary, models tend to “jump the gun” and predict more positives. This then causes potentially more TP, but also more FP, increasing recall and decreasing precision. With a high decision boundary, models are more conservative, predicting fewer positives. This then leads to potentially fewer TP and FP, decreasing recall and increasing precision. This antagonistic behavior is very dependent on the quality of the model and the difficulty of the task. This antagonistic relationship between precision and recall for different decision thresholds can be visualized in a precision-recall curve.
Example for a precision-recall curve of a good classifier
source: Machine Learning Mastery

Now that we have established what recall and precision are, let’s look at some of their properties and potential drawbacks.

  1. Due to the antagonistic behavior of recall and precision, it is difficult to decide on a model or a threshold for that model. It also makes it difficult to compare previous models to current models. Is a slight uptick in recall worth the downtick in precision?
  2. Due to the per-class evaluation, calculating precision and recall per class can be too fine-grained if we look at multiclass problems. In addition, we have the same issue as in the above, where we have to wonder: is an uptick in recall for class A worth the downtick in recall for class B?

Drawback 1) Selecting a model only using precision and recall is difficult, as we always have to balance at least two metrics against each other.

Finally, for all metrics discussed today, we can only determine the FP, TP, and FN on discrete labels. Most models return a continuous confidence score between 0 and 1. In order to get discrete labels, we need to set a threshold to separate the positive predictions from the negative predictions.

A big drawback of accuracy, precision, and recall: They all only give us the model performance at a specific threshold. When comparing two models against each other using precision, recall, or accuracy, we only know how the model is performing at a specific threshold. We do not know how the model performs overall. ROC AUC, as well as the Precision-Recall curve, are metrics that do provide this information, and I will go into depth on those in a later blog post.

Finally, precision has one major pitfall, that, if the Data Scientist is unaware of, can lead to wrong conclusions during training, model selection, and model evaluation. I will talk about this pitfall in my next blog post to give it the appropriate space, so stay tuned for that!

Summary

To summarize, we looked at accuracy and found that for imbalanced datasets, accuracy is not the right metric, as it is obstructing the model’s performance on the minority class. It is also class-independent and does not provide any insights into the individual class performance.

We then looked at precision and recall. They bring to the table that they are looking at individual classes and are able to alert us on bad performance on minority classes. Because of this, they are the metrics of choice, when it comes to evaluating classifiers on an imbalanced dataset and are superior to accuracy.

Also, accuracy, precision, and recall all are evaluated at a specific threshold, while though this is useful when evaluating a model as a black box, it does not give us the whole performance of the model.

In the next blog post, I want to do a deep dive into precision, and what we need to be on the lookout for when using precision for model training, evaluation, and model selection. I also explain how to fully utilize precision as a metric.

In later blog posts in this series, I want to discuss what metrics can be used to complement precision and recall in model selection and compare different models against each other on the same dataset or even consecutive datasets. We will discuss the harmonic mean between precision and recall, the F1, as well as the weighted harmonic mean between precision and recall, F0.5 and F2. Those especially will help solve the dilemma on how to select a model using only precision and recall. We will also look at the ROC AUC as a way to measure the performance of the model as a whole, independent of the threshold.

We’re always looking for new talent! View jobs.

Follow us: Facebook | Twitter | | Linkedin | Instagram