This article is the start of a series that will provide some background for the problems NuronLabs provides solutions to.

Precision-Recall is an important metric that measures performance on a few tough visual perception problems such as Object Detection, and Instance/Semantic Segmentation. Models trained to solve these tasks are often sensitive and perform very differently on different visual distributions.

Thus, when we compare models, it is useful to have a more fine-grained, yet holistic view of their performance characteristics. To this end, precision-recall turns out to be a fairly useful metric to track when considering where the model might fail.

In my experience, I’ve noticed that many smart people understand what precision-recall is but find it hard to apply it in real-time to interpret results. The purpose of this post will be to introduce how Precision-Recall is typically computed in computer vision and also provide a few conceptual tools to apply them in real-time during a discussion.

These shortcuts are probably best explained through an analogy everyone can relate to, e.g. ordering food at a restaurant. In the next section, we’ll go over a few commonly used terms and apply that analogy to them.

The analogy

Suppose you’re in a restaurant and want to order fries and a coke. There are a few things that could happen. The waiter could:

1. Miss items you ordered (false negatives)
2. Bring you items you didn’t order (false positives)
3. Bring you items you ordered (true positives)
4. Not bring you items you didn’t order (true negatives)

An order composed of only 1&2 is clearly bad while an order consisting of only 3&4 is clearly good. However, things aren’t often as straightforward as that. Let's take a look at the other cases while introducing the terms precision and recall. These help in identifying patterns in the behavior of models.

False positives but no false negatives (high recall, but low precision)

Customer: I’d like an order of fries and a coke
Waiter: Sure

<hands over an order of fries, burger and a coke>
Customer: …Yes, but you gave me a bunch of extra stuff

In this case, the waiter recalled what you ordered (no false negatives), but was not very precise (or maybe too generous) as he included additional items (false positives).

False negatives but no false positives (low recall, but high precision)

Customer: I’d like an order of fries and a coke
Waiter: Sure

<hands over an order of fries>
Customer: …Thanks, but you forgot something

In this case, the waiter didn’t bring you anything additional (no false positives) but failed to recall something you had actually ordered (false negatives).

Interpreting the Precision-Recall curve

A typical Precision-Recall curve of a trained model.

The image above is a Precision-Recall (PR) curve. It initially seems cryptic but its rather useful. Understanding the commonalities observed among the curves is a good place to begin.

The first thing to notice in most precision-recall curves is that it starts off at the top-left, at recall=0.0 and precision=1.0. This seems confusing at first but things become clearer when you consider how the curve is constructed.

Each point on the curve is computed at different thresholds in the model.
Each point on the curve is a tuple of (precision, recall) which are computed as follows:

How to compute precision and recall

To get a better idea of what these thresholds represent, let's stick with the restaurant analogy.

Customer: I’d like an order of fries and a coke
Waiter: Sure

<hands over an order of potato wedges and a diet coke>
Customer: …I guess, technically they’re similar

In this case, the individual may or may not accept the order based on their inner “threshold”. An impossibly picky customer might insist that every order of fries contains an average of 86 fries each drink is composed of 13% of ice.

On the curve, this would be the point on the top-left where the threshold is set impossibly high that the model predicts nothing. Note that the precision is undefined here (see equations above), but we set it as 1 as a matter of convenience.

As the threshold is increased, the natural trade-off between precision-recall is observed, resulting in an average downward trend. A perfect model would just be a horizontal line on precision=1.0.

Sometimes there may be a sharp drop-off on the right. This has to do with the minimum threshold we’re willing to accept.

The curve, in general, is useful in assessing the characteristics of the model and what kind of threshold is appropriate for the precision-recall we need for the applications. Some applications such as Youtube recommendations (and ordering at a restaurant) might need high precision but can tolerate lower recall. Some applications like self-driving cars or medical applications might need high recall but can tolerate lower precision.

A good way to determine this tradeoff is to consider whether the opportunity cost of missing something is higher or whether the cost of including something additional is higher.

Average Precision (AP)

Intersecting curves with similar AUC

This curve is useful in assessing quality, however, 2 models may have intersecting curves which makes them hard to compare visually. One way to do this would be to compare the area under the curve (AUC) since a perfect model would have a maximal area under the curve. However, since the mass can be distributed across different parts in the curve, the total area doesn’t summarize the performance at all recall values well. Thus, to provide better visibility into performance at low recall, the following scheme was introduced in the 2012 Pascal VOC challenge (Section 4.2 here).

Here, the average precision is computed by taking the mean of the precision at 11 equally spaced recall levels in [0, 0.1, …, 1]
This max of each quantized block is used in order to account for “wiggles” in the curve.

Mean Average Precision (mAP)

While AP lets us summarize two curves and compare them, a model trained for a typical detection task usually outputs scores for multiple classes. We compute the PR curve and AP for each class independently to assess the class-wise characteristics of our model so that we can either add more data or calibrate the model.

However, the model’s holistic performance on a dataset is usually reported via the Mean Average Precision. That is the mean of the each of the classwise AP values. This is usually the final number reported and compared across multiple models.

Building a neural layer for reliable real-world deep learning