Intuitive explanation of precision and recall

One of the first things you will learn (or have learned) when getting into machine learning is the model evaluation concept of precision and recall. If your experience was like mine, you were shown the formula for computing both, and you understood that if your model has high precision and recall, that’s a good thing. Conversely, if it only has one of those high, and the other low, then your model may not be that good. However, having an intuitive understanding of these metrics may not have been a take away.

In this post, I will attempt to illustrate what these metrics actually mean in terms of the performance of your model.

Precision and Recall

As a redundant explanation of how to compute those, I will do what everyone else has done and list the formula used to compute these metrics:

def precision(tp, fp):
return tp / (tp + fp)
def recall(tp, fn):
return tp / (tp + fn)

So here’s my example that illustrates the difference between the two, and what they mean in terms of your model performance:

Hot dog not hot dog

Suppose your model is a classifier that detects if a given input image has a hot dog in it or not. Let’s say the test set has 10 images, and only 3 of them have a not dog in it.

Recall in a classification example

If the model is able to find all photos with a hot dog in it, then it will be said to have 100% recall. The catch is that recall says nothing about the model making mistakes on the negative case. In other words, if there are three images with a hot dog, and the model predicts that 5/10 of the images had hot dogs, it would still have 100% recall even if 3 of those 5 images did in fact have a hot dog. That is, recall is when you were able to correctly find all instances of the thing you’re claiming (in this case, that an image has a hot dog in it).

Precision in a classification example

In contrast, if the dataset only has 3 images with a hot dog in it, and the model says that only 3 of the images had a hot dog in it, it would have 100% precision if it got all three of those predictions correct.

Liar card game

Another illustration in case the above didn’t clarify precision/recall for you… there’s a card game that I’ve seen called different names, but I first learned it as liar. The point of the game is for you to be able to lie without being caught, while at the same time calling other people out when they’re lying.

For example, say there are three players playing. You each get a pile of cards that are not known to the other players. Each player takes a turn putting a card down in the middle face down. Presumably, each player puts a card down in the sequence A, 2, 3, 4, 5, …, K. When you put your card down, you say what that card presumably is. You could put down the actual card in the sequence (in which case you’d be telling the truth), or you could put down any card and lie that it is in fact the card that follows the sequence.

Your purpose is to detect when someone put down a card that’s not in the sequence, and then call out that person as the liar. If when someone gets called out as the liar, the card that was played by the person being called out gets revealed. If the person did in fact lie, then the liar takes back his/her card, as well as all the cards in the pile in the middle. If the person was actually telling the truth, then whoever called out “liar” takes all the cards instead. (Here’s the wikipedia explanation of this same game)

In the context of this game, if every time someone puts down a card you call liar on that person, you will end the game with 100% recall. That is, every time someone actually lies, you will detect that person. The problem with this approach is that you will also lose the game, since you will be making as many false positives as possible. That is, you will be claiming that a person was lying when they really were not.

On the other hand, if you only called someone out, say, two times, you would have 100% precision if on both times you called someone out as lying, you were in fact correct. While in this case you did not make any mistakes when you detected a liar, you would probably lose the game because your recall was possibly too low — that is, you did not detect all liars in the dataset.

For completion, your liar detection abilities would be perfect if you were able to only call people out when they were in fact lying. That is, if, say, out of 10 rounds only 3 people told a lie, your recall and precision would be perfect if you only call someone out 3 times, and in all 3 cases, you were correct.

To conclude, it is in fact possible to have 100% precision or 100% recall while having a completely useless model. As illustrated in the examples above, it is easy to manipulate those metrics individually by deliberately making way too many predictions for the positive case, or making very few predictions for that case (say, only when you’re absolutely sure that you’re correct). That is, just because your model has really good precision or recall (but not the other), it is not necessarily a good indication of overall performance and usefulness.