Accuracy, Confusion Matrix, Precision/Recall, Oh My!
An intuitive explanation
In machine learning, evaluating the performance of classifiers is more tricky to understand than evaluating regression-based models. Here, I will attempt to provide an intuitive example to shed light on some important evaluation concepts. Please note that this discussion does not include an explanation of F1 score, specificity, ROC, or AUROC. If there is enough interest, I would be happy to do an additional article on those. Now, let’s continue to the important stuff.
Imagine you’re a security guard for a store. You work half the time and another security guard, your coworker, works the other half. Of course, what a security guard is trying to do is prevent theft. For simplicity, let’s say that 90% of people who visit the store are there to shop and not steal anything. The other 10% of visitors, however, are looking for a five finger discount. (Apparently this store makes people want to steal.)
Let’s first imagine this scenario. Your coworker is terrible at the job and despite the statistic that 10% of people who enter the store try to steal something, he almost never stops anyone! The threshold he uses to act is very, very high. Let’s say, in this case, threshold values range from 0 to 100 and that your coworker is at 99, meaning that he has to be very, very certain that someone is stealing for him to act. Even a very high suspicion will not lead him to act. He must be certain.
As a result, when he is on watch, almost the entire 10% of people who come to steal get away with it. The manager calls him in and tells him that there is a lot of theft under his watch. What does your coworker say? “Well, I’m still about 90% accurate!”
Here lies the problem with using accuracy as a performance measure with imbalanced classes in a classification task. In this case, the two classes of visitor are ‘thief’ and ‘not thief’, where ‘thief’ only accounts for 10% of the people who visit the store. So, yes, your coworker is right. Even when he does nothing, he is still 90% right when it comes to classifying who is stealing and who is not simply because 90% of visitors do not steal. A ferret dressed up in a security guard outfit would be about as accurate.
Your boss gets angry and says he prefers different ways to evaluate security guard performance. He says one way he measures performance is through a Confusion Matrix. In fact, through video footage of shoppers, he has created a confusion matrix of your coworker’s performance with a sample of 1,000 people who entered the store during his watch.
Below portrays a confusion matrix. The rows represent the actual values for each class while the columns represent the predicted values for each class.
In our example, the NO label corresponds to ‘not thief’ visitors and the YES label corresponds to ‘thief’ visitors. Let’s go cell by cell in the matrix and clarify some terminology, starting with the top left cell.
- When the security guard predicts that a person belongs to the ‘not thief’ class (Predicted: NO) and the prediction is correct (the visitor was not a thief), it is called a True Negative (TN).
- When the security guard predicts that a visitor belongs to the ‘thief’ class (Predicted: YES), but the person belongs to the ‘not thief’ class (incorrect prediction), it is called a False Positive (FP).
- When the security guard predicts that a visitor is a ‘not thief’, but the person is a ‘thief’ visitor (incorrect prediction), it is called a False Negative (FN).
- Finally, when the security guard predicts that a visitor is a ‘thief’, and the prediction is correct, it is called a True Positive (TP).
An easy way to remember what these bolded terms above mean is to understand that the positive/negative part refers to what the prediction was, regardless of whether it was right or wrong. The true/false part is what determines if the prediction was correct or not. Thus, a false negative, for example, indicates that the prediction was that the visitor was not a thief, but that the prediction was false or wrong.
Looking at your coworker’s confusion matrix below shows that out of 1,000 people, he correctly predicts that 894/1000 people belong to the ‘not thief’ class. But the matrix also shows that out of the 1,000 visitors in this sample, he allowed 105 people to steal something from the store since he will not act unless he’s absolutely sure the person is trying to steal. In one case, he actually saw someone steal right in front of him and acted for once. This is the true positive value of 1 in the matrix. (Note that since this is a sample, there will be some randomness in values, which is why 894/1000 is less than 90%.)
The confusion matrix thus paints a fuller picture of security guard performance.
Furthermore, your boss says that he likes to use Precision and Recall as measures of security guard performance, but that he puts a lot more focus on recall. He explains that precision is the number of true positives divided by the number of true positives + the number of false positives. More formally:
This is simply how accurate the security guard is when he predicts visitors are thieves (Predicted: YES). In your coworker’s case, this is 1/1 or 100%. Is this value a shock? Not really. Since your coworker has to be absolutely positive that a person is stealing and basically has to catch the person in the act, it is very likely that he will be correct. In the confusion matrix, precision can be identified by the blue rectangle in the confusion matrix below.
Then your boss explains that recall is the number of true positive predictions divided by the number of true positives + the number of false negatives. More formally:
Recall can be identified by the red rectangle in the confusion matrix above. What the equation is explaining is that of all the individuals who actually were there to steal, how many did the security guard correctly predict were there to steal. In this regard, your coworker does horribly. His recall is 1/106 or about 0.94%.
Is this number a shock? Again, no. We know that your coworker very rarely approaches anyone and that he will only catch people that he is absolutely certain are stealing; basically catching them in the act. Unfortunately, most thefts are more discreet and even if your coworker has a fairly strong suspicion that someone might be stealing, he still does not approach them because he is not absolutely certain and the person gets away. This is why his recall is so low.
This is what is meant by the precision/recall trade-off. If your coworker will only act if he is absolutely certain someone is stealing, then when he does act, he will very likely be accurate. However, because he has to be so certain, a lot of thefts will occur.
The boss’s message to your coworker is to be more like you. So, what are you like? Unlike your coworker, your threshold for approaching someone is a lot more balanced — something like 50 out of 100. In other words, if you have a mild suspicion that someone is stealing, you will act (Prediction: YES). How will this threshold affect your confusion matrix and your precision and recall scores?
This is quite different from your coworker’s confusion matrix. The numbers show that while approximately 10% of visitors are still trying to steal something, you predict people are stealing a lot more than your colleague.
Your precision score is 81/106 or about 76%. When you have a suspicion that someone is stealing, you are right about 76% of the time. This is lower than your coworker’s score, but it is expected to be. You approach people even when you are not completely certain, so you end up bugging about 25 people out of 1000 (false positives) who were not there to steal anything.
What about your recall? It is 81/105 or about 77%, meaning that of the 105 people who were coming to the store to steal something, your threshold to act meant that only 24 people out of 1000 actually leave the store with stolen goods. This is way better than your coworker’s performance, where he let 105 people leave the store with stolen items.
By now, there is hopefully some intuition about what precision and recall are and how the threshold value will affect them. And hopefully, there is also some understanding as to why there is a trade-off between precision and recall. The higher the precision, the lower the recall, and vice-versa. Below is a graphical representation of this trade-off.
Like in the example of you and your coworker, your coworker’s threshold was way to the right on the graph above. And like the numbers we calculated from his confusion matrix, the graph also depicts that your coworker is expected to have high precision but very low recall. Conversely, your threshold is closer to the middle of the graph, where the precision and recall curves intersect. And like we calculated from your confusion matrix, the graph depicts that while you are bound to bug some innocent people because you thought they were there to steal, you also correctly predict and catch a lot more people who are actually there to steal.
When considering the precision/recall trade-off, the question is “What is more important to the task?” Higher recall, high precision, or about the same of both? As we have seen, the threshold set is what determines the relative importance of precision and recall.
At airport security, like in our security guard example, recall has higher importance, which is why your boss said he puts more weight on it. When there is something that is a security threat at the airport, we would hope that a very high proportion of those threats are caught. If a classifier is trying to detect safe websites for a child (where a safe site is considered the positive class), then we would want high precision where our predictions of what is safe is very high. We don’t want any unsafe sites not getting blocked.
Happy learning!