Precision and Recall: Understanding the Trade-Off
by Samuel Hillis & Sara Hoormann
Recent advances in computing and availability of data have driven an explosion of enthusiasm around predictive analytics. As these algorithms and technology spread through companies and industries, it has become increasingly important to have knowledge of key concepts that extend beyond buzzwords.
There are two standard types of predictive analytics. The first type, which you are likely familiar with, is regression. In regression, you’re predicting a continuous numeric value, like when you forecast sales for the next month or year. The other type is classification. You’re likely to start to encounter this more as your organization further embraces analytics. Classification is the act of predicting discrete outcomes, like whether an event will or won’t happen, or whether an email is spam or legitimate.
In the supply chain, for example, classification methods are likely to help with the following types of problems:
- Predicting whether a machine or part will fail
- Predicting whether a product will stock out or not
- Predicting whether a new product is likely to be a hit or a dud after the first few weeks of its release
- Predicting whether items are likely to become obsolete
- Predicting whether a product meets your quality standards (for example, potato chips)
- Predicting whether an order is likely to be late
- Predicting if a truck driver is likely to get into an accident
With regression, we know to ask about the forecast errors and statistical significance of the variables to understand more about the model and its performance. But when your predictions are “True” and “False” instead of numbers, how do you figure out if the model is as good as it can be?
With classification algorithms, we have new metrics to ask about: precision and recall.
To explain precision and recall, let’s employ a fishing example.
Say there’s a pond that you like to fish in, and somehow you know the total number of fish that live there. Our goal is to build a model that catches red fish (we may say that we want to “predict” that we catch red fish).
In our first test, we have a model that consists of two fishing poles, each with bait made from a scientific recipe with all the food that red fish like. The precision metric is about making sure your model works accurately (or, rather, that the predictions you make are accurate). With our fish example, that means that the fish caught with the special bait are, in fact, red.
The following test shows great precision — the two fish caught were both red. We were trying to (or predicted we would) catch red fish, and all of the fish that we caught were red.
There is one small problem here though. We knew there were a lot more fish in the pond, and you might notice that when we looked closer we also found a lot more red fish that we didn’t catch. How can we do a better job of catching more of the red fish?
Here is where our other measure, recall, comes into play. Recall increases as we catch more red fish. In our first model, we didn’t do a very good job of catching a high volume of red fish; although our precision was excellent, our recall was not so good.
Knowing this, we decide to develop a new model using a fishing net and a brand new bait recipe. The picture below shows the result of our first test with this new model. We caught more of the red fish! Our recall has definitely improved.
Unfortunately, with this imprecise approach, we caught a lot of blue fish in our net as well. We weren’t trying to catch them, but they ended up in our net anyway. As our net got bigger and our bait became less specialized, our precision suffered as our recall improved.
This is the fundamental trade-off between precision and recall. Our model with high precision (most or all of the fish we caught were red) had low recall (we missed a lot of red fish). In our model with high recall (we caught most of the red fish), we had low precision (we also caught a lot of blue fish).
When building a classification model, you will need to consider both of these measures. Trade-off curves similar to the following graph are typical when reviewing metrics related to classification models. The thing to keep in mind is that you can tune the model to be anywhere along the frontier.
For a given model, it is always possible to increase either statistic at the expense of the other. Choosing the preferred combination of precision and recall can be considered equivalent to turning a dial between more or less conservative predictions (i.e. recall-focused vs. precision-focused). It is important to note that this is for a given model; a better model may, in fact, increase both precision and recall.
In choosing the correct balance of precision and recall, you should carefully consider the problem you want to solve.
Let’s relate this back to a supply chain problem: if we’re predicting truck driver accidents, we may want high recall (and be okay with low precision), because the cost of failing to prevent an accident is high. That is, we want a list that captures all the high-risk drivers, even if some low-risk drivers find their way into our net. We can then do extra training and monitoring. Our money spent on preventive measures is worth it if we prevent just one severe accident.
On the other hand, if we’re predicting stockouts, we aim for higher precision. Let’s say that 200 of my 5000 SKUs will stock out next month. I would be very happy to have a high-precision list of even just the 60 SKUs most likely to stock out — I’ll expedite and take extra measures with these items. I’ll still miss 140, but that’s better than the model giving me a list of 600 SKUs, because the cost of taking these extra measures is high, and on balance it’s not worth it to be overly cautious about stockouts (unlike the truck driver example). A low-precision list of 600 SKUs would force us to spend time and money on 400 items where there wasn’t going to be a problem.
Final Thoughts
Knowing about precision and recall will help you build the best models possible. Think about the problem you want to solve to help you strike the right balance between them.
Details on the Calculations
To truly understand the calculations, we need to understand the following conditions:
- True negative: we predicted that we wouldn’t catch a certain kind of fish, and we didn’t (i.e., our prediction about something not occurring was correct; hence, a true statement about a negative)
- False negative: we predicted that we wouldn’t catch a certain kind of fish, but we actually did (i.e., our prediction about something not occurring was incorrect; hence, a false statement about a negative)
- True positive: we predicted that we would catch a certain kind of fish, and we did (i.e., our prediction about something occurring was correct; hence, a true statement about a positive)
- False positive: we predicted that we would catch a certain kind of fish and we didn’t (i.e., our prediction about something occurring was incorrect; hence, a false statement about a positive)
Precision is the fraction of predicted positives that are, in fact, positives. For example, let’s say our goal is to catch red fish, and we manage to catch seventeen fish throughout the day. If only six of those caught fish are red, we would have a precision of 6/17, or around 35%.
Precision = true positives / (true positives + false positives)
Recall is the fraction of all existing positives that we predict correctly. For example, say there are only eight total red fish in the pond. If we catch six of them, our recall is 6/8, or 75%.
Recall = true positives / (true positives + false negatives)
A modified version of this article first appeared in SupplyChainDigest.
_________________________________________________________________
If you liked this blog post, check out more of our work, follow us on social media (Twitter, LinkedIn, and Facebook), or join us for our free monthly Academy webinars.