Precision vs Recall

shruti saxena
6 min readMay 11, 2018

--

In this blog, I will focus on the challenges pertaining to model evaluation I came across while implementing a machine log analytics classification algorithm. Specifically, I will demonstrate the meaning of model evaluation metrics — precision and recall through real life examples, and explain the trade-offs involved. Though, my learnings are derived from my experience in the log analytics project, I will try to give generic examples to explain all the concepts. For the curious ones scratching their brains right now, here is a great reference paper to understand what log analytics is all about. For more details, please check out the references cited at the end of this blog.

Before diving into the concept of precision and recall, let me recap for you what Type I and Type II errors signify.

Type I and Type II Errors

One fine morning, Jack got a phone call. It was a stranger on the line. Jack, still sipping his freshly brewed morning coffee, was barely in a position to understand what was coming for him. The stranger said, “Congratulations Jack! You have won a lottery of $10 Million! I just need you to provide me your bank account details, and the money will be deposited in your bank account right way…”

What are the odds of that happening? What should Jack do? What would you have done?

Tricky, right? Let me try to explain the complexity here. Assuming Jack is a normal guy, he would think of this as a prank, or maybe, a scam to fetch his bank details, and hence will deny to provide any information. However, this decision is based on his assumption that the call was a hoax. If he is right, he will save the money in his bank account. But, if he is wrong, this decision would cost him a million dollars!

Let’s talk in statistical terms for a bit. According to me, the null hypothesis in this case is that this call is a hoax. As a matter of fact, if Jack would have believed the stranger and provided his bank details, and the call was in fact a hoax, he would have committed a type I error, also known as a false positive. On the other hand, had he ignored the stranger’s request, but later found out that he actually had won the lottery and the call was not a hoax, he would have committed a Type II error, or a false negative.

Now that we are clear with the concept of Type I and Type II errors, let us dive into the concept of precision and recall.

Precision and Recall

Often, we think that precision and recall both indicate accuracy of the model. While that is somewhat true, there is a deeper, distinct meaning of each of these terms. Precision means the percentage of your results which are relevant. On the other hand, recall refers to the percentage of total relevant results correctly classified by your algorithm. Undoubtedly, this is a hard concept to grasp in the first go. So, let me try to explain it with Jack’s example.

…Feeling a bit panicky, Jack called up his bank to ensure his existing accounts were safe and all his credits were secure. After listening to Jack’s story, the bank executive informed Jack that all his accounts were safe. However, in order to ensure that there is no future risk, the bank manager asked Jack to recall all instances in the last six months wherein he might have shared his account details with another person for any kind of transaction, or may have accessed his online account from a public system, etc…

What are the chances that Jack will be able to recall all such instances precisely? If you understood what I asked in the previous sentence with a cent per cent confidence, you have probably understood what recall and precision actually means. But, just to double check, here is my analysis. if Jack had let’s say ten such instances in reality, and he narrated twenty instances to finally spell out the ten correct instances, then his recall will be a 100%, but his precision will only be 50%.

Barring the time Jack spent on the phone call with the bank executive spelling out extra information, there was actually nothing much at stake here due to low precision. But, imagine if the same thing happens the next time you search for a product on let’s say amazon. The moment you start getting irrelevant results, you would switch to another platform, or maybe even drop the idea of buying. This is the reason why both precision and recall are so important in your model. And by this time, you might have already guessed, one comes at the cost of another.

Trade-off

This is pretty intuitive. If you have to recall everything, you will have to keep generating results which are not accurate, hence lowering your precision. To exemplify this, imagine the case of digital world (again, amazon.com?), wherein there is a limited space on each webpage, and extremely limited attention span of the customer. Therefore, if the customer is shown a lot of irrelevant results and very few relevant results (in order to achieve a high recall), the customer will not keep browsing each and every product forever to finally find the one he or she intends to buy, and will probably switch to Facebook, twitter, or may be Airbnb to plan his or her next vacation. This is a huge loss, and hence the underlying model or algorithm would need a fix to balance the recall and precision.

Similar thing happens when a model tries to maximize precision.

Does a simpler metric exist?

In most problems, you could either give a higher priority to maximizing precision, or recall, depending upon the problem you are trying to solve. But in general, there is a simpler metric which takes into account both precision and recall, and therefore, you can aim to maximize this number to make your model better. This metric is known as F1-score, which is simply the harmonic mean of precision and recall.

To me, this metric seems much more easier and convenient to work with, as you only have to maximize one score, rather than balancing two separate scores. In fact, there are other ways to combine precision and recall into one score like a geometric mean of the two, and it might be worth exploring the different kinds and their respective trade-offs.

So, what are the key takeaways?

Precision and recall are two extremely important model evaluation metrics. While precision refers to the percentage of your results which are relevant, recall refers to the percentage of total relevant results correctly classified by your algorithm. Unfortunately, it is not possible to maximize both these metrics at the same time, as one comes at the cost of another. For simplicity, there is another metric available, called F-1 score, which is a harmonic mean of precision and recall. For problems where both precision and recall are important, one can select a model which maximizes this F-1 score. For other problems, a trade-off is needed, and a decision has to be made whether to maximize precision, or recall.

I hope that this blog was engaging and insightful. I look forward to your feedback in the comments section. And don’t forget to read the reference articles, they are truly a wealth of knowledge. Happy reading!

References

Precision-Recall (scikit-learn), The Relationship Between Precision-Recall and ROC Curves, A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation

--

--