Precision vs Recall: A Trade-Off

Debmalya Mondal
8 min readFeb 11, 2023

--

Photo by Alex Gorin on Unsplas

Introduction

There are certain concepts which for some unforseen reason I couldn’t remember. You can be a seasoned expert in Machine Learning or a beginner or somebody who is making movements towards intermediate stage like me. There are instances happened with me where I was working on a project with few really talented individuals. Then while measuring the performance of the model, someone asked me about precision & recall and bang! All of a sudden, I’m in a numbed state. Hence, this article is for all the people like me who struggles to remember this very concept.

In this article, I will explain the different metrics and evaluation measures required to build a classification model in Machine Learning. I will specifically emphasize on Precision & Recall while highlighting the challenges related to a classification problem and the factors that constitutes to choose one over the other etc. I will try to describe the concepts with simple yet realistic scenarios with visual snaps for better understanding.

Connecting The Dots

Before moving into the actual topic, I would like to explain a little about the concepts of hypothesis testing, type I error & type II error. The main reason to do this is to link the points from inferential statistics to machine learning. Now, imagine a hypothetical situation.

You have registered for kaggle Titanic dataset competition and submitted your prediction as well. But due to some reason you are unable to view the results. So you asked your friend to check the leaderboard on behalf of you and let you know if you have made it into the top 10.

Type I, Type II error

You may think that your friend is a noble guy and he will convey the truth to you, no matter what. Whereas without you knowing, your friend may find a chance to have some fun with you and he is not willing to miss this opportunity or vice-versa. So you see, the situation is not really that simple as it seems. Basically, we are in a dilemma between the reality and the possibility.

(fig-1) Decision table: image by author

Lets talk from a statistical standpoint and gauge the situation again. I will consider my friend as a noble individual, so for me the null hypothesis is “friend is noble” (fig-1). Hence, here are the takes:
Type I error: The error is made when a null hypothesis is rejected though it was really true.
Type II error: The error is made when a null hypothesis is accepted though it was really false.
I think we have a good grasp on the fundamental idea about type I, type II error and therefore it is a good time to make the transition to next point of our discussion.

Confusion Matrix

As you may have realized that we need to have a decision table to really understand the type I or type II errors. This decision table is called confusion matrix. If we refer (fig-1), the null hypothesis and the corresponding explanation again, we essentially discovered few measures and is demonstrated as follows in Statistics or Machine Learning:

Actual is True, Predicted is True: True Positives
Actual is True, Predicted is False: False Negatives
Actual is False, Predicted is True: False Positives
Actual is False, Predicted is False: True Negatives

Let me now represent the above measures into the perspective of confusion matrix and hopefully the connection thread with the inferential statistical concepts that I mentioned earlier will be much clearer.

(fig-2)Confusion matrix: image by author

The confusion matrix states or provides three most important metrics for a classification problem namely accuracy, precision and recall. Let me give their definition as well as the corresponding calculations.

There are instances where people often consider both precision and recall as a measure of accuracy only. While it is cannot be negated entirely but I have to admit that the concept of precision & recall is a bit tricky and I even heard this from some of the seasoned Machine Learning professionals as well. Thus, I would like to take another example and try to explain in the simplest way possible.

You are preparing for a software engineering role in a giant tech company where one of your old colleagues Rick is already working. While having a conversation with Rick you asked him about the number of Data Structure & Algorithm (DSA) questions he had faced during his interview. Rick took a little time and stated that 50 questions were from DSA. Later you discovered that actually 40 questions were from DSA out of 100 technical questions during the interview process.

Here are the outcomes:-
(i) Rick predicted 40 DSA and 60 non-DSA only to realize later that he misrepresented 10 questions: 90 questions’ module are correctly predicted i.e., Accuracy is 90%
(ii) 40 DSA questions were precisely predicted not a single relevant instance was misrepresented i.e., Precision is 100%
(iii) 40 questions were correctly predicted out of actual 50 DSA questions i.e., Recall is 80%

(fig-3) Confusion matrix for the above example: image by author

From now on, I will use these abbreviations:
True Positives = TP; False Positives = FP
False Negatives = FN ; True Negatives = TN

Probably it is about time that I shift to more technical terminologies or demonstrate with industry specific case studies. Let me try with a popular example here.

You have developed a Machine Learning model for a bank to detect online fraudulent transactions. It is obvious that before deploying it into the production you want to monitor the prediction stats. 1 denotes fraudulent whereas 0 is for a valid transaction.

(fig-4) Confusion matrix and actual vs predicted hypothetical sample: image by author

Precision

When we speak about precision essentially we ask is how precise is our model is working? I know it sounds same and probably confusing as well. Then let me ask this question which is linked to the above fraud detection model example. How many detected instances are actually relevant or (in our case) fraud?

Precision is defined by the number of instances that are relevant out of the total instances the model has retrieved.

Let’s fit the definition into our illustration and see what have we got. The model has predicted 5 fraudulent transactions where only 2 are actually fraudulent (TP) and 3 are not (FP). That means precision here is (2/5) i.e., 40% and essentially we get the calculation as follows:

(fig-5) Precision calculation: image by author

Recall

In our fraud detection example if we are to evaluate recall, we try to understand the answer of another very pertinent question. How many relevant items are retrieved by the item?

Recall is defined as the number of instances that the model retrieved out of the total relevant instances.

Following the same path, let us again try to fit the definition into our example. The model has rightfully detected 2 transactions as fraud (TP) but has gone wrong on 1 occasion (FN). The formula for recall is as follows:

(fig-6) Recall calculation: image by author

The Trade-Off

In a perfect world I like to have a everything perfect i.e., precision & recall both at 100%. However, in reality things maybe a little different. We have covered so much and I think the time is perfect to understand that one cannot have both ways to get maximum precision & recall. Rather it is a trade-off ;when one increases the other one decreases. Let me explain this with one more example.

Titli has a pastry shop and she decides to provide a cup cake free of cost for the customers who are more than 30% likely to return. After a month she calibrated her sales to decide that from now on the offer will be valid for the customers who are 50% likely to return to the shop.

(fig-7) Explaining trade-off: image by author

In this example we see that with the increase of threshold, the precision increases and the recall decreases. With that same logic, if you think about the converse situation, each time you lower the threshold, the recall increases and the precision decreases. This is called precision-recall trade-off.

How to combine both?

Most of the times, it really depends on the business scenario you are approaching and accordingly you decide on which metric to prioritize. However, a better metric is available where we can balance both precision and recall are considered viz., F1 score.

(fig-8) F1 score: image by author

F1 score is the harmonic mean of precision & recall. Say we have two situations — (i) precision is 75%, recall is 90% and (ii) precision is 90%, recall is 75%. In both of these cases F1 score will be same.

In certain business scenarios, you may require to give more weight to either of precision and recall and F1 score might not be considered a good metric where you want to consider both. One possible workaround is to use weighted harmonic mean where we can use a hyper-parameter (beta). Higher value of beta means more weight to precision whereas a lower value means more weight to recall. Here is the computation formula:

(fig-9) weighted harmonic mean: image by author

Prioritizing Precision or Prioritizing Recall

As I have explained before it depends on the problem in hand to decide which on to maximize. Here are few sample examples which will give you an idea:

  1. Covid positive detection — Health authorities cannot afford to have any person to roam around freely with the virus in her body. In other words, a model cannot afford to falsely detect a person healthy (FN=0)while she is carrying the virus. The target here is to minimize FN which means we need to maximize Recall.
  2. Spam Filtering — While tagging emails as spam, we need to make sure that no valid emails are being tagged as spam because the user may lose some important communication. Here, our goal is to minimize the FP ie., maximize Precision.
  3. Fraudulent Transaction — Using similar logic, a bank cannot afford to allow any fraudulent transactions through its security gateway ie., minimize FN. For bank it still alright if a few valid transactions are flagged as fraudulent. Again, higher Recall.

Final Words

Precision and Recall are two of the most important metrics in Machine Learning while measuring the performance of a classification model. It is not really realistic to achieve maximum values for both precision and recall. Higher precision makes lower recall and vice-versa. Often it is better to combine both the precision and recall via F1 score which is nothing but the harmonic mean of the two metrics. You may decide to give more weight to precision or recall for your problem in hand. In that case a weighted harmonic mean is a better measure than F1 score.

Thanks for reading!

If this article is helpful and you liked it then hit a “clap” and follow Debmalya Mondal for more such contents.

Connect with me on

--

--