Why Precision-Recall? Why not Accuracy?

Himanshu Singh
9 min readOct 13, 2019

--

If you search for the explanation about the concepts of Precision and Recall, you will find numerous articles for the same. You can count this article as well belonging to that list. I’ll try to make the concepts as much simple as possible, and maybe that can make this one different from the others.

When we talk about Machine Learning, the first thing that comes to our mind is ‘data’. Without it there will be no models. Also, based on the type and quality of data we provide, the quality of model gets decided. It follows a simple process of “Garbage In — Garbage Out”. Therefore, it becomes imperative to judge how good is our model trained, before sending it out for the production phase.

As most of us must be knowing, a supervised learning algorithm can be divided into two domains: Regression and Classification. When it comes to regression we judge the model performance through metrics like RMSE, AIC, BIC, or adjusted R2. But, when it come to Classification, we have Accuracy scores, Precision and Recall of the model, F1 score, or Area Under the Curve. As the heading of this article says, we’ll be talking about the metrics of Classification Models.

Suppose, we want a model which tells us whether a person has cancer or not by looking at the physical attributes. Let’s say that we had around 1000 people information, and we trained a Logistic Regression model over it. Now, we have 100 new individuals and their physical traits. We want to see how the model is performing over this new data. To understand this, the first thing that needs to be made is the Confusion Matrix.

Confusion Matrix

Confusion Matrix is a statistical matrix that helps us decide whether the model is good or bad. Most of the important metrics related to a classification model are derived from this matrix only. Given below is the confusion matrix made for the 100 new individuals for whom our model did the predictions.

Confusion Matrix

In the above diagram, left hand side tells us about the real information, while upper side tells us about the predictions for the actual data. As we have decided to make predictions on 100 new individuals, above matrix so prediction comparison with person’s actual health status. Let us understand what each cell of the above matrix means.

At one glimpse you must know that the green cells are good data points, while red cells are bad points. This means green cells represent the correct predictions while red cells represent wrong predictions. Now let’s dive deeper. The first green cell, having number 40, tells us that there were 40 individuals actually having cancer and our model predicted them correctly. Similarly, the second green cell states that 35 individuals didn’t have cancer and again our model predicted it(classified them) correctly. Red cells are the ones that are giving trouble to our model.

The first red cell states that there were 15 individuals that were having cancer but our model said that they didn’t have cancer. Similarly, 10 individuals didn’t have cancer but the model classified them as having cancer. This can be a real issue in our model. If the problem persists, then after production if a person comes without cancer and our model predicts cancer for him then he may go for chemotherapy. Similarly, a person having cancer can be predicted as healthy and no medication is given to them.

Before jumping into different metrics of judging the model performance, we must know what each cell is termed as. Suppose, we represent having cancer with 1 and not having cancer with 0. So whenever the prediction is 1, we call it as Positive and whenever it is 0 we call it as Negative. Next is, whenever the prediction is correct we call it as True, else False. Therefore, in the first green cell it was a correct prediction and the prediction was 1 (cancer). Therefore that cell is termed as True Positives. Similarly, for the second green cell predictions were correct but the prediction was 0. Therefore it is termed as True Negatives.

Coming to the first red cell, predictions were wrong and it were predicting it as 0 (No Cancer). Therefore, they are termed as False Negatives, and similarly the last cell as False Positives. Given below is the revised diagram where each cells are named.

Named Confusion Matrix

Now that we are clear about the confusion matrix, let’s proceed to different metrics to judge our model. The first one to start with is Accuracy Score.

Accuracy Score

This metrics tells us about in total how many predictions by our model were correct. So we can see in the confusion matrix that our green cells were correct predictions. So a total of 75 individuals were classified correctly, out of 100 individuals. This makes the accuracy as 75/100, meaning 75%. Therefore, we can say that the formula for accuracy score will be:

We always think that if we can get the accuracy of the model by the above formula, why to move on and understand precision and recall for the model? The reason is that accuracy score can be misleading if our dataset is imbalanced.

An imbalanced dataset means that suppose we have 1000 individuals out of which 999 didn’t have cancer while 1 individual had. If we train our model on this data, it becomes biased towards people not having cancer. Since it has only seen one individual having cancer, while rest of the individuals not having cancer, it tends to predict any new person coming as not having cancer. Suppose, we trained our model and applied it on 5 individuals. Out of the 5, 3 didn’t have cancer while 2 had cancer. But since the model is biased, may be it will predict all the 5 as not having cancer. So the accuracy will be 75%. But suppose all the 5 individuals had cancer, and model predicted 4 not having cancer. This time the accuracy will be only 20% (1/5). Therefore, as the dataset changes, the accuracy changes as well and we cannot solely depend on it. Let us see how Precision and Recall tries to address this issue.

Precision and Recall

Precision talks about all the correct predictions out of total positive predictions. Recall means how many individuals were classified correctly out of all the actual positive individuals. Confused? Same definition that you find in other blogs as well? Don’t worry! Let’s try to understand these two concepts one step at a time. Let us start by understanding Precision.

Precision

Let’s consider 5 individuals only. Suppose 3 of them had cancer, while remaining two didn’t.

Suppose, our model gave following predictions for them,

As you remember, we represented Cancer with 1 (Positive), while Not Cancer as 0 (Negative). Therefore, the total positive predictions made by our model is 3 (Three Reds). Out of these three, correct prediction are only 2 (The first two reds). Therefore, the precision of our model will be 2/3, which is 66.67%. Now if you go back to the previous section and look at the Precision definition, it will make more sense. If we have very low precision in our model, then most of the positive predictions in our model will be wrong. In this problem statement, if precision is low then most of the patients will be diagnosed with Cancer. Given below is the formula of Precision, based on the confusion matrix.

Recall

Taking the same example, you can see that in the actual data there were three individuals having cancer. When we did predictions for them, are the recognised correctly? You can see that out of 3 cancer patients, our model correctly identified 2 as having cancer while 1 was not recognized corectly. This means that the recall of our model will be 2/3, which will again be 66.67%. If we have very low recall then most of the patients who have cancer will be predicted as not having cancer. The formula for recall is:

Let us understand this with one more example. Suppose we have made a model which predicts whether it is going rain or it will be sunny. We looked at 10 days where actual weather was 6 days of rain and 4 days of sun, as given below in the image.

For the same 10 days our model did the predictions, as given below:

You can see that 7 days were predicted as sunny while 3 days were rainy. Let us look at the accuracy now. There were 7 correct predictions out of 10 and hence the accuracy will be 70%. When we talk about Precision, considering sunny weather as positive event, we can see that out of 7 positive predictions, 4 were tight. Therefore precision will be 4/7, which is around 57%. Recall says that there were 4 positive events in our original data (4 sunny days) and out of those 4 all of them were classified correctly. This means that our recall is 4/4 = 100%. Therefore, you can see that even though our model is giving 70% accuracy, our precision is very very less. Only 57%. Therefore, we cannot proceed with the model.

Next question that arises is, which score to consider then? Should it be precision or recall? as recall for the above model is 100%. The answer is that we should consider both of them. Values of both of them should be on the higher side, as we cannot compromise with any one of those metrics. Therefore, even though the recall is 100% above, precision is only 57 % and hence we should reject the model.

F1 Score

Should we always look at both the values? Can’t there be only one value with which we can make a judgement of the model? Yes! F1 Score is the answer. Only by looking at this score we can decide on the model, without looking at Precision and Recall. F1 Score is derived from the P&R score only, and hence it takes both the metrics into consideration. F1 Score is the harmonic mean of Precision and Recall, as given below.

Higher the value of F1 Score, better is the model. If we are curious about F1 Score’s result, then we can always look at the precision and recall score and judge which metrics is driving a major role in affecting F1 Score. Taking the same example which we discussed, the F1 Score will be (2*0.57*1)/(0.57+1) = 72%.

I hope in this article I successfully explained the Precision and Recall Concepts. If you think that it made you understand this concept, please do applause :)

In the next article I will talk about Threshold and the role it plays in deciding Precision and Recall. Also we’ll look at ROC curve and Area under the curve. We’ll also see the application of all these concepts in Python. Till then, happy reading!

--

--

Himanshu Singh

ML Consultant, Researcher, Founder, Author, Trainer, Speaker, Story-teller Connect with me on LinkedIn: https://www.linkedin.com/in/himanshu-singh-2264a350/