What nobody tells you about binary classification metrics

Published in

Analytics Vidhya

7 min readAug 22, 2020

You most likely know (or at least heard about) metrics for evaluating classifiers such as accuracy, precision, recall, and f1-score. But, did you know there are certain types of problems they are not indicated for? Also, have you already learned about MCC, NPV, or specificity? Well, I also did not know these metrics for a long time, and I already miss evaluated some classifiers in the past because of this. So, in this article, we’ll see:

a brief and clear explanation of the most popular metrics — so you never have doubts about them again;
when to use and, mainly, NOT to use each metric
what is the most recommended metric for evaluating classifiers

Ready? Come on…

Accuracy

Accuracy is the most popular metric by far. Basically, it represents how many predictions were correct among all classes in a given dataset. Although the accuracy is highly adopted, it's not well suited when your dataset is unbalanced.

To give an example, let's suppose we have a dataset in which labels are man/woman. Also, let's consider 80% of the samples are women, and only 20% are men. If your model only predicts women, it's accuracy will be 80% easily. The same is valid for multiclass classification. For instance, imagine an animal dataset with 40% dogs, 50% dogs. Again, if your model learns to distinguish dogs and cats only, the accuracy will be 90%.

Therefore, the accuracy can only be considered a valid metric when your dataset is balanced. In other words, when the class distributions (be 2 or more) are somewhat uniform. Individually, I don't like accuracy because this assumption is hard to happen in real-world data.

Precision, Recall, and F1-score (or F-measure)

Recall: how accurate is your model in relation to the positive class? I.E.: “Given the positive samples, how many predictions were correct?"
Precision: how much you trust your model when it predicts the positive class? I.E.: “Given the positive predictions, how many were correct?

Beyond classification problems, precision and recall are very useful for detection problems. For instance, let's suppose we have a model to detect faces. Changing the previous definitions, precision, and recall would be defined as follow:

Recall: "given the labeled faces, how many the model detected?"
Precision: "given the predicted faces, how many are real faces?

To give a practical example, imagine we have an image with 10 faces. Now, let's suppose one model returned 5 detections, where 3 were really faces (TP) and 2 were incorrect. In this case, we have a recall of 30% (3 out of 10 detected faces) and a precision of 60% (3 out of 5 correct predictions).

So, what's the purpose of the f-measure (or f1-score)? Let's say that in this same image, another model returns 2 detections, and they are real faces. In this case, the recall is 20% (2/10) and the precision is 100% (all detections are real faces). Which model is better: the one with a recall of 30% and a precision of 60%, or the other with a recall of 20% and a precision of 100%?

To help with this decision, we use the f-measure. It represents a harmonic mean of Precision and Recall. Why harmonic? Because in contrast with arithmetic mean, to increase f1-score, both precision and recall must be higher.

These metrics are appropriate when your positive class is more important than the negative class. If you take a closer look to the formulas, Precision and Recall do not consider the True Negatives (TN). They just concern about the positive samples. And this is not always good.

In detection problems, like in the example, it makes total sense. Think about it: how to define True Negatives (TN) in a detection problem? Beyond being hard to define, True Negative does not make much sense. An algorithm that is good in not detecting what’s not the object of interest? At least, confusing…

On the other hand, in some classification problems, the negative class is much more relevant than the positive class. For such cases, we have the NPV and Specificity.

Negative Predictive Value (NPV), Specificity

The metrics Specificity and NPV are equivalent to Recall and Precision, but for negative class:

Specificity: how accurate is your model in relation to the negative class? I.E: “Given the negative samples, how many predictions were correct?”
NPV: how much you trust your model when it predicts the negative class? I.E.: “Given the negative predictions, how many were correct?

As I said before, these metrics are helpful when the negative class is more important in practice than the positive class. For example, let's say we have a classification problem where we want to know if a patient is healthy (positive) or sick (negative). Or if we want to know if some credit card transaction is genuine (positive) or fraudulent (negative). In these cases, you must compute the NPV and Specificity. Also, we can adapt the F-measure with these metrics quickly.

F-Beta

F-beta is a generalization of f1. The formula has a Beta term to weight how much recall is more important than precision. As we can see, when Beta = 1, we have the original formula of f1-score.

Another typical value is Beta = 2, generating the f2-score. In this case, we're considering that recall has a higher weight than precision. I.e., a good recall is more important for the classifier than precision. This is useful when we want to make sure that the classifier is better identifying the positive class, even though it might generate more false positives. In the previous detection example, our detector would try to detect as many real faces as possible, but probably returning more false positives (detections which are not real faces).

Notice that we can easily modify f-beta to use NPV and Specificity in place of Precision and Recall, respectively.

Matthews Correlation Coefficient (MCC)

I decided to write about MCC later because it's my favorite metric. I always compute MCC in the classification problems I'm dealing with. Do you wanna know why?

First, because MCC is a metric that considers all possibilities of binary classification (TP, TN, FP, and FN). Secondly, MCC is robust to unbalanced datasets. Finally, the result of MCC is a normalized coefficient between -1 and 1 that is straightforward to interpret. In other words:

-1: the closer to -1 your MCC, the worst is your classifier, i.e the classifier is making more mistakes than correct results.
+1: on the other hand, the closer to +1, the better is your classifier.
0: when the coefficient is close to zero, the MCC is telling us that the classifier is just predicting the most frequent class(es).

Regardless of your problem is unbalanced or the importance of your positive/negative classes, the MCC is a good and fair metric to evaluate your classifier results.

To sum up

Only use accuracy when the distribution of your classes is uniform. Nevertheless, also compute (or give preference) to other metrics.
Compute Precision, Recall, and F-measure when your positive class is more critical for you than the negative class. Use F-measure to decide between 2 or more classifiers.
When the negative class is more relevant, use NPV and Specificity.
With F-beta, you can balance Precision and Recall (or even NPV and Specificity).
Always compute the MCC. The metric is robust to unbalanced datasets, its coefficient is interpretable, and gives a better feeling of how is your classifier.

That's it. I hope you liked it. If this article was helpful for you too, please give some applause 👏👏. Follow me on Medium for more posts like this. You can also check my work in: