Evaluating Probabilistic Classifier: ROC and PR(G) Curves

Published in

knowledge-engineering-seminar

12 min readMay 20, 2020

Evaluation of classification performance commonly shrinks to reporting accuracy, precision, recall or F1 scores. However, these metrics do not take into account when the classifiers predict probabilities of the classes instead of the classes themselves. In such cases it might be more valuable to analyze the model’s performance at various decision thresholds that marks the predicted probability needed of the sample to be considered as positive. Receiver operating characteristic (ROC) curve is a standard tool for such analysis due to its properties like universal baseline or interpretation of its area under curve. Maybe inspired by ROC many researchers started reporting precision-recall (PR) curves which seems as more suitable for evaluation of imbalanced data sets. However, PR curves have major pitfalls that prevent them from being used the same way as ROC curves. In this blogpost we provide an introduction to binary classification and predicting probabilities instead of the classes, we describe ROC and PR curves and explain how they should be used and, finally, we explain the concept of precision-recall-gain (PRG) curves introduced in [1] — a way how to modify PR curves so that they have similar desirable properties as ROC curves.

Introduction

Binary Classification

Binary classification is a task of dividing samples into two groups. The two groups are commonly represented as a negative class and a positive class (for example “does not have coronavirus” and “has coronavirus”). This task can be solved by building a machine learning classification model, a classifier, using training data which is then capable of predicting classes for new, previously unseen, data. Such machine learning classification algorithms include decision trees, support vector machines or naive bayes algorithms.

Evaluation Metrics

Once a classifier is built its performance needs to be evaluated to have an estimate of how it will perform when deployed/used in production. The evaluation is typically done by calculating various performance metrics from the confusion matrix of the predictions:

Confusion matrix (also called contingency table) of the predictions

The common metrics include accuracy, precision, recall (also called true positive rate — TPR), F1 score and fall-out (also called false positive rate — FPR) defined as follows:

accuracy is the ratio of total truly predicted values;
precision is the probability of a positive prediction actually being positive sample;
recall is the probability of detecting a positive sample;
F1 score is a harmonic mean of precision and recall;
fall-out is the ratio of false alarms with respect to all negative samples, on the contrary of the above metrics, we want to minimize this score — we want to have as few false alarms as possible.

Probabilistic Classifiers

Many classifiers are capable of predicting not only the binary class itself but also a probability of the sample belonging to the positive class. The classification itself is then done by selecting a threshold which marks all the samples with higher or equal probability as the positive class and the samples with lower probability as negative. Such classifiers are commonly called probabilistic classifiers. The figure below illustrates prediction process of a “standard” classifier and a probabilistic classifier.

Selecting a lower threshold tends to increase recall as more samples are being marked as positive while selecting a higher threshold tends to increase precision and decrease fall-out (false alarms) as only samples with high probability are selected.

The choice of the correct threshold depends on the domain needs — the operating conditions of the model. In some cases there is a need for higher recall — e.g. medical screening tests where there must be detected as many sick patients as possible even though the precision might be low. On the other hand, in certain domains there is a need to minimize the amount of false positives, i.e. to increase precision as much as possible, as for example in spam detectopm where classifying a normal mail as spam (false positive) might have big negative consequences as an important mail might be lost.

However, in many cases the specific domain needs are not apriori known when building the model. Therefore, one might want to select a model that performs well regardless of a specific threshold — i.e. somehow aggregate how the model performs at all thresholds — and select the threshold later when the operating conditions of the model are known. This is exactly what receiver operator characteristic (ROC) and precision-recall (PR) curves do.

Evaluation at Different Thresholds

As explained above, in case of probabilistic classifier there might be a need of analyzing performance of the classifier at different decision thresholds. The figure below illustrates a set of actual labels, their corresponding predicted probabilities (sorted ascendently) and a plot of the corresponding values of the evaluation metrics described above at different thresholds.

Set of actual labels, predicted probabilities (sorted by the probabilities from lowest to highest) and corresponding values of evaluation metrics when varying a decision threshold.

We see that every time the threshold is set higher than a negative sample the FPR decreases and precision increases — a FP is converted into TN. On the other hand every time a positive sample is marked as negative the TPR and precision decreases — as a TP is converted in to a FN. Moreover, note, that precision is undefined when there are no TP and FP (as the denominator in precision is thus zero).

The plot above might give valuable insight into the performance of the classifier. However, there is no straightforward way how to aggregate the performance into single value and the visualized plots are dependent on the values of decision thresholds (or the predicted probabilities). Some classifiers can for example predict uncalibrated probabilities, i.e. the predicted probability is not an actual probability but rather some kind of a score. Moreover, some classifiers might not predict probabilities/score in range [0, 1] but rather as e.g. a real positive nuber. A great example are anomaly detection algorithms where the range of the predicted anomaly scores is highly dependent on the specific anomaly detection algorithm. Therefore, ROC and PR curves are commonly used as they do not use the actual values of the decision thresholds.

ROC Curves

Receiver operator characteristic (ROC) curve is a a plot of TPR against FPR across multiple decision thresholds:

Note, that the highest decision threshold is in the upper right corner and the lowest threshold is in the bottom left corner. The reason for using the ROC curve instead of plotting TPR and FPR against decision thresholds itself (as illustrated in the previous section) are ROC’s properties:

Universal baselines: A random classifier (which randomly selects positive or negative class) is represented by a major diagonal while the perfect classifier is the one that contains point [0, 1] — no false positives and TPR = 1. One can imagine the perfect classifier as a classifier that is able to predict such probabilities so that one decision threshold can perfectly split negative samples from the positive ones. Therefore, the closer to the top left corner the better the performance of the model.
Linear interpolation: Any two points on an ROC curve can be linearly interpolated and all the values on such created lines can be achieved by a random choice from predictions of the two respective classifiers.
Area: The area under ROC curve (AUROC) has a clear interpretation: it is a probability of a randomly chosen positive sample being ranked higher (having higher predicted probability) than a randomly chosen negative sample. Moreover, AUROC can be used to calculate an expected accuracy of the model:

Expected accuracy [4].

Due to the above mentioned properties the ROC curve (and AUROC) has become a standard for comparison of the performance of classification models. However, as we’ll show in the following section, it might be unsuitable for evaluating imbalanced data sets.

PR Curves

Often the data set at hand is imbalanced, i.e. one of the classes is less frequent than the other. Such example is spam detection where the amount of spam mails is significantly lower than the amount of normal mails (at least it should be). The minority class is commonly the positive one and thus the amount of imbalance is often expressed as prevalence — the ratio of positive samples to all the sample s— denoted as π (an imbalanced data set thus has π << 50 %). In imbalanced data sets we are typically interested in the probability of detecting a positive (detecting spam mail as spam), i.e. recall, and the probability that a positive prediction is an actual positive sample (probability that a mail detected as spam is actually spam), i.e. precision. These metrics are suitable for evaluation of imbalanced data sets as they are insensitive to the amount of TNs. This is indeed desirable as the amount of TNs will be typically high as there are lots of negative samples — FPR (the ratio between the FP and TN ) will be then be naturally always low. Precision-recall (PR) curve is then a plot where recall (x-axis) is plotted against precision (y-axis) at various thresholds.

From the definition, PR curve is insensitive to TNs. Let’s demonstrate this by an example. Let’s assume a data set which has prevalence 15 % and a trained binary classifier. The figure below shows an ROC curve of its predictions and a corresponding PR curve:

We see that AUROC is 0.93 which seems as a very good result. However, in the PR curve we see that the performance in terms of precision and recall isn’t that good — we are able to achieve for example recall 0.8 at precision 0.6 or recall 0.3 at precision 0.8. The reason that AUROC has a high value is that there are lots of TNs (due to the negative class being a majority class) and thus FPR is always low. Let’s now decrease the prevalence even more to 5 % by adding more negative samples and give them all predicted probability equal to 0 (so that they are always TNs — except when setting the threshold to 0):

We can see that the AUROC is even bigger (0.98) but the PR curve remaings exactly the same as the one above. This nicely demonstrates the PR curve’s insensitivity to the amount of TNs and its benefit in the evaluation of imbalanced data sets.

The shape of the PR curve might be, however, very unintuitive. It is important to realize that precision and recall at the lowest threshold corresponds to a point farthest on the right— at the highest recall value. Let’s now assume we continuously increase the threshold. Whenever TP is converted into FN both the precision and recall decrease. On the other hand, when a FP is converted into a TN precision increases but recall remains the same (as it does not use neither FP or TN). This thus causes the step-wise shape of the PR curve.

An extreme case of a PR curve’s shape can be when there are low precision values at low recalls as shown in the figure below. This can be caused by a negative sample having a high predicted probablity (or score) — thus at high decision threshold the precision can be very low as the there can be a relatively big ratio of FP compared to TP.

PR curve thus has big pitfalls when compared to ROC in terms of its properties such as no universal baseline, no linear interpolation and no intepretation of its area under curve:

No universal baseline — PR curves have no universal baseline as the performance of a random classifier depends on the prevalence (ratio between the classes) in the data set. The random classifier (baseline) is represented by a horizontal line at precision = π.
No linear interpolation — PR curve cannot be linearly interpolated. In [2] the authors explain that the proper interpolation should be hyperbolical.
Uninterpretable area— Area under PR curve (AUPR) has no interpretation and, more importantly, the area under curve, commonly calculated by a trapezodial rule that performs linear interpolation, might cause be an overly optimistic measure of the classifier’s performance.

Therefore, selecting a model based on AUPR might lead to a worse performing model. Precision-recall-gain curves can help to overcome the above mentioned issues of PR curves.

Precision-Recall-Gain Curve

Precision-recall-gain (PRG) curve is a modification of PR curve which gives it similar properties as ROC curve have. PRG curve was introduced by P. Flach and M. Kull in 2015 [1]. The main idea behind the PRG curve is establishing a baseline — an always-positive classifier — and express precision and recall scores in terms of gain over this baseline.

The always-positive baseline has precision = π and recall = 1. It can be easily seen that any model that has precision or recall < π loses against this baseline in terms of F1 score. Therefore, the idea is to scale both precision and recall according to this baseline. By using harmonic scaling:

with taking min = π and max = 1 we arrive at following definitions:

Definitions of precision-gain and recall-gain [1]

where precG is stands for precision-gain and recG for recall-gain. The figure below shows a PR curve and and equivalent PRG curve:

Equivalent PRG curve (right) to PR curve (left) [1].

The dotted lines in the above figure correspond to F1 isometrics, i.e. The points in the space that have the same F1 score. We can see that in the PR curve these isometrics are hyperbolical (while the exact shape depends on π) while in the PRG curve the isometrics are parallel.

The authors prove that the PRG curve inherits the key properties of the ROC curve:

Universal baseline is the minor diagonal from (from top left to bottom right) which corresponds to a baseline F1 isometric.
Linear interpolation is possible as it is possible to obtain any classifier on the line connecting the two values by random choice from its predictions (Theorem 1 and 2 in [1]). Thus there exists a convex hull.
Area under curve (AUPRG) has an interpretation of an expected FG1 score (Theorem 3 in [1]) where FG1 is a linearized version of F1 score the same way as precG or recG.

In the experiments performed on 426 data sets in OpenML platform [3] the authors have shown that AUPR, AUPRG and AUROC often disagree in what model they consider as the best one for a particular problem. The figure below shows a comparison of how AUPR and AUPRG rank the trained models.

The disagreement between the metrics was following:

AUPR vs AUPRG — disagree on the best method in 24% of the tasks and on the top three methods in 58% of the tasks (i.e., they agree on top, second and third method in 42% of the tasks);
AUPR vs AUROC — 29% and 65% disagreement for top 1 and top 3, respectively;
AUPRG vs AUROC — 22% and 57% disagreement for top 1 and top 3, respectively.

PRG curves, however, lose to PR curves’ easy visual interpretability by a naked eye. We can see that in the figure below which shows a PR curve and two corresponding PRG curves for data set with prevalence 16 % and 2 %.

PRG curve thus probably isn’t the right tool to use for visual inspection of the performance and for selecting the final decision threshold as it is hard to intepret. However, the area under PRG curve, AUPRG, can be considered as a proper metric for model selection for imbalanced data sets compared to AUPR.

Conclusion

In this article we’ve provided an introduction to evaluation of probabilistic classification and we described when and how to use ROC, PR and PRG curves. ROC curves and AUROC are suitable for evalution on a balanced data set. In case of an imbalanced data set PR and PRG curves might be more appropriate with AUPRG being a correct metric for model selection as it optimizes the expected F1 score while the classical PR curve remains suitable for an evaluation by visual inspection.

[1] Flach, Peter, and Meelis Kull. “Precision-recall-gain curves: PR analysis done right.” Advances in neural information processing systems. 2015.
[2] Davis, Jesse, and Mark Goadrich. “The relationship between Precision-Recall and ROC curves.” Proceedings of the 23rd international conference on Machine learning. 2006.
[3] J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo. OpenML: networked science in machine learning. SIGKDD Explorations, 15(2):49–60, 2013.
[4] Hernández-Orallo, José, Peter Flach, and Cèsar Ferri. “A unified view of performance metrics: translating threshold choice into expected classification loss.” Journal of Machine Learning Research 13.Oct (2012): 2813–2869.