Bird’s eye view on the viability of AUC ROC in unbalanced binary classification problems

Dmitry Petukhov
IT’s Tinkoff
Published in
14 min readOct 2, 2023

The area under the receiver operating characteristic curve (AUC ROC) [1–3] is often considered the only reliable model quality metric for binary classification problems where model scores are used as a degree of confidence in something. However, recent literature raises concerns about the use of AUC ROC as a model quality metric in unbalanced binary classification problems, where negative examples significantly outnumber positive ones. Such concerns are common in machine learning sources [4–12] as well as in biological, ecological and bioinformatics literature [13–16]. All of them recommend the area under precision-recall curve (AUC PR) as an alternative to AUC ROC.

This article evaluates the viability of AUC ROC as a model quality metric in binary classification problems. It is based on several experiments that explore the dependence of AUC ROC on sample size and class balance, with a focus on its relationship with AUC PR. The article assumes that you have a basic understanding of binary classification theory, but if you don’t, you can quickly acquire it through any of the following references [1–5, 17–18].

Introduction: a general overview of AUC ROC and AUC PR

The receiver operating characteristic (ROC) curve shows the trade-off between the true positive rate and the false positive rate [1, 2]. The true positive rate (TPR), also known as recall or hit rate, is a function of true positives (TP) and false negatives (FN): TPR = f(TP, FN). It measures how often the classifier predicts a positive class when the target is positive. The false positive rate (FPR) is a function of false positives (FP) and true negatives (TN): FPR = f(FP, TN). It measures how often the classifier predicts a positive class when the target is negative.

The AUC value derived from the ROC curve can be used to compare the performance of different classification models. A classifier performing no better than random guessing has an AUC of 0.5 [3, 4].

Similar to the ROC curve, the precision-recall (PR) curve plots precision against recall for different thresholds. Precision is a function of true positives and false positives (FP): Precision = f(TP, FP). It measures how often the classifier correctly predicts the positive class. Recall is the same metric as TPR.

Unlike the ROC curve, the baseline of the PR curve changes depending on the class balance determined by the ratio of positive examples in the dataset: y = P / (P + N) [5]. For a balanced dataset, this ratio equals 0.5, which gives an AUC of 0.5.

The differences between these curves are visually represented using confusion matrices in Figure 1.

Figure 1 — Confusion matrix for the ROC curve, highlighting TPR and FPR (left), and for the PR curve, indicating precision and recall (right) [3–5]
Figure 1 — Confusion matrix for the ROC curve, highlighting TPR and FPR (left), and for the PR curve, indicating precision and recall (right) [3–5]

It is worth noting that the confusion matrix calculations for the PR curve do not involve true negatives. Instead, the PR curve emphasizes only accurate predictions of the positive class, which is different from the ROC curve that covers both classes.

Switching from FPR to precision is considered more beneficial in unbalanced problems. In such problems, a large number of true negatives usually overshadows any changes in false positives, keeping the FPR relatively low, which may indicate fewer false positives. Therefore, the classifier shows a high AUC ROC value [4, 6–9]. At the same time, two classifiers may have almost identical AUC ROC values, despite the different numbers of false positives.

This article proposes four experiments to illustrate how variations in class balance and sample size impact both AUC ROC and AUC PR. The code to reproduce the plots and simulations is readily available on GitHub [19].

Methods

The experiments use samples based on two normal distributions. One of these distributions relates to the negative class and the other to the positive one. The resulting distribution is referred to as classifier scores. An example of this distribution is shown in Figure 2.

Figure 2 — Synthetic distribution of classifier scores represented by two normal distributions
Figure 2 — Synthetic distribution of classifier scores represented by two normal distributions

The code to generate the sample from two normal distributions is as follows:

The reproducibility of the sample depends on a seed value. Using different seed in get_sample() results in slightly different samples and consequently different AUC values. Such variability is necessary to obtain stable metric estimates.

This study uses twenty seed values generated by random.randint() [20] as follows:

The article presents three experiments that vary three parameters within get_sample(). In the first experiment, the influence of sample size changes on AUCs is investigated by varying the n_objects parameter in nearly balanced and unbalanced cases. The second experiment examines the relationship between AUCs and class balance by varying the class_weight parameter across different sample sizes. In the third experiment, the influence of classification task complexity on AUCs under different conditions is explored by varying the loc parameter, which represents the center of the right normal distribution. The results of these variations on the ROC and PR curves, as well as on AUCs, are visually demonstrated and described.

Lastly, the fourth experiment unravels how model performance is affected by AUCs used as an evaluation metric for early stopping in well-known gradient-boosted decision tree libraries, namely CatBoost and XGBoost.

Experiment #1 — how does AUC depend on sample size?

This experiment investigates the relationship between AUCs and sample size in two cases: one where classes are nearly balanced with the positive class weight at 40%, and another where classes are unbalanced with the positive class weight set to 0.05%.

Let’s look at the changes in AUC in the case when class_weight is set to 40% in Figure 3.

Figure 3 — Dependence of AUC ROC and AUC PR on the number of objects in the sample when class_weight is set to 40%
Figure 3 — Dependence of AUC ROC and AUC PR on the number of objects in the sample when class_weight is set to 40%

Each number of objects corresponds to a specific distribution of AUCs. The circular marker indicates the mean value of the distribution, while the gray area indicates the range between the minimum and maximum values. The distribution of AUCs for each n_objects is obtained by varying the seed parameter as follows:

Figure 3 shows only a slight fluctuation in AUCs when the sample size is relatively small. To assess the impact of this subtle fluctuation on the ROC or PR curves, we can make the following comparison:

This piece of code produces data for plotting the curves at seeds where the AUC ROC reaches its minimum and maximum values during iteration over the seeds. Additionally, it calculates the corresponding AUC values and the percentage range between the minimum and maximum AUC values.

When class_weight is set to 40%, there is no visible difference between the curves, and the range of AUCs is also small, as depicted in Figure 4.

Figure 4 — Differences between ROC curves (left) and PR curves (right) at the seeds corresponding to the minimum and maximum AUC ROC values; number_of_objects is set to 1E+5 and class_weight is 40%
Figure 4 — Differences between ROC curves (left) and PR curves (right) at the seeds corresponding to the minimum and maximum AUC ROC values; number_of_objects is set to 1E+5 and class_weight is 40%

A more interesting case is presented in Figure 5. With class_weight set to 0.05%, the fluctuations in AUCs at smaller sample sizes become more pronounced compared to the previous case.

Figure 5 — Dependence of AUC ROC and AUC PR on the number of objects in the sample when class_weight is set to 0.05%
Figure 5 — Dependence of AUC ROC and AUC PR on the number of objects in the sample when class_weight is set to 0.05%

Figures 6 and 7 provide a detailed comparison of the ROC and PR curves for sample sizes of 1E+5 and 5E+7 objects, respectively.

Figure 6 — Differences between ROC curves (left) and PR curves (right) at the seeds corresponding to the minimum and maximum AUC ROC values; number_of_objects is set to 1E+5 and class_weight is 0.05%
Figure 6 — Differences between ROC curves (left) and PR curves (right) at the seeds corresponding to the minimum and maximum AUC ROC values; number_of_objects is set to 1E+5 and class_weight is 0.05%
Figure 7 — Differences between ROC curves (left) and PR curves (right) at the seeds corresponding to the minimum and maximum AUC ROC values; number_of_objects is set to 5E+7 and class_weight is 0.05%
Figure 7 — Differences between ROC curves (left) and PR curves (right) at the seeds corresponding to the minimum and maximum AUC ROC values; number_of_objects is set to 5E+7 and class_weight is 0.05%

Figure 6 shows a noticeable difference between the curves when class_weight is set to 0.05%. The difference between the PR curves is much larger compared to the difference between the ROC curves at the seeds corresponding to the minimum and maximum AUC ROC values. Accordingly, the AUC PR range is also greater than the AUC ROC range. From this, we may conclude that even minor changes in a relatively small and unbalanced sample can affect the AUC PR value more than the AUC ROC value. As we increase the sample size to 5E+7, the difference between the curves becomes less pronounced; however, the AUC PR range remains slightly larger than the AUC ROC range.

The code used in this experiment is available on GitHub [19].

Experiment #2 — how does AUC depend on class balance?

This experiment explores the dependence of AUCs on the class_weight parameter in two cases: one when the sample consists of 5E+7 objects, and another when the sample consists of 1E+5 objects.

Figures 8 and 9 show the results. As expected, the mean value of the AUC ROC remains almost constant as class_weight varies, given that ROC curves are considered insensitive to changes in class distribution [1]. On the other hand, the AUC PR value decreases significantly, as it is highly dependent on class balance.

Figure 8 — Dependence of AUC ROC and AUC PR on the positive class weight when number_of_objects is set to 5E+7
Figure 8 — Dependence of AUC ROC and AUC PR on the positive class weight when number_of_objects is set to 5E+7

When the sample size is set to 5E+7 objects, there are only slight fluctuations in AUCs, especially when the class weight is small. In this case, the differences between the ROC or PR curves at the seeds corresponding to the minimum and maximum AUC ROC values are barely visible, as depicted in Figure 7.

However, the picture changes when the sample size is reduced to 1E+5. Here, minor changes in the sample due to seed parameter variations trigger significant fluctuations in the AUCs. These fluctuations become higher as class imbalance increases.

Figure 9 — Dependence of AUC ROC and AUC PR on the positive class weight when number_of_objects is set to 1E+5
Figure 9 — Dependence of AUC ROC and AUC PR on the positive class weight when number_of_objects is set to 1E+5

In this case, the differences between the curves become apparent and the AUC range at the seeds corresponding to the minimum and maximum AUC ROC values reaches 7.5% for AUC ROC and 65% for AUC PR, as depicted in Figure 6.

The code used in this experiment is available on GitHub [19].

Experiment #3 — how does the complexity of the classification task impact AUC?

Unlike the previous two experiments, this experiment investigates how AUCs depend on the distance between the two normal distributions that make up the sample, while class weights and sample sizes remain constant. This is achieved by varying the loc parameter in the get_sample() function.

Figure 10 depicts the case where class_weight is set to 40% and the number of objects is 1E+5. There is almost no difference between the AUC ROC and AUC PR dependencies. A comparison of the corresponding curves for the loc parameter set to 6 is provided in Figure 11.

Figure 10 — Dependence of AUC ROC and AUC PR on the center of the right normal distribution when class_weight is 40% and number_of_objects is 1E+5
Figure 10 — Dependence of AUC ROC and AUC PR on the center of the right normal distribution when class_weight is 40% and number_of_objects is 1E+5
Figure 11 — Differences between ROC curves (left) and PR curves (right) at the seeds corresponding to the minimum and maximum AUC ROC values; number_of_objects is 1E+5, class_weight is 40%, and loc is 6
Figure 11 — Differences between ROC curves (left) and PR curves (right) at the seeds corresponding to the minimum and maximum AUC ROC values; number_of_objects is 1E+5, class_weight is 40%, and loc is 6

For a specific loc parameter set to 6, the difference between the ROC curves at the seeds corresponding to the minimum and maximum AUC ROC values is barely visible. The same is true for the PR curves. When the loc parameter increases to 14, the difference between the curves reduces. As a result, the range of both AUC ROC and AUC PR reduces to a mere 0.12%.

Figure 12 depicts more interesting changes in AUCs when class_weight is set to 0.05% with the sample size of 1E+5 objects. Given the significant fluctuation in AUCs, it is interesting to compare the ROC and PR curves at the seeds corresponding to the minimum and maximum AUC ROC values.

Figure 12 — Dependence of AUC ROC and AUC PR on the center of the right normal distribution when class_weight is 0.05% and number_of_objects is 1E+5
Figure 12 — Dependence of AUC ROC and AUC PR on the center of the right normal distribution when class_weight is 0.05% and number_of_objects is 1E+5

These comparisons, made with the loc set to 6 and 14, are presented in Figures 13 and 14. It is evident that as the loc parameter increases (the complexity of the classification task decreases), AUCs increase in absolute values. At the same time, the difference between the curves becomes less pronounced, and the AUC range decreases in percentage terms.

It is also worth noting that the AUC PR range is much larger than that of AUC ROC, regardless of the loc value. This reflects high sensitivity of the AUC PR to changes in the sample of relatively small size.

Figure 13 — Differences between ROC curves (left) and PR curves (right) at the seeds corresponding to the minimum and maximum AUC ROC values; number_of_objects is 1E+5, class_weight is 0.05% and loc is 6
Figure 13 — Differences between ROC curves (left) and PR curves (right) at the seeds corresponding to the minimum and maximum AUC ROC values; number_of_objects is 1E+5, class_weight is 0.05% and loc is 6
Figure 14 — Differences between ROC curves (left) and PR curves (right) at the seeds corresponding to the minimum and maximum AUC ROC values; number_of_objects is 1E+5, class_weight is 0.05% and loc is 14
Figure 14 — Differences between ROC curves (left) and PR curves (right) at the seeds corresponding to the minimum and maximum AUC ROC values; number_of_objects is 1E+5, class_weight is 0.05% and loc is 14

When dealing with a larger sample size of 5E+7 objects, the difference between the curves and AUCs vanishes. This is demonstrated in Figure 15, where the loc parameter is set to 14. Similar to the case with the sample size of 1E+5 objects, the difference between the curves becomes smaller as the loc parameter increases (i.e., the complexity of the classification task decreases); this is evident when comparing Figure 7 and Figure 15.

Figure 15 — Differences between ROC curves (left) and PR curves (right) at the seeds corresponding to the minimum and maximum AUC ROC values; number_of_objects is 5E+7, class_weight is 0.05% and loc is 14
Figure 15 — Differences between ROC curves (left) and PR curves (right) at the seeds corresponding to the minimum and maximum AUC ROC values; number_of_objects is 5E+7, class_weight is 0.05% and loc is 14

The code used in this experiment is available on GitHub [19].

Experiment #4 — Is AUC ROC a suitable metric for early stopping in gradient-boosted decision tree libraries?

This final experiment compares AUC ROC and AUC PR as evaluation metrics for early stopping in gradient-boosted machine learning libraries.

The experiment relies on datasets created with the make_classification() function of scikit-learn as follows:

Each dataset, corresponding to a specific random_state value, is split into training, validation and testing parts in a 50/25/25% proportion. The validation part is used for early stopping, while the testing part serves as a holdout for the independent evaluation of AUCs.

Figure 16 displays the results using the CatBoost library, where AUC denotes AUC ROC as the evaluation metric for early stopping and PRAUC denotes AUC PR. Each boxplot represents the distribution of AUCs on the testing part of the datasets for the specified evaluation metric.

Figure 16 — Distribution of AUC ROC and AUC PR on the testing part of datasets using different evaluation metrics for early stopping in the CatBoost library
Figure 16 — Distribution of AUC ROC and AUC PR on the testing part of datasets using different evaluation metrics for early stopping in the CatBoost library

The ANOVA test [21] confirms that there is no significant difference between the mean values of AUC ROC across the three groups trained with different evaluation metrics, since the p-value is greater than 0.3. Meanwhile, the same dispersion test confirms no significant difference in the AUC PR of the three groups with a p-value of 0.18. However, as shown in Figure 16, the median value of AUC PR for the PRAUC is markedly higher than for AUC.

A figure corresponding to the XGBoost library is quite similar to Figure 16. It can be found on GitHub, along with the code used in this experiment [19].

Discussion and conclusion

Four mock experiments have been analyzed here to discuss the viability of AUC ROC in unbalanced binary classification problems by exploring the changes of the metrics under different conditions. Even though the experiments described have some similarities with the findings in references [5, 9–10, 13, 22], a few points should be highlighted.

For example, when classes are nearly balanced, using AUC PR over AUC ROC has little to no advantage due to the negligible difference between the ROC or PR curves at the seeds corresponding to the minimum and maximum values in the AUC ROC distribution formed by seed variation. Even with relatively small samples this difference remains insignificant.

However, the situation changes when the classes are unbalanced. In such cases, there is a clear difference between both the curves and the AUCs obtained at seeds that correspond to the minimum and maximum values in the AUC ROC distribution. The difference between the PR curves is much larger compared to the ROC curves in terms of AUC percentage range.

As the number of objects in the sample decreases while class_weight remains constant, both the difference between the curves and the AUC range increase. In these cases, the AUC PR percentage range is much larger compared to the AUC ROC range. However, as the sample size increases, the difference between the curves diminishes and the resulting small AUC PR range is only slightly higher than the AUC ROC range at equivalent seeds. Hence, when dealing with highly unbalanced, relatively small datasets, it is preferable to use AUC PR as an evaluation metric, given its higher sensitivity to minor changes in small samples. This requirement is not strict for larger datasets due to the less pronounced difference between AUC ROC and AUC PR ranges in larger samples.

The results of the third experiment confirm these conclusions, with a few additions. When the sample size is large and the classes are nearly balanced, there is no clearly visible difference between the curves or AUCs. In this case, decreasing the complexity of the classification task simply vanishes the difference between the curves and increases the AUCs in absolute terms. However, when the sample size is small and the classes are unbalanced, the difference between the curves becomes obvious, with the AUC PR range much larger than that of the AUC ROC. Once again, decreasing the complexity of the classification task reduces the difference between the curves and increases the AUCs.

A specific note to highlight is that high complexity of the classification task with severe class imbalance may result in low precision, and consequently, quite low AUC PR. In such cases, the metric can be misleading, which is also confirmed by the conclusions provided in the reference [23]. In practice, training models with gradient-boosted libraries using AUC PR as the evaluation metric for early stopping on datasets with weak features may lead to under-training or a high number of false positives during inference mode, while using AUC ROC as the evaluation metric ensures reasonably good results.

Nevertheless, in some cases of class imbalance where the positive class is quite important, using AUC PR as the evaluation metric during model training may be reasonable. This is due to the fact that algorithms optimizing the area under the ROC curve do not necessarily optimize the area under the PR curve [6]. Even though the difference in the fourth experiment was not statistically significant, there is clear evidence that the median value of AUC PR is higher for the corresponding evaluation metric compared to other evaluation metrics. This observation is consistent regardless of the library used.

In other cases of class imbalance, especially with relatively large datasets, using AUC ROC is feasible. This is due to the negligible difference between AUC ROC and AUC PR, as illustrated in experiments with samples of 5E+7 objects. In these cases, high complexity of the classification task is another reason to favor the use of AUC ROC.

References

  1. Fawcett, T., 2006. An introduction to ROC analysis. Pattern recognition letters, 27(8), pp. 861–874. https://people.inf.elte.hu/kiss/13dwhdm/roc.pdf
  2. Fawcett, T., 2004. ROC graphs: Notes and practical considerations for researchers. Machine learning, 31(1), pp. 1–38. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.10.9777&rep=rep1&type=pdf
  3. https://developers.google.com/machine-learning/crash-course/classification/video-lecture
  4. https://towardsdatascience.com/on-roc-and-precision-recall-curves-c23e9b63820c
  5. Cook, J. and Ramadas, V., 2020. When to consult precision-recall curves. The Stata Journal, 20(1), pp. 131–148. https://journals.sagepub.com/doi/pdf/10.1177/1536867X20909693
  6. Davis, J. and Goadrich, M., 2006. The relationship between Precision-Recall and ROC curves. In Proceedings of the 23rd international conference on Machine learning (pp. 233–240). https://pages.cs.wisc.edu/~jdavis/davisgoadrichcamera2.pdf
  7. https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/
  8. https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-imbalanced-classification/
  9. https://sinyi-chou.github.io/classification-pr-curve/
  10. https://classeval.wordpress.com/simulation-analysis/roc-and-precision-recall-with-imbalanced-datasets/
  11. https://cosmiccoding.com.au/tutorials/pr_vs_roc_curves
  12. https://neptune.ai/blog/f1-score-accuracy-roc-auc-pr-auc
  13. Saito, T. and Rehmsmeier, M., 2015. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PloS one, 10(3). https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0118432
  14. Sofaer, H.R., Hoeting, J.A. and Jarnevich, C.S., 2019. The area under the precision‐recall curve as a performance metric for rare binary events. Methods in Ecology and Evolution, 10(4), pp. 565–577. https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.13140
  15. Lobo, J.M., Jiménez‐Valverde, A. and Real, R., 2008. AUC: a misleading measure of the performance of predictive distribution models. Global ecology and Biogeography, 17(2), pp. 145–151. https://onlinelibrary.wiley.com/doi/10.1111/j.1466-8238.2007.00358.x
  16. Li, W. and Guo, Q., 2021. Plotting receiver operating characteristic and precision–recall curves from presence and background data. Ecology and Evolution, 11(15), pp. 192–206. https://onlinelibrary.wiley.com/doi/full/10.1002/ece3.7826
  17. https://www.youtube.com/watch?v=4jRBRDbJemM
  18. https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5
  19. https://github.com/dspetukhov/auc-roc-unbalanced
  20. https://docs.python.org/3/library/random.html
  21. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html
  22. https://paulvanderlaken.com/2019/08/16/roc-auc-precision-and-recall-visually-explained/
  23. https://towardsdatascience.com/demystifying-roc-and-precision-recall-curves-d30f3fad2cbf

--

--