Fallacy of using AUC as a model selection criteria

Rajneesh Tiwari
Analytics Vidhya
Published in
5 min readSep 14, 2019

In the context of model selection, most data scientists rely upon a wide variety of goodness-of-fit criteria to decide which model corresponds to the best performance on the holdout/validation dataset. These goodness-of-fit criteria are often also referred to as a ‘metric’.

Confusion Matrix: Most metrics for tabular ML models are derived from the confusion matrix

Note that a “metric” can be quite different from the “loss function”, which, in most cases, corresponds to a function that maps a cost penalty to any given prediction, such that if the prediction is closer to the target, then the cost is lower, and vice-versa.

Loss function in 3D Euclidean space

Loss Functions are constrained:

One constraint with loss functions is that in most cases they are required to be at-least first-order differentiable, and in some advanced algorithms, even second-order differentiable as well (LightGBM, XGBoost etc)

Examples of Loss Function and Metrics:

Loss Functions: LogLoss, Likelihood Loss, Mean Squared Error etc.

Metrics: Accuracy, Precision, Recall, f-Beta Score, RMSE, MSE, MAPE, MAD etc.

You will notice that a loss function can definitely be used as a metric, but the reverse might not necessarily be true. For example, one can use logloss both as a loss function and as a metric. However, metrics such as Accuracy, Precision, Recall etc, can’t be used as loss functions.

The constraint that at-least valid Gradient or even Hessian Matrix (for boosted tree models) to be defined is violated if we use the aforementioned metrics.

Analytical Gradient Formula: f(x0 + ∆x) = f(x0) + A(x0)∆x + o(∆x)

Note: Try to convince yourself that gradient can be regarded as a vector pointing to the direction of the steepest ascent on the surface of the function.

Hessian is defined as the second-order differential, and can be mathematically written as below:

Hessian Formula

AUC-ROC metric

Let us focus on one particular metric that is very common for classification problems — AUC or area-under-the-curve of ROC (receiver operating characteristic) curve.

Consider a case of binary classification, where we trained a model X and retrieved the prediction probabilities on holdout set. We also assume a particular cutoff threshold of 0.4 for positive class (prob > 0.4 = +ve class).

Refer to the table below of Y actual (Target), Predicted Probability (+ve class) and threshold-based predicted class label.

Actuals vs. Predicted probs

We will not into details of how to calculate AUC, but at a high level, AUC corresponds to:

1. First calculating TPR(True Positive Rate) and FPR(False Positive Rate) based on multiple cutoff thresholds

2. Plotting these on a graph and calculating the area under the curve.

AUC — ROC Curve (Source)

For details free to check here if you need to take a look.

For now, we will start by creating a random dataset. Specifically, we generate three things:

  1. Y Actual Vector : 10 X 1
  2. Y Pred Vector: 10 X 1
  3. Predicted Class based a threshold: 10 X 1

Next, we will calculate the AUC based on actual target labels and predicted probability values

Upon first viewing, it seems that AUC depends directly on the probabilities. In the above code, we can see that we used only the actual targets and the predicted probabilities to calculate the AUC. While this is true that probabilities are used, however, one might be surprised to find that the magnitudes are not used at all. Let me explain by code.

Instead of using the raw probabilities, we will instead rank average the probabilities and then feed in the rank average results to the AUC formula. Let us see what happens when we do that:

Voila! The AUC value does not change when we used the rank averages of probability vector.

What does this mean for our analysis?

In short, it means that the raw probability values are not used in the calculation of AUC, and hence, we cannot say anything about how confident our model is in terms of making predictions.

To further our analysis, we will prepare another set of predicted probabilities(model B), which is the ranks-avg of Model A probabilities down-scaled by a factor of 100.

Note that, the order in terms of rank-avgs will still be maintained in the newly generated probabilities. The magnitude of probabilities will most likely diminish vis-a-vis original model’s (model A) output.

Let’s check the probability distribution curve for predicted probability outputs from two different models as below :

KDE plot for predicted probs from two different models (with same AUC)

AUC Model A: 0.875
AUC Model B: 0.875

So we have the same AUC score for the two classifiers, but very different predicted probability distributions. What does this mean?

It means that our models behave quite differently when it comes to confidence on their predictions. We know that we neither want an overconfident model nor an under-confident one.

We will have to look at a series of numbers to understand which model out of the two is overconfident/under-confident/just-right.

There will be a follow up on this post next week, when I will post further analysis and introduce the concept of probability calibration analysis.

Thanks for reading.

References:

1.https://blog.algorithmia.com/introduction-to-loss-functions/

2.https://courses.cs.ut.ee/MTAT.03.227/2016_spring/uploads/Main/lecture-4-extra-materials.pdf

--

--