Metrics to Measuring Calibration in Deep-Learning

Published in

CodeX

6 min readAug 3, 2021

In a previous post, we reviewed the confidence calibration problem in deep neural networks. Informally, confidence calibration means that if a model predicts a class with a 90% probability, that class should appear 90% of the time. We reviewed methods to measure and improve calibration. We also discussed why deep learning typically shows poor calibration performance.

The calibration problem is typically visualized using reliability diagrams and the calibration error is evaluated with the expected calibration error. Recent research found that this measure is problematic for 4 reasons, and proposed alternative metrics for measuring calibration error.

This post is based on the paper Measuring Calibration in Deep Learning

Expected Calibration Error (ECE)

Because calibration can’t be computed directly, we use alternative metrics such as ECE to evaluate the calibration of the network. We first divide the probability interval [0, 1] into multiple bins. ECE is calculated as a weighted average of the accuracy/prediction error across the bins, weighted on the relative number of samples in each bin.

ECE is commonly used as a scalar metric to evaluate calibration in experiments. But this paper finds that ECE is fundamentally flawed as a metric for calibration and that some methods for calibration that successfully optimized ECEs were not properly evaluated. Issues on ECE include

Not Computing Calibration Across All Predictions: ECE is designed for binary classification, and when extended to multi-class classification only the probability of the predicted class is used to measure calibration. The metric inherently ignores how the model predicts the other K-1 probabilities. ECC becomes a worse metric for calibration as class predictions beyond the predicted one matter more.

Fixed Calibration Ranges: The ECE metric is weighted based on the relative number of samples in each bin. Because the network predictions are typically very confident, a few bins on the right side contribute most to the ECE. The ECE focuses only on making sure that the more confident samples are calibrated.

Bias-Variance Tradeoff: The number of bins is a hyperparameter that involves trading off bias and variance of the calibration measurements. A larger number of bins will improve the subdivide the range into low biased measures but as fewer samples are allocated into each bin, they will have high variance. Because certain bins are more sparse than others, this problem compounds with the previous problem of Fixed Calibration Ranges.

Pathologies in Static Binning Schemes: Apparently, it seems that networks can “hack” the ECE metric. Near 0 calibration error can occur when both overconfident and underconfident predictions occur in the same bin. However, this seems to be slightly extreme and perhaps unpractical considering that the calibration problem in deep learning is typically overconfidence.

For example, assuming the dataset is 45% positive we could simply output a prediction in the range of (0.41, 0.43) for the negative examples and (0.47, 0.49) for the positive examples to create a set of predictions that has 1.0 AUC, 0 ECE and yet be uncalibrated.

To solve the ECE metric's problems, modifications of ECE are proposed in this paper.

Class Conditionality

We can compute the error for each class separately then average it to compute the final calibration error. This allows us to evaluate the calibration error between classes independent of the unbalance of the model in class frequency. This fairs out the error especially when we have a class imbalance(too much of one class compared to another) or when the model prediction is skewed.

Maximum Probability (SCE)

Class conditionality extends the ECE by considering every probability instead of only one for multi-class outputs. It computes the weighted average across classes as well as across bins. This aims to solve the problem of Not Computing Calibration Across All Predictions. It also fixes the class weights skewing to the one most confident output. The equation is straightforward(K: # classes).

Adaptivity (ACE)

Adaptively calibration ranges modify the bin intervals so each bin contains an equal number of samples. Adaptivity is implemented by sampling batches of size [N/R] from a sorted array of prediction, accuracies (N: # data, R: array of ranges). This adaptive scheme solves the Bias-Variance Tradeoff problem.

Norm

The norm to compare accuracy against confidence can be measured as either the L1 norm |acc-conf| or the L2 norm sqrt((acc-conf)²). The L2 norm is known to be more sensitive to outliers compared to the L1 norm. The choice seems to contribute to the effectiveness of the calibration metric.

Thresholding

By using softmax, the output gets infinitesimal(>0, but very small) and can wash out the calibration score. Especially when applying SCE, the majority of model predictions should have a minute value. This is similar to the fixed calibration range problem, where the confidence was skewed. We can solve this problem by only considering predictions above a threshold ε. While this neglecting of small class probabilities can be similar to focusing on the maximum probability only, the choice of small ε guarantees that significant secondary predictions are not rejected.

Experiments(to show the flaw in ECE)

The calibration approaches such as histogram binning and Platt scaling are described in the previous post.

Because histogram binning actually optimizes a notion of ECE, the method seems to favor the ECE metric. The table above shows that histogram binning performs better in the ECE metric. ECE calibration error is one-third to one-fifth that of the class conditional variant. According to the left figure below which plots the per-class calibration error, the error is non-uniform across classes. All this suggests that ECE gives an unfair advantage to Histogram binning.

The table above compares the effects of multiple modifications to the learning objective of the temperature scaling parameter on the ECE metric. The temperature parameter for temperature scaling is optimized on the “property” objective, and the ECE is measured. The ECE seems to be impacted small properties instead of evaluating the general calibration performance. e.g. ECE suggests that optimizing the L2 norm instead of the L1 norm can halve calibration error on Imagenet.

ECE is very sensitive to the number of bins. In an experiment, the rankings on the same calibration metric on varying binning hyperparameters were measured. Adaptive calibration metrics dramatically outperformed the classic evenly binned metrics in the mean rank correlation, stating that the adaptive binning scheme is more robust in terms of the number of bins. Other metrics such as max probs, class conditionality, thresholding, and L2 norm worsened the mean rank correlation. Which was disappointing.

The table above describes the ranking of 8 calibration techniques on 32 metrics. Inconsistency, in general, was found in the ranking order of calibration metrics. While there was no significantly better or worse technique that performed well in all metrics, we could discover some favorable metrics of each technique. Varying the calibration error metric will lead to different conclusions about which method achieves the best calibration. It is easy to make a wrong conclusion about a post-processing calibration method.

Conclusion

The paper studies the uncertainties and limitations of metrics for evaluating calibration. It identifies and classifies the problems with the ECE metric. We also reviewed modifications to the ECE metric to resolve the limitations pointed out. Finally, we discussed the weird phenomenon with metrics for evaluating calibration performance.

Future metrics to measure calibration and assess calibration must address the challenges posed in this paper. To actually make progress in calibration, we first need to clearly define calibration. This paper suggests that this is not as easy as we thought.

I believe the calibration of deep learning is very eerie and under-researched. It is a significant flaw in the learning algorithm, while it is inconsistent and the reasoning is yet unclear.