Predicting the true probability in Neural Networks: Confidence Calibration

Published in

CodeX

8 min readJul 30, 2021

Photo by Tools For Motivation on Unsplash

Wisdom is knowing what you don’t know — Confucious

Suppose a deep learning-based binary cancer diagnosis system, well known for its superior accuracy predicts 0.996 on my data. Does it mean that I actually have a 99.6% chance of having that disease? Well, they should be but maybe not. We will review why these predictions don’t necessarily reflect the probability in deep learning models and solutions to this phenomenon, prevalent in current deep learning systems.

Deep neural networks achieved undoubtful success by benefiting from modern accelerators and design principles. In deep-learning-based classification, the output value is designed to reflect the probability to be correct. While deep learning definitely improves the accuracy of the task, they don’t seem to have the ability to accurately tell their confidence in the output. A paper quotes, “modern neural networks are no longer well-calibrated”.

This post reviews the paper On Calibration of Modern Neural Networks

Confidence Calibration

Confidence calibration — the problem of predicting probability estimates representative of the true correctness likelihood

When we say that we want the output to be calibrated, we want the output to represent the true probability. For example, given 100 predictions each with a confidence of 0.8, we expect approximately 80 samples to be correctly classified. A perfect calibration would be when exactly 80 samples are correct, or more formally when the following equation is met.

Perfect calibration is obviously impossible. In fact, it is impossible to measure the exact calibration of the model because Pˆ is continuous since there is really no Pˆ=p.

Importance of Confidence Calibration

A neural network with softmax activation trained with cross-entropy loss is designed to be calibrated but often doesn’t behave so in practice.

Can’t we just live without calibration? While deep learning achieves great performance, they are sometimes wrong. But if they are always 99% confident, the consequences of being wrong could be critical and we must have less trust in these systems. The failure to be not sure can limit the applications of DL in safety-critical real-world systems.

As an example, consider a self-driving car that uses a neural network to detect pedestrians and other obstructions. If the detection network is not able to confidently predict the presence or absence of immediate obstructions, the car should rely more on the output of other sensors for braking.
Alternatively, in automated health care, control should be passed on to human doctors when the confidence of a disease diagnosis network is low.

Measuring Calibration

As mentioned earlier, calibration is impossible to compute exactly. We alternatively estimate calibration by separating the prediction into multiple bins. The following methods are some examples of how the calibration is estimated.

Reliability Diagrams: The bottom plot in the figure below is an example of reliability diagrams. By plotting the statistical accuracy as a function of confidence(estimated accuracy), we can visualize how well the confidence reflects the true accuracy. For example in the bottom right plot, the true accuracy of the model when it was 0.6~0.7 confident was about 0.3. A larger gap with the diagonal identity line represents miscalibration.

Expected Calibration Error (ECE): Because the probabilities are infinite, the predictions are grouped into M interval bins. ECE is a scalar that measures calibration. It is calculated as a weighted average of the accuracy/confidence difference of the bins. The |B_m|/n term is the ratio of samples in bin m.

Maximum Calibration Error (MCE): Similar to ECE, but instead of a weighted average of difference, MCE detects the worst-case deviation between confidence and accuracy.

Negative log-likelihood (NLL): NLL is a component of the binary cross-entropy loss. NLL is minimized if and only if πˆ(Y |X) recovers the ground truth-conditional distribution π(Y |X).

Effects of techniques towards Calibration

The left plot above illustrates the calibration of 5-layer LeNet while the right plot is the result of 110-layer ResNet. Deeper neural networks seem to have significantly worse calibration. The paper studies the change in model calibration when using techniques used in modern deep learning to find out the exact cause of miscalibration.

When the model capacity(depth and width) is increased, the classification error is reduced. However, the calibration(ECE) seems to gets worse until a degree(figure 1, 2). Next, it is interesting that batch normalization affects model calibration on a 6-layer network significantly. Regardless of hyperparameters, batch normalization was found to pose negative effects on model calibration. Finally, regularization represented as weight decay was found to improve classification error to a certain degree while also improving model calibration. Interestingly, calibration keeps on improving even when it seems to be considered as over-regularization(decay factor≈10ˆ-2.5).

During these experiments, we could observe that calibration is not optimized together with accuracy. I am curious about whether model calibration is directly affected by the amount of regularization, involving other forms such as label smoothing, data augmentation, dropout, or stochastic depth.

Achieve Calibration in Deep Learning

In this section, we will review some existing solutions to calibration. For simplicity, we expect a binary classification problem and assure the model predicts one confidence pˆ_i, for the positive class. We want to recalibrate the output the confidence to qˆ_i. Suppose z_i is the output before sigmoid activation.

Histogram binning: The uncalibrated prediction is assigned to one of M bins. A score θ_m that minimizes the bin-wise squared loss is assigned to each bin. The scores map the prediction range into the true calibrated confidence. At test time, if the prediction falls into bin m, the calibrated prediction is θ_m.

Isotonic regression: Learns a piecewise constant function f to calibrate the raw output pˆ_i. f minimizes the function below. Similar to histogram binding, but the bin boundaries are jointly optimized with the bin predictions using the function below.

Bayesian Binning into Quantiles (BBQ): Essentially is histogram binning using Bayesian averaging. A binning scheme is a pair (M, I) where M is the number of bins and I is a list of intervals that partitions [0, 1]. BBQ considers the space of binning schemes and performs Bayesian averaging of the probabilities produced by each scheme. The red term denotes the calibrated probability when using binning scheme s and the blue term is the probability of using that binning scheme. Because the validation data D is finite, this term can actually be calculated directly.

Platt scaling: The z vector is fed into a logistic regression model trained on the validation set to predict probabilities. Considering that the simplified problem is binary classification, it output qˆi = σ(az_i + b) as the calibrated probability and learn parameters a, b. Simple enough.

For classification with K>2 classes, an extension of the methods above is used. However, as the base concept seems to be consistent with the binary classification counterpart, we will not describe most of these methods in this post. Check out the original paper for more details.

Temperature scaling: As an abstraction of Platt scaling, a single parameter T, called the temperature is used to scale all classes. The calibrated confidence prediction is calculated as the function below. T acts as softening the confidence because, with larger T values, the calibrated probability qˆ_i will reach 1/K representing uncertainty. However, the strict inequality will remain consistent and thus won’t affect the accuracy of the model. The paper introduces this technique in the context of confidence calibration.

Instead of actually guiding the neural network to predict calibrated probabilities, these methods achieve confidence calibration by adding a separate branch to the network to estimate the confidence. I believe this trend is based on the assumption that model calibration is irrelevant to model accuracy, and thus the accuracy should be learned through a separate procedure. But is this really true? The cross-entropy loss is fundamentally designed to be calibrated, but why aren’t they doing so? I believe the problem lies in overfitting and the training procedure and believe that training with a calibration objective can provide better feedback to the model.

Experiments

ECE(M=15) of calibration strategies on various models

On the various methods for confidence calibration, temperature scaling achieved surprising performance. Temperature scaling outperformed other methods by a large margin regardless of the dataset and network architecture. Despite its simplicity, we can observe that the problem with miscalibration is really low-dimensional, or perhaps even linear.

As can be seen in the reliability diagram, uncalibrated networks(left) are overconfident. While temperature scaling is the best among these, all other methods also measure reasonable calibration.

Conclusion

Miscalibration of deeper neural networks is a strange problem because they don’t seem to relate to model accuracy. We discussed that many modifications can harm the calibration of the model while improving accuracy. We also had a look at some previous methods for accurate calibration. We concluded that Temperature scaling is a simple yet effective technique for confidence calibration.

Understanding the miscalibration of deep learning is incomplete. Research on the theoretical background of this phenomenon is needed. Because the cross-entropy objective favors calibrated outputs on uncertainty, miscalibration derives from an unknown unhealthy behavior in deep neural networks. Thus, the network parameters should learn to properly calibrate the probability. However, calibration methods typically don’t interfere with the model predictions. I am looking forward to more research in this field since there is not much work on calibration.