Expedia Group Technology — Data

Calibrating BERT-based Intent Classification Models: Part-2

Using temperature scaling and label smoothing to calibrate classification models

Published in

Expedia Group Technology

5 min readMay 20, 2021

In Part-1 of this series, my colleague Ramji Chandrasekaran described the problem of unreliable confidence score outputs in BERT-based intent classifiers. In short, the problem was that confidence scores of an intent classification model were always saturating to a number close to 1, even when the predictions were incorrect.

Schematic input and output from an imagined conversation (HI agent) with a virtual agent, with low confidence — Before Calibration (This Photo by Unknown Author is licensed under CC BY-SA and edited by the author)

As shown in the above image, without calibration, the confidence score is high even when a prediction is not correct. An unreliable model may result in False Accepts leading to unreasonable responses from our virtual agents leading to dissatisfied customers, thus we need to address this problem.

There is an existing body of literature on the confidence score reliability issues of BERT-based classification models which stem from the characteristics of the softmax layer in the output ¹. There are relatively simple methods to mitigate these problems.

In part-2 of this two-part blog post, we will elaborate on these calibration methods: Temperature Scaling ¹ and Label Smoothing ². We will explain these methods in mathematical terms and will further show the results of our experiments comparing before and after calibration.

Temperature Scaling

Temperature scaling (TS) post-processes model probabilities by rescaling the logits with a scalar temperature hyperparameter T, which divides non-normalized logits before the softmax operation.

Temperature Scaling Formula

If T is equal to 1, there is no change to the original model probabilities. While T increases with T>1, the probabilities become more uniform, thereby reducing the potential for over-confidence; as T decreases with T<1, the previous maximum probabilities become even higher. As this scalar parameter T doesn’t alter the “max” of the “softmax”, there is no change in predicted output classes.

Let’s use a simple example to gain some intuition for TS. Assume we have the following logits: [1.0, 4.0, 0.1], the following examples show that TS with T = 2 makes the maximum probability decrease from 0.9346 to 0.7324, which is in the right direction to resolve the confidence score saturation problem.

the softmax output without TS (T = 1) would be [0.0465, 0.9346, 0.0189];
the softmax output with T = 2 would be [0.1634, 0.7324, 0.1042];
the softmax output with T = 0.5 would be [0.0025, 0.9971, 0.0004].

In order to effectively utilize TS to calibrate our model output, we need to find the optimal T. In our process, we determined the best value by minimizing Expected Calibration Error (ECE) ¹ with respect to T on a separate validation dataset.

Label Smoothing

Label Smoothing (LS) is to smooth the one-hot label with a scalar parameter α. The idea is to steal some probability mass from the predicted class and distribute it over other classes.

For example, the original target is [0,0,1] given there are 3 classes in a classification problem. The label smoothing target would be [0.05,0.05,0.9] with α = 0.1. As a result, the model is discouraged from producing a large probability for the correct class. This method also tends to have uniform distributions among incorrect classes.

Results

We applied TS alone, LS alone, and TS combined with LS to our BERT-based intent classification model and used Reliability Diagram and ECE to evaluate the quality of confidence scores. Reliability Diagram and ECE have been described in Part-1.

The following Calibration Plots compare the results between before calibration and after calibration (TS alone). Reliability diagrams and curves are shown for both before and after calibration. The diagrams indicate the accuracy within each bin of the confidence score range, and the dots on the curves indicate the average confidence score within each bin. The diagonal line represents a perfectly calibrated model that is neither underconfident nor overconfident. After calibration, the reliability diagram and curve get closer to the diagonal line, which means the confidence scores become more meaningful. With the histograms, we notice that the confidence scores become more evenly distributed after calibration. These results are based on a test dataset that includes ~22% of irrelevant utterances.

Before and after calibration histograms, showing improved confidence — Calibration Plots showing the impact of Temperature Scaling (TS) alone

All of these calibrations methods help improve the model reliability with a lower ECE. With TS alone, we got a lower ECE of ~10% compared to >20% before calibration; TS combined with LS further reduced ECE but the improvement is not significant; LS alone is not as good as TS alone.

Besides calibration, my colleague attempted to address the confidence score reliability issue by experimenting with alternate output layer architectures such as multiple logistic regression functions instead of softmax. I will share that information in a future post.

Summary

Temperature Scaling (TS), despite being a simple technique, is surprisingly effective at calibrating predictions. Our practice affirms that applying TS to BERT-based classifiers helps improve the reliability of the model output with meaningful confidence scores. Label Smoothing (LS) also helps but TS is more important, and TS can be combined with LS to further improve the reliability of the model.

As shown in the following image, after calibration, when a prediction has a low confidence score, we know the model does not clearly understand the input. The virtual agent can then ask customers for clarification or flag the message as irrelevant. This will help provide more reasonable responses from our virtual agents and leads to more satisfied customers.

Schematic input and output from an imagined conversation with a virtual agent, with high confidence — After Calibration (This Photo by Unknown Author is licensed under CC BY-SA and edited by the author)

Acknowledgments

This work was done collaboratively by Conversational AI and Conversation Platform teams. I would like to thank and acknowledge my colleagues for their help in dataset curation and for useful feedback on this work. Special thanks to mani najmabadi, Ramji Chandrasekaran, Kevin Womack, Maria Bell, Sunyinge, and Zoe Yang for providing helpful suggestions for this article!

References:
1. C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On Calibration of Modern Neural Networks,” arXiv:1706.04599 [cs], Aug. 2017, Accessed: Mar. 03, 2021. [Online]. Available: http://arxiv.org/abs/1706.04599.2. R. Müller, S. Kornblith, and G. Hinton, “When Does Label Smoothing Help?,” arXiv:1906.02629 [cs, stat], Jun. 2020, Accessed: Mar. 03, 2021. [Online]. Available: http://arxiv.org/abs/1906.02629.