Expedia Group Technology — Data

Calibrating BERT-based Intent Classification Models: Part-1

Why model reliability is as important as its predictivity

Ramji Chandrasekaran

Published in

Expedia Group Technology

5 min readMay 13, 2021

Intermeshing metal gears — Photo by Laura Ockel on Unsplash

At Expedia Group™, we strive to improve our customer satisfaction by providing a frictionless shopping and support experience. A core part of this are machine learning models that power Expedia Group’s virtual agents, that are available 24x7 to guide customers through any changes in their travel plans, pull-up information about their upcoming trips, book new trips and much more. Each capability has an associated ‘intent’, which is identified by an intent classification model.

However, natural language is messy, and not all customer utterances are actionable — perhaps not even relevant. Therefore, the predicted probabilities of a classification model — intent classifier in this case — should be reliable: a misclassification should not have a high probability. Without usable confidence scores, our virtual agent cannot disambiguate between different intents and ask clarifying questions to our customers. So, an unreliable model leads to dissatisfied customers, loss of revenue and a bad reputation. How do we ensure that a model is reliable without affecting its predictivity? Calibration.

Calibration is a method to disincentivize a model from being over-confident, i.e. temper the tendency to produce high probability scores for all inputs. Before applying calibration to a model, we must first know if the model is already well calibrated, and if not, what is its Calibration Error. In part-1 of this two part blog post, we will explore how to visualize model reliability and compute calibration error. In part-2, we discuss a few calibration methods that are applied to our classification models.

An unreliable model leads to dissatisfied customers, loss of revenue and bad PR.

Background

A telling sign of an uncalibrated model is lack of variance in predicted probabilities: ~0.99 for all inputs. This is especially true for those classifiers that have high accuracy. To understand why this happens, we need to understand the standard training process of classifiers. A typical classifier takes inputs in the form <X, y>, where X is any type of input represented by a vector of numbers; y is the target variable or label, that denotes the class to which X belongs. For a simple classification model, y is represented using one-hot encoding, and the model outputs a vector of probabilities that sum to 1.

During training, the model tries to minimize the difference between the one-hot targets and its own predictions. It does so by maximizing the value corresponding to true class label and minimizing all other values. This produces a probability distribution that looks like:

Visualization captures the distribution of predicted probabilities, grouped into 10 bins (of size 0.1). For an uncalibrated model, bulk of the predicted values fall into the last bin — 0.9 to 1.0 — Figure 1: Distribution of Predicted Probabilities of True Class

Although the F1-score of the intent classifier that produced the probability distribution in Figure-1 is 92%, its predictions cannot be entirely relied upon, as it simply predicts a high probability for majority of inputs, irrespective of the correctness of the prediction. To further verify that this skewed probability distribution is symptomatic of poor calibration, we should

Plot a Reliability Diagram [1]
Compute Expected Calibration Error [1]

Note: Predicted probabilities are also called as confidence score, as they represent a model’s confidence in the prediction. We use the terms interchangeably in this post.

Reliability diagrams

Reliability diagrams plot the accuracy of a model as a function of its confidence score. On the x-axis, it has the average confidence scores of equal width bins, and the average accuracy for each bin is on the y-axis. For a well calibrated model, the diagram should plot the identity function. Any deviation is evidence of mis-calibration. The intuition behind reliability diagrams is that the average accuracy of a model should be equal to the average confidence score of its predictions on a set of inputs.

Shown below are 2 reliability diagrams generated from predictions on a dataset with ~800 examples by 2 intent classification models: one without calibration and another with calibration.

Plot of a reliability diagram for an uncalibrated model, which has bucketed confidence scores(probabilities) on x-axis and average accuracy on y-axis. Confidence scores of all buckets are shown to be significantly lower than corresponding accuracy values. — Figure 2: Uncalibrated Model

The uncalibrated model clearly does not plot the identity function. In fact, it has only 4 bins for confidence score. The calibrated version has a much better spread and nearly plots the identity function.

Plot of a reliability diagram for a calibrated model, which has bucketed confidence scores(probabilities) on x-axis and average accuracy on y-axis. Confidence scores of most buckets are shown to be close to their corresponding accuracy values, with only few buckets not keeping the trend. — Figure 3: Calibrated Model

Although reliability diagrams provide useful visual information to assess a model’s calibration, we need a quantitative measure of calibration to compare different models. This is done by Expected Calibration Error.

Expected calibration error

ECE or Expected Calibration Error, is the weighted sum of the difference between average confidence score and accuracy in each bin. Unlike reliability diagrams, ECE takes the number of examples in each bin into account. ECE ranges from 0 to 1, and naturally lower is better. Notice the ECE values in Figure 2 & Figure 3 — lower ECE corresponds to better reliability diagram. It should be noted that both these models have similar predictivity. However, only the calibrated version produces meaningful confidence scores. This underscores the importance of using a scalar measure such as ECE to compare calibration of different classifiers and factor it in model selection.

A poorly calibrated model’s accuracy can be just as good as a well calibrated model, except its confidence scores are meaningless and unreliable.

How to calibrate classification models

Using reliability diagrams and ECE, we can identify mis-calibration and assess its severity. How do we actually calibrate the models without affecting its predictive performance? There are several calibration methods that can be applied to classification models, two of which are used in our models. These are:

Temperature Scaling [1]
Label Smoothing [2]

In the upcoming Part-2 of this series, my colleague Yao Zhang, will elaborate on these methods and how she applied them to our BERT based intent classifiers.

Acknowledgements

This work was done collaboratively by Conversation Platform and Conversational AI teams. I would like to thank and acknowledge my colleagues for their help in dataset curation, obtaining model predictions and useful feedback on this work.

Learn more about technology at Expedia Group

References

Guo, C., Pleiss, G., Sun, Y. and Weinberger, K.Q., 2017, July. On calibration of modern neural networks. In International Conference on Machine Learning (pp. 1321–1330). PMLR.
Müller, R., Kornblith, S. and Hinton, G., 2019. When does label smoothing help?. arXiv preprint arXiv:1906.02629.