Model Calibration in Machine Learning

Heinrich Peters
6 min readJul 9, 2023

Machine learning models are not only expected to make accurate predictions but also to estimate their confidence in these predictions reliably. The ability of a classification model to provide accurate probability estimates is known as calibration. In this post, I will delve into the concept of calibration in machine learning, discuss its importance, explore methods for achieving it, and address common challenges that come with it.

Beyond Model Accuracy

The predicted probabilities produced by a well-calibrated model truly reflect the likelihood of a particular outcome. For instance, among all instances where the model predicts an outcome with 80% probability, that outcome should indeed occur approximately 80% of the time. Calibration goes beyond traditional accuracy measures to ensure that the confidence of a model in its predictions aligns with its actual performance.

Calibration is crucial in many areas where decision-making relies not only on the model’s predictions but also on its estimated uncertainties. In healthcare, for instance, a well-calibrated model can help doctors assess the risk of disease more accurately. In finance, calibrated models can provide more reliable estimates of investment risk.

Many machine learning models, especially complex ones like neural networks, are not inherently calibrated. However, several techniques can improve model calibration. These include Platt scaling and isotonic regression, which adjust the model’s probability outputs to align better with true outcome frequencies.

Platt Scaling

In Platt scaling, the raw model output scores are transformed into probabilities by fitting a logistic regression model to the output scores. The logistic regression model used in Platt scaling is mathematically expressed as P(y=1|x) = 1 / (1 + exp(Af(x) + B)) where f(x) represents the output score of the classification model, y=1 stands for the positive class, and A and B are parameters learned from fitting the logistic regression model to a validation dataset.

While Platt scaling proves to be a robust and widely applicable method, it is not without its assumptions and limitations. The principal assumption in Platt Scaling is that the raw output scores from the classification model are linearly related to the log odds of the positive class. Should this assumption fail to hold, Platt scaling might not provide the expected results. In such cases, alternative methods like isotonic regression can be applied as they do not impose a specific form on the relationship between the output scores and probabilities.

Isotonic Regression

Calibration can also be performed by fitting an isotonic (monotonically increasing) function to the output scores. Isotonic regression does not assume a specific parametric form of the mapping function from raw model scores to probabilities. Instead, it learns a stepwise constant non-decreasing function that best fits the data according to a specified error metric.

Isotonic regression proves advantageous in cases where there exists a complex, non-linear relationship between the raw output scores and probabilities. It is a flexible method that learns a non-decreasing function that can capture more complex patterns. However, isotonic regression may overfit the calibration data if it is too flexible. This is especially true when the amount of calibration data is limited. Additionally, isotonic regression only produces probabilities in the range observed in the training data. This means that if the test data have scores outside the range of the training data, isotonic regression cannot provide calibrated probabilities for those scores.

Models that Inherently Provide Probability Scores

While many machine learning models produce a raw output score that must be transformed or calibrated to provide a meaningful probability estimate, there are several models that can inherently output pseudo-probabilities.

Logistic regression, for example, is a common model for binary classification that generates probability scores. It models the log odds of the probability of the positive class as a linear combination of the input features. A sigmoid function is then applied to these log-odds to output a probability between 0 and 1. Naive Bayes classifiers calculate probabilities directly from the distribution of the training data. Decision trees can estimate the probability of an instance belonging to a particular class based on the proportion of instances of that class in the leaf node where the instance falls. Random forests, by extension, can also output probabilities by averaging the probability estimates of the individual trees. Similarly, gradient boosting models, especially implementations like LightGBM and XGBoost, are capable of producing probabilities by using a logistic link function to transform the weighted sum of the predictions of all the trees.

However, it is important to note that the calibration of these out-of-the-box probability scores can depend on various factors, such as model configuration, the representativeness of the training data, and the complexity of the task. It is, therefore, always good practice to evaluate the calibration of a model’s output probabilities and, if necessary, apply an additional calibration method like Platt scaling or isotonic regression to further improve them.

Common Challenges in Model Calibration

Model calibration is an integral part of machine learning workflows, particularly when interpretability and confidence in predictions are important. Yet, the process is not without its challenges, ranging from the assumptions underlying calibration techniques to issues of overfitting and the need for separate calibration sets.

First, each calibration technique relies on certain assumptions. For example, Platt scaling presumes a logistic relationship between the raw model outputs and the log odds of the positive class. This might not always hold true depending on the nature of the model or data. On the other hand, isotonic regression assumes a monotonic relationship between the raw output and probabilities but makes fewer assumptions about the form of the calibration function. These assumptions, if violated, can result in inaccurate calibration.

Second, calibration methods can overfit, especially when the calibration set is small. Overfitting in calibration implies that the model captures the noise or specific peculiarities in the calibration data instead of the general pattern. Consequently, the calibrated probabilities might not generalize well to unseen data. This issue is particularly prevalent with Isotonic regression due to its flexibility compared to Platt scaling.

Third, model calibration demands a separate dataset that hasn’t been used during the training phase. When data availability is limited, sourcing a separate calibration set can be a challenge. Using the same dataset for both training and calibration can lead to overly optimistic probability estimates and a potential overfitting scenario.

Fourth, while the focus so far has been on binary classification, calibrating probabilities for multi-class settings adds a layer of complexity. Both Platt scaling and isotonic regression can be extended to multi-class scenarios, but these extensions often increase computational expense and complexity.

Finally, calibration, while beneficial for the interpretability and reliability of probability estimates, does not enhance a model’s discriminative power. A model with subpar performance will not improve through calibration — it will merely provide more reliable probability estimates. Additionally, calibration does not address inherent biases or shortcomings in the original model.

Let’s Get Philosophical

Model calibration transcends mere technicalities — it entails philosophical considerations about the inherent complexities and uncertainties of interpreting models and decision-making. Fundamental questions arise regarding the interpretation of probability scores produced by models — for example, whether we see them as long-term frequencies in repeated trials (frequentist interpretation) or as degrees of belief or confidence (Bayesian interpretation). Trust in model predictions is another interesting point, as models outputting well-calibrated probabilities may be perceived as more reliable. However, it is important to distinguish that a well-calibrated model does not necessarily signify “correctness” but rather reflects the alignment between the model’s confidence and its actual accuracy. Interesting trade-offs also surface between predictive accuracy and calibration. For instance, a highly complex model may yield excellent accuracy but poor calibration, whereas a simpler model might provide less accuracy but well-calibrated probabilities, raising questions about which qualities to prioritize in a given context. Finally, ethical considerations come into play, particularly in high-stakes applications like medicine or finance, where miscalibrated probabilities can lead to impactful consequences.

Conclusion

In conclusion, calibration is an essential aspect of machine learning that ensures not only the accuracy of predictions but also the reliability of the estimated confidence. As machine learning and AI increasingly inform critical decision-making, the demand for well-calibrated models will continue to grow. Thus, understanding and improving model calibration is an important direction for ongoing machine learning research and practice.

--

--