Are your models well calibrated ?

A model can have a very good accuracy but be badly calibrated!

Olivier Caelen
The Modern Scientist
4 min readNov 25, 2022

--

Introduction

Suppose we have a supervised training dataset (e.g., labeled images of trees or houses) and this dataset is used by a machine learning algorithm to train a predictive model (more specifically, a probabilistic classifier).

This predictive model is now able to return a “score” if a (new) image is a tree or a house.

For example, the predictive model returns a score of 0.8 indicating that this image is a house. Question: can we consider this score of 0.8 as a probability ?? But before answering this question, let us first try to define what we mean by considering the output of our model as a probability.

Interpreting the output of a model as a probability

Suppose a validation set is available to test the model on new unseen examples. The predictive model assigns a score to all images in the validation set and assumes that we select all samples with the same score, e.g. 0.8. If we consider the output score of our model as a probability, then we should expect that 80% of the observations with a score equal to 0.8 are images of houses.

Let h() be the predictive model. More formally, what we want is that if the model h() returns a score of 0.8 that the image x is a house, then the probability that this image x is a real house should be exactly 0.8.

And more generally, we have a calibrated model if this property is true for all probabilities p:

Now… to check if my model is properly calibrated, I need a way to measure it → we use the Calibration Plot for that.

Calibration Plot

It is a visual tool to evaluate the agreement between the predictions made by the model and the observations on a validation set. The easiest way to understand the Calibration Plot is perhaps to start with an example.

The x-axis represents predicted score returned by the model and the y-axis is the corresponding ratio of true positives. The curve of an ideal calibrated predictive model is the straight blue dotted line starting from (0, 0) and moving to (1, 1). We can see that on this example, that the model is relatively badly calibrated. It overestimates small scores (e.g., when the predicted value is about 0.3, the fraction of true positives is about 0.1.) and underestimates high scores (e.g., when the predicted value is about 0.8, then the fraction of true positives is about 0.9.).

Scikit-learn provides an easy way to obtain a calibration plot from a fitted predictive model and a validation set.

from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.calibration import CalibrationDisplay

mnist = load_digits()
X = mnist.data
y = mnist.target
y = y%2 ## Transfrom as a binary problem

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.5, random_state=42)
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

CalibrationDisplay.from_estimator(rf, X_val, y_val, n_bins=10)

Main causes of calibration problems

The calibration problem of a predictive model can have several causes. We give here a short list of the main ones.

The error function that is minimized during learning plays an important role. The types of models that minimize a cross-entropy loss function are well calibrated (e.g., Logistic regression, Simple deep-learning). Other models that do not directly minimize cross-entropy have calibration issues (e.g., Tree based (decision tree, random forest, xgboost, …), SVM, KNN, …).

Modern deep learning tends not to be well calibrated (ref: “On Calibration of Modern Neural Networks”; Chuan Guo, et al.).

If the data set is not balanced and we use rebalancing techniques, we change the prior probabilities, which leads to calibration problems.

Conclusion

The importance of calibration issues tends to be underestimated by the data science community. I hope this blog will make you realize the risk of simply considering the output of a model as a probability without checking…

--

--