Probability Calibration Essentials (with code)

Published in

Analytics Vidhya

7 min readOct 17, 2019

In the last post , I described the outline of how the AUC metric works. If you haven’t read through the post, here is the link (link). Just to do a quick recap, I discussed in details the usage of AUC as a metric for model selection and quick look at some fallacies in doing so.

In this post, I will expand on the same topic and highlight the concept of probability calibration.

One thing to keep in mind is that probability calibration is only useful when we are interested in probabilities. One such case could be if the metric of consideration is logloss for model selection. With that in mind, lets move forward.

First, lets talk about what is probability and what it represents.

Probability is the measure of confidence of an event. One can either by under-confident, over-confident or just right in terms of reporting the probabilities. Let me explain what I mean by these terms.

Consider a case of say typical data science workflow. We built a model pipeline as follows:

We start with the input data, followed by cleaning and transforming the variables. Then we divide the dataset into 2 parts, namely train and test. We import the relevant model library from sklearn (yes, sklearn because it’s still the most popular option out there for data scientists) to learn the patterns in train dataset; this yields us the model object.

We can save a serialized version of the model object as .pkl file if we expect the same model to be used multiple times over different test dataset.

Lastly, we call the predict/predict_proba method to get the output probabilities.

Model Characteristics

Now lets go back to defining over-confident, under-confident models in the context of probability. Consider a case where we are predicting the chances of raining everyday in the next 10 days.

Hypothetical Model for Predicting Rain over next 10 days

Note that if the model prediction is more than the average actual outcome/positive class, then the model is over-confident, otherwise under-confident. If the model prediction is equal to average actual outcome/positive class, then we can say the model was just right.

Code: All hands on deck!

First, we will generate some random classification data using sklearn’s make classification; we will also create 80% train and 20% validation dataset

Code: Make dataset and train-test split

Next, we will train 3 models as listed below:

Logistic Regression
Naive Bayes
Support Vector Classifier

Code: Model train and inference

We have used the default parameters for model training

We will plot the probability distribution of test probabilities to see if there are any significant differences in trends.

Code: Probability Distribution

Here’s the output of the probability distribution plot for the three classifiers on the test dataset

Probability Distribution via Density plot

We can already see some stark differences in the way the three models have computed output probabilities.

Logistic Regression has an even distribution and is generally not concentrated anywhere in particular. There is a slight accumulation of probabilities around 0 and 1, however, it isn’t very significant
Naive Bayes, has slightly higher concentration of probabilities around 0 and 1, but still has some values in the mid range
SVC, has almost all the values concentrated around 0 and 1, with negligible values in mid probability range

Lets also look at the AUC-ROC plots for the classifiers, so we can better visualize the entire analysis.

Code: AUC ROC Plot

Here’s the output with AUC-ROC plots for the three classifiers

AUC — ROC plot for the three classifiers

So, SVC has the highest AUC of 0.962, followed by NB (0.892) and LogReg (0.859).

The next logical question we can ask is if we can we conclude SVC is the best model based on the AUC score?

Well, not really, we need to look at the how confident the model is based on predicted probability and actual values. This calibration analysis will allow us to deploy informed post-processing techniques to the probabilities so that the final output results in better score (AUC, logloss etc).

We will follow the following approaches:

Visualization driven approach to understand model calibration
Metric driven approach

Model calibration Viz via Reliability plots

In principle, to measure the extent of reliability (and by extension how much more calibration needs to be done) we do the following:

Group prediction into bins
Fraction of positives per bin & Avg probability value per bin
Confidence per bin

So lets go ahead and start with the visualization approach to understand what calibration means

Visual approach to calibration

Lets start with the question what exactly are we looking for in this viz — There are 3 things we are going to do:

a) Plot the actual fraction of positives per bins on y axis and predicted probability averages on x axis. This tells us how our model behaved vis-a-vis the actual probability averages.

b) Plot the diagonal line representing a perfectly calibrated classifier. This effectively means that the actual fraction of positives = predicted probability average. This is our benchmark.

c) Observe how trend of our classifier’s performance vis-a-vis the perfectly calibrated one. If the points lie above the diagonal/45 degree line, then our classifier was over-confident, while, if the points lie below the benchmark line then our classifier was under-confident.

So lets write the code for reliability curve.

Code: Calibration plot

Below is the output calibration reliability plots for the three classifiers in question

Calibration plots for the three classifiers

Notice the different ways in which the actual predictions try to hug the perfect-calibration line. For example, LogReg exhibits the logistic curve while doing so, which makes sense (think about why it makes sense and post in comments)

SVC on the other hand is very closely hugging the benchmark calibration line, which means the predicted probabilities are very close to the average per bin, which definitely implies some stability in terms of probabilities and higher reliability.

Next steps, so we have figured out that SVC seems more appropriate in terms of providing well calibrated probabilities.

2. Metric based approach to calibration

If you refer to the above graph outputs, you will see there is a number attached to the graph legend. For SVC this number is 0.031, for NB its 0.153, and for LogReg it is 0.149. This is called the brier score loss which tells us how well calibrated the model outputs are.

Brier score loss: Across all items in a set N predictions, the Brier score measures the mean squared difference between (1) the predicted probability assigned to the possible outcomes for item i, and (2) the actual outcome.
Therefore, the lower the Brier score is for a set of predictions, the better the predictions are calibrated. Note that the Brier score always takes on a value between zero and one, since this is the largest possible difference between a predicted probability (which must be between zero and one) and the actual outcome (which can take on values of only 0 and 1).

So, to conclude the model selection, SVC not only has highest AUC-ROC but also lowest Brier loss score; hence, it is most appropriate for the modeling task in this case.

Next up, what do we do if we want to calibrate say the other two (NB and LogReg) classifiers. We can use both parametric methods such as Platt scaling or non-parametric ones such as Isotonic regression. Since Isotonic regression does not make any parametric assumptions, we will use the same here.

Code: Isotonic Regression Calibration

Plots: Isotonic Calibration Plots for the 3 classifiers

Conclusion: One can see that after calibration the brier score loss for NB and LogReg have gone down (hence improved), however, for SVC the score has increase (become worse). This is perfectly justified as the SVC is already well calibrated, and we did not really need to do run a isotonic regression based calibration on it. In actual analysis, one would run this calibration only on NB and LogReg as they are the ones with slightly poor/higher brier score loss.

Caveat: Note that calibration might negatively affect the metric if you are looking at non-probabilistic metrics such as accuracy, F1 etc

With this, we conclude our analysis of probability calibration. Hope you have enjoyed this post.

References:

https://scikit-learn.org/stable/modules/calibration.html
https://youtu.be/FkfDlOnQVvQ
https://youtu.be/RXMu96RJj_s

Probability Calibration Essentials (with code)

Written by Rajneesh Tiwari