Model Calibration for Regression

10 min readMar 17, 2024

Requirements

One requirement to read this blog is that you understand the concept behind model calibration and how it is done for classification problem
I have a detailed blog for the same — Model Calibration: A Step Towards Trustworthy and Reliable AI

Let’s get Started

Besides the accuracy of the model, it is equally important to ensure its reliability to use in real-time. Model calibration ensures that the models are reliable and their predictions can be trusted.

In classification problems, the predicted confidence can be used as a measure of uncertainty provided they are well-calibrated, and well-calibrated means if the confidence is 90% then it should be correct 90% of the time.

However, in regression tasks, we don't have such confidence scores, the real values are predicted directly. We can use the notion of confidence interval but for that, we need the distribution rather than a single real-valued output.

Most of the implementations just predict the single output and not the entire distribution. One way to measure the uncertainty for regression would be to make the network predict the output distribution and not just single output, and then the predicted variance can be used as an uncertainty measure. The most easily understood distribution to imagine is the gaussian distribution defined by two parameters — the mean and variance

Further, like in classification the way we ensure the model is calibrated, and if not then we calibrate the predicted confidence before using it as an uncertainty/ reliability measure, in regression as well we need to ensure our model is calibrated which means that the predicted variance reflects the actual residual error and if not we need to calibrate this variance before we can use it as uncertainty/reliability measure.

With the above arguments, the steps we need to follow for model calibration in the regression problem

Modify the regression network to give probabilistic output, i.e. predicting the output distribution and not just the single output value. The variance of the predicted output distribution can be used as an uncertainty measure (similar to confidence in classification can be used as an uncertainty measure)
Modifying the loss to learn the variance. please note that we don’t need a separate ground truth to learn the variance. just a slight modification in the loss can help to learn the variance
Before we use the variance as our uncertainty measure, we need to ensure the model is calibrated. so we need to measure the calibration error (like we measured in classification with ECE, ACE, etc.,.)
Calibrate the model with methods like Platt scaling, isotonic regression

Modifying the regression network to output distribution rather than a single output (going from deterministic to probabilistic)

One of the most common distributions to model our output would be the Gaussian distribution.

A Gaussian distribution is defined by two parameters — the mean and variance

so to emulate this we need to have an additional output from our network. The good part is that we do not need a ground truth for the variance to learn from. below is the diagram showing how we can convert a deterministic network that predicts point estimate to a probabilistic network that predicts the output distribution

In practice, the model is made to learn log(variance) for numerical stability

Modifying the loss to learn the variance

Usually, for learning regression problems, the loss function we use is MSE or MAE.

The MSE is actually derived from the Gaussian function and the mae is derived from the Laplacian distribution

Let's see how we define Negative Log Likelihood assuming the Gaussian distribution and then how it boils down to Mean Square Error when we learn the point estimate

Assuming the above Gaussian density estimation, the output likelihood will take the same for,

Let's say the ground truth is represented by y_i and the prediction from the model is represented by f(x_i)=μ

Now taking the log of likelihood

For optimization either we maximize the likelihood or we minimize the likelihood. We mostly consider minimizing so we will take negative log-likelihood as our loss function

For the network where we learn point estimate the sigma is either 1 or constant so the term that will be optimized is only the right-hand side term. so for point estimate, the NLL (loss functions) becomes the mean square error between ground truth and the predicted value from the model.

Whereas for the model that predicts the entire distribution, both the terms are optimized. so the NLL (Loss function) for the probabilistic output will be

Similarly, if we are using MAE, we can modify the network to give Laplacian distribution and modify the loss accordingly. we will take the topic of Laplacian distribution at the end. Now let's continue once we have modified the network to predict the probabilistic output

Measuring Calibration

For regression tasks the calibration can be defined such that predicted confidence intervals should match the confidence intervals empirically computed from the data set (Kuleshov et al. 2018), i.e.

To measure the calibration of classification we used the predicted confidence and the true confidence in different bins, where each bin measured a range of confidence. Also as confidence is used as an uncertainty measure we calibrated the confidence.

The below table shows the ECE computation for the classification problem,

In regression problems we don’t have any confidence in the prediction, the prediction is a real number. Also, in regression problems, since the predicted variance is a measure of uncertainty, we need to make sure that the predicted variance is calibrated. This can be done by making sure that the predicted error (variance/std) is equal to the actual error (Predicted mean — ground truth = residuals)

Confidence Interval Revision

Before moving ahead let's revise one more concept of the confidence interval, as we can use the confidence interval of the predicted variance for plotting against the residuals

Confidence Interval

In the context of a regression problem, a confidence interval provides a range of values that is likely to contain the true value with a certain confidence level

For example, based on various environmental data sensor measurements the ML model estimated a rainfall of 45 inches with a 95% confidence interval of [43, 47] inches. This means you’re 95% confident that the amount of rainfall you will receive falls within the interval [43, 47].

Interpretation: Here’s how you typically interpret a confidence interval:

If you were to observe the amount of rainfall many times, let's say 100 days, it will not always be 45, it may be 46, 44.5, or even, 41 but 95% of the time it will be in the range of 43 to 47, that is among 100 days 95 days would have rainfall in the range of 43 to 47.

thus confidence intervals give the notion of uncertainty in your predictions

Significance Level: The choice of the confidence level (e.g., 95%) is often referred to as the significance level or alpha (α). Common choices include 90%, 95%, and 99%, but you can choose a level that fits the requirements of your analysis.

Now the question is how to identify the values for this 95% CI, in the above examples, given our predicted distribution of certain mean and std, how to find the range of value which will contain the 95% of the predictions, how to get the number 43 and 47.

To understand that let's compare a simple Gaussian function with mean 0 and std 1. We can compute the confidence interval using Scipy norm.stats.

So we can compute the confidence interval by computing the percentile. I have discussed the percentile in detail here — Quantile, Percentile (one tail and two tail distribution), Confidence Interval, Box Plot

So basically the 50th percentile means the value in the data below which you have 50% of the data.

for a two-tail distribution (which is our case) we compute the 50% data about the mean so 25% to the left of the mean and 25 % to the right of the mean.

Below is the figure for one-tail and two-tail distribution for the 50th percentile or computing values in the data which will have 50% of the data. The distribution is a standard normal Gaussian with mean 0 and std 1.

from scipy import stats as stats
norm = stats.norm(loc=0, scale=1)
data_percent = 0.5
# Significance level (alpha) for a two-tailed test
alpha = data_perecent ==> 0.5
# Calculate the critical values for the tails
critical_value = norm.ppf(alpha) ==> norm.ppf(0.5)

For one tail distribution the 50th percentile, the value below which 50% of the data will be q=0.0

from scipy import stats as stats
norm = stats.norm(loc=0, scale=1)
data_percent = 0.5
# Significance level (alpha) for a two-tailed test
alpha = 1-data_perecent ==> 0.5
# Calculate the critical values for the tails
critical_value_left = norm.ppf(alpha / 2) ==> norm.ppf(0.25)
critical_value_right = norm.ppf(1 - alpha / 2) ==> norm.ppf(0.75)

For two-tail distribution, the 50th percentile, the value of lower alpha is 0.25 and upper alpha is 0.75, and computing its ppf gives the value which contains 50% of data about the mean is -0.67 and 0.67 respectively

Now coming back to the rainfall example where the predicted value was 45,. Let's say the predicted distribution was a Gaussian with a mean of 45 and a standard deviation of 2. Now let's compute the 95% confidence interval for this

from scipy import stats as stats
# define the gaussian function
norm = stats.norm(loc=45, scale=1
data_percent = 0.95
# Significance level (alpha) for a two-tailed test
alpha = 1-data_perecent ==> 0.05
# Calculate the critical values for the tails
critical_value_left = norm.ppf(alpha / 2) ==> norm.ppf(0.025) ==> 43.05
critical_value_right = norm.ppf(1 - alpha / 2) ==> norm.ppf(0.975) ==> 46.96

so the 95% confidence interval is [43,47].

one more trick before we go back to calibration is that suppose you want to compute the ppf for a normal function with mean and mean other than 0 and, we can do it in two ways


Example-1

Gausian Distribution ==> N(0,5)
Mean = 0 ; std = 5

Method-1
norm1 = stats.norm(loc=0, scale=1)
norm1.ppf(0.8) ==> 0.84162
final value = norm1.ppf(0.8)*std + Mean
            = 0.84162*5 ==> 4.2081

Method-2
norm2 = stats.norm(loc=0, scale=5)
final value = norm2.ppf(0.8) ==> 4.2081

Example-2

Gausian Distribution ==> N(10,5)
Mean = 10 ; std = 5

Method-1
norm1 = stats.norm(loc=0, scale=1)
norm1.ppf(0.8) ==> 0.84162
final value = norm1.ppf(0.8)*std + Mean
            = 0.84162*5 + 10 
            = 4.2081 + 10 ==> 14.2081

Method-2
norm2 = stats.norm(loc=10, scale=5)
final value = norm2.ppf(0.8) ==> 14.2081

Please note we are going to use method -1, that is compute the std multiplier for N(0,1) and multiply with the std of the predicted distribution

Coming back to measuring calibration

For the classification task, the calibration can be defined as such that the predicted confidence should match the observed confidence (accuracy)

For regression tasks the calibration can be defined such that predicted confidence intervals should match the confidence intervals empirically computed from the data set (observed confidence interval)

let's bring back the reliability plot of the classification

For the regression problem let's take the example

now I will show how to compute the confidence interval of 20% and 95% for the third sample X3 with mean 50 and std 3

from scipy import stats as stats
# define the gaussian function
norm = stats.norm(loc=50, scale=3)
### for 20% confidence interval
data_percent = 0.2
# Significance level (alpha) for a two-tailed test
alpha = 1-data_perecent ==> 0.8
# Calculate the critical values for the tails
critical_value_left = norm.ppf(alpha / 2) ==> norm.ppf(0.4)
critical_value_right = norm.ppf(1 - alpha / 2) ==> norm.ppf(0.6)
print(critical_value_left,critical_value_right)

49.2399586905926 50.7600413094074

### for 95% confidence interval
data_percent = 0.95
# Significance level (alpha) for a two-tailed test
alpha = 1-data_perecent ==> 0.05
# Calculate the critical values for the tails
critical_value_left = norm.ppf(alpha / 2) ==> norm.ppf(0.025)
critical_value_right = norm.ppf(1 - alpha / 2) ==> norm.ppf(0.975)
print(critical_value_left,critical_value_right)

44.12010804637984 55.87989195362016

Similarly for all other samples

let's take the above example that our model predicted the amount of rainfall

predicted mean = 45, predicted_std =1

from the above distribution, we computed the different confidence interval values

For a 95% confidence interval

the predicted range is between 43 and 47 so the allowed error is within -2 to +2 or we take the absolute value the allowed error is 2 as per the prediction

Predicted Error = +-2

now let's compute the actual error

case 1: if GT value = 48

Actual Error = abs(45–48) = 3, which is higher than the predicted error of 2, so the model is miscalibrated

case 2: if GT value = 44

Actual Error = abs(45–44) = 1, which is less than equal to the allowed error of 2 so, the model is uncalibrated for a 95% confidence interval

Now let's plot a reliability plot for the regression problem same as the classification problem.

References

https://stackoverflow.com/questions/60699836/how-to-use-norm-ppf