How to add confidence to your Machine Learning models

Uncertainty Quantification using Metamodeling with IBM’s UQ360 toolkit

Islem Benmahammed

Published in

TotalEnergies Digital Factory

10 min readApr 6, 2022

Photo by Javier Allegue Barros on Unsplash

What do we mean by uncertainty quantification in Machine learning?

Today, critical applications use machine learning, and the associated error cost can be extremely high. For example, we can consider machine learning in a self-driving car or a medical diagnostic system, and the fundamental question is whether we can trust some of the predictions these tools give us. In other words, you can think of uncertainty quantification as a wrapper for a prediction engine that returns a trustworthy set of plausible values.

Uncertainty intervals in a regression task

Suppose you want to build a gas price prediction system for a regression problem and use this machine learning model to get the point estimates at a particular point in time. Then you may be interested in two things, the first is how much this point estimation changes given some features, and the other question is about the accuracy of the trajectory of the predicted points. Quantifying the uncertainties for this case can provide you with answers to these questions.

For a classification task, if we have a system trying to detect the presence of cancer, we don’t want a system that only gives us the best estimate; we also need some measure of uncertainty or confidence in the model to make the predictions actionable.

In this case, the model is not very confident, even though it predicts that there is no cancer, but the confidence level is not that high. So what a user, in this case, a doctor, will do is reject the model’s prediction because he is not sure. So he can either seek advice from another expert, try another model or collect more features.

Having a good quality uncertainty estimate adds a layer of security and transparency and enables Human-AI collaboration. It’s pretty critical, especially in systems such as medical and judicial domains, let’s say, wherever there is an end-user who is affected by the model’s decision. Uncertainties can also be used to improve the model itself. Thus, if the model can exhibit or provide us with suitable confidence measures, we can get guidance on how to improve it, for example, by looking at the region of the input where the model is least confident and trying to collect data in that space. Hence, uncertainties are helpful as a guide for improving the model.

What are the different types of uncertainties?

Uncertainty emerges from two sources of lack of knowledge; the first source is epistemic uncertainty, also called model uncertainty, it is the type of uncertainty caused by a tiny number of training data points in some regions, and you cannot be sure how the model should behave. The significant element of this type of uncertainty is that showing more data instances might reduce it.

The other kind of uncertainty is aleatoric uncertainty, also called data uncertainty. It happens because of the inherent observational noise in the training data. The model prediction may vary each time slightly for a given feature value. We can never be sure of the forecast even if we know the ground truth model that generates the data and captures the ground truth pattern.

Metamodeling for Uncertainty Quantification

Most approaches to quantifying uncertainty can be classified into intrinsic and extrinsic.

Intrinsic approaches are based on models that can produce uncertainty quantification, such as Bayesian models and Gaussian process ensembles. The extrinsic methods are based on the idea that we have a pre-existing model that may or may not have the capacity to produce uncertainty and that we are doing something on top of it to add this feature.

In this article, we are interested in an approach called Meta-modeling.

The idea here is that you have a base model that takes on a task, whether it is classification or regression, on one side, and we have a secondary model on the other side that acts as an observer. It has access to the input and output data of the base model and the ground truth data and is trained to predict the base model’s success rate or failure rate. Meta-modeling is a particular solution to many problems developers and researchers face when deploying AI models. Consider the example of a model that has been trained by someone else: let’s say you have a client’s model that you don’t have access to, you can’t retrain it, and large amounts of potentially proprietary data may have trained it, but that model can’t produce its uncertainty. Therefore, you can provide this model with uncertainty without accessing it.

Several variants of meta-modeling exist. In a more straightforward variant, the observer sees only the inputs and outputs of the base model, which we call the BlackBox method. In another variant, the observer has access to the inner workings of the base model, which can lead to a better prediction of uncertainty; this method is called the white box.

Are ML Model Uncertainties reliable?

When we have a model that gives some uncertainties, we can ask whether they are trustworthy and reliable. So we need some metrics to measure the quality of these uncertainty estimates, and in general, we don’t have access to ground truth uncertainty scores, so the most common way to measure the quality of the uncertainties is to use some surrogate metrics that may be slightly different depending on whether it is a classification or regression task.

On the classification task, the Reliability Plot, also called Calibration Curves, is one of the common approaches to assessing how well an uncertainty estimate is calibrated.

A reliability plot is a linear plot of the relative frequency observed compared to the predicted probability frequency. The location of the points or curve relative to the diagonal can assist in interpreting the probabilities; for example: if we look at the blue line and take all the instances whose confidence level was 0.8, the average score is 0.95, which means that the model is overconfident and not well calibrated. Similarly, if the model has much higher accuracy than its intermediate confidence level, it is underconfident, and both can be detrimental or disadvantageous in some situations.

In a regression case, we measure calibration through two concepts. The first is the Prediction Interval Coverage Probability (PICP), also called Miss rate, and the second is the Mean Prediction Interval Width (MPIW), also called Band excess.

Let us look at these two examples :

We have two models with one input feature. The blue line is the ground truth in both cases, and the green dots are the mean estimates of the model, and they are pretty similar in both cases. However, the prediction intervals vary; for model A, the uncertainty is quite significant for most instances; on the contrary, for model B, the uncertainty is smaller for most samples.

We compute the PICP score by looking at the instances where the prediction interval covered the ground truth. The average width of these prediction intervals MPIW tells you how uncertain the model is on average.

Comparing uncertainties in a regression task :

Many tools give the calibration concept in terms of a measure of coverage and likelihood. Still, little additional information is provided, such as the average bandwidth and why these intervals are wide or narrow. Another problem is that when doing coverage probability, people decide on a different confidence level, for example, 0.95 for some and 0.9 or 0.99 for others, making comparing uncertainties work challenging. When dealing with prediction intervals in regression, we can encounter two types of costs :

Band excess: when overshooting the observation. We measure it by taking the lowest width between the ground truth and the upper and lower boundaries.
Miss rate: undershoot and miss the ground truth, also called a deficit in bandwidth.

An optimal boundary captures the entire ground truth while being the least excessive in average bandwidth.

UQ360 introduced the notion of the Uncertainty Characteristics Curve.

This diagnostic tool helps assess prediction interval quality represented by a two-dimensional graph with two axes: on one axis is the type 1 cost (Avg Band excess), and on the other is the type 2 cost (Miss rate). The implementation of the UCC in UQ360 allows you to choose from several variants (excess, failure rate, absolute bandwidth, band deficit). The idea is that we have a system of axes where we can see a trade-off: find a way to scale down or up, to inflate or deflate the bands of the prediction intervals, so that we can start with a skinny slice of a prediction interval around a regression prediction and inflate it fully to envelope all our observations. These curves now give us a deeper insight into the behavior of each model that generates prediction intervals. In practice, we often observe cases of crossovers of different models, meaning that model A may perform better in the low failure rate operating region. Still, model B may perform best in the higher failure rate region.

Another insight is the calculation of the area under the curve, which gives a summary metric of the performance of a prediction interval model called operating point agnostic. For example, imagine that you don’t need to know the specific active region: do we focus on a coverage probability of 0.99 or 0.8? Such a metric can give you an agnostic view and be used for a more general comparison.

UQ360 in practice

IBM has created the Uncertainty Quantification 360 toolkit and put it out to the open-source community. UQ360 gives data scientists and developers algorithms to simplify machine learning models’ quantification, evaluation, improvement, and communication of uncertainty. For more information, visit their GitHub repo.

Next, we will see how to use meta-models in practice UQ360.

Let’s start by importing the necessary packages from uq360:

In this use case, we are interested in a regression problem that predicts the price of a house, knowing that we have some information such as the number of rooms, the area, and other features. To simplify the demonstration, we will use just one characteristic: the number of rooms, as input.

Now, consider that we already have a base model RidgeRegressionthat was trained before on these data.

Then, we need to instantiate LinearRegression as an observer model with some basic parameters.

Base and meta models can be any type of model, not necessarily a “scikit-learn” model. The only requirement placed on that instance is that it implements a fit fit and predict function.

After that, using the base and meta model instances, we can create a Blackbox MetamodelRegression instance and train it using the option base_is_prefitted=True.

Once the model is trained, we can quickly compute uncertainties using predict function.

In addition to the base model predictions, we now have uncertainties around.

Now, we can assess the quality of the bounds by computing regression metrics.

And its associated Uncertainty Characteristics Curve.

The Operating Point (OP) stands for the point that gives us the lowest cost that we can use after choosing between different uncertainty models.

In this case, the system operates at a 32% Miss Rate with an average bandwidth of 0.48.

We can go further and try to get a better Operating Point by training a Blackbox MetamodelRegression from scratch using other algorithms such as GradientBoosting as a base and observer models.

base_config and meta_config allows specifying parameters for both base-model and meta model.

We see an improvement in both metrics: Prediction Interval Coverage Probability and Prediction Interval Average Width.

It looks like we have a better curve with a lower associated cost for the uncertainty characteristic curves.

Conclusion

The ideal Uncertainty Quantification technique depends on its underlying model, the machine learning task (regression or classification), the data, and essentially the user’s objective. A selected UQ technique that does not generate high-quality uncertainty estimates may hoodwink end users. Therefore, before deploying an ML model, developers should regularly review the quality of the UQ using metrics like the Uncertainty Characteristics Curve. In this article, we have seen the essential steps of a very basic BlackBox metamodeling. Moreover, UQ360 includes a comprehensive collection of uncertainty quantification methods like Whitebox metamodeling and other exciting techniques that allow you to quicken the development process and add trust more neatly.

Further support

Uncertainty Quantification 360: A Holistic Toolkit for Quantifying and Communicating the Uncertainty of AI, Soumya Ghosh, Q. Vera Liao, Karthikeyan Natesan Ramamurthy, Jiri Navratil, Prasanna Sattigeri, Kush R. Varshney, Yunfeng Zhang, arXiv, 2021
Uncertainty Characteristics Curves: A Systematic Assessment of Prediction Intervals, Jiri Navratil, Benjamin Elder, Matthew Arnold, Soumya Ghosh, Prasanna Sattigeri, arXiv, 2021
Uncertainty Prediction for Deep Sequential Regression Using Meta Models, Jiri Navratil, Matthew Arnold, Benjamin Elder, arXiv, 2020