DATA SCIENCE THEORY | EXPLAINABLE AI | MACHINE LEARNING

Conformal prediction theory explained

Learn how to assess the uncertainty of predictions produced by your machine learning model

Artem Ryasik

Published in

Low Code for Data Science

11 min readJan 24, 2023

Photo by Gabriella Clare Marino on Unsplash.

In this post we are going to take a look at the basic theory of conformal prediction — a powerful algorithm that allows data scientists to quantify the uncertainty of predictive models and set up the desirable threshold for errors. For a more exhaustive explanation, please check the scientific paper that I co-authored with Tuwe Löfström and Ulf Johansson.

As long as machine learning solutions surround us everywhere and have impact on multiple aspects of our life, another requirement recently arose which can be briefly described as Explainable AI is a must to ensure a fair and unbiased world. This term includes a lot of requirements starting from how the decision was made, whether it was ethical, whether there are any biases in prediction, or how it can be interpreted, how accurate the prediction is, etc. Today, we are going to take a look at the model certainty regarding the prediction and see if it is possible to control the level of confidence of the model. And the short answer is — yes!

Simple examples

Let’s take a look at two simple examples for classification and regression:

We have models A and B predicting customer churn. Both models predict that the customer is not going to churn, and eventually we will find out that this prediction is true, so we have a true positive example. However, which model should we pick? Can they be equally trusted? To answer this question, we need to refer to the probabilities: model A predicted that the customer will not churn with a probability of 55%, whereas model B predicted it with a probability of 75%. Now the answer becomes clear, model B did better for this particular prediction, hence model B should be (carefully) trusted more than model A.
Again we have two models C and D predicting the weather for the next day. Model C predicts 15 degrees with a variance of 2 degrees, and model D predicts 15 degrees with a variance of 4 degrees. On the next day, we find out the prediction is correct again. Following a similar logic as in the previous case, we can assume that model C is more accurate then model D, since the range of possible temperatures is more narrow for model C, hence lower the possible error.

As long as models give us more than just predictions, we can use class probability for classification, and variance for regression models to define a measure of conformity or nonconformity of the samples that have been applied to a special calibration data set that contained ground truth labels, and later compare them with new samples from the test set or from an unlabeled data set. Thus, we can estimate how different or similar new samples are relative to the special calibration table based on their class probability or variance values. Moreover, we can set up a desirable threshold of the tolerable error. That’s where conformal prediction will come into play!

The concept of conformal prediction was first introduced quite a long time ago in this book (Vovk, V.; Gammerman, A.; Shafer, G. Algorithmic learning in a random world; Springer: New York, 2005). Fortunately, it has been well accepted among data scientists and cheminformaticians in particular — at least this is the area where I first found practical usage of conformal prediction. So let’s now have a look at the algorithms of conformal predictions for both classification and regression problems. Although the algorithms are quite easy and straightforward, it may take some time to fully get the concept, understand how useful it is and start applying it in your daily data science routines.

Conformal prediction for classification

Following the standard procedure of supervised machine learning problems, in order to train a model first we should split data into training and test sets (for simplicity we are not going to consider the case with cross-validation). To apply conformal prediction, we need to introduce a third data set for calibration. The proportions could be the following: 70–80% for training and 10–15% for calibration and test. Please also do not forget to perform stratified sampling on the target variable and consider having balanced data sets if possible — quite standard requirements for classification, though. In general, test and calibration sets could be of the same size. Mind, however, that it always depends on your data set — how big it is, how many classes you have, how balanced they are, etc. For simplicity, we are going to take a look at a binary classification problem. It is worth noticing that conformal prediction does not impose any restrictions with respect to the number of classes.

As it was stated before, we just need to follow the standard procedure of fit-predict pattern (see figure 1) when we use a training data set for training the model.

Figure 1. The algorithm of conformal prediction combined with fit-predict pattern for classification.

However, now we have two data sets instead of one (test), so first we need to apply our trained model to a calibration data set. Afterwards, we need to extract class probabilities from the calibration data set and rank them for each true class by its prediction probability in descending order (see figure 2, left table). The class probabilities are the most simple and straightforward values that can be used as a measure of conformity/non-conformity for a classification problem.

Figure 2. Conformal prediction for classification problem. On the left, we have a ranked calibration table. On the right, we have new samples with class probabilities. These probabilities are used to get the rank of the samples, then the rank is used for p-values calculation. Notation: P(n) — class n probability, Rank(n) — class n rank.

After that, we need to get model predictions for the training data set. Now, we are ready to go to conformal prediction, we just need to compare the test set probabilities with calibration set probabilities and identify their rank according to the existing one. In other words, we are going to define how conformal (or non-conformal) test samples are compared to the calibration data set based on their class probabilities. Once the rank of the new sample is defined, we can calculate p-values (not these p-values) with a simple formula:

where r is the rank of the sample, and c_i is the number of records in the calibration set per class i. This way we are going to get p-values for each class for every sample (see figure 2, right tables).

So what is next? Why did we get these p-values and how can they help us with the predictions? This is the most exciting part. We can set up the desired error rate (or significance level of the predictions) and then we will always get valid predictions in terms of the selected error rate. In case you want to be extremely precise you can select 1% error rate (or 0.99 significance level), or maybe you would like to avoid a lot of false positives so you can loosen your requirements to 15% tolerable error (0.85 significance level).

How does this work? We just simply need to compare the p-values which are basically the measure of conformity of new samples to the samples we had in our calibration data set. If a p-value for a certain class is equal or bigger than desired error rate then it is considered to belong to a valid prediction. This means that some of the predictions will be actually sets, rather than a single value. This can be interpreted in the following way: the prediction is conformal to more than one class, so it belongs to the domain of possible classes and belongs to both classes simultaneously. It means that the model is not very certain which of the classes this is, however it is a valid sample. There might be another situation: the p-values for each class are smaller than the desired error rate, in that case the prediction will be an empty set and it is considered to be an invalid prediction. The interpretation for this case could be that the prediction is out of the domain of possible classes. This situation might indicate the following:

You might have another class that you have not considered before. This situation might happen if your data drifts and is no longer relevant to the model and the training data set that was used to train this model, as well as the calibration data set.
The sample is very hard for the model to classify, so perhaps you need to include more samples of a similar kind in your training and calibration sets.
Your data set is extremely unbalanced, so you need to introduce more samples of this class, so you will have a decent representation in the calibration table.

Conformal prediction for regression

The case for a regression problem in general is the same as for classification and follows the same pattern displayed in figure 1. We need to have three data sets, we need to train a model and rank the calibration data set. The only difference is how we can define Alpha — the measure of conformity/non-conformity. And here we actually might have multiple options. The most straightforward measure could be prediction variance or absolute error that naturally come from the most well-known regression models. However, one can introduce a more specific function. Moreover, we can normalize the prediction interval in order to potentially minimize prediction regions, with the result of obtaining individual prediction bounds for each sample. You can read more about different conformity/non-conformity measures and normalization for the regression case in this paper (link).

The algorithm of ranking the prediction bound is shown on figure 3. As Alpha we are going to use the absolute error: the difference between the actual and predicted values. Then we sort Alpha values in descending order, assign rank to these sorted values — so now we have our calibration table. Once we got the model predictions, we also need to define the error rate (significance level) to make conformal predictions. The error rate will define Alpha based on the fraction of the rank which is closer to the provided error rate (significance level). As one can see on figure 3, depending on the error rate (significance level) values Alpha is different, consequently the lower bounds (LB) and upper bounds (UB) are different too. As shown on the figure 3, we have two significance level values 0.1 (red) and 0.3 (blue). These values define which Alpha values we are going to take to define the prediction interval. It is quite easy to see that if we demand to have less error, the intervals become wider since we take bigger Alpha values, and vice versa if we are more tolerant to the error, intervals are more narrow. Eventually we will always get the desired number of errors, i.e. the fraction of prediction values that are out of prediction bounds.

Figure 3. Conformal prediction for regression problem. On the left, we have a table with true values, prediction and prediction intervals for error rate 0.1 (red) and 0.3 (blue). On the right, we have a calibration table that contains Alpha.

Closing remarks

Of course it is worth optimizing the error rate (significance level) in order to get the best prediction. This way conformal prediction can be estimated using special metrics (different for classification and regression) that will be described in the upcoming blog posts. The good thing is that optimizing error rate costs nothing, when you already have a trained model and a calibration table.

Another simple technique that would help make conformal prediction more robust is training multiple models along with multiple calibration sets. Next, you can get the same amount of predictions for a test or unlabeled data set as the number of pairs of model and calibration set you have obtained. Then you follow the same procedure to get conformal predictions using pairs of models and calibration data sets. Finally, you can aggregate the predictions using the median function for every sample. The latter approach can be easily combined with cross-validation, and it is called Mondrian cross-conformal prediction.

At the same time conformal prediction has one strong requirement: samples must be IID (independent and identically distributed) — this is a quite standard requirement for machine learning problems in general. For time series data, which is clearly not IID, conformal prediction can still be applied since it supports the exchangeable distribution property, if needed (you can learn more about this in the original manuscript of Vovk et al). Moreover, conformal prediction is agnostic to both data set and model — you can work with any model, you just need to define what is going to be your conformity/non-conformity measure.

As a final note, conformal prediction has conditional coverage which guarantees that the prediction set (for classification) or interval (for regression) adjusts to the difficulty of prediction. This basically means that easy samples will have a smaller prediction set or interval, while difficult samples will have a bigger prediction set or interval. This is also a very useful insight about the data and helps you understand in which cases the model is more prone to error. If you would like to learn more about this, I recommend you to watch these videos that provide a thorough explanation of this feature.

Conclusion

This was a very gentle introduction to conformal prediction without math. In this article, I just wanted to describe the algorithm that one needs to apply in order to use conformal prediction for classification and regression problems. In case you are interested in a deep dive in the theory and math behind it, I encourage you to look at a few scientific papers. Personally, I can recommend the following:

Vovk, V.; Gammerman, A.; Shafer, G. Algorithmic learning in a random world; Springer: New York, 2005 — the original work where conformal prediction was introduced;
Norinder, U., Carlsson, L., Boyer, S., & Eklund, M. (2014). Introducing conformal prediction in predictive modeling. A transparent and flexible alternative to applicability domain determination. Journal of chemical information and modeling, 54(6), 1596–1603. — the paper I mostly relied while preparing this blog post;
Boström, H., Linusson, H., Löfström, T., & Johansson, U. (2017). Accelerating difficulty estimation for conformal regression forests. Annals of Mathematics and Artificial Intelligence, 81(1), 125–144. — the manuscript I used for describing the algorithm for conformal prediction for regression;
Löfström, T., Ryasik, A., & Johansson, U. (2022, August). Tutorial for using conformal prediction in KNIME. In Conformal and Probabilistic Prediction with Applications (pp. 4–23). PMLR. — the paper that I wrote with my co-authors for presenting conformal prediction for regression as an extension for KNIME.

In the next parts, I am going to demonstrate how to use conformal prediction for classification and regression using an open-source no-code tool, KNIME. We are also going to cover the metrics for estimating predictive quality for each case.

Repositories

You can find code implementation for conformal prediction in these repositories:

https://hub.knime.com/redfield/extensions/se.redfield.cp.feature/latest — Knime extension;
https://github.com/RedfieldAB/conformal_prediction — Java implementation;
https://github.com/henrikbostrom/crepes — Python implementation;
https://github.com/aangelopoulos/conformal-prediction — another Python implementation.
https://github.com/valeman/awesome-conformal-prediction — repository with a huge number of materials about conformal prediction.

This blog post is written as my private initiative and not anyhow related to my current employer.

UPDATED ON 09/02/23. Fixed the error on figure 2 and provided more details on calibration process for regression case. Thank you everyone for your reviews here and on LinkedIn.