DATA SCIENCE THEORY | EXPLAINABLE AI | MACHINE LEARNING

Conformal prediction for classification

A hands-on codeless example with KNIME

Artem Ryasik

Published in

Low Code for Data Science

11 min readFeb 20, 2023

A fancy image to draw attention produced by Midjourney that tried to visualize the conformal prediction.

After introducing the theory behind conformal prediction, it is time for practice. In this article, we are going to jump into a KNIME workflow to solve a multi-class classification problem using the nodes of the Conformal Prediction extension. We are also going to discuss how to estimate the uncertainty of predictions. Finally, we will wrap up the pipeline with the nodes of the Integrated Deployment extension, so the workflow will be ready for production with a couple of clicks.

Read about Conformal prediction for regression

The data

The data we are going to work with is a data set describing different physical parameters of beans (available on Kaggle). The data set contains 13k records with 16 numerical columns, describing 7 classes, which are not well-balanced — this is actually a good test for conformal prediction.

The workflow

Here we shall start with the description of an “advanced” method that includes training multiple models and getting calibration tables for them.

Those who are impatient can directly jump to the KNIME Community Hub and download the Conformal prediction classification workflow. The implementation (“advanced”) described hereafter corresponds to the upper branch of the workflow.

Data preprocessing

As with any ML routine, we need to first read the data, do preliminary analysis, clean it, and only after this, we can move to training the models and interpreting the results. I have already converted the data into KNIME native table format and included it inside the workflow. After reading the data with the Table Reader node, let’s have a look at the class distribution (see figure 1) which is produced with the Bar Chart node. We can see that the classes are not balanced. However, the situation is not that dramatic so we do not need to resort to technique for class-imbalance handling (e.g., under-sampling, over-sampling, etc.).

Now, let’s split the data set into two parts: the control data (10%) and the rest (90%), which is going to be used for training, calibration, and intermediate testing. The first data set is only going to be used for prediction and estimation once we finish with the whole conformal classification pipeline. This is done to simulate production data.

The next quite common step is data normalization, and in this case it is definitely needed as the values we have are of a different scale. So we are going to apply simple min-max normalization in the range [0, 1] since we do not have any negative values, and this is done with the Normalizer node.

Now, it is time to start capturing parts of the workflow using the nodes of the KNIME Integrated Deployment extension. These nodes will help us naturally build the final pipeline for production (you can learn more about Integrated Deployment here). The first step is to capture preprocessing operations, such as the application of the normalization model that we created before capturing, and we will later provide as one of the inputs of the Workflow Executor nodes.

We also include a correlation filter that will reduce the number of highly correlated features. Indeed, we have a lot of them — see figure 2! I set up the threshold for the correlation coefficient = 0.8, and it helped me reduce the number of features from 16 to 6, which should be both beneficial in terms of future predictions robustness and training time reduction. Here, we are closing the first capture part — this is the only preprocessing we are going to use.

Figure 2. Heatmap of features correlation matrix.

Training and calibration

We are now ready to proceed with model training and calibration. In this example, we are going to use the Random Forest algorithm but readers are free to use any other classification algorithm. We first split the data into training + calibration and test data sets. Then, we start the calibration loop using the Conformal Calibration Loop Start node that helps us split the first data set into training and calibration data sets. The settings are almost identical to the Partitioning node, with the addition that we need to specify how many model and calibration table pairs we would like to get by tweaking the parameter “Number of iterations”. In both cases, we need to use stratified sampling over the class column.

The construct within the loop block is the standard Learner + Predictor but now we are adding the calibration step that includes the creation of calibration tables based on the class probabilities and known class values. The Conformal Calibrator node takes as input the model predictions and produces an extra column — Rank — that will be used for calculating p-values at the next step. Within this loop, we also need to serialize the model as a cell object and put it into a KNIME table with the Model to Cell node. This is needed to propagate the model to the next step but it also helps to be agnostic about the model we use. With the same procedure you can use any other model available in KNIME, or you can even serialize a model in Python — for example, from scikit-learn or TensorFlow libraries.

Finally, the Conformal Calibration Loop End just synchronizes the models and calibration tables together by adding the iteration column for each of the tables, so they share the same iteration number. The described conformal calibration loop block can be seen in figure 3.

Figure 3. The conformal calibration loop block — creating the pairs of models and calibration tables.

Conformal prediction

Finally, the next step is conformal prediction, and here we need another type of loop. This loop is intended to be used in production, so we are going to nest it between Capture nodes too (see figure 4).

The Conformal Prediction Loop Start node starts iterating over consequent pairs of models and calibration tables. We feed the trained model and test set (or unlabeled data set) to a Predictor node in the same way we usually do in our ML routines. Then, as an additional step, we feed in the model predictions to the Conformal Predictor node to obtain the p-values. This process is repeated for all model + calibration table pairs we have gathered previously. Eventually, the p-values are aggregated by computing the median in the Conformal Prediction Loop End node. This operation helps make the predictions more accurate and robust because we are not relying just on a single model and calibration table, but on many. After the loop, we can finally get conformal predictions with the Conformal Classifier node. In the settings of this node, the user should provide the desired error rate and select the preferred format of prediction representation — array or string, so the output of the node will contain the column with prediction sets.

Conformal prediction quality metrics

The main metrics one can use to estimate quality of conformal prediction are:

Efficiency — the ratio of single label classification (right or wrong). It is calculated as Single class predictions / Total amount of predictions. In plain English this means: how many predictions we got that have just a single prediction in the set. Efficiency shows how simple it was for the model to classify a particular class or if we take the total — for all the classes.
Validity — counts the fraction of correct predictions. If a record belongs to a mixed class containing the correct value, it is considered to be correct. It is calculated as Total_match / Total amount of predictions. This shows whether the model did good at getting the right class even if the prediction set contains more than one prediction. Validity basically shows the ratio of correct predictions even if they are not so clear.

There are some more metrics available and you can read about them in the description of the Conformal Scorer node or in this paper. If you tick the box “Additional efficiency metrics” in the settings of the Conformal Scorer node, you can get them as well. In this article, we are going to operate with efficiency and validity.

At this point, the selection of the error rate can be reduced to a fairly simple optimization problem. It is computationally cheap since we do not need to train the models. We simply bin the samples according to their p-values for the given error rate value. This is pretty much it.

Finally, we need to combine all the workflow parts together and save the production workflow with the Workflow Combiner and Workflow Writer modes. Then, the workflow can be executed with the Workflow Executor node.

Optimizing error rate

Once the workflow is ready and automatically deployed, let’s take a look at how it can be utilized in KNIME. First of all, we need to decide what is the best error rate value to use in order to get the best predictions. To do this, we need to read the freshly deployed workflow with the Workflow Reader node, and then provide the Workflow Executor node with all the workflow parts that we were gathering and developing as inputs:

The unlabeled or test data set;
Normalization model;
The table with trained models;
The calibration tables;
The error rate value as a flow variable.

Since we are going to optimize the error rate, we will put the Workflow Executor node inside an optimization loop, which is set to iterate over the error rate values in the interval [0.01; 0.2] with a step = 0.01 (see figure 5).

Figure 5. Running the deployed workflow to find the best error rate value. Read the workflow, and provide a new data set, the normalization model, the tables with trained models and calibration tables and error rate value.

After the loop is executed, we are interested in the total prediction so we filter out the metrics for individual classes and finally plot the efficiency and validity values (see figure 6).

This plot will help us select the best error rate but users should ask themselves: “what predictions are we looking for?”. In case one doesn’t want to miss any correct prediction even if it is a part of a set with other values, then smaller error rate values are preferable since the validity will be the highest. In our case, it corresponds to the error rate = 0.01 (shown on the plot in orange). On the other hand, if one looks for the biggest amount of single predictions then it is better to stick to the values where the efficiency curve has the highest value — in our case, the error rate = 0.07 (shown on the plot in blue). Another criteria could be an optimum between validity and efficiency, so the predictions will be balanced in terms of both metrics. The last case refers to the error rate = 0.06 (shown on the plot in green).

Figure 6. The line plot shows the efficiency (blue) and validity (orange) values as a function of the error rate. Highlighted values on the plot describe the error rate values that are selected and proposed as the best values for different cases.

Note. Please consider that when you run the workflow, you might get slightly different results for predictions, optimal values of error rate, etc. In general, however, results should be reasonably close to those presented in this article.

Conformal predictions analysis

I prefer to select the error rate value that balances both metrics (error rate = 0.06). Once the error rate is selected, we can take the control data set we saved in the beginning and feed it into the Workflow Executor node. The upper output port of the Workflow Executor node returns the output of the Conformal Scorer node, and the lower output port returns the table with predictions. Let’s take a look at the Conformal Scorer output (see figure 7). In this table, we see the validity and efficiency values for each class. Another useful metrics are the number of matches:

Exact match — shows the number of single class predictions that are correct;
Soft match — shows the number of the predictions that contain a correct class, but it is mixed with the other classes;
Error — the number of predictions that have a set that does not contain the correct class;
Null predictions — the number of empty prediction sets; nulls are always considered as errors.

Figure 7. The output of the Conformal Scorer node. The table shows the metrics for conformal predictions per class and total.

Now let’s pick some insights from this table:

Class Bombay has 3 errors and all of them belong to an empty set prediction or Null predictions, despite being a minority class. It seems that this type of bean is quite distinct so it is not mixed with other classes. This may also explain why we got Null predictions, for these 3 samples had anomalous feature values for this type of bean;
The largest amount of Soft Match values we got was for Dermason (even if we take the ratio of Soft Match compared to Total). This indicates that this class was mostly mixed with others. Probably, as long as this is the majority class, it is more likely to have some kind of average parameters of the bean. Hence, it can be easily misclassified.
The Cali bean seems to be the easiest to classify since it has the highest exact match rate and the smallest error ratio.

These three insights can be useful in many ways. For example, to understand how the predictions can be interpreted, how the feature importance can be estimated, which bean classes are usually mixed together, etc. Elaborating further on how these insights can be useful is, however, out of the scope of this article.

“Simple” case

The workflow also contains a second branch that basically does the same and also includes the nodes for Integrated Deployment. Therefore, once you run both of them, you can compare the workflows. The reason why the second branch is called “simple” is that this workflow does not contain any loops, which means that only one model is trained and one calibration table is produced. Basically, in this case, the Conformal Classification node combines the functionality of the Conformal Calibrator and Conformal Predictor nodes. The inputs of the Conformal Classification node are a calibration table without rank (just plain predictions for the calibration data set), and the unlabeled or test data set with model predictions. Its output is a table with sets of classes.

Perhaps this case is better for beginners, for those who would like to quickly implement the solution, or in case you are trying multiple models with conformal prediction, so you do not spend time on training multiple models of the same type. The results should still be quite satisfactory but usually having multiple rounds of conformal prediction (as in the “advanced” case) might give you more robust results. Hence, I encourage you to give it a try and make your own tests and comparison between the “advanced” and “simple” implementations.

Conclusion

In this article, I showed how to use the conformal prediction nodes in KNIME for a multi-class classification problem. I described how to build an “advanced” and “simple” case for multiple models and calibration tables, and a single pair, respectively. Both approaches were combined with the nodes for Integrated Deployments, so the workflow produces and automatically deploys both the “advanced” and “simple” case. We discussed which metrics can be used for estimating conformal predictions and how one can optimize the error rate.

References

https://medium.com/low-code-for-advanced-data-science/conformal-prediction-theory-explained-14a35226df80 — the first part of the series where I described the theory of conformal prediction;
https://www.youtube.com/watch?v=_ZVuEWEfwuw&ab_channel=KNIMETV — webinar where the first version of the nodes was presented;
https://hub.knime.com/redfield/spaces/Public/latest/Conformal%20prediction%20workshop%20by%20Redfield~BaWdZnceB7O4sqjA — the workflow that has been presented in the webinar;
https://hub.knime.com/-/spaces/-/latest/~gjQXYhI68F8ff-ef/ — the workflow presented in this article.

This blog post is written as my private initiative and not anyhow related to my current employer.