Experimenting Confusion Matrix for Regression — A powerfull model analysis tool

8 min readMay 24, 2022

1-Confusion Matrix (CM)

Confusion matrix is a popular tool for summarizing the performance of a classification algorithm (A model which is used to predict a discrete variable) by giving a better idea of what is correctly predicted and what types of errors your algorithm is making. The confusion matrix is a N x N matrix, where N is the number of classes or outputs. For 2 classes, we obtain a 2 x 2 confusion matrix.

The name « confusion » stems from the fact that it makes it easy to see whether the system is confusing / mislabeling two classes. It have two dimensions “actual” vs “predicted” and for 2 X 2 matrix (binary problems), it produce 4 count metrics :

True Positive (TP) : predicted(1), actual(1)
False Positive (FP) : predicted(1), actual(0)
True Negative (TN) : predicted(0), actual(0)
False Negative (FN) : predicted(0), actual(1)

Those 4 count metrics are also very usefull for calculating a bunch of classification model performance metrics like precision, recall, f1 score on go on.

2-Multiclass Confusion Matrix (MCM)

The binary confusion matrix can be extended for multiclass classification problems. For example, for 3 classes we obtain a 3 X 3 confusion matrix.

How to calculate TP, FN, FP, TN? Pretty much the same logic here but we have to calculate the sum of values for each class. For example, using the 3 X 3 confusion matrix above we obtain:

Once we have our TP, FN, FP, TN for each class there are two main approaches in defining performance metrics called “micro” and “macro”. The difference between macro and micro averaging is that macro weighs each class equally whereas micro weighs each sample equally. The macro and micro will have the same score if we have an equal number of samples for each class.

3-Regression Confusion Matrix

Continuous data is measured while discrete data is counted… Is it possible to extend the use of the confusion matrix from a discrete variable problem (classification) to a continuous variable problem (regression)?

To make it possible we have to convert our continous variable to a discrete form first!

How can we do that ? Simply by discretizing the continuous values into regions (bins). Discretization is the process of transforming continuous variables into a discrete form. We do this by creating a set of continuous intervals:

Let’s do it!

Suppose we have this simple regression problem X vs Y :

The first step is to fit our regression model :

Code example :

2.Second, we discretize the target Y in N bins. We will start with 4 bins. We can use pandas « qcut » function to do that:

Code example :

Result :

We can see that we have almost the same sample size for each discretized bin label. Here is the new dataframe :

3. Next, we have to merge the Y True continous intervals with the Y Pred results, like this :

Result :

We are now ready to calculate our multiclass Confusion Matrix for Regression ! (Don’t forget to use the test set or validation set as we don’t want to evaluate the performance on the fitted data used for training)

LINEAR REGRESSION — CONFUSION MATRIX (4 BINS)

With this simple 4 X 4 confusion Matrix, we can see that our model is not breathtaking. A perfect model would result in a perfectly centered dark diagonal which is far from the case here !:

Perfect Model example :

With the Linear Regression metric, we have more like a reverse « L » shape than a diagonal. We have a roc auc of 0.5250 and an accuracy of 0.2877. If we look at the regression metrics we have a R2 of 0.0695 and a RMSE of 129.303.

This is nice because we can now use pretty much all the classification metrics to benchmark the performance of our regression model.

Now if we do the same with a default CatBoostRegressor :

CATBOOST REGRESSION — CONFUSION MATRIX (4 BINS)

Seems a lot better ! We begin to see a centered diagonal. We now have a roc auc of 0.64873 and an accuracy of 0.47308. If we now look at the regression metrics we have a R2 of 0.6323 and a RMSE of 81.2865.

What happen if we split our continuous target into more bins ?

In this context, more bins means a more accurate confusion matrix of our regression model.

More bins means also smaller sample by class which result in a less credible evaluation. In general, the more bins we add, the more our model should tend to lose performance when using the confusion matrix, since we are looking at the data much more precisely.

Let’s now try with 10 bins and the Linear Regression model :

LINEAR REGRESSION — CONFUSION MATRIX (10 BINS)

Wow, now we can see more precisely how the model fits the continuous data and where the errors are. Our Linear regression model is not capable of predicting the first bins (< 279) and the last bins (> 447) as there are 0 counts in the predicted columns. As we expected, the accuracy have now dropped from 0.2877 to 0.09055. Note that the RMSE and R2 remains the same as those metrics are calculated with the overall results.

If we do the same with the CatBoost model :

CATBOOST REGRESSION — CONFUSION MATRIX (10 BINS)

Again we see a more centered diagonal compared to the linear regression. Accuracy decreased from 0.47308 to 0.14438, which is a smaller drop (~67%) than with linear regression (~80%) and still has better overall accuracy.

The nice thing about using the confusion matrix for regression problems is that it gives you a powerful visual tool to interpret the overall fit of the model and help spot problems in specific regions of the continuous data, even for multivariate problems where it is not possible to plot X vs Y (you can also see my previous article visualization trick for multivariate regression problems for more tools).

We can add a last model, a polynomial regression and compare the 3 models for 4, 10 and 24 bins:

POLYNOMIAL REGRESSION

CONFUSION MATRICES BENCHMARK

4-RCM Accuracy AUC Score

Can we calculate a classification metric that takes into account multiple bins to get a better performance evaluation and benchmark models together ?

We will experiment someting here… Let’s create a new regression-classification metrics that we can call RCM Accuracy AUC Score (RegressionConfusionMatrix Accuracy AreaUnderCurve Score)

First, let’s calculate accuracy (can be any classification metrics) using multiple bins, for each model.

Then we can do a simple interpolation and draw a curve where Y = Performance Score and X = Number of bins, like this :

Now we can calculate the area under the curve, for each model using the sklearn “auc” function that calculates area under curve using the trapezoidal rule :

Based on this new RCM Accuracy AUC Score, Catboost seems to have the bigger area!

Is it the same story with the F1 Score ?

5-RCM F1 AUC Score

Based on this new RCM F1 AUC Score, again, Catboost seems to have the bigger area under the curve!

6-Conclusion

Final model benchmark:

With the traditional regression model performance metrics, the polynomial regression model seems a little bit better than the CatBoost model in all metrics (RMSE, MAE, R2). However, with the classification metrics applied on discretized continuous range, the Catboost seems better than the polynomial regression… Why ? … This is an interesting experiment that need more digging!

To conclude, as we experimented, confusion matrix can also be used as a powerfull tool for regression model visualization, analysis and benchmark.

Even when the regression problem is multivariate, the regression confusion matrix can offer an interesting visualization tool to help understand how the model fits our target and where the error lies.

For more tools to help multivariate regression visualization you can also try this trick : Visualization trick for multivariate regression problems