How to not build a three-point ROC curve

Otávio Vasques
DataLab Log
Published in
6 min readFeb 21, 2020
Photo by Michel Paz on Unsplash

Introduction

A very common task in machine learning is to evaluate models. Model evaluation is a process in which someone tries to assign a number to a model concerning some data. This number is called a metric and can have many different ways of building it. Usually, higher metric values indicate better performing models but how we analyze this number and how we apply each kind of metric can have a huge impact on the process of building a machine learning model.

To understand better how the differences between all these metrics can impact our model we must analyze how they are built and how our models make the final predictions. A very common mistake in people starting to work with machine learning is to ignore the threshold role on threshold independent metrics, such as the area under the curve of the Receiving Characteristic Curve (the famous ROC Curve). In this post, I will try to describe better how this kind of metric are built and what to do to avoid the three-point ROC Curve.

It is important to notice that I’m just going to make an overview of binary classification problems, for multi-class classification and regression problems some other issues and discussions will not be present in this post.

How models make predictions

Let’s build a simple binary classification model for some sample data. For this, I will use all the scikit-learn built-in tools.

Code of a sample binary classifier.

The code above produces the following confusion matrix.

A confusion matrix example. We can see that the main diagonal has most of the examples indicating good performance.

We can see that the model is getting a lot of examples right but if we take a closer look at the logistic regression expression and compare it with the results in the confusion matrix we will find something curious.

Logistic function.

The logistic function takes our feature vector x and our weights beta and output a value between 0 and 1, not just 0 or just 1, any continuous value in the interval [0, 1]. How do the logistic regression maps this continuous value of the logistic function to a hard prediction of 0 or 1?

The plot of the logistic function.

The way to convert the continuous output to a hard prediction is to establish a threshold value that will be used to split the predictions into two parts, prediction values below the threshold are classified as class 0 and prediction values above the threshold are classified as 1. We can visualize the threshold as a horizontal line in the plot below.

Three different options of a threshold for the same logistic function.

If we don’t take into account the model probabilities and plug in the hard predictions for the confusion matrix into a ROC curve we will get a three-point curve.

The three-point Receiving Operating Characteristic curve.

This happens because the ROC Curve is a threshold independent metric, i.e. it will build the confusion matrix for all possible threshold values, and we need the continuous output to build it properly. We will discuss the ROC Curve construction in detail.

Almost every model available makes a continuous approximation before mapping it to a hard prediction. Usually, in the scikit-learn models, these continuous outputs can be accessed through the predict_proba method, although these values lie in the [0, 1] interval, not every model is making a probability estimation. This is the case for the logistic regression but with custom different loss functions the model maybe estimating other quantities that just happen to be continuous and normalized in the [0, 1] interval.

Metrics construction

The confusion matrix above is not a metric, it is a matrix showing all kinds of errors and successes possible for a binary classifier, to compare models we need a single number that summarizes all the information from the confusion matrix.

The most obvious, and intuitive, metric construction is accuracy. The accuracy is just the ratio between the correct predicted examples over the total of the examples. That would be the sum of the main diagonal of the confusion matrix over the sum of all the other squares.

The problem of this metric is that highly unbalanced datasets, i.e. there is a class much more present than the other, get super high scores with a very simple classifier. Just guessing the most frequent class will yield very good accuracy. To overcome this we need to look to other ratios in the confusion matrix.

The two most popular ratios are precision and recall.

tp stands for true positive, fp for false positive and fn for false negative. For these quantities, if we take our naive classifier that guesses the most frequent class for all predictions, it would result in high values either false positive or false negative values. Although these new quantities can catch weird behavior all comes down to generating the confusion matrix.

To generate the confusion matrix we must choose a threshold, but there isn’t any reason to choose a particular threshold value. Unbalanced datasets produces predictions probabilities centered on the unbalancing ratio, so choosing a value of 0.5 or 0.7 means nothing to any classifier in general, the default behavior of the predict method of scikit-learn linear models is to take the class with the highest continuous prediction, which corresponds to the 0.5 threshold for a two-class problem, but this doesn’t mean anything for any dataset that has unbalanced classes. To overcome the threshold choice lets choose all.

Threshold independent metrics

When we decide to not choose a particular threshold we start to build the threshold independent metrics such as the ROC Curve and the Precision-Recall Curve, but how are these metrics built?

The change from just taking the ratios from the confusion matrix to these independent metrics is to take all the ratios based on all possible threshold values. As the name say the Precision-Recall Curve uses the precision and recall ratios and the ROC Curve uses the recall and the true negative rate.

Taking these ratios for every threshold possible we can build a threshold parametrized curve. Here is the ROC Curve for our sample data but now using the probabilities.

Left: ROC Curve. Right: The same plot on the left but with some explicit threshold points.

We can see now that it forms a proper curve, not two lines, and in the right plot, I outlined some of the threshold points used to compute the confusion matrix ratios.

To turn this into a metric we just take the area under the curve. With this, we have a number that higher the area, higher the model performance. This is because at each threshold we expect always low rates of false positives and high rates of true positives. The perfect classifier would produce a curve that is a straight line from (0, 0) to (0,1) than a straight line from (0, 1) to (1, 1).

It is important to notice that the dashed red line shows the random classifier behavior, so a random classifier should produce a ROC Curve almost equal to the diagonal. In the case your ROC Curve is going below the red dashed line you probably swapped the positive and negative probabilities.

Conclusion

There are a variety of metrics out there and a variety of problems. Choose the metric that suits best your problem but always be aware of how your model produces predictions and how your metric uses them. Especially in the case of threshold independent metrics always check if your predictions are continuous outputs and how the ROC Curve expects to use it.

--

--

Otávio Vasques
DataLab Log

Data Scientist looking for awesome code. Open Source enthusiast. @otaviocv on Github. @VasquesOtavio on Twitter.