ROC model and AUC

Davide Bacchini
AI Odyssey
Published in
4 min readApr 22, 2024
Credits for picture here.

Imagine that you’ve just built your machine-learning model to recognize a certain pattern and make predictions, but you don’t know if it is accurate. How can you validate it and verify if it’s the proper model for your prediction? Suppose your model predicts who will pass an exam, based on how many hours a student studied for it. Given this problem, the input is a continuous data variable (hours studied) and the output is a binary variable (pass or not). The best way to address this problem seems logistic regression due to its nature.

This logistic regression (blue line) tells us the probability that a student will pass this exam based on the hours they studied for it. However, if we want to classify the students who passed it or not, we need a way to turn probabilities into classifications. One way could be setting a certain threshold, for example 0.5, and then classifying all the students with a probability of passing the exam > 0.5 as successful and the others as not. To evaluate the effectiveness of this logistic regression, we can test it with the students and their exam outcomes we already know. According to this threshold, all the students are correctly classified except for the two at points 2 and 5, who behaved oppositely to what our logistic regression model predicted. We can now use a confusion matrix to summarize the classification and calculate the sensitivity and specificity to evaluate this logistic model with a threshold of 0.5.

Credits: StatQuest.org

Based on the results we want to obtain, we could set a different threshold in our model. For instance, if we are seeking people potentially infected by a disease, we set a low value and we accept to classify a larger number of false positives, but the model will include every infected person with an extremely high probability. Conversely, we could set a high threshold and we will improbably include any false positive. So, how do we find the perfect threshold value for our model? Calculating a confusion matrix for every point is a complex and expensive method. Fortunately, the Receiver Operator Characteristic (ROC) graphs provide a simple way to summarize all the information.

Credits: evidentlyai.com

The True Positive Rate (TPR) is plotted on the y-axis against the False Positive Rate (FPR) on the x-axis. Particularly, the TPR is the proportion of students who actually passed the exam and were predicted correctly by the model, whereas the FPR is the proportion of students who didn’t pass it and were classified as passing by the model. The TPR is called sensitivity, whereas the FPR is 1-specificity, specificity tells us what proportion of the negative class got correctly classified.

As you adjust the classification threshold in the logistic regression, the ROC curve will capture a change in the TPR and FPR, storing that value as a point (see black points below).

Credits: StatQuest.org

The optimal threshold is the one that either maximizes the number of true positives with the minimum number of false positives or maximizes the number of true negatives with the minimum number of false negatives. The green line shows random guessing performance (chance level) across all thresholds.

So, The farther the ROC is above the diagonal line, the better the model distinguishes between passing and failing students. The AUC (area under the curve) plays a fundamental role in analyzing the ROC curve since summarizes the model’s performance in a single metric from 0 (worst) to 1 (best). If it is equal to 0.5, it means that our model predicts accurately as the random one, if it is closer to 1, it indicates a strong model. The AUC is fundamental to analyzing different machine learning models and identifying the most accurate among them. Imagine you used two models to predict if a student will pass the exam: a logistic regression with AUC equal to 0.85 and a random forest with AUC equal to 0.7. The identification of the best model is immediate since the logistic regression is the most accurate one (Its AUC is greater). As a data scientist, you need to understand these tools and add them to your portfolio skills to proper classify machine learning models.

install.packages("verification")
library(verification)
x<- c(0,0,0,1,1,1)
y<- c(.7, .7, 0, 1,5,.6)
data<-data.frame(x,y)
names(data)<-c("yes","no")
roc.plot(data$yes, data$no)

--

--