ROC Curve and AUC in Machine learning and R pROC Package

Published in

The Startup

7 min readJun 15, 2020

The world is facing a unique crisis these days and we all are stuck in a never seen before lockdown. As all of us are utilizing this time in many productive ways, I thought of creating some blogs of data concepts I know, not only to share it with the community but also to develop a more deep understanding of the concept as I write it down.

The first one is here about the most loved evaluation metric — The ROC curve.

ROC (Receiver Operating Characteristic) Curve is a way to visualize the performance of a binary classifier.

Understanding the confusion matrix

In order to understand AUC/ROC curve, it is important to understand the confusion matrix first.

TPR = TP/(TP+FN)

FPR = FP/(TN+FP)

TPR or True Positive Rate answers the question — When the actual classification is positive, how often does the classifier predict positive?

FPR or False Positive Rate answers the qestion — When the actual classification is negative, how often does the classifier incorrectly predict positive?

To understand it more clearly, let us take an example of the current COVID situation. Assume that we have data for COVID patients and using some classifier we were able to classify the patients as positive and negative.
Let us now, without going into further details have a look at the distribution of the predicted classes. Here, again for simplicity let us assume that the data is balanced i.e. negative and positive classes are almost equal, additionaly they follow a normal distribution.

In the above graph, my classifier is doing a great job in classifying the patients — positive and negative. If I calculate the accuracy for such model, it will be quite high. Now, for different values of threshold, I can go ahead and calculate my TPR and FPR. According to the graph let us assume, that my threshold =0.5. At this threshold, the number of patients for which my classifier predicted a probability of 0.5, half were negative and half were positive.Similarly, I can check for other thresholds as well. For every threshold, TPR would be all patients in green area in the right of the threshold line divided by total patients in the green area.
FPR would be all patients in pink area in the right of the threshold line divided by total patients in the pink area.

ROC Curve

Now, if I plot this data on a graph, I will get a ROC curve.
The ROC curve is the graph plotted with TPR on y-axis and FPR on x-axis for all possible threshold. Both TPR and FPR vary from 0 to 1.

Therefore, a good classifier will have an arc/ curve and will be further away from the random classifier line.
To qantify a good classifier from a bad one using a ROC curve, is done by AUC (Area under Curve). From the graph it is quite clear that a good classifier will have AUC higher than a bad classifier as the area under curve will be higher for the former.

From the above discussion, it is evident that ROC is a robust evaluation metrics than say Accuracy or Missclassification error because ROC takes into account all possible threshold levels whereas a metric like missclassification error takes only one threshold level into account.
The choice of your threshold depends on the business problem or domain knowledge. In our COVID patients example above, I would be okay with high FPR thus keeping my threshold levels low to ensure maximum COVID patients tracked.

Key points for ROC Curve

There are few important points regarding ROC curve which in a way also summarizes this blog:

ROC curve is a curve plotted with FPR on x-axis and TPR on y-axis
ROC curve works well with unbalanced datasets also
ROC curve can also be used where there are more than two classes
Closer the ROC curve is to the upper left corner, the higher the overall accuracy of the test
ROC curve takes into account all possible threshold levels, although the chosen threshold cannot be visualized on the curve
Terms Sensitivity and Recall of a classifier are same as TPR and FPR is also referred as (1- Specificity).

ROC curve using an example dataset

Now let us explore a simple dataset to build a classifier in R and use ROC as evaluation metric.
I have used a College Admission data that I found on UCLA site.

Let us read this data and view its summary:

raw <- read.csv("Admission.csv") 
summary(raw)## admit gre gpa rank ## Min. :0.0000 Min. :220.0 Min. :2.260 Min. :1.000 ## 1st Qu.:0.0000 1st Qu.:520.0 1st Qu.:3.130 1st Qu.:2.000 ## Median :0.0000 Median :580.0 Median :3.395 Median :2.000 ## Mean :0.3175 Mean :587.7 Mean :3.390 Mean :2.485 ## 3rd Qu.:1.0000 3rd Qu.:660.0 3rd Qu.:3.670 3rd Qu.:3.000 ## Max. :1.0000 Max. :800.0 Max. :4.000 Max. :4.000## admit gre gpa rank ## 1 0 380 3.61 3 ## 2 1 660 3.67 3 ## 3 1 800 4.00 1 ## 4 1 640 3.19 4 ## 5 0 520 2.93 4 ## 6 1 760 3.00 2## [1] 400 4

Here ‘admit’ is the dependent variable and it is a binary classification problem.
I have checked for missing values, but did not go into further pre -processing of data since my objective is to demonstrate ROC curves here and not model fine tuning.

library(DataExplorer)plot_missing(raw)

There are no missing values.Partitioning this data into training and validation data sets.

set.seed(123) partition <- sample(2, nrow(raw), replace=TRUE, prob=c(0.7, 0.3)) tdata <- raw[partition==1,] vdata <- raw[partition==2,] dim(tdata)## [1] 285 4## [1] 115 4vdata_X <- vdata[,-1] vdata_Y <- vdata[,-(2:4)]

I have used two classifiers here. Logistic regression and Support Vector Machines.

# Logistic Regression LR_fit <- glm(admit~.,data=tdata,family = binomial()) summary(LR_fit)## ## Call: ## glm(formula = admit ~ ., family = binomial(), data = tdata) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.6226 -0.9052 -0.6161 1.1109 2.1483 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -4.727209 1.411025 -3.350 0.000808 *** ## gre 0.001796 0.001280 1.403 0.160601 ## gpa 1.249248 0.395265 3.161 0.001575 ** ## rank -0.522473 0.150021 -3.483 0.000496 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 365.52 on 284 degrees of freedom ## Residual deviance: 329.29 on 281 degrees of freedom ## AIC: 337.29 ## ## Number of Fisher Scoring iterations: 3LR_predict <- predict(LR_fit,newdata = vdata_X,type="response") LR_predict_bin <- ifelse(LR_predict > 0.6,1,0)#Confusion matrix cm_lr <- table(vdata_Y,LR_predict_bin) #Accuracy accuracy <- (sum(diag(cm_lr))/sum(cm_lr)) 
accuracy## [1] 0.7565217

The accuracy of the model using logistic regression is 75% and I have chosen threshold = 0.6 here. Let us try with SVM now.

#SVM library(e1071)svm_fit = svm(admit ~ .,data = tdata, kernel = "linear",cost=1,scale = FALSE) svm_predict <- predict(svm_fit,newdata = vdata_X,type="response")# SVM Confusion matrix cm_svm <- table(vdata_Y,svm_predict) # Accuracy svm_accuracy <- (sum(diag(cm_lr))/sum(cm_lr)) svm_accuracy## [1] 0.7565217

Here too, the accuracy is almost same 75%. Now, let us plot a ROC curve for both the models.

I have used the package pROC to plot ROC curves here.

library(pROC)par(pty="s") lrROC <- roc(vdata_Y ~ LR_predict,plot=TRUE,print.auc=TRUE,col="green",lwd =4,legacy.axes=TRUE,main="ROC Curves")## Setting levels: control = 0, case = 1## Setting direction: controls < casessvmROC <- roc(vdata_Y ~ svm_predict,plot=TRUE,print.auc=TRUE,col="blue",lwd = 4,print.auc.y=0.4,legacy.axes=TRUE,add = TRUE)## Setting levels: control = 0, case = 1 ## Setting direction: controls < caseslegend("bottomright",legend=c("Logistic Regression","SVM"),col=c("green","blue"),lwd=4)

So, although the accuracy of the models is almost same, the ROC curves give a better understanding of which model is performing better. To quantify this, AUC is also visible making SVM a slightly better classifier than Logistic Regression for the given senario.

Few points about pROC Package and roc function

ROC curve can obiviously be plotted in many ways, and it is not necessary to use the pROC package. In case some of you wish to use it, here are few points to keep in mind:

roc function by default will give a curve between Senstivity and Specificity and not (1-Specificity). So, the x axis will have a reverse axis. In case you want to plot it against (1-Specificity) use legacy.axes=TRUE.
To have a neat box with ROC curve and to get rid of the side panels use par(pty=”s”).
The roc function will by default generate a single curve for a particular model predictor and response, in case you want it to plot multiple curves in one plot like I have done above use, add = TRUE.

I hope this blog was helpful in building a better understanding of ROC curves.It certainly helped me, and I also learned how to share my R Notebooks on WordPress through this blog.
I will be documenting many more such tutorials and sharing.

Thanks for reading. Happy to receive your feedback, comments!!

References

An introduction to ROC analysis Tom Fawcett
https://www.dataschool.io/roc-curves-and-auc-explained/
https://www.youtube.com/watch?v=qcvAqAH60Yw

Happy to connect on Linkedin!!