Area Under the ROC Curve — Explained

Sarath S
5 min readMar 21, 2018

--

Accuracy is the most common measure of classifier performance. This metric is easy to understand but definitely ignores many factors like false positives and false negatives that is brought into the system by a model.

For example, a model that correctly predicts 97% of the transactions as normal transactions, however it misses out on the fact that remaining 3% might have fraud transactions. In real world, the fraud transactions will account for less than 1% of the transactions.

Here, accuracy religiously ignores the fraud transactions despite having accuracy score of 97%. Therefore, by adjusting the appearance of true positive results and false positive results in the model, we can evaluate the performance much better than accuracy score. One such evaluation metric is AUC.

Area Under the ROC curve otherwise known as Area under the curve is the evaluation metric to calculate the performance of a binary classifier.

Before getting into details of AUC, lets understand the glossary.

AUC — Is a numerical representation of the performance of binary classifier.

ROC — Otherwise called as Receiver Operating characteristic is the visual representation of performance of binary classifer. False Positive Rate vs True Positive Rate is plotted to get the visual understanding of the classifier’s performance.

True Positive Rate(TPR):
TPR or Sensitivity is the ratio of true positive out of condition positive.
TPR = True Positive / True Positive + False Negative.

This corresponds the first column of the Contingency table.

False Positive Rate(FPR):
FPR or Fall-out rate is the ratio of false positive out of condition negative.

FPR = False Positive / False Positive + True Negative

This corresponds to the second column of Contingency table

2x2 Contingency table or Confusion Matrix, Source: Wikipedia

Lets understand the concept with an example of PimaIndiansDiabetes data:

library(mlbench)
data(PimaIndiansDiabetes)
Top 5 records of PimaIndiansDiabetes

Let us run the Logistic regression model on the data of dimension 768x9, diabetes column being the target based on 8 numerical features mentioned.

library(nnet)
fit <- multinom(diabetes ~., data = data)
pred <- predict(fit,data)
#confusion matrix for the predicted results
tab <- table(pred,data$diabetes)
tab
Confusion matrix

Now this confusion matrix is formulated similar to the 2x2 contingency table we have seen earlier. ie Predicted results vs Actual results.

Accuracy of the prediction can be calculated by adding diagonal column and divide it by sum of all the values. The formulae can be written as follows.

Accuracy = (True Positive + True Negative)/n

So here the accuracy will be (445+156)/768 which is 78%, leaving the mis-classification rate to 22%(1-accuracy).

In addition to accurately predicting who is having diabetes and who is not. It makes sense not to include the non-diabetic patient as diabetic. Similarly,in case of fraud prediction, the innocent should not be tried as suspect.

To achieve this, we are provided with two knobs to adjust how far the false positive and true positive data to be included in the model for prediction. They are called as Specificity and Sensitivity, which deals with adjusting false positive rate and true positive rate respectively.

Before that, we will identify the probability distribution and cut-off(probability) that distinguishes the person as diabetic or non-diabetic.

pred <- predict(fit, data,type='prob')
pred <- prediction(pred,data$diabetes)
hist(pred)
Histogram of predicted probability

From the above histogram, it is observed that predicted probability of non-diabetic patients falls under 0.5 range. Now, we are going to identify the exact probability cut off where the accuracy of prediction is higher.

The performance function summarizes the probability score and accuracy values, which can be plotted as shown. Max cut off and probability can be found by drawing a horizontal and vertical line using the evaluation summary.

eval <- performance(pred, 'acc')
plot(eval)
Probability vs Accuracy(Evaluation plot)

In the above graph, accuracy is peaked out at the cut-off of around 0.5.

plot(eval)
abline(h=0.785,v = 0.486)

Note that the values of h and v are found in the evaluation summary. ie. Max values of x and y co-ordinates, which can be drawn as shown below. Therefore the cut off probability is 0.48 with the max accuracy of 78% can be achieved.

Accuracy vs Probability cut-off

We have seen how to calculate the cut-off, now that we need to plot the ROC curve. To calculate ROC, we need to plot true positive rate against false positive rate, which can be found in the prediction result of a model.

roc <- performance(pred,"tpr","fpr")
plot(roc)
abline(a=0,b=1)

The line that is drawn diagonally to denote 50–50 partitioning of graph. If the curve is more close to the line, lower the performance of classifier, which is no better than mere random guess. Our model has considerably perform well, since it is farther from the diagonal line.

# Calculating Area under Curve
perf <- performance(pred,"auc")
auc <- as.numeric(perf@y.values)
auc

From the above code, the AUC value is calculated as 0.83, which is not bad for a binary classifier.

--

--