Analytics Vidhya
Published in

Analytics Vidhya

Deep intuition and things behind the ROC Curve and Area Under ROC Curve(AUC)

This Blog is the continuation of this post.

For now, I am assuming that you know the Confusion matrix, Accuracy, Precision, Recall, and F1 score.

I’m again saying, before learning these metrics you need to have knowledge about how to build a classification model.

For now, I’ll build a model for you...

Let’s get the same old code snippet,

import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_splitfrom sklearn.tree import DecisionTreeClassifier
X,y = make_classification(n_samples=200,n_features=10,random_state=20)X_train,X_test,y_train,y_test =
sklearn.model_selection.train_test_split(X,y,random_state = 43)
model = DecisionTreeClassifier()
model.fit(X_train,y_train)

For now, You have a model.

Till now, what you’ve understood is, “if you saw an imbalanced dataset, then you must go with the metric of F1-Score(which combined both precision and recall)”.

Note:- Please Go through my previous blog about classification metrics to understand the below things…

Before diving into ROC_AUC, we need to know some things like True +ve Rate(TPR), False +ve rate(FPR), False Negative rate(FNR), True -ve rate(TNR), sensitivity, and specificity.

  1. True Positive Rate(TPR):- True +ve Rate is the same as Recall, “it’s the rate at which predicted +ve values got correctly predicted as +ve samples”

TPR = TP / (TP + FN)

True Positive Rate is also called Sensitivity.

if the sensitivity of the model is high, then our model can predict +ve samples efficiently.

2. True Negative rate:- True Negative rate is the rate at which predicted -ve values got correctly predicted as -ve samples.

TNR = TN / (TN + FP)

True Negative Rate is also called Specificity.

If the specificity of our model is high, then our model can predict-ve samples efficiently.

Note:- Whenever we are building our model, we need to design our model in such a way that sensitivity and specificity are very high.

3. False Negative Rate(FNR):- False Negative rate is the rate at which positives values got incorrectly predicted as Negative samples.

FNR = FN / (FN + TP)

Also, (False -ve rate) + ( True +ve rate) = 1

i.e., FPR = (1- sensitivity).

4. False Positive Rate(FPR):- False positive rate is the rate at which Negatives values got incorrectly predicted as Positive samples.

FPR = FP / (FP + TN)

Also, (False positive rate + True negative rate) = 1

i.e., FPR = (1-specificity).

Now, you know what is TPR, TNR, FPR, and FNR.

If you’re a Machine Learning beginner, then you’ll predict the classes by doing this thing

y_pred = model.predict(X_test)

when we compiled the above line of code, what will happen is,

First, the model is going to predict the probabilities of test data to be Positive. Then, by comparing with a threshold of 0.5, i.e., if prediction_probability > 0.5, then the predicted test sample will be said as +ve sample, else it is negative. Actually, we can do practical implementation too.

see the above implementation with an example:

Let’s think for a test sample, Predicted_probability = 0.68. Then, this probability will be compared with the threshold = 0.5 i.e., in code language,

if( Predicted_Probability > 0.5) #0.5 is threshold 
{
return 1;
}
else
{
return 0;
}

1 -> Positive

0-> negative

Here, by default, our model is considering the threshold as (0.5). So, whenever the predicted probability > 0.5, then the predicted test label will be 1, if the predicted probability < 0.5, then the predicted test label will be 0. All these things will be automatically calculated by the model, and in y_pred, test labels will be appended.

y_pred will be look like this[1, 0, 0, 1, 1, 1, 1, 0].

All the metrics that were discussed in the previous blog will be calculated through the above procedure only. The below complete code represents the above explanation.

Read again, if you didn’t understand…

import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_splitfrom sklearn.tree import DecisionTreeClassifier
X,y = make_classification(n_samples=200,n_features=10,random_state=20)X_train,X_test,y_train,y_test =
sklearn.model_selection.train_test_split(X,y,random_state = 43)
model = DecisionTreeClassifier()
model.fit(X_train,y_train)
#this will be predicted with the threshold of 0.5 as explained above
y_pred = model.predict(y_test)
#y_pred will be look like[1, 0, 0, 1, 1, 1, 1, 0]
accuracy_score = sklearn.metrics.accuracy_score(y_true,y_pred)

Now, we’ll see another type of prediction in order to understand the ROC_AUC…

Till now, what you’ve studied is a type of prediction method which you won't use much in implementing real-time applications. Now, we’ll discuss a method of predicting, which will be used in real-world problem-solving.

The method is simple. As we consider the default threshold as 0.5, manually we’ll predict the probabilities of test samples and then, we’ll consider a threshold in such a way that our metric will give high accuracy.

Note:- From now, you should not use model.predict() function, see the below function and you should use it now.

the function is,

model.predict_proba(X_test)

the above function will predict the probabilities of test samples to be positive. The higher the value, the higher the chance that the test sample to be +Ve.

ROC_AUC Score:- ROC_AUC score is nothing but the area of the ROC Curve, whereas the ROC Curve will get built by using different TPRs and FPRs according to different thresholds.

Let’s get dive into the ROC_AUC score step by step:-

  1. Consider a set of thresholds, let’s say, [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0].

Now, we’ll compare the test sample predicted probabilities with each threshold and predict the discrete values i.e. 0 or 1.

According to the predicted discrete values, the True +Ve rate, False +Ve rate will get calculated.

see the above point implementation below…

model = sklearn.linear_model.LogisticRegression()   model.fit(X_train.values,y_train.values)y_prob = model.predict_proba(X_test.values)             print(y_prob)tpr = []
fpr = []
for i in threshold_list:
k = y_prob[:,1] > i
j = []
for i in k:
j.append(int(i)
tp = true_positive(y_test,j)
fp = false_positive(y_test,j)
tn = true_negative(y_test,j)
fn = false_negative(y_test,j)
tpr.append(tp/(tp+fn))
fpr.append(fp/(fp+tn))
data = list(zip(tpr,fpr))
dataset = pandas.DataFrame(data,columns = ["tpr","fpr"])
print(dataset.values)

the above code will give output something like below:-

In the output, left values represent TPR, right values represent FPR

The dataset that I consider above is just used to show how FPR & TPR will get calculated, don’t get bothered about the dataset. Just concentrate on how FPR & TPR getting calculated(you can see the dataset here).

2. Now, we’ll plot a graph which is called the ROC Curve, with the help of TPRs and FPRs, that we calculated for different thresholds.

see the code & output below,

plt.figure(figsize = (7,7))
plt.fill_between(dataset.fpr.values,dataset.tpr.values, alpha=0.4)
plt.plot(dataset.fpr.values,dataset.tpr.values,lw = 2)
plt.title("ROC_AUC CURVE")
plt.xlim(0,1.0)
plt.ylim(0,1.0)
plt.xlabel("fpr",fontsize = 16)
plt.ylabel("tpr",fontsize = 16)
plt.show()

Output for the code is a graph which is ROC Curve, see below:-

Don’t bother about the curve, just concentrate on how to plot a ROC Curve. I didn't preprocessed data that much before modeling. So, the Curve is not good enough.

3. Now, we need to find out the area of the Blue region in the graph, which is called Area of the ROC Curve also called Area Under the Curve(AUC).

We can go into linear algebra, integration, and many math frameworks in order to calculate the area under the curve. But we don't need to go much into them. We can calculate the area by using sklearn, which will give AUC(Area Under the Curve).

This AUC is one of the most popular metrics in machine learning for skewed targets.

let’s see the code of calculating this area(one line only😉):-

print(sklearn.metrics.roc_auc_score(y_test,y_prob[:,1]))

which will give the output of Area Under the Curve like:-

AUC score

the AUC score will be between 0 and 1. if the AUC score is close to 1, then the model we developed is one of the best models that one can develop for the considered dataset(there is also a chance that, it may be got overfitted, be careful about that). if AUC Score is close to 0, then our model is even worse than random guessing.

if our model gives the score = 0.5, then our model is a naive model.

Big Note:-

We can use the ROC Curve for another purpose too. As we can see above that, with different thresholds, different discrete predictions are produced, and different TPRs and FPRs are produced. but, while developing industrial applications, a particular value of threshold will get selected by project domain expertise people, in such a way that whether they want high TPRs or low TPRs, high FPRs or low FPRs, and according to that, a threshold will get selected. with the help of that threshold only, discrete value predictions will take place from predicted probabilities. On, those predictions, Metrics will get applied.

As we are beginners in machine learning, we will select the threshold in such a way that both TPR and FPR will get balanced. The Top-Left value in any ROC curve will give a balanced threshold.

I think this is all about the ROC_AUC score…

if anything needs to be added, lemme know…

contact me here:- https://www.linkedin.com/in/vishnu-vardhan-varapalli-b6b454150/

Happy Learning✌!!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store