Understanding ROC AUC: Pros and Cons. Why is Bier Score a Great Supplement?
Binary classification is a common machine learning problem and the correct metrics for measuring the model performance is a tricky problem people spend significant time on. Roc AUC is one of the most commonly used metrics for classification problem.
This blog will contain the following parts:
- What are the ROC curve and ROC AUC?
- What’s the meaning of the Roc score?
- What’s the pros and cons of using AUC ROC?
- What’s Bier Score and why it is a great supplementary metric?
What is the ROC curve?
ROC (Receiver operating characteristic) curve is the relationship between True Positive Rate and False Positive Rate given by different threshold. They are also known as Sensitivity and 1-Specificity.
You don’t need to get terrified by those names. Here are some more intuitive explanations:
- True Positive Rate: among all the Positives, what percentage of them is correctly classified as Positive. For example, let’s we are trying to classify apples and oranges. We have 100 apples (positives) and 40 oranges (negatives) in total, and we correctly identify 80 apples among the 100, the True Positive Rate is 80%.
- False Positive Rate: amount all the Negatives, what percentage of them is falsely classified as Positive. Using the above example, if we classify 20 oranges to be apple, then the False Positive Rate would be 50%.
From the plot, it is clear that there is an interesting relationship between True Positive Rate (TPR) and False Positive Rate (FPR). In order to maximize the TPR, the simplest way is to classify all 140 fruits to be apples, which will cover all the apples (100% TPR) to be correctly classified. But it will also make FPR be 100%. And if we want to minimize the FPR to be 0, we can mark all as oranges, which will, in turn, make TPR be 0. That’s the intuition behind why ROC curve starts from the origin and ends at (1,1).
What is AUC ROC?
AUC ROC is the area under the ROC curve. It is the metric that is used to measure how well the model can distinguish two classes.
The better the classification algorithm is, the higher the area under the roc curve. The score ranges from 0.5 to 1, and the score being 1 is the ideal case where TPR is 1 and FPR is 0, which means we correctly classify all positives and negatives. Whoopee!
The following images can give a more intuitive illustration:
Does AUC just mean the area under the curve? What else does it represent?
AUC can also be translated to “the probability a randomly-chosen positive example is ranked more highly than a randomly chosen negative example”, which then can be further interpreted as “the probability that two randomly-selected samples are correctly ranked”.
So let’s say this is our model predictions and the real data, and this is how we can calculate AUC manually:
So there are 3 (# of positives) * 2 (# of negatives) = 6 pairs in total. Then we sort the data by the y_predict, which makes it:
If our model is perfect, all 1’s should be before 0’s, aka correctly ordered. But in this case, there are only 4 times that 1's is actually before 0 ’s. So our AUC would be 4/6 = 0.667. You can verify the results by calculating it in python.
This example illustrates another perspective of AUC, which shows it is a measure of correct ordering of the classification.
What're the cons of AUC ROC?
As explained above, AUC ROC is essentially the measurement of the order of the predictions. You may realize that there is a way to trick the AUC ROC metric. If we divide all of the predictions by 100, we will keep the same AUC ROC! How crazy?!
Despite the fact that AUC ROC is a really reliable metric, this is the part we need to pay attention to. When making predictions for a classification problem, we want to be able to say “this is 80% likely to be an apple” rather than “this fruit is more likely to be an apple than the other fruit”. We do care about the actual probability we are giving, instead of just the ranking.
Is there a way to avoid this? Yes, we have the Brier Score!
What is the Brier Score? Why is it a good supplement?
Brier Score is defined by the above function, which can be interpreted as how close the prediction is to the real case. For example, if the prediction we give to a data point is 0.7, while it is positive, the Brier Score will be (0.7– 1) *(0.7–1) = 0.09. If there is more data, we average them.
Since the better the predictions are, the lower the Brier Score, it is also known as Brier Loss. You may realize it is somewhat similar to Log Loss, which also compute how far away your predictions are from the real case.
Brier Score is a great supplement to AUC ROC. Imaging our model has AUC ROC 0.9 and Brier Score 0.05, it guarantees that the predictions are accurate in both orderings and scales!
Conclusion
In daily life, hen we are evaluating model performance, it is usually very helpful to look at metrics like Brier Score or Log Loss with AUC ROC so that the results can be evaluated in a more comprehensive way.
I hope this blog helps! Please let me know if there is anything unclear.