Evaluation Metrics

Don’t let AU-ROC trick you!

Area-under-ROC presents an optimistic view of performance in an Imbalanced dataset

Pratyush Khare
ILLUMINATION
Published in
4 min readJan 21, 2023

--

One of the most popular methods for assessing a binary classifier’s performance is by utilizing a receiver operating characteristic (ROC) curve. ROC curves show how the performance of the classifier varies when the threshold is changed by plotting the true positive rate (TPR) versus the false positive rate (FPR) at various threshold values. Precision-recall (PR) curves, however, may be a preferable choice when working with unbalanced datasets, where the number of positive occurrences is significantly lower than the number of negative instances. We’ll go through why PR curves are more effective for datasets that are unbalanced and why they can produce more illuminating findings than ROC curves in this blog article.

ROC curve & PR curve for imbalanced dataset (Image by Author)

What is an imbalanced dataset?

An imbalanced dataset is a dataset where the distribution of classes is not equal. Specifically, it refers to a dataset in which one class, called the minority class, has significantly fewer instances than the majority class. When training a classifier, this imbalance may cause problems, as the model may be biased towards the majority class, resulting in poor performance.

--

--

Pratyush Khare
ILLUMINATION

Data scientist, tech buff, student-for-life, loves building AI/ML platforms/solutions, drawing insights from data.