AUK — a simple alternative to AUC
A better performance measure than AUC for unbalanced data
Binary classification and evaluation metrics
Classification problems are widespread in data science. With classification, a model is trained to label input data from a fixed set of labels. When the length of this fixed set is two, then the problem is called binary classification.
Typically, a trained binary classification model returns a real-valued number r for each input. If the real-valued number is higher than a set threshold t, a positive label is assigned to the input, and otherwise, it will be assigned a negative label.
Evaluation metrics calculated for binary classification problems are typically based on the confusion matrix.
To assess the performance of a classification model or rank different models, one can choose to pick one threshold value t and to calculate the precision, recall, F(1)-score and accuracy based on it. Additionally, a model’s performance can be assessed by focusing on all threshold values (the number of unique r scores in the test set) instead of choosing one threshold value. A commonly used graph based on this latter approach is the ROC curve, which plots the true positive rate (TP / TP + FN) against the true negative rate (TN / TN + FP). Then, the area under this curve (AUC, a value between 0 and 1) is used to assess a model’s quality and compare it to others.
Shortcomings of AUC
The AUC is one of the most used scalars for ranking model performance. However, the disadvantages of the AUC are less known; Hand (2009) has shown that the AUC uses different classification cost distributions for different classifiers (in this context; for different threshold values t), and it does not account for class skewness in the data. However, the misclassification loss should depend on the relative proportion of objects belonging to each class; the AUC does not consider these priors. This is equivalent to saying that, using one classifier, misclassifying class 1 is p times as serious as misclassifying class 0. But, using another classifier, misclassifying class 1 is P times as serious, where p ≠ P. This is nonsensical because the relative severities of different kinds of misclassifications of individual points is a property of the problem, not the classifiers which happen to have been chosen.
AUK
To overcome these shortcomings, Kaymak, Ben-David and Potharst (2012) have proposed a related but different metric; the area under the Kappa curve (AUK), which is based on the well-established metric called Cohen’s Kappa. It measures the area under the graph that plots Kappa against the false positive rate. Just like the AUC can be seen as an indication of overall performance, so can the AUK. However, Kappa accounts for correct classification due to chance. Therefore, it inherently accounts for class skewness.
In the paper, the authors demonstrate a couple of characteristics and merits of the AUK:
- Kappa is a non-linear transformation of the difference between the true positive rate and the false positive rate.
- The convex Kappa curve has a unique maximum that can be used to select the optimal model.
Furthermore, if the dataset is balanced:
- Cohen’s Kappa provides precisely the same information as the ROC curve.
- AUK = AUC — 0.5 (AUK and AUC differ only by a constant).
- AUK = 0.5Gini (AUK is equal to half the Gini coefficient when there is no skew in the data set)
- The value of Kappa is maximized when the gradient of the ROC curve equals 1. Therefore, there is no added value in finding the optimal model through Kappa.
Conclusion
That is, if the dataset is balanced. However, the AUC and AUK may have different model rankings for unbalanced datasets (please read the paper for examples), which can have huge implications when taken into production. Since AUK accounts for class skewness and AUC does not, AUK seems to be the better choice and should be part of any data scientist’s toolkit.
Code
Suppose you have:
- Probabilities: The output from your classification model; a k-length list with real-valued numbers.
- Labels: The actual labels for the classification model; a k-length list with zeros and ones.
Then, the class below can be called to calculate the AUK and or get the Kappa curve.
class AUK:
def __init__(self, probabilities, labels, integral='trapezoid'):
self.probabilities = probabilities
self.labels = labels
self.integral = integral
if integral not in ['trapezoid','max','min']:
raise ValueError('"'+str(integral)+'"'+ ' is not a valid integral value. Choose between "trapezoid", "min" or "max"')
self.probabilities_set = sorted(list(set(probabilities)))
#make predictions based on the threshold value and self.probabilities
def _make_predictions(self, threshold):
predictions = []
for prob in self.probabilities:
if prob >= threshold:
predictions.append(1)
else:
predictions.append(0)
return predictions
#make list with kappa scores for each threshold
def kappa_curve(self):
kappa_list = []
for thres in self.probabilities_set:
preds = self._make_predictions(thres)
tp, tn, fp, fn = self.confusion_matrix(preds)
k = self.calculate_kappa(tp, tn, fp, fn)
kappa_list.append(k)
return self._add_zero_to_curve(kappa_list)
#make list with fpr scores for each threshold
def fpr_curve(self):
fpr_list = []
for thres in self.probabilities_set:
preds = self._make_predictions(thres)
tp, tn, fp, fn = self.confusion_matrix(preds)
fpr = self.calculate_fpr(fp, tn)
fpr_list.append(fpr)
return self._add_zero_to_curve(fpr_list)
#calculate confusion matrix
def confusion_matrix(self, predictions):
tp = 0
tn = 0
fp = 0
fn = 0
for i, pred in enumerate(predictions):
if pred == self.labels[i]:
if pred == 1:
tp += 1
else:
tn += 1
elif pred == 1:
fp += 1
else: fn += 1
tot = tp + tn + fp + fn
return tp/tot, tn/tot, fp/tot, fn/tot
#Calculate AUK
def calculate_auk(self):
auk=0
fpr_list = self.fpr_curve()
for i, prob in enumerate(self.probabilities_set[:-1]):
x_dist = abs(fpr_list[i+1] - fpr_list[i])
preds = self._make_predictions(prob)
tp, tn, fp, fn = self.confusion_matrix(preds)
kapp1 = self.calculate_kappa(tp, tn, fp, fn)
preds = self._make_predictions(self.probabilities_set[i+1])
tp, tn, fp, fn = self.confusion_matrix(preds)
kapp2 = self.calculate_kappa(tp, tn, fp, fn)
y_dist = abs(kapp2-kapp1)
bottom = min(kapp1, kapp2)*x_dist
auk += bottom
if self.integral == 'trapezoid':
top = (y_dist * x_dist)/2
auk += top
elif self.integral == 'max':
top = (y_dist * x_dist)
auk += top
else:
continue
return auk
#Calculate the false-positive rate
def calculate_fpr(self, fp, tn):
return fp/(fp+tn)
#Calculate kappa score
def calculate_kappa(self, tp, tn, fp, fn):
acc = tp + tn
p = tp + fn
p_hat = tp + fp
n = fp + tn
n_hat = fn + tn
p_c = p * p_hat + n * n_hat
return (acc - p_c) / (1 - p_c)
#Add zero to appropriate position in list
def _add_zero_to_curve(self, curve):
min_index = curve.index(min(curve))
if min_index> 0:
curve.append(0)
else: curve.insert(0,0)
return curve #Add zero to appropriate position in list
def _add_zero_to_curve(self, curve):
min_index = curve.index(min(curve))
if min_index> 0:
curve.append(0)
else: curve.insert(0,0)
return curve
To calculate the AUK, use the following steps:
auk_class = AUK(probabilities, labels)auk_score = auk_class.calculate_auk()kappa_curve = auk_class.kappa_curve()
Lastly, I strongly recommend using trapezoid only for calculating the integral as this is also how sklearn calculates its integral for the AUC.
References
Hand, D. J. (2009). Measuring classifier performance: a coherent alternative to the area under the ROC curve. Machine learning, 77(1), 103–123.
Kaymak, U., Ben-David, A., & Potharst, R. (2012). The AUK: A simple alternative to the AUC. Engineering Applications of Artificial Intelligence, 25(5), 1082–1089.