Fun with Classification Metrics

Ilia Kopylov
6 min readJul 9, 2018

--

Matthews Correlation Coefficient and Youden’s J Statistic

If you’ve been doing Machine Learning or Statistical Modelling (especially solving classification problems) for at least one month I bet by now you’ve seen over 9000 articles about accuracy, precision and recall, sensitivity and specificity, F1 score, confusion matrix, and ROC curve. So you’re not so easily fooled just by accuracy score over 99.9%. I hope I’ll tell you something new today.

This article is about two different, slightly less popular, metrics I discovered while researching recommender systems evaluation metrics that, depending on your goal, might reflect the model’s performance way better than precision, recall, or F1 score:

I already hear you asking: “What’s the problem with precision, recall, and F1 score?”. Well, there might or might not be a problem depending on what’s the objective of your model. Let’s take a look at some examples. Fun with spreadsheets! Yay!

Table 1

Grey columns represent the dataset: [+] is the number of positive data points our classifier is intended to detect, [-] is the number or negatives. Their sum is the total dataset size: 100 + 900 = 1000.

Blue and purple columns represent our model’s behaviour: specifically how many true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) it produces.

Accuracy (A) is the metric that everyone understands: number of correct decisions the model makes divided by total number of decisions: (TP + TN) / dataset size.

Precision (P) is the fraction of correct positive decisions among all the positive decisions the model makes: TP / (TP + FP)

Recall (a.k.a. Sensitivity) (R (S)) is the fraction of positive data points the model correctly identifies among all the positive data points: TP / [+]

F1 score is a combined single-value metric of precision and recall — their harmonic mean: 2 * P * R / (P + R)

Specificity (Sp) is the fraction of negative data points the model correctly identifies among all the negative data points: TN / [-]

Each row in the table represents the different classifier (from the least sensitive to the most sensitive). Green/red represent whether the metric in this column increases or decreases with the increase of the classifier’s sensitivity.

Let’s assume that with the increase of the classifier’s sensitivity TP and FP numbers grow approximately the same (in this example from 50 to 99) which is a fairly reasonable assumption that reflects the accuracy-precision-recall trade-off. Don’t worry, we’ll make more pessimistic assumptions later :)

Finally, MCC and YJS are the reasons why you’re still reading this article. We’ll talk about them a bit later. Please, continue!

So I just paste the same table again (so you don’t have to scroll back) and also add the table 2 with the same classifiers but with more negative examples (9900 comparing to 900 in the table 1) that our classifiers correctly identify as negatives so the TP, FP, and FN numbers remain the same.

Table 1 (1,000 data points)
Table 2 (10,000 data points)

The more we increase sensitivity (a.k.a. recall) the better values of F1 score we get, assuming precision remains the same. Also since we’re getting less TN the specificity value drops (and turns red).

Now take a look at the precision, recall, and F1 score values in both tables. They are exactly the same within the same sensitivity level! No matter how many negative data points we add to the test set.

That’s it! This is the issue with precision, recall, and F1 score. They completely ignore true negatives. Now it’s time to introduce Matthews correlation coefficient:

The formula might look either intimidating or quite familiar but we’re not going to dive deep into mathematical reasoning behind it. See that cute TN in the numerator? It makes MCC sensitive to true negatives. It’s quite clear if you compare MCC columns in table 1 and table 2.

Now let’s take a look at Youden’s J statistic which is even simpler:

YJS = sensitivity + specificity - 1

As you remember, sensitivity and specificity are exactly the ratios of TP / [+] and TN / [-] respectively. So YJS is just a linear combination of those on the scale of 0 to 1.

We noticed a substantial increase of MCC and YJS when added more negative examples which are correctly identified by the model. But both metrics also grow as we increase the classifier’s sensitivity because they actually care about both sensitivity and specificity. Pretty cool, right?

It gets darker…

As I promised, things will get a little more pessimistic now. But for the sake of science of course. Data science!

In the table 3 FP is proportional to TP: it’s always 20 times larger. In the table 4 FP is a square of TP.

Table 3 (FP = 20 x TP)
Table 4 (FP = TP²)

In the table 3 accuracy drops with the increase of sensitivity because FP grows 20 times faster than TP. Precision remains the same because the TP/FP proportion itself is a constant. In the table 4 all the metrics, except recall, drop (recall exactly represents sensitivity, which we gradually increase in these experiments).

So both MCC and YJS take into account both sensitivity and specificity and both are sensitive to the proportion of TP and FP and also to the rate this proportion changes. How awesome is that?!

Btw, a good representation of that proportion depending on the model’s sensitivity is ROC curve.

Conclusions

  1. MCC and YJS are simple and efficient classification metrics which might be more relevant than precision, recall, and F1 score for your classifier if you care about true negatives.
  2. If you do care about true negatives, then the case of MCC and YJS dropping is a good indication that you’re going in a wrong direction with your model development.
  3. Setting a minimum acceptable specificity value and maximising either MCC or YJS might be a good strategy.
  4. Distributions matter! Comparing models’s performance metrics on two different test sets isn’t always the right thing to do. Ideally you should strive for having the same data distributions in training/evaluation/testing sets which is as close as possible to the real life. But if you want to have an insight on how your model behaves in case of data mismatch MCC and YJS are here to help.
  5. Spreadsheets are awesome! If you’re a type of person that easier understands a general pattern by looking at the set of specific examples (just as ML models do) you’ll definitely find helpful just simulating them in the spreadsheet without even writing a line of code. As much as I hate to say it as a developer.

Thank you! Have a clean data and choose metrics wisely!

--

--