Should I Never Use F-Measure?

Thomas Packer, Ph.D.
TP on CAI
Published in
4 min readOct 30, 2019

This short story is a response to Ian Soboroff’s presentation and tweet about never using the F-measure.

2023 update: I now feel foolish for taking the time to defend the use of F-measure after thinking back through my motivation for doing so and realizing what must have been going on. This short story was actually a response to a co-worker of mine who randomly sent me a link to this rant from Ian Soboroff after I mentioned using F-measure. Why did this co-worker send me this link? The only explanation I have is prejudice: prejudice against something about me that he never expressed, but which bubbled to the surface of our conversation in the most illogical way (as most prejudice does). You see, if someone doesn’t like you for reasons that he knows are indefensible (as prejudice usually is), his cognitive dissonance will often manifest itself in other ways which he thinks are defensible. But if his opinion of you is low enough, he will flippantly give arguments that are actually less defensible than the original prejudice, and therefore blatantly illogical. Such was the case in this situation when this co-worker felt the need to convince me that I should not use F-measure even though he, himself, probably used it many times before and after.

https://twitter.com/ian_soboroff/status/1110903162171465728

Four thoughts on his tweet.

First, his objective was to be controversial, probably because it’s hard to get noticed these days without at least a little controversy. So, I would take what he says about never using the F-Measure with a grain of salt.

Second, he is coming from an IR perspective that focuses on ranking. By his own admission, he does not have a lot of experience in other fields that use the F-measure. He first tries to convince himself that all classification problems can be reduced to ranking, which is clearly not true. Even in ranking, it is sometimes possible to move a decision boundary in such as way that it will improve both precision and recall simultaneously. In classification, it gets even more interesting as the dimensionality of the problem goes up above one (one being the number of dimensions ranking has to work with).

Third, most of his premises and theses don’t make any sense. “You should probably not average two values that are inversely correlated.” That inverse correlation is exactly why F-measure is so useful. It’s trivially easy to maximize recall while ignoring precision, and vice versa. This is also why one of his theses is nonsensical: “F is what gets used when someone couldn’t decide whether they wanted recall or precision.” If you want to look at both precision and recall every time you measure the performance of every experiment involving every model you work with, you can decide how best to use your time. But assuming you want a single metric to summarize these hard-to-reconcile metrics at least some of the time, and assuming you have a confidence threshold, and assuming you can weight the cost of false positives and false negatives equally, F1-Measure is one good option. If you don’t have a single threshold you can rely on, area under the precision-recall curve would be better. If you know that you should weight false positives and false negatives differently, then F-Score with a different Beta parameter might be better.

As for his argument against using a single metric, that leads me to my last point.

Fourth, his bottom line was to say, you can learn more about a system by breaking down an aggregate metric and looking at all its components. Is this unexpected to anyone? This is true of any aggregate metric you care to use — and there are many used every day, everywhere. Sometimes you need to understand the details of a system, e.g. when you are debugging a problem with it. In those cases, you should look at more metrics. But sometimes you just need a simple reason to use one model instead of another, a reason that is good enough and avoids the analysis paralysis his tweet is probably perpetuating. For example, a single aggregate metric is very useful when you need to repeat a decision frequently, when you need to make that decision autonomously, or when you need to communicate the decision widely. In these cases, a single metric wins — assuming it is the right single metric. There is a lot of power in a minimalistic good enough metric. According to Jim Collins, finding the best single metric to focus on has helped a number of the greatest companies become great.

Join the CAI Dialog on Slack at cai-dialog.slack.com

About TP on CAI

Other stories in TP on CAI you may like:

--

--

Thomas Packer, Ph.D.
TP on CAI

I do data science (QU, NLP, conversational AI). I write applicable-allegorical fiction. I draw pictures. I have a PhD in computer science and I love my family.