Evaluation Metric given Precision Recall Goal

2 min readOct 18, 2019

Recently, I have been working with some AutoML libraries such as H2O and auto-sklearn, and both threw me off with the same problem: leaderboard ranking with a metric unsatisfying to my project needs.

Some of the most common evaluation metric today are accuracy score, auc, auprc, f1, etc. Libraries are often built on top of these metrics, and some but not all libraries offer a custom option for specifying your own metrics function. In this blog post, I would like to discuss some of the benefits of offering such custom option, as well as proposing a new evaluation metric that could be in many real world business applications.

An example of this: using player location data in a soccer match, a researcher in a startup company is trying to build a model that send out alerts right before a player scores. This model would require very high recall, since there are only a few scorings in each soccer game, and the audience would not like to miss any if they are relying on the alert system to call them back from their breaks. A slightly lower precision would be acceptable, since the audience would not lose much if a false alert is given, especially if false alerts are only given at moments that will also be highlights. However, a precision too low is obviously not ideal for the business, since audience would lose trust in the alert system. Say a precision of 20% is acceptable, anything above that is preferred, but not necessary, but a recall of over 90% is required. How are we going to train a model that takes our requirements in its evaluation?

I would like to propose new scoring system, a simple addition to the f1 calculation that takes into consideration of our precision and recall requirements, and I will call it the bounded-f1 score. The bounded-f1 score takes in the following parameters:

Lower bound of precision (lbp)
Lower bound of recall (lbr)
(Optional, default==1) score weight of precision:recall (denoted by pw:rw)

And calculated by:

The intuition behind this is that this is a version of f1 score with an offset at the precision and recall boundaries. The +1 keeps the score no less than 0, and the respective precision weight and recall weight scales up or down the significance of improvement above the requirement or underperformance below the requirement. Using the bounded-f1, the model would take into consideration a desired precision recall combination.

Evaluation Metric given Precision Recall Goal

Written by Quinn Wang