Why You Might NOT Want to Use Area Under PR Curve

Quinn Wang
3 min readSep 1, 2019

--

Last time I introduced Mr. Potato and his approach to his tiny donut problem. It looks like Area Under Curve is a better evaluation metric for his unbalanced class problem because such scalar comparison is insensitive to changes in the decision threshold.

However, there are still intrinsic problems with this metric. I present 2 models: model A with Area Under Curve equal to 0.8 and model B with Area Under Curve equal to 0.75. From a scalar comparison point of view it is clear and almost without randomness that model A is superior. Let's look into the possible precision recall curves (PR curve) of model A and B just based off of these two numbers:

Figure 1:

Figure 2:

Figure 3:

In figure 1, it is clear that model A is superior in terms of precision and recall with any threshold, so given the existing features, model B can be discarded. On the other hand, figure 2 and 3 does not offer that clear an insight. Maybe recall is more important than precision for Mr. Potato, but he would also need to avoid a terrible precision - a 10% precision obviously sounds likes a dumb model.

Or consider a more extreme situation as shown in figure 4:

In this case, model A is behaving in a strange manner. This could occur when clusters of true positives fall very close to false positives, which causes the model to have 61% confidence that true positives are actually 1's but also 60% confidence that false positives should be predicted as 1's. This behaviour in the PR curve will also likely be accompanied with low generalizability over to the test set, a little variation in the negative examples from the test set might cause the model to falsely categorize them as positive.

Furthermore, if you are going after a minimum precision or recall score for business/safety reasons with your specific application, for example you might want to ensure a 95% precision for a spam filter, the scalar comparisons of Area Under Curve may not be optimal to draw conclusions from.

--

--

Quinn Wang

Data analyst with an interest in machine learning. Passionate about understanding the theoretical backings of ML algorithms.