Why You Might Want to Use Area Under PR Curve Instead of Precision Recall

Quinn Wang
3 min readAug 23, 2019

--

There are various scalar metrics when it comes to evaluating your model's performance. Applied to the real world, it can sometimes be difficult to choose the metric that suits your problem. For example, commonly used accuracy might not be appropriate when you are dealing with skewed classes, where the model could simply be predicting the majority class and still yielding high accuracy.

You might be more interested in precision and recall scores with your highly imbalanced data. However, these scores will likely vary with your decision threshold. Let's talk about this in a more concrete example. Mr. Potato is a Krispy Kreme 🍩 fanatic, and with his impressive donut buying history he found that on some days, his donut (yes he's only buying one) is very noticeably smaller than usual. He comes up with a hypothesis (and yes he do not want to just ask for a larger one): certain conditions on those days may have affected his donut size. He quickly noted down some things that might came into play the last time he received a smaller 🍩 - time of day, number of the same donuts left on the food case, number of staffs that day, number of customers that day, etc.

Photo by freepik

He does this for fifty years and has gathered enough training data, but only 2% of data were labeled "tiny 🍩". His model was getting a 98% accuracy, but only a 71% precision and 26% recall. He would obviously want the recall to be higher, and he is willing to trade some of this precision for recall - it would be nice if they can simple switch percentages.

He knows his model predicts probabilities for the 2 classes, tiny and regular donut (1 and 0), and makes a decision with a default 0.5 decision threshold. So he decided to poke his model and make it predict 1 as long as it's probability surpasses 0.45, which landed him a 60% precision and 40% recall. He experimented with some other thresholds and finds this to be potentially very helpful, but also troubled him because he discarded a bunch of models with 24%, 25% recalls but higher precision, what if after some threshold adjustments one of those models comes out top?

This is why you want to consider area under curve for your model comparisons. Area under curve stands for the integral of the area under the precision-recall curve. The precision-recall curve is plotted using a partition of decision thresholds from 0 to 1 with the x-axis being recall and y-axis being precision. Comparing precision-recall curves between different models can bring valuable insights. If the precision-recall curve of model 1 is always above the precision-recall curve of model 2, that means model 1, at any decision thresholds will perform better in both precision and recall. In that sense, having a higher under curve translates to better precision and recall at a wider range of decision thresholds.

You might have already spotted why this area under curve is a more interesting scalar measure to compare your different models: it is independent of decision thresholds. So there’s no need for questioning, “better, but with what thresholds?” There are still other reason that even with a higher area under curve Mr. Potato should be hesitant to choose a model over another, but that’s story for next time.

--

--

Quinn Wang

Data analyst with an interest in machine learning. Passionate about understanding the theoretical backings of ML algorithms.