Ranking metrics from first principles

Amir Ziai
10 min readMar 31, 2023

--

Metrics

Motivation

Ranking metrics are widely used in many applications. If you’re reading this, you’ve probably at least heard of, if not worked extensively with Precision @ K, Average Precision (AP), Area Under the Receiver Operating Characteristic (AUROC), and likely some other ones.

Despite their pervasiveness, ranking metrics are not as well understood as classification metrics. We’ll also see that some of them are actually somewhat unintuitive, which makes them prone to misinterpretation.

The goal of this series is to systematically explore each metric in detail. However, before getting there, we’ll lay the foundation for studying ranking metrics in this post.

Naturally, a systematic approach leads to a longer read. I will be summarizing takeaways for each section. If you’re strapped for time, I suggest starting with the takeaways. If the takeaways for a section are clear to you, feel free to skip to the next section. If not, that section is probably worth digging into it.

How is ranking different from classification?

Let’s start at the very beginning. What is classification?

Classification

You have a set of items, and you want to have a way to classify them into C mutually exclusive classes. If C=2, this is called binary classification. If you’re familiar with multi-label classification, where classes are not mutually exclusive, think of that as applying one binary classifier to each class, so we can keep it consistent with our simple definition.

Here’s an interface that captures our definition abstractly:

def classifier(item: Item) -> Class
  • classifier is a function that does the classification given an input
  • item is an instance of type Item that we want to classify
  • The function outputs an instance of type Class

Let’s make this a bit more concrete.

Example: binary emoji food classification

Here’s a function that takes an emoji, and outputs true if it’s a food emoji:

def emoji_food_classifier(emoji: Emoji) -> bool
  • emoji_food_classifier is now a binary classifier
  • emoji is an instance of type Emoji (subset of str)
  • The function outputs a bool

As an example, let’s take the following five emojis:

Figure 1- set of items to classify using emoji_food_classifier.

Ideally, we want emoji_food_classifier to make the following predictions:

Figure 2- applying an accurate emoji_food_classifier to the set of sample emojis.

In code:

emoji_tests = {
'🌭': True,
'🍔': True,
'🍕': True,
'🔑': False,
'✈️': False,
}

assert all(
emoji_food_classifier(emoji) is expected
for emoji, expected in emoji_tests.items()
)

This formulation is highly versatile. You can cast a surprising number of real-life problems as binary classification.

The food delivery app problem

What if you care about capturing preferences? Say you’re building a food delivery app and want to personalize the app to each user. Here’s the current offering:

Figure 3- set of items offered by a food delivery app.

By design, everything you offer is food! The classification formulation becomes a bit awkward for personalization. There’s usually degrees of preference, and it’s useful to tweak the formulation to accommodate this.

One way to do this is to rank the list of foods according to the user preference. Meaning, foods we place at the top are preferred over the ones at the bottom. For instance, let’s say that these are the preferences for users 🙂 and 🤓:

Figure 4- food preferences for users 🙂 and 🤓.

This is a common design choice made by a lot of apps. This approach, typically called a feed, lends itself to a virtually infinite scrolling experience when many items exist.

Takeaway

Ranking cares about degrees of preference. Classification cares about distinguishing classes.

What are the benefits of ranking over classification?

Instead of identifying what the users like or do not like and only showing the positive items, we can rank items. The user may be willing to look at more items than a binary classification approach decides to show them.

For instance, say 🙂 has never tried 🍣, but today she’s willing to give it a try. A good classifier might’ve removed 🍣 from 🙂‘s recommendations altogether. A good ranker would just place it lower in the list.

Aren’t all classifiers also rankers?

Many classifiers can act as rankers. We can just use the classifier score for ranking items. Ranking punts on picking a threshold and leaves it to the user to decide.

What’s the catch?

Preferences are not always very consistent over time and not necessarily transitive. E.g. even though 🤓 prefers 🌮 over 🌭, and 🌭 over 🍔, given a choice between 🌮 and 🍔, they may prefer 🍔. Consistency is not a ranking-specific problem though.

Even if we assume that users are perfectly consistent over time and have transitive preferences, it’s not easy to explicitly collect a list of pairwise preferences that we can use to construct this ranking.

The easiest type of data that can be collected is binary labels. These labels can be collected explicitly (e.g. user thumbs up/down an item) or implicitly (e.g. user orders from a restaurant and doesn’t leave a bad review).

Ranking evaluation

The rest of this post assumes that we have collected binary ground truth labels. In future posts we will look at non-binary labels.

How do we evaluate rankers? First, we need to define an interface for ranking.

Interface

We assume that we have a ranker function (could be ML or not) for producing a ranking of items given a set of items:

def ranker(items: List[Item]) -> List[int]

ranker takes a sequence items of type Item, and produces a parallel sequence of ranks.

Here’s a ranker that takes a list of emojis and produces the ranking:

def food_emoji_ranker(items: List[str]) -> List[int]

Our food emoji ranker in action:

assert food_emoji_ranker(
items=['🌭', '🍔', '🌮', '🍕', '🍣'],
) == [3, 1, 2, 4, 5]

Visually:

Figure 5- Ranker takes a list of items and ranker them.

In reality, the ranker will probably take additional context as input (e.g. user features), so that it can personalize the ranking to the context. To keep things simple, let’s just overlook that detail for now and keep our focus on evaluation. If it helps, you can imagine that the ranker is implicitly aware of the context/user that it’s trying to personalize the output to.

Toy dataset

Let’s extend our example from Figure 4 and imagine that we have binary labels for our favorite users 🙂 and 🤓 as follows:

Figure 6- binary labels appended to the preference we saw in Figure 4 for our favorite users.

Given this dataset, we can now start evaluating rankers. Our ML team has built four rankers. We ask the team to generate rankings for each user and we get back this table:

Figure 7- Four candidate rankers built by our ML team that we want to evaluate against the toy dataset.

The earlier items in the list are ranked higher. For instance, ranker_1 thinks that 🙂 prefers 🌭 to everything else.

To be very concrete, let’s map the output in the table to the interface we defined earlier for ranker_1 and user 🙂:

# Rankings for user 🙂 in the table is:
# [🌭, 🍔, 🌮, 🍕, 🍣]
# mapping this to the interface we defined earlier
ranker_1([🌭, 🍕, 🍔, 🌮, 🍣]) == [1, 4, 2, 3, 5]

Which ranker is better? We need to define metrics to quantify the quality of each ranker.

But, let’s start by looking at the results for user 🙂 to build some intuition for this setup.

Figure 8- Ranker outputs with binary labels overlaid for user 🙂.

Best-case (perfect) ranking

ranker_1 is the exact same ranking we saw in Fig 6. Intuitively, this ranking should be considered perfect, we can’t really improve it.

ranker_2 is not the exact same ranking, but notice that all the positive items are placed at the top. Should ranker_2 be penalized because of this?

The answer is that the two orderings are indistinguishable from our perspective. We only have binary labels for items, so as long as we put all positive items on top of the negative ones, the order is irrelevant.

Another way to think about this is that as long as we fill the first three spots (positive ones) with positive items and the next two spots with negative items, we’ve done our job perfectly.

Figure 9- the perfect ranker would put the positive items on top of negatives.

That is, any of the 6 permutations for the positive spots and 2 permutations for negative spots, for a total of 12 orderings is equally valid.

Figure 10- any of the 6 green (positive) orderings can be paired with one of the 2 red (negative) orderings to produce a perfect ordering (i.e. 6 x 2 = 12 total orderings).

This is a key observation to keep in mind: in our example ranker_1 and ranker_2 are indistinguishable and should both receive a perfect score.

Takeaway

With binary ground truth labels, some orderings are indistinguishable. Putting all the positives at the top, in any order, results in a perfect ranking.

Worst-case ranking

Now we know what the ideal case is. How about the worst case? It’s probably not too hard to see that doing the opposite is the worst that we can do. I.e. if we put all the negative items on top of the negative items.

We can construct 12 worst-case orderings by just putting the negatives at the top (similar to the previous section).

Figure 11- the worst ranking entails placing all of the negative items on top of the positive ones.

ranker_3 is producing this worst-case ranking and should receive the lowest possible score.

Takeaway

Putting all the positives at the bottom, in any order, results in worst-case ranking.

Everything in between

We have established the best and worst cases. What about orderings in between? What if we mix one negative item with the positives? What if the one negative item is placed at the top of the list but the rest is correctly ordered? Intuitively, the more we “contaminate” positives with negatives, and the higher up these “contaminants” are, the worse the experience will be for the user.

ranker_4 is interleaving positives and negatives, and should intuitively receive a score somewhere between the two extremes.

Random ranking

For any metric, it’s critical to establish a baseline that can be trivially accomplished. In the case of binary classification, a good baseline is majority-class classification. What’s a trivial ranker?

We can trivially produce random rankings. Any useful ranker should do better than this baseline, hopefully by a wide margin.

It turns out that we can actually do better (in expectation) than the worst-case scenario by randomly ranking items. Intuitively speaking, if a ranker produces a random ranking, it should interleave positives and negatives and land somewhere between the two extremes.

Metric bounds

Before using any metric, we should establish the three values that we covered in this section:

Figure 12- Before using any metric, we have to make sure that we have a good understanding of its bounds.

We certainly want to avoid rankers that fall in A, we can do better by just using a random ranker at that point. For rankers in B, obviously the closer to perfect, the better.

To drive this point home, take binary classification accuracy for a perfectly balanced dataset (i.e. 50% positive prevalence). Worst-case is 0%, baseline (i.e. either random guessing or constant prediction) is 50%, and perfect accuracy is 100%.

Takeaway

Worst case ≤ baseline ≤ perfect
It’s very important to understand these 3 numbers for a metric before you start using it. These values are not always very intuitive as we’ll see in future posts.

Metric requirements

Without loss of generality, we’ll assume that higher is better for a ranking metric. Here’s the set of requirements that a metric needs to satisfy:

  1. Highest score should be assigned to perfect (all positives at the top)
  2. Lowest score should be assigned to worst-case (all negatives at the top)
  3. Worst-case < expected score for random ranking < perfect (can they all be the same?)
  4. The order of the items within the positive and negative sets should not matter.
  5. Bumping a positive item higher in the ranking (if there is room) should produce a non-zero improvement in the score.

Interface for ranking metrics

Given the requirements we just laid out, we’re now ready to turn our attention to the study of ranking metrics.

All of these metric will share the following interface (same as sklearn’s):

def ranking_metric(y_true: List[bool], y_pred: List[float]) -> float

where y_true is the list of ground truth labels, and y_pred is a list (parallel to y_true) that captures a score that can be used for ranking.

For example, let’s say that we have items [🌭, 🍣, 🍔, 🍕, 🌮] and the following corresponding values:

y_true = [True, False, True, False, True], 
y_pred = [0.1, 0.7, 0.3, 0.5, 0.99]

These scores induce the following ranking: [🌮, 🍣, 🍕, 🍔, 🌭].

Visually:

Figure 13- Example of converting the input expected by our ranking metric interface into a ranked list of ground truth binary labels.

As you might’ve noticed, the scores are just a means of producing rankings and not used in the actual computation. A simplified interface is to just take a ranked list of binary labels

def ranking_metric_simplified(ranked_y_true: List[bool]) -> float

Note that because we’re only interested in the ranking, the actual scale of the scores is not important, so long as the rankings remain the same. I.e. The metric is invariant to the scale of the scores.

Summary

Here’s what we went over in this post:

  • Motivated the ranking formulation and contrasted it to classification
  • Narrowed down our study to evaluating rankers with binary labels
  • Explored the extremes: worst and best case
  • Covered the need for establishing baselines
  • Came up with a set of requirements for a ranking metric

We now have the requisite foundation to embark on our exploration of ranking metrics.

Next up

The first concrete metric we’ll study in depth is Average Precision.

--

--