Ranking metrics from first principles: Average Precision

Amir Ziai
23 min readSep 20, 2023

--

Haven’t read part 1 yet? Ranking metrics from first principles lays down some of the required foundations that you may find useful when reading this post.

Note 1

The goal of this post is to systematically explore Average Precision in detail. Naturally, a systematic approach leads to a longer read. I will be summarizing takeaways for each section. If you’re strapped for time, I suggest starting with the takeaways. If the takeaways for a section are clear to you, feel free to skip to the next section. If not, that section is probably worth digging into it.

Note 2

I’ve tried to minimize code clutter, so the code presented in the post is not self contained. If you want to run the code as you read, I suggest referencing this standalone notebook alongside the post.

Intro

This post is focused on Average Precision (AP). It is (arguably) one of the harder ranking metrics to develop a deep intuition for.

Questions you might’ve asked yourself about AP

Here’s a number of related questions that you might’ve asked yourself as you’ve come across AP:

  1. Why aren’t we assuming a threshold in the AP computation?
  2. Precision is just a scalar given a threshold, is AP an average across a bunch of thresholds?
  3. Wikipedia says that AP is the area under the precision-recall curve, so I guess I should just plot the curve and compute it?!
  4. I understand precision @ K, is AP the average across all K?
  5. I tried to compute AP using scikit-learn (or some other ML package) and got NaN, what does that mean?
  6. Is AP = 0.8 good? How do we determine that? What’s the baseline?
  7. Is AP sensitive to imbalance? Is it sensitive to the number of items in the dataset?
  8. What is mAP?

If you don’t have very crisp answers to all of these questions, I think that you’ll get some use out of this post.

Recap: metric requirements

In the earlier post, we claimed that a ranking metric (for binary labels) needs to satisfy the following requirements:

  1. Highest score should be assigned to perfect (all positives at the top)
  2. Lowest score should be assigned to worst-case (all negatives at the top)
  3. Worst-case ≤ expected score for random ranking ≤ perfect
  4. The order of the items within the positive and negative sets should not matter.
  5. Bumping a positive item higher in the ranking (if there is room) should produce a non-zero improvement in the score.

We’ll see that AP satisfies all of these.

The black box approach

Before defining AP, let’s just treat it as a black box and kick the tires a bit. We’ll use scikit-learn:

from sklearn.metrics import average_precision_score as ap

Recall the interface we defined for ranking metrics:

def ranking_metric(y_true: List[bool], y_pred: List[float]) -> float

ap uses this exact interface, for instance:

assert ap([True, False, True], [1, 0.5, 1]) == 1.0

Toy data

We’ll reuse the dataset of food preferences we introduced in part 1:

Users and items

items = ('🌭', '🍔', '🌮', '🍕', '🍣')
users = ('🙂', '🤓')

ground_truth = {
'🙂': {
'🌭': True,
'🍔': True,
'🌮': True,
'🍕': False,
'🍣': False,
},
'🤓': {
'🍕': True,
'🍣': True,
'🌮': False,
'🌭': False,
'🍔': False,
},
}

Recall that we’re working with binary labels, i.e. each user either likes an item or not. E.g. user 🙂 likes 🌭 and does not like 🍕.

Rankers

We have 4 rankers that we want to evaluate. Each ranks all 5 items for each user in the decreasing order of preference:

ranker_results = {
1: {
'🙂': ['🌭', '🍔', '🌮', '🍕', '🍣'],
'🤓': ['🍕', '🍣', '🌮', '🌭', '🍔'],
},
2: {
'🙂': ['🌮', '🌭', '🍔', '🍣', '🍕'],
'🤓': ['🍣', '🍕', '🌭', '🌮', '🍔'],
},
3: {
'🙂': ['🍣', '🍕', '🌮', '🌭', '🍔'],
'🤓': ['🌮', '🍔', '🌭', '🍣', '🍕'],
},
4: {
'🙂': ['🌭', '🍣', '🍔', '🍕', '🌮'],
'🤓': ['🍕', '🌮', '🍣', '🌭', '🍔'],
},
}

E.g. ranker 1 believe that 🙂 would prefer items in this order: 🌭, 🍔, 🌮, 🍕, and 🍣.

  • Note that rankers 1 and 2 are correctly putting all positive items on top of the negative ones. We expect these to be perfect.
  • The only difference between rankers 1 and 2 is that the order of items within the positive and negative sets has changed. As we saw in part 1, there’s no way to distinguish these cases and they should both get identical (perfect) scores.
  • Ranker 3 is putting all the negatives at the top. It’s the worst thing that you can do.
  • Ranker 4 is somewhere in between.

Helper function for eval

Here’s a helper function to simplify our evaluations using AP:

def user_ap(
user: str,
ranking: t.Sequence[str],
) -> float:
y_pred = list(range(len(ranking), 0, -1))
y_true = [
ground_truth[user][item]
for item in ranking
]
return ap(y_true, y_pred)

Here’s an example of using it to evaluate ranker 1 for user 🙂:

user_ap(user='🙂', ranking=ranker_results[1]['🙂'])

this should return 1.

Ranker AP scores

Let’s use user_ap to compute AP for each (user, ranker):

pd.DataFrame(
dict(
ranker=ranker,
user=user,
ap=user_ap(user=user, ranking=ranking)
)
for ranker, res in ranker_results.items()
for user, ranking in res.items()
).groupby(['ranker', 'user']).ap.mean().unstack().round(2)

returns:

Observations:

  1. Perfect rankers 1 and 2 get AP = 1, this is actually the max possible value for AP.
  2. Rankers 1 and 2 are indistinguishable from the perspective of AP. I.e. changing the order within a cluster of positives or negatives, without placing them above or below another cluster, does not change the score. This was explored in more depth in part 1, so head over there if it’s not super clear.
  3. Worst case (i.e. ranker 3) AP is 0.48 and 0.33 for 🙂 and 🤓 respectively. Hmmmm! You might’ve intuitively expected this number to be zero. We’ll come back to this.
  4. Ranker 4 is somewhere between 1 (or 2) and 3. This is expected. Getting a non-zero improvement over the worst-case as stated in requirement #5.

According to this toy dataset, AP satisfies most of the requirements we mentioned earlier.

Let’s inspect requirement 5 a bit more, i.e bumping a positive item higher in the ranking (if there is room) should produce a non-zero improvement in the score. To check this, we can see if we can improve ranker 4’s AP score in two cases:

  1. For 🙂, we’ll swap 🍔 and 🍣.
  2. For 🤓, we’ll swap 🌮 and 🍣.

Make sure that it makes sense to you that these changes will result in an improvements.

Here’s a handy helper function:

def swap(
ranking: t.Sequence[str],
i: int,
j: int,
) -> t.List[str]:
tmp = ranking[::]
tmp[i], tmp[j] = tmp[j], tmp[i]
return tmp

Let’s use it:

# swap buger and sushi for 🙂
user_ap(
user='🙂',
ranking=swap(ranker4['🙂'], i=1, j=2),
)

return 0.87, which is higher than 0.75 ✅.

# swap taco and sushi for 🤓
user_ap(
user='🤓',
ranking=swap(ranker4['🤓'], i=1, j=2),
)

return 1.0, which is higher than 0.87 ✅.

Expected AP for random rankings

What is the expected value of AP if we were to randomly rank items? Let’s generate 10k random rankings for each user and look at some summary stats:

SEED = 0
N_ITER = 10_000

np.random.seed(SEED)

df_user_ap_random_rankings = pd.DataFrame(
dict(
user=user,
ap=user_ap(
user=user,
ranking=np.random.choice(
items,
replace=False,
size=len(items),
)
),
)
for user in users
for _ in range(N_ITER)
)
df_user_ap_random_rankings.groupby('user').ap.describe().round(2)
  • Sanity check: we see that max and min are the same as we saw before.
  • ✅ Average is between the two extremes as we expected from requirement #3.

Without knowing anything about the inner workings of AP, we have learned a few things:

  1. Perfect / max score is 1
  2. It seems to fit the requirements we stated in the previous section, at least for our toy dataset.
  3. Curiously, the worst-case score is not 0, and not even consistent between the two users.
  4. Similarly, random rankings produce different average APs between the two users.

As we’ll see later in this post, #3 and #4 make AP very easy to misunderstand and misinterpret. That’s about all we can learn about the metric without actually knowing how it is defined.

Takeaways

It’s useful to understand metrics as a black box first
Maximum value of AP is 1
Worst-case and baseline (i.e. expected value of random ranking) depend on the dataset

Let’s open the box

Don’t be turned off by a bit of math that follows- I’ll explain everything.

Definition

Here’s the mathematical definition according to Wikipedia:

This equation just means that we should compute the area under whatever p(r) is over the [0, 1] interval. Why [0, 1]? Well, both precision and recall are bound to that range.

To make sure that I haven’t lost you already, p is a function that takes r (recall) as an input and outputs a scalar (precision) that’s also in the [0, 1] interval. Here’s a picture:

The blue patch is the space of all possible p(r) curves that could be drawn. You might’ve heard that AP is the area under the precision-recall (PR) curve. I haven’t told you much, but at least now you know what the precision-recall curve connection is to AP.

If you’ve taken a calculus class, you may be expecting me to present some integrable function p(r) at this point, so you can test your integration skills. Alas, we don’t have a continuous function, but we do have data! Let’s see if we can somehow put our data points on the curve.

Drawing precision-recall (PR) curves

Let’s revisit the sample data for 🙂. The rankings produced by the 4 rankers:

Overlaying ground truth:

Green cells are positive and red ones are negative.

Visually you should see that ranker 1 and 2 are perfect, 3 is worst case, and 4 is somewhere in-between the two. Make sure this makes sense before you move forward. If it doesn’t, go back to part 1.

The x-axis: recall

Since the x-axis of the curve is recall (r), we can ask the following question: what are all the possible values of recall?

In case you don’t recall (pun intended) the definition:
r = # of correctly predicted positives / # of actual positives

We have 3 actual positives for this user, so the four possible values are: 0/3, 1/3, 2/3, and 3/3.

There’s no other possible value. Make sure that this makes sense to you.

When do we get each of those recall values? Since recall is a classification metric, we need to have a way to binarize. One way to do this is to use a threshold: any item above the threshold is considered a positive prediction, and items below are considered negative.

To build some intuition, let’s add a column for scores that’d induce a ranking. Looking at ranker #1, we see that setting the threshold to a value that is higher than the largest score would result in a recall of 0:

If we reduce the threshold to a value between the first and second score, we’d predict the first ranked item 🌭 as positive. Because 🌭 is actually positive, we’d have recall = 1/3.

The following table continues this process for a few other thresholds:

Make sure you can map the table to the picture. I.e. visualize placing different thresholds and trace through the prediction column and compute recall.

What do we observe?

  1. We have created all possible recall values using these thresholds.
  2. You can always trivially get 0 if you set threshold > max(score). I.e. if you don’t predict anything as positive, you’re not going to recall anything.
  3. You can always trivially get 1 if you set a threshold ≤ min(score). I.e. If you predict everything as positive, you can recall everything.
  4. More interestingly, the value of recall only changes when the threshold results in the inclusion of a new positive item. E.g. between 0.95 and 0.8, we include 🌭 in the set of positive predictions, and recall jumps from 0 to 1/3. Conversely, when moving the threshold from 0.4 to 0.2, there’s no new actual positives added, so recall stays the same at 3/3 = 1.
  5. The actual value of the threshold t we use does not matter, as long as it’s within these ranges:

Recall monotonically decreases as we increase the threshold (i.e. it may stay constant as you increase the threshold, but it can never go up):

You might ask: what happens if there’s no positive items? Recall is not defined in that case (division by zero).

We now know that there’s only P + 1 values that we need to worry about on the x-axis, where P is the number of actual positives in the dataset.

Precision

Let’s repeat the same process of thresholding. We can even reuse some of the same thresholds that we used for recall:

What do we observe in this case?

  1. The denominator is no longer a constant, it’s now a function of the threshold
  2. The numerator is the same as recall’s numerator (i.e. number of true positives)
  3. This means that we may have as many as N unique precision values, where N is the total number of items.
  4. #3 means that for the same value of recall, we can have different values of precision, depending on the threshold (see the last 3 rows).

Sweet! So, precision is monotonic too? Actually it happened that it is in this case, but not generally the case. To make sure I don’t leave you with that impression, here’s a quick counter example:

  • Ground truth: [True, False, True]
  • Scores: [0.9, 0.8, 0.7]
  • For increasing values of threshold 0.65, 0.75, and 0.85, precision is 2/3, 1/2, and 1.
  • Starting from 2/3, it goes down to 1/2, and then up to 1 => not monotonic.

You can find all precision and recall curves for 🙂 in Appendix 1.

Summary of what we observed about precision and recall

The y-axis: p(r), or precision at all possible values of recall

We need to put the two together.

Let’s continue with ranker 1’s output (items are ranked by ranker 1) for user 🙂 and try to fill precision and recall values in this table:

One way to do this is to start at the top and work our way down. At each row, assume that you’re predicting that item and everything above it as positive. Now you can calculate precision and recall for row 1 (i.e. 🌭):

For precision, you’re predicting 1 item to be positive and it actually is, so 1/1 (the 3rd column is the ground truth). For recall, we know that the dataset has 3 positives (looking at the third column again), and the first item is predicted to be positive, so recall is 1/3.

Here’s the rest:

Make sure these values all makes sense to you.

How is this exercise useful? I claim that these are all the points that you need, in order to be able to draw the PR curve for this (ranker, user). Let’s just put them on the graph:

Green + points indicate positive items and the red - ones indicate negative items.

Fine, how do we get AP from this?

What is an integral, really?

It took us a while to get here, but if you recall, we defined AP as an integral earlier. So, we need to calculate the area under some curve. However, we have a few discrete points and not a continuous curve. That’s actually great, we can sum a finite number of things as follows:

Why? turns into because we’re dealing with discrete points. dr turns into Δr for the same reason.

Doesn’t a sum feel much better to work with than an integral?

But, wait! What are we summing over? What is this Δr thing?

Here’s a picture of the kinds of elements that we’re summing over:

At each value of r, we have a corresponding p(r), that we’re multiplying by Δr. So, we’re summing over a bunch of areas of rectangles.

Geometric intuition

Here’s one way we can draw these rectangles for our example:

You may object at this point. There are 3 choices at r=1. Why did I draw only one of them? I.e. I could’ve drawn the rectangle up to one of the negative points, or maybe I should draw all 3?

Let’s take a closer look at what we’re trying to do. I’ve reproduced the table from earlier, with two additional columns: Δr and ∑ p(r)Δr. Let’s work from top to bottom, like last time. We already know recall r and precision at that recall p(r):

What should we use for Δr? We have a current value for r, so what should we subtract from it to getΔr? If we don’t predict any positives, recall is zero, so Δr=1/3. The last column is the running sum of p(r)Δr, which is just 1/3.

Here’s another 2 rows filled out:

Note that the running sum is already 1, and we said earlier that max value of AP is 1. So, we better not exceed 1 with the last two rows:

Since r is already maxed out at 3/3, Δr=0 for the last two rows and there’s no contribution from those.

Now we can think back to the three rectangles we drew, we’re only drawing the rows with Δr > 0, and all three of them have p(r)=1.

Check your understanding: ranker 4 output for user 🙂

Recall:

  • Ground truth: {🌭: True, 🍔: True, 🌮: True, 🍕: False, 🍣: False}
  • Ranker 4 output: [🌭, 🍣, 🍔, 🍕, 🌮]

Fill out the following table:

Remember to work from top to bottom and that the last column is a running sum. Check your answer against appendix 2.

Finally, double check the correspondence between the table and this graph:

Simplifying the AP calculation

It’s instructive to work through a few examples. You’ll likely develop a good intuition for what AP really is doing. I suggest repeating the same exercise for the rest of the toy dataset with pen and paper.

Here’s a few things that you might start to notice:

  • We don’t need to use any thresholds in the AP calculation
  • We don’t need to use records where the item is not positive for the simple reason that Δr=0, and there’s no contribution in the sum.
  • Δr is always either 0 or 1 / P, where P is the number of positive items (i.e. ground truth is positive). Instead of multiplying it as we go, we can just do a one time multiplication at the end.

In other words:

If that’s not super clear, here’s a concrete implementation:

def ap_simple(
ranked: t.Sequence[bool]
) -> float:
pos_cnt = 0
precision_sum = 0
for idx, r in enumerate(ranked):
if r:
pos_cnt += 1
precision_sum += pos_cnt / (idx + 1)
if pos_cnt == 0:
return float('nan')
return precision_sum / pos_cnt

Here’s a few examples to work through:

assert ap_simple([True]) == ???
assert ap_simple([True] * 10) == ???
assert ap_simple([True, False]) == ???
assert ap_simple([False, True]) == ???
assert ap_simple([False, False, True]) == ???

I suggest thinking through the answer, using ap_simple to guide your thinking. Replace values of ??? before running the code, and then run the code and see if you got it right.

Want a bit more practice? Head over to Appendix 3.

Two final questions to wrap this part up:

  1. What’s AP if there’s no positive items? E.g. what is ap_simple([False])?
  2. What’s AP if all items are positive?

Takeaways

Calculating AP involves calculating precision @ k at positions that correspond to a positive item.

AP is not defined if no positive items exist

AP equals 1 if all items are positive. In this special case, all the following values are the same: worst-case, expected, and perfect.

Inspecting AP

We’ve come a long way. We started from the definition of AP and got to a (hopefully at this point) intuitive way to compute AP given a ranked set of ground truth labels.

We’re now ready to understand AP more deeply.

Here’s what we know so far:

  1. AP = 1 for a perfect ranking
  2. Anything other than a perfect ranking has AP < 1
  3. We noted that worst case and expected values are not 0 and 0.5 as we might’ve intuitively expected.

Just as a reminder, we need to know worst-case and baseline in order to locate the region that we should avoid (region A in the following figure). A solid ranker should perform at least (stat sig) better than baseline.

Computing worst-case AP

If you remember from our examples earlier:

Ranker 3 produces the worst results. We also saw that these numbers correspond to the min(AP) when we simulated 10k random rankings.

So, what is ranker #3 doing? It’s placing all negatives on top of positives.

Therefore, knowing only N (total number of items) and P (number of positive items), we can calculate the worst-case AP as follows:

Why? Well, remember that we only need to compute precision for the positive items and we know that they’ll be placed at the bottom and exactly how many of them we have.

Sanity check

  • Let’s try this for user 🙂 with N=5 and P=3: 1/3(1/3+2/4+3/5)=0.48.
  • And for user 🙂 with N=5 and P=2: 1/2(1/4+2/5)=0.325.

These match with what we calculated before using ap (and ap_simple).

Code

Here’s the same thing in code:

def ap_worst_case(N: int, P: int) -> float:
assert 0 < P <= N
return sum(
i / (N - P + i)
for i in range(1, P + 1)
) / P

Worst-case AP as a function of P and N

Let’s take a look at two examples:

  1. N=2 and P=1
  2. N=1000 and P=500

Both have p=P/N=1/2. Worst-case APs for these are 1/2 and 0.31. That’s a big difference for two datasets that have the same rate of positives p=P/N.

Here’s worst case AP as a function of N for a few values of p:

AP plateaus N grows.

Now that we know how worst-case AP are calculated, it should make a bit more sense why it’s not just a function of p (but also of N).

Who cares?

The ML team you work with declares that they’ve trained a very accurate ranker with AP=0.8. Upon inspecting the evaluation set, you notice that N=1000 and P=950, you run ap_worst_case(N=1000, P=950), which gives you 0.84. I.e. it’s not even possible to get a smaller value of AP than 0.84 for this dataset, so something is seriously wrong!

This example should remind you of some of the well-known issues with using accuracy without accounting for class imbalance. In addition to class imbalance (i.e. p), AP is also sensitive to the number of instances N.

You have to be very careful when using this metric. Your intuition may not be super reliable.

Small vs. large values of N

In general, worst case AP gets larger as N decreases. The following figure captures this difference:

AP_small_N is the smallest possible integer N for that p. E.g. for p=0.1, N=10.

Why is this? Let’s think about it from the user’s perspective. Think back to the difference between N=2 and N=10 for p=1/2. If there’s only 2 spots (i.e. N=2), showing 1 negative followed by a positive is not that bad. If there are 10 spots, the user has to go through 5 negative items before getting to the positives. It’s just more negative, which translates to a worse experience, and a worse AP.

There’s also something interesting happening with p > 0.5 in the figure above. Try to explain it using a similar UX-focused argument as above.

Computing expected AP (i.e. baseline)

We’ve established worst-case AP. Now we need to establish the baseline.

Intuition

How well can you expect to do if you were to just randomly rank items?

Little bit of math

We can define this expectation as follows:

where E(AP) is the expectation of AP, R is the set of all possible rankings, r is an element of this set, prob(r) is the probability of this ranking, and AP(r) is the AP given this ranking.

Conveniently, prob(r) is equal for all rankings, i.e. prob(r)=1/|R|. We can simplify the expression:

Let’s start with N=2 and P=1. There’s two possible rankings with equal probability. So E(AP)=1/2(1/1+1/2)=0.75.

Calculation

Naively, we can enumerate all possible rankings, calculate AP for each, and then just divide by the number of possible rankings. I.e. something like this:

import itertools
import math


def expected_ap_naive(N: int, P: int) -> float:
return sum(
ap_simple(r)
for r in itertools.permutations([True] * P + [False] * (N - P))
) / math.factorial(N)

This should terrify you if you know how permutations and factorials scale. Let’s see if we can do any better.

A simple case: P = 1

If P=1, there’s N spots that this item can fall into, which significantly simplifies the computation. Recall that we said that the order within the negatives (or positives) does not matter with binary labels.

Here’s an efficient function to compute E(AP) for this special case with P=1:

def expected_ap_p1(N: int) -> float:
return sum(
1 / i
for i in range(1, N + 1)
) / N

Removing redundant computations

Can we get rid of all those redundant computations with indistinguishable orderings?

We’ll work through a small example to develop our intuition. Say we have 3 items, 2 of which are positive: {🍣: True, 🍔: True, 🍕: False}.

Here are all the possible rankings (there 3! of them):

What’s the AP for each? At this point you should be able to calculate AP in your sleep. APs are: 1, 1, 0.83, 0.83, 0.58, and 0.58 respectively.

This suggests that we can collapse the indistinguishable cases as follows:

This is good news. Instead of dealing with N! rankings, we can work with N choose P rankings. Why? You have N choices for placing your P items in the ranking and the order does not matter. In this example, your choices are {1, 2, 3} and you want to see how many sets of size 2 (i.e. P) you can make. There’s 3 such sets: {1,2}, {1,3}, and {2,3}.

Here’s an implementation:

def expected_ap_naive2(N: int, P: int) -> int:
N_choose_P = math.comb(N, P)
return (1 / N_choose_P) * sum(
ap_simple(
[x in set(rs) for x in range(1, N + 1)]
)
for rs in itertools.combinations(range(1, N + 1), P)
)

Simulation

Another approach to computing E(AP) is to just generate a bunch of random rankings and take the mean:

def expected_ap_simulate(N: int, P: int, iters: int = 1_000) -> float:
r = [True] * P + [False] * (N - P)
return sum(
ap_simple(np.random.permutation(r))
for _ in range(iters)
) / iters

This is not exact, but we can increase iters to get a more accurate estimate.

The bad news is that this function scales linearly with N. The good news is that E(AP) tends to p=P/N as we increase N. Here’s pictorial evidence:

Dashed line is p = P / N.

For practical purposes, if you have more than 1k examples (i.e. N ≥ 1k), you can just use p=P/N as your baseline. For smaller N, you’ll have to actually calculate or estimate it.

Putting worst-case and expected AP together

As I mentioned earlier, things are relatively straightforward for large N (i.e. N≥1k): just make sure your ranker does better than p=P/N. For smaller N, you have to be more careful. The following figure is an attempt to capture all that we covered in this section:

AP for two cases: (a) Smallest possible N at each value of N (b) Large enough N such that increasing N does not change AP much at the exact same value of p. For both parts: the grey region is not possible, the red region should be avoided as it’s worse than baseline, and the green region is safe. The y axis is AP.

Takeaways

Placing all negative items at the bottom results in worst-case AP.

Expected AP (i.e. baseline) needs to be calculated or simulated for small values of N. For large values of N, it’s safe to use p = P / N.

AP is sensitive to class imbalance and N. In particular, you have to be extra careful when working with a small dataset.

Is AP=0.6 better than random guessing for a dataset with p=0.5? Intuitively you might think that the answer is yes because AP > p. However, we saw that expected AP can be as high as 0.75 (i.e. for N=2).

What is mean AP (mAP)?

Say you have to produce rankings for 3 different users. You compute AP for each of these rankings and get 2/3, 3/4, and 1/2. How is your ranker doing? The simplest thing you can do is to take the average of these 3 values and report 0.64. You’ve just calculated mAP.

Is mAP=0.64 good? I hope that all sorts of alarms bells are going off for you after all the discussion we had in the previous section. The answer is that we have no idea just based on the information presented here.

It’s useful once you establish E(mAP), which you can calculate by just taking the average of the E(AP) for each user. From there you can compare 0.64 to that baseline, and also compare mAP for different rankers.

Is it an interpretable metric? Not really! Try explaining 0.64 to your non-technical stakeholders (or yourself after 3 months of not thinking about AP) in the context of the nuances of N and p for each user.

How is AP related to binary classification metrics?

Binary classification metrics require binary predictions, so we need a threshold. For instance, say you have a perfect ranker that ranks five items as follows:

Is there a threshold we can pick to get perfect accuracy? Sure, we can pick 0.5. Is there a threshold to get less than perfect? Yes, 0.8. Is there a threshold to get accuracy = 0? How about precision or call?

It’s useful to think through some special cases:

  1. If accuracy = 1, does AP = 1?
  2. If AP = 1, does accuracy = 1?
  3. If AP < 1, is it possible to get accuracy = 1?
  4. If precision = 1, does AP = 1?
  5. If AP < 1, is it possible to get precision = 1?
  6. If recall = 0, what’s the possible range of AP?
  7. How about recall = 1?

You can find answers in Appendix 4.

Summary

Here’s what we went over in this post:

  • Looked at AP as a black box and studied its behavior in the context of ranking metric requirements.
  • Defined AP and went through the details of how it’s computed.
  • Inspected worst-case and expected AP as a function of N (number of items) and P (number of positives).
  • Pointed out that a few properties of AP can make it unintuitive to interpret and explain: (1) AP is sensitive to class imbalance, and (2) in addition to imbalance, AP is sensitive to the number of items (N) in the dataset.
  • Briefly discussed mAP.

Next up

We’ll talk about Area Under the Receiver Operating Characteristics curve (AUROC) next. Stay tuned!

Appendix

Appendix 1

Precision and recall curves for user 🙂

Appendix 2

Appendix 4: AP answers

assert np.isnan(ap_simple([False]))
assert ap_simple([True]) == 1
assert ap_simple([True] * 10) == 1
assert ap_simple([True, False]) == 1
assert ap_simple([False, True]) == 1/2
assert ap_simple([False, False, True]) == 1/3

Appendix 3: More AP calculation practice

Calculate AP for each user. We’re assuming that the ranker is producing the exact same rank for all 5 users, the only thing changing between users is the ground truth.

Answers are: 29/36, 9/20, 1, NaN, 1/2

Appendix 4: classification metrics quiz

  1. Yes.
  2. Not necessarily, it depends on the threshold. But, there’s some threshold for which this is possible.
  3. No.
  4. Not necessarily, it depends on the threshold. But, there’s some threshold for which this is possible.
  5. Yes.
  6. [0, 1]
  7. [0, 1]

--

--