Evaluating Object Detectors: Average Precision (AP), and Localization-Recall-Precision (LRP)

Kemal Öksüz
10 min readDec 5, 2018

--

Measuring the performance of an algorithm accurately is very crucial since the community struggles to outperform the state-of-the-art method in terms of these performance metrics/measures. Thus, in a way, the performance measures and metrics lead the community.

In the object detection problem the Average Precision(AP) was unchallanged until it was questioned in our ECCV paper “Localization Recall Precision: A New Performance Metric For Object Detection”. In this article, my aim is to introduce the basic ideas of this paper. I reserved the definitions of AP and oLRP to 3rd and 4th sections since they have some equations which may seem intimidating and thus I begin directly with the intuition of our performance metric, Optimal Localization-Recall-Precision(oLRP), and a comparison on a toy example. For more interested readers, I suggest the order of the sections to be 3,4,1,2.

1-Which one to choose for an application: AP or oLRP

Here, I just try to relate the intuition of the AP and oLRP with the real-world applications. While evaluating an object detector, the main difference/idea of AP and oLRP is that:

AP: Considers the entire recall domain and determines a performance score.

oLRP: Moves on the Recall Precision(RP) curve by determining a performance score consisting of localization, recall and precision errors for each point (this is a LRP score of an RP point) and finally finds the best configuration that the detector can achieve (minimum achievable LRP is the oLRP).

Now let’s see an example to see how these two ideas can conflict. Below, there are 3 object detectors B, G and S. Let’s compare B and S. Area under the curve of B is obviously larger than S, however, the best(top-right) point of the curve of S seems better. As a result, S have better recall & precision pair than B in its best configuration. Now, the questions are: If you have develop a real world application (i.e. a robot), which one would you prefer B or S? Do you set a threshold or use the entire detection set?

Here is another example that shows the output of the object detectors. The image on the left shows an image and the right one shows all the detections results of Faster R-CNN. Although there is only one train in the image, the large number of detections (most of them have very low confidence scores) make it difficult to use them all.

An image from Pascal VOC dataset

Finally let’s see some statistical results. There are 36,781 annotations of MS COCO dataset. RetinaNet produces 486,108 bounding boxes and Faster R-CNN yields 127,039 (thanks to its RPN method). For RetinaNet, confidence scores of 57% of the detections are under 0.1, and 87% of them are under 0.25 (these values are 29% and 56% for Faster R-CNN). So, there are lots of detections with very low confidence scores, which are not easy to use due to their very low precision values. That’s why, I believe that this large number of detections, which corresponds to the tail of the RP curve, must not to be included for the performance evaluation. The same applies for the very low recall and higher precision portion of the RP curve. Thus, we believe that the object detectors are to be evaluated in their best configuration that provides higher recall-precision since this should be the way that they are used in real-world applications.

2- Comparing oLRP with AP on a Toy Example

The Scenario: Below, there are 3 object detectors. An example image and the RP curve of its corresponding object detector are paired vertically. Note that, these 3 object detectors are completely different in terms of their detection limitations. The leftmost detector, say D1, is very precise but has a recall error. The detector in the middle, say D2, has a duplication problem which results in low precision. Finally the rightmost detector, say D3, behaves like a conventional object detector with several low confidence false positives so it reaches high recall values in very low precision values. Also, D3 has not tight results.

Next let’s compare AP with oLRP. Note that for AP higher is better, but in oLRP lower is better since it is an error by definition.

1- Even though, the characteristic of the detectors are entirely different, they all have the same AP with 0.5 (check area under curve). So, obviously, AP can not distinguish between very different RP curves. However, using the components of the oLRP, we can directly pinpoint the peak points of the detectors. For D1, 1-FPError=1 and 1-FNError=0.5; for D2, these two components interchange as expected and D3 has the error of both D1 and D2. So, that’s why we can understand the problem of the object detector via the components of oLRP.

2-The localization is not taken into account by AP, but oLRP has a localization component, which penalized D3 the most. So, in localization component D3 has the highest error with 0.4. So, the average tightness of detections of D3 is 1–LocalizationError=0.6. Overall, D3 is the worse with 0.93 error since it has the errors of both D1 and D2 plus worse localization.

3-AP is not confidence score sensitive. As long as the sorted order is preserved, one can change the confidence scores of detections, squeeze them in a very small interval, and AP reports nothing about it. But, a confidence score is a parameter of LRP/oLRP.

4-AP can not find the best configuration of an object detector since it considers the entire recall domain; however, oLRP specifies the class specific optimal thresholds.

5-AP uses the interpolated RP curve while oLRP is determined on the RP curve. Since, these RP curves are plotted on validation/test data, this is the real life behaviour of an object detector and we assert that the performance is to be estimated without interpolation.

6-AP is not fair when comparing thresholded object detectors. To explain this, see the following figure in which there is a video object detector S that uses a thresholded portion of detections of the still image detector B for faster association between frames. Since S uses some of the detections of B, it directly loses the tail part of the recall. On the other hand, by using an association mechanism, S has a better precision&recall combination than B. Note that in their optimal configuration, marked by crosses, S is higher in both precision and recall than B. Since a large portion of the area under curve, corresponding to the tail of the curve, is conceded voluntarily by S for the sake of faster association; we say that it is not fair to compare S and B in terms of AP. However, we can fairly compare their best configuration in terms of oLRP.

7-AP is not a metric since basically symmetry property is violated but LRP is a proven metric.

3-How AP evaluates: Considering The Entire Recall

During the period that I first started to get involved with the object detection problem, I spent some time to find a thorough resource that explains AP step by step. That’s why, I include a definition of AP here.

Firstly, AP is a “per-class” measure. For a class, the 5 step process to compute the AP is as follows:

I) For the entire dataset, sort all the detections labeled with the same class with respect to their confidence scores.

II) Now, go over these sorted detection list and check whether each detection can be assigned to a ground truth. The assignment (or labeling as a True Positive) is based on Intersection Over Union (IoU) of the detection with a ground truth. There is a true positive validation threshold in terms of IoU and it is generally 0.5. In this step, note that a ground truth can be matched with only one detection (and this detection is the one with the highest confidence score since we go over a sorted list).

IoU is computed by dividing area of overlap to area of union, so it is an indicator of the tightness of the detection boxes.

III) At the end of the previous step, we have identified the True Positive (TP) detection boxes, False Positive (FP) detection boxes and False Negative (FN) ground truth boxes. So, by going over the sorted detection list again, we can find precision for each recall to draw the Recall-Precision (RP) curve. The blue curve in the figure is the RP curve.

IV) To discard the wiggles of the RP curve, at some recall point interpolate the blue curve to the highest precision possible at the positive side of this recall point. Thus, the red curve, called the interpolated RP curve, is obtained.

Blue curve is the RP curve and Red curve is the Interpolated RP curve

V) Finally, as far as I know, there 3 different methods to compute the AP using the interpolated RP curve:

  • “area under the curve approach”: simply compute the area under this curve to find the AP. This is used in ImageNet Object Detection Challange.
  • “arithmetic average approach”: divide the recall domain into evenly spaced slices, check precision values at these recall values and get their average. Older Pascal VOC metric used to compute the AP in this way by using 11 recall points.
  • MS COCO style AP: It is an extended version of the arithmetic average approach. It uses 101 recall points and computes AP for 10 different TP validation threshold(0.5, 0.55, 0.6,…,0.95) in terms of IoU in order to implicitly include localization error. So indeed, it is “the arithmetic average of the arithmetic average approach” on different TP validation thresholds.
First and second AP computation approaches. The third one (COCO style AP) uses arithmetic average of the average approach.

Mean Average Precision (mAP in short) is the performance measure that is assigned to an object detector (not to a single class). In all three cases, mAP is the average of APs over classes.

4-Optimal Localization Recall Precision: Evaluating each point on the RP curve

Rather than considering the entire recall domain to assign a single performance value for the object detector as in the AP, Localization Recall Precision metric determines a performance value for each point on the RP curve (not the interpolated RP curve). The performance metric for a class is the oLRP defined as the minimum achievable LRP error on the RP curve. So, LRP is the performance of an ordinary RP point, but oLRP is the the minimum LRP value.

The methodology to move on the RP curve is changing the confidence score parameter, denoted by s. We first take all detections without applying any threshold and compute a performance score (The performance score is the LRP score) based on localization error of TPs, number of FPs and number of FNs. Then, we increase the confidence score threshold to 0.01 and compute the performance score (i.e. LRP) again. We continue computing until confidence score 1. So, we have 101 LRP values and each of them corresponds to a different threshold. The minimum of these 101 values is the oLRP and the corresponding confidence score (s* in the figure)is the class-specific optimal threshold. (Thresholding the object detectors is not in the scope of this article.)

So, the main idea of oLRP different from the AP is that oLRP finds the best performance configuration that the detector can achieve on a class while AP considers the entire recall domain to yield a single performance value for a class.

Now let’s talk about, how a single point is evaluated (= what is LRP?). For a LRP score of a point, we need number of FPs, FNs and IoUs for each TP. These values are obtained as in AP. We have two definitions of LRP metric below: Component based and Without components (my favorite is the second one.). The parameters are s, confidence score threshold and τ, TP validation threshold .The inputs are: X, the ground truth set and Y_s, the detection set thresholded from s.

The component based definition is as follows:

Component Based Definition of LRP

Here, LRP is defined as the “Normalized Weighted Combination of Localization Component, FP Component and FN Component”. Let’s go over each word in this definition.

Weights: The weights are proportional to the set sizes as noted. There is 1-tau in the denominator of the localization component weight since we want each box to add an error between [0,1]. Without this denominator, a TP error will end up within the interval [0, 1-tau].These weights also preserves the metricity of LRP and prevents the total error to be undefined even if a single component is undefined.

Normalization Constant (Z): It is just the sum of error contributors.

Localization Component: It penalizes each TP by its missing portion of its IoU. Note that, 1-Localization error is exactly average IoU of TPs. So, this component is an indicator of the tightness of the boxes.

False Positive/Negative Components: They are just the missing part of the precision/recall. For the oLRP, 1-FP/FN Errors together represent the peak point of the RP curve. So, we can understand the shape of the RP curve by considering these two components at oLRP.

Next comes my favorite definition:

LRP as the per-bounding box error

Here, the normalization constant is the same but no weights and components exist now. The part of the definition in parentheses defines the total error. Each TP is penalized again by its missing IoU portion normalized to the [0,1] interval using 1-τ. Each FP and FN adds the error exactly 1, that is the upper bound of the penalty for a box. Then, using the denominator, the total error is again normalized by the total number of the error contributors. In this way, we say that LRP is “the per-bounding box error”.

The ranges of total error and the components are [0, 1] and lower value implies better performance (as opposed to AP). Similarly moLRP is estimated by averaging over the classes.

--

--