Object Detection Evaluation Metric Explained

5 min readDec 11, 2019

Object detection model predicts a bounding box for each object in an image to localize the objects. We usually use mean average precision (mAP) to evaluate object detection model quality. In this article, I will explain how mAP is computed in details.

IOU-based True and False Positives

Since object detection models output both bounding boxes and a label for each bounding box, a good evaluation metric needs to evaluate the correctness of the bounding box locations and labels.

Given a collection of detected boxes and corresponding ground truth boxes for an image, we need to be able to determine which of the detections are “correct” (i.e. true positive) and which are “incorrect” (false positives). Since in detection, the predicted boxes are never exactly the same as the groundtruth boxes, it is necessary to match the the predicted boxes to groundtruth.

For a predicted box to be a true positive for a given groundtruth box, it has to satisfy the following three criteria (otherwise it is considered to be a false positive):

Category match: the predicted category must match the groundtruth category.
Location match: To decide if two boxes are a location match we measure their overlap by computing the intersection-over-union or IOU (i.e., area of intersection divided by area of union, shown in the following figure) and compare against a threshold value. A typical threshold value is 50% — thus if the overlap between a predicted box and a ground truth box is 50% or higher, (and the other two bullet points hold), we say that the detected box is a true positive.
Faithful: Finally, a predicted box cannot be a true positive for multiple groundtruth boxes (and multiple predicted boxes cannot simultaneously be a true positive for a ground truth box). In practice, this is enforced by first computing a match between predicted boxes and groundtruth boxes greedily in descending order of score based on IOU overlap.

Precision and Recall

For a detector to be considered “good”, we want: (1) a large fraction of the produced detections to be true positives and (2) the groundtruth boxes corresponding to these detections to comprise a large fraction of all total groundtruth boxes within the dataset. When using a fixed IOU threshold, for example 50%, precision and recall can be computed as following:

Precision: is the fraction of produced detections which are true positives.
Recall: is the fraction of groundtruth boxes in the data that matched to some produced detection.

Precision Recall Curve

In practice there is typically a trade off between precision and recall which is achieved by discarding boxes whose detection scores are below some fixed score threshold. Assuming that the scores give a meaningful ordering of confidence, then setting the threshold to be low allows us to find more objects within image at the expense of false positives. This would be considered a low precision, high recall setting. Setting the threshold to be high on the other hand, allows us to minimize the number of false positives within our detections, which we would consider a high precision, low recall setting. One would choose an appropriate model depending on the need (high precision, or more data with false positives allowed).

We can visualize the trade off between precision and recall more explicitly by creating a so-called precision/recall curve at a given IOU by sweeping over many score thresholds and plotting the corresponding precision and recall of the detector along the x and y axes of the plot. Good detectors are those that manage to achieve points that are up and to the right in these plots. The following image shows an example of precision recall curve.

Mean Average Precision

Based on the precision recall curve at several IOU values, we can compute the following metrics:

Average Precision (AP): We can compute average precision at a given IOU by computing the region under the precision recall curve. This is achieved by selecting 11 equally spaced recall values [0, 0.1, 0.2, …, 1.0], and approximate the integration by sampling the precision-recall on these recall values.
Integrated AP: To compute integrated AP, we average the AP over multiple values of the IoU threshold. For example, we average over values of 50% to 95% in increments of 5%.
Category AP: To compute category AP, we compute AP independently for each class. This lets us evaluate a detectors performance separately for people, dogs, cats, etc. Category AP can be computed at a fixed IoU threshold or averaged over multiple values of the IoU threshold as a integrated AP.
Mean AP (mAP): The Mean AP (or mAP) is the category AP averaged over all classes. The Mean AP is often taken as a single number that represents the performance of a detector on a dataset. Note that this metric ignores class imbalance — — thus even if a dataset contains 20 dogs for every cat, Mean AP would still give equal weight to both categories.

Conclusion and Caveats

mAP is considered as a golden metric because it is a good summary of the performance of detection models. However, it might not be the best metric for your problem. Here are something to consider when you want to use mAP to evaluate your object detection models.

How to choose an IOU. mAP is an aggregated metric for all possible IOU thresholds. However, if the accuracy of bounding box locations is critical for your applications, you might want to use mAP@0.75IOU as your evaluation metric.
Data imbalance: mAP does not give any weights when averaging category API. However, if the distribution of classes actually reflects the importance of each class, your might want to consider using a weighted average instead .