Fifteen Minutes with FiftyOne: Seeing without Looking

Rescoring object detections to maximize mAP

Eric Hofesmann
Voxel51
6 min readSep 15, 2020

--

Countless models exist for the purpose of detecting (localizing and classifying) objects in images. One thing that nearly all of these models have in common is that they produce a set of detections as output which contains bounding box coordinates, class scores, and confidence values.

COCO 2017 image with Cascade R-CNN original and rescored predictions visualized in FiftyOne

Seeing without looking [1] is a paper published in CVPR 2020 that provides a new approach called contextual rescoring. Contextual rescoring takes detections as input and recomputes confidence scores to improve the mean average precision (mAP) of the detections with the ground truth. Most object detectors produce individual predictions with no context between them resulting in duplicate or out of place predictions. Incorporating the context of other detections and rescoring the classifications can help avoid these issues and boost performance.

Seeing without Looking

Contextual Rescoring Architecture

Seeing without looking architecture used to recompute confidence scores (source)

The idea behind contextual rescoring is to retain all predicted bounding boxes and classes and only change the confidence scores. These rescored confidences will then lead to different precision-recall curves when computing mAP.

In order to train a model to rescore predictions, the targets for the rescoring have to be generated. The targets that maximize mAP are computed by:

  1. Changing how detections are matched with ground truth
  2. Selecting the optimal confidence score for each detection

The network architecture used to perform this contextual rescoring consists of recurrent neural networks that take all detections for an image as input followed by a self-attention layer to capture long-range dependencies between the detections. A multi-layer perceptron is then used to regress a new confidence score between 0 and 1 for every detection. More details of this implementation can be found in the original work [1].

Error Analysis

The authors performed an error analysis following [2] on Faster-RCNN [3] and Cascade-RCNN [4] on the COCO 2017 dataset [5]. This analysis looks at false positive detections in the dataset and accumulates the different types of errors that occurred.

The specific errors that were analyzed for R-CNN false positives:

  • Localization error: Correct class but wrong location (IOU < 0.5) or correct location but duplicate detection
  • Confusion with similar class: Same COCO supercategory (IoU > 0.1)
  • Confusion with dissimilar class: Different COCO supercategory (IOU > 0.1)
  • Confusion with background: Remaining false positives (IOU < 0.1)
Results of error analysis on COCO 2017 validation for models with a ResNet-101 backbone (source)

These results show that the majority of high confidence predictions are some kind of false-positive rather than correct.

Results of error analysis on COCO 2017 validation after contextual rescoring (source)

The number of background false positives was reduced in both models when rescored. There was also a decrease in the number of localization errors, indicating that duplicate samples were likely removed.

As a result of rescoring, the mAP of Faster R-CNN with ResNet-101 improved by 0.5% and Cascade R-CNN with ResNet-101 improved by 0.7%.

This kind of analysis is exactly what the dataset visualization and model exploration tool FiftyOne [6] is designed for. In fact, the same analysis was performed on Google’s Open Images dataset and it was found that a third of false positives were in fact dataset errors [7]!

Digging in with FiftyOne

I have previously written an article stating why only looking at differences in mAP between models may not be as helpful as digging into the model results and looking at various samples hands-on. To that point, let’s go through some contextual rescoring outputs and see how this method has modified detections in individual samples.

COCO 2017 image with Cascade R-CNN original and rescored predictions visualized in FiftyOne | Pink are Cascade detections | Blue are rescored detections

One of the proposed outcomes of contextual rescoring is to provide a form of learned non-maximal suppression, where duplicate objects would be suppressed so only one detection per object exists. The example above shows how contextual rescoring has suppressed multiple carrot detections so that only a single carrot detection remains at a confidence of 0.2.

Below is an example of two tie detections, similar to an example from the paper. After contextual rescoring, the confidence of the false-positive tie detection was decreased from 0.83 to 0.31 while the confidence of the true positive was only decreased from 0.98 to 0.85.

COCO 2017 image with Cascade R-CNN original and rescored predictions visualized in FiftyOne | Pink are Cascade detections | Blue are rescored detections

This deduplication of detections seems rather hit or miss though. Below is another example where the rescoring function did not actually help disentangle the duplicate skis and instead just decreased confidence scores of both boxes, even reducing the difference in confidence between them.

COCO 2017 image with Cascade R-CNN original and rescored predictions visualized in FiftyOne | Pink are Cascade detections | Blue are rescored detections

Looking at more examples in FiftyOne, it became clear that the rescoring function most often just decreases the confidence of all boxes.

This is not necessarily a bad thing, especially since it does not decrease all confidences equally. Decreasing confidences of all predictions generally results in fewer false-positive predictions since higher confidence predictions are more likely to be true positives. Looking back at the pie charts above, it makes sense that the percentage of true positives increases even if the number of true positives stays the same due to the number of false-positives decreasing.

Conclusion

There is no doubt that contextual rescoring is an efficient and clever method to provide a boost in mAP to object detection models. The ability to use this method on any object detections that contain a bounding box, classification, and confidence is extremely powerful. Our analysis indicates that contextual rescoring may not entirely replace non-maximal suppression. However, our analysis has also shown that this method is effective in reducing confidences resulting in fewer false positives and increasing mAP.

References

[1] Pato L., et al, Seeing without Looking: Contextual Rescoring of Object Detections for AP Maximization, CVPR (2020).

[2] Hoiem D., et al, Diagnosing error in object detectors, ECCV (2012)

[3] S. Ren, et al, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, NeruIPS (2015)

[4] Zhaowei Cai and Nuno Vasconcelos, Cascade R-CNN: delving into high quality object detection, CVPR (2018)

[5] T. Lin, et al, Microsoft COCO: Common Objects in Context, ECCV (2014)

[6] Voxel51, FiftyOne: Explore, Analyze and Curate Visual Datasets, (2020)

[7] T. Ganter, I performed Error Analysis on Google’s Open Images dataset and now I have trust issues, TowardsDataScience (2020)

About Me

My name is Eric Hofesmann. I received my master’s in Computer Science, specializing in computer vision, at the University of Michigan. During my graduate studies, I realized that it was incredibly difficult to thoroughly analyze a new model or method without serious scripting to visualize and search through outputs. Working at the computer vision startup, Voxel51, I helped develop FiftyOne to help researchers and myself quickly load up and start looking through datasets and model results. This series of posts go through state-of-the-art computer vision models and datasets and analyzes them with FiftyOne.

--

--

Eric Hofesmann
Voxel51

Machine learning engineer at Voxel51, Masters in Computer Science from the University of Michigan. https://www.linkedin.com/in/eric-hofesmann/