Object detection: speed and accuracy comparison (Faster R-CNN, R-FCN, SSD, FPN, RetinaNet and YOLOv3)

It is very hard to have a fair comparison among different object detectors. There is no straight answer on which model is the best. For real-life applications, we make choices to balance accuracy and speed. Besides the detector types, we need to aware of other choices that impact the performance:

  • Feature extractors (VGG16, ResNet, Inception, MobileNet).
  • Output strides for the extractor.
  • Input image resolutions.
  • Matching strategy and IoU threshold (how predictions are excluded in calculating loss).
  • Non-max suppression IoU threshold.
  • Hard example mining ratio (positive v.s. negative anchor ratio).
  • The number of proposals or predictions.
  • Boundary box encoding.
  • Data augmentation.
  • Training dataset.
  • Use of multi-scale images in training or testing (with cropping).
  • Which feature map layer(s) for object detection.
  • Localization loss function.
  • Deep learning software platform used.
  • Training configurations including batch size, input image resize, learning rate, and learning rate decay.

Worst, the technology evolves so fast that any comparison becomes obsolete quickly. Here, we summarize the results from individual papers so you can view them together. Then we present a survey from Google Research. By presenting multiple viewpoints in one context, we hope that we can understand the performance landscape better.

Performance results

In this section, we summarize the performance reported by the corresponding papers. Feel free to browse through this section quickly.

Faster R-CNN (Source)

This is the results of PASCAL VOC 2012 test set. We are interested in the last 3 rows representing the Faster R-CNN performance. The second column represents the number of RoIs made by the region proposal network. The third column represents the training dataset used. The fourth column is the mean average precision (mAP) in measuring accuracy.

Results on PASCAL VOC 2012 test set.

VOC 2012 for Faster R-CNN.

Results on MS COCO.

COCO for Faster R-CNN

Timing on a K40 GPU in millisecond with PASCAL VOC 2007 test set.

R-FCN (Source)

Results on PASCAL VOC 2012 test set.

VOC 2012 for R-FCN

(Multi-scale training and testing are used on some results.)

Results on MS COCO.

COCO for R-FCN

SSD (Source)

This is the results of PASCAL VOC 2007, 2012 and MS COCO using 300 × 300 and 512 × 512 input images.

SSD

(SSD300* and SSD512* applies data augmentation for small objects to improve mAP.)

Performance:

Speed is measure with a batch size of 1 or 8 during inference.

(YOLO here refers to v1 which is slower than YOLOv2 or YOLOv3)

Result on MS COCO:

COCO for SSD

YOLO (Source)

Results on PASCAL VOC 2007 test set.

VOC 2007 for YOLOv2

(We add the VOC 2007 test here because it has the results for different image resolutions.)

Results on PASCAL VOC 2012 test set.

VOC 2012 for YOLOv2

Results on MS COCO.

COCO for YOLOv2

YOLOv3 (Source)

Results on MS COCO

COCO for YOLOv3

Performance for YOLOv3

Performance for YOLO2 with COCO

FPN (Source)

Results on MS COCO.

COCO for FPN

RetinaNet (Source)

Results on MS COCO

COCO for RetinaNet

Speed (ms) versus accuracy (AP) on MS COCO test-dev.

COCO for RetinaNet

Comparing paper results

It is unwise to compare results side-by-side from different papers. Those experiments are done in different settings which are not purposed for apple-to-apple comparisons. Nevertheless, we decide to plot them together so at least you have a big picture on approximate where are they. But you are warned that we should never compare those numbers directly.

For the result presented below, the model is trained with both PASCAL VOC 2007 and 2012 data. The mAP is measured with the PASCAL VOC 2012 testing set. For SSD, the chart shows results for 300 × 300 and 512 × 512 input images. For YOLO, it has results for 288 × 288, 416 ×461 and 544 × 544 images. Higher resolution images for the same model have better mAP but slower to process.

* denotes small object data augmentation is applied.

** indicates the results are measured on VOC 2007 testing set. We include those because the YOLO paper misses many VOC 2012 testing results. Since VOC 2007 results are in general performs better than 2012, we add the R-FCN VOC 2007 result as a cross reference.

Input image resolutions and feature extractors impact speed. Below is the highest and lowest FPS reported by the corresponding papers. Yet, the result below can be highly biased in particular they are measured at different mAP.

Result on COCO

For the last couple years, many results are exclusively measured with the COCO object detection dataset. COCO dataset is harder for object detection and usually detectors achieve much lower mAP. Here are the comparison for some key detectors.

FPN and Faster R-CNN*(using ResNet as the feature extractor) have the highest accuracy (mAP@[.5:.95]). RetinaNet builds on top of the FPN using ResNet. So the high mAP achieved by RetinaNet is the combined effect of pyramid features, the feature extractor’s complexity and the focal loss. Yet, you are warned that this is not an apple-to-apple comparison. We will present the Google survey later for better comparison. But it will be nice to view everyone claims first.

Takeaway so far

Single shot detectors have a pretty impressive frame per seconds (FPS) using lower resolution images at the cost of accuracy. Those papers try to prove they can beat the region based detectors’ accuracy. However, that is less conclusive since higher resolution images are often used for such claims. Hence, their scenarios are shifting. In additional, different optimization techniques are applied and make it hard to isolate the merit of each model. In fact, single shot and region based detectors are getting much similar in design and implementations now. But with some reservation, we can say:

  • Region based detectors like Faster R-CNN demonstrate a small accuracy advantage if real-time speed is not needed.
  • Single shot detectors are here for real-time processing. But applications need to verify whether it meets their accuracy requirement.

Comparison SSD MobileNet, YOLOv2, YOLO9000 and Faster R-CNN

Here is a video comparing detectors side-by-side.

Report by Google Research (Source)

Google Research offers a survey paper to study the tradeoff between speed and accuracy for Faster R-CNN, R-FCN, and SSD. (YOLO is not covered by the paper.) It re-implements those models in TensorFLow using MS COCO dataset for training. It establishes a more controlled environment and makes tradeoff comparison easier. It also introduces MobileNet which achieves high accuracy with much lower complexity.

Speed v.s. accuracy

The most important question is not which detector is the best. It may not possible to answer. The real question is which detector and what configurations give us the best balance of speed and accuracy that your application needed. Below is the comparison of accuracy v.s. speed tradeoff (time measured in millisecond).

In general, Faster R-CNN is more accurate while R-FCN and SSD are faster.

  • Faster R-CNN using Inception Resnet with 300 proposals gives the highest accuracy at 1 FPS for all the tested cases.
  • SSD on MobileNet has the highest mAP among the models targeted for real-time processing.

This graph also helps us to locate sweet spots to trade accuracy for good speed return.

  • R-FCN models using Residual Network strikes a good balance between accuracy and speed,
  • Faster R-CNN with Resnet can attain similar performance if we restrict the number of proposals to 50.

Feature extractor

The paper studies how the accuracy of the feature extractor impacts the detector accuracy. Both Faster R-CNN and R-FCN can take advantage of a better feature extractor, but it is less significant with SSD.

Source

(The x-axis is the top 1% accuracy on classification for each feature extractor.)

Object size

For large objects, SSD performs pretty well even with a simple extractor. SSD can even match other detectors’ accuracies using better extractor. But SSD performs much worse on small objects comparing to other methods.

Source

For example, SSD has problems in detecting the bottles in the middle of the table below while other methods can.

Source

Input image resolution

Higher resolution improves object detection for small objects significantly while also helping large objects. When decreasing resolution by a factor of two in both dimensions, accuracy is lowered by 15.88% on average but the inference time is also reduced by a factor of 27.4% on average.

Source

Number of proposals

The number of proposals generated can impact Faster R-CNN (FRCNN) significantly without a major decrease in accuracy. For example, with Inception Resnet, Faster R-CNN can improve the speed 3x when using 50 proposals instead of 300. The drop in accuracy is just 4% only. Because R-FCN has much less work per ROI, the speed improvement is far less significant.

Source

GPU time

Here is the GPU time for different model using different feature extractors.

Source

While many papers use FLOPS (the number of floating point operations) to measure complexity, it does not necessarily reflect the accurate speed. The density of a model (sparse v.s. dense model) impacts how long it takes. Ironically, the less dense model usually takes longer in average to finish each floating point operation. In the diagram below, the slope (FLOPS and GPU ratio) for most dense models are greater than or equal to 1 while the lighter model is less than one. i.e. less dense models are less effective even though the overall execution time is smaller. However, the reason is not yet fully studied by the paper.

Source

Memory

MobileNet has the smallest footprint. It requiring less than 1Gb (total) memory.

Source

2016 COCO object detection challenge

The winning entry for the 2016 COCO object detection challenge is an ensemble of five Faster R-CNN models using Resnet and Inception ResNet. It achieves 41.3% mAP@[.5, .95] on the COCO test set and achieve significant improvement in locating small objects.

Lessons learned

Some key findings from the Google Research paper:

  • R-FCN and SSD models are faster on average but cannot beat the Faster R-CNN in accuracy if speed is not a concern.
  • Faster R-CNN requires at least 100 ms per image.
  • Use only low-resolution feature maps for detections hurts accuracy badly.
  • Input image resolution impacts accuracy significantly. Reduce image size by half in width and height lowers accuracy by 15.88% on average but also reduces inference time by 27.4% on average.
  • Choice of feature extractors impacts detection accuracy for Faster R-CNN and R-FCN but less reliant for SSD.
  • Post processing includes non-max suppression (which only run on CPU) takes up the bulk of the running time for the fastest models at about 40 ms which caps speed to 25 FPS.
  • If mAP is calculated with one single IoU only, use mAP@IoU=0.75.
  • With an Inception ResNet network as a feature extractor, the use of stride 8 instead of 16 improves the mAP by a factor of 5%, but increased running time by a factor of 63%.

Most accurate

  • The most accurate single model use Faster R-CNN using Inception ResNet with 300 proposals. It runs at 1 second per image.
  • The most accurate model is an ensemble model with multi-crop inference. It achieves state-of-the-art detection on 2016 COCO challenge in accuracy. It uses the vector of average precision to select five most different models.

Fastest

  • SSD with MobileNet provides the best accuracy tradeoff within the fastest detectors.
  • SSD is fast but performs worse for small objects comparing with others.
  • For large objects, SSD can outperform Faster R-CNN and R-FCN in accuracy with lighter and faster extractors.

Good balance between accuracy and speed

  • Faster R-CNN can match the speed of R-FCN and SSD at 32mAP if we reduce the number of proposal to 50.