Zyl Story
Published in

Zyl Story

Review of Deep Learning Algorithms for Object Detection

Comparison between image classification, object detection and instance segmentation.

Why object detection instead of image classification?

Datasets and Performance Metric

Examples of segmented objects from the 2015 COCO dataset. Source: T.-Y.Lin and al. (2015)

Region-based Convolutional Network (R-CNN)

Selective Search application, top: visualisation of the segmention results of the algorithm, down: visualisation of the region proposals of the algorithm. Source: J.R.R. Uijlings and al. (2012)
Region-based Convolution Network (R-CNN). Each region proposal feeds a CNN to extract a features vector, possible objects are detected using multiple SVM classifiers and a linear regressor modifies the coordinates of the bounding box. Source: J. Xu’s Blog

Fast Region-based Convolutional Network (Fast R-CNN)

The entire image feeds a CNN model to detect RoI on the feature maps. Each region is separated using a RoI pooling layer and it feeds fully-connected layers. This vector is used by a softmax classifier to detect the object and by a linear regressor to modify the coordinates of the bounding box. Source: J. Xu’s Blog

Faster Region-based Convolutional Network (Faster R-CNN)

Detecting the anchor boxes for a single 3x3 window. Source: S. Ren and al. (2016)
The entire image feeds a CNN model to produce anchor boxes as region proposals with a confidence to contain an object. A Fast R-CNN is used taking as inputs the feature maps and the region proposals. For each box, it produces probabilities to detect each object and correction over the location of the box. Source: J. Xu’s Blog

Region-based Fully Convolutional Network (R-FCN)

The input image feeds a ResNet model to produce feature maps. A RPN model detects the Region of Interests and a score is computed for each region to determine the most likely object if there is one. Source: J. Dai and al. (2016)
Source: J. Dai and al. (2016)

You Only Look Once (YOLO)

Example of application. The input image is divided into an SxS grid, B bounding boxes are predicted (regression) and a class is predicted among C classes (classification) over the most confident ones. Source: J. Redmon and al. (2016)
YOLO architecture: it is composed of 24 convolutional layers and 2 fully-connected layers. Source: J. Redmon and al. (2016)
Real Time Systems on PASCAL VOC 2007. Comparison of speeds and performances for models trained with the 2007 and 2012 PASCAL VOC datasets. The published results correspond to the implementations of J. Redmon and al. (2016).

Single-Shot Detector (SSD)

Comparison between the SSD and the YOLO architectures. The SSD model uses extra feature layers from different feature maps of the network in order to increase the number of relevant bounding boxes. Source: W. Liu and al. (2016)
SSD Framework. (a) The model takes an image and its ground truth bounding boxes. Small sets of boxes with different aspect ratios are fixed by the different feature map ((b) and ©). During training, the boxes localization are modified to best match the ground truth. Source: W. Liu and al. (2016)

YOLO9000 and YOLOv2

YOLOv2 architecture. Source: J. Redmon and A. Farhadi (2016)
Prediction on ImageNet vs WordTree. Source: J. Redmon and A. Farhadi (2016)

Neural Architecture Search Net (NASNet)

Example of object detection results. Comparison of Faster R-CNN pipelines one is using Inception-ResNet as feature maps generator (top) and the other the NASNet model (bottom). Source: B. Zoph and al. (2017)

Mask Region-based Convolutional Network (Mask R-CNN)

Examples of Mask R-CNN application on the COCO test dataset. The model detects each object of an image, its localization and its precise segmentation by pixel. Source: K. He and al. (2017)
Mask R-CNN framework for instance segmentation. Source: K. He and al. (2017)

Conclusion

Overview of the mAP scores on the 2007, 2010, 2012 PASCAL VOC dataset and 2015, 2016 COCO datasets.