SSD(Single Shot Multi-Box Detection) for real time object detection

Convolutional neural network outperforms other neural network architectures in detecting objects in an image. Soon researchers improved CNN for object localization and detection and called this architecture R-CNN(Region-CNN). The output of R-CNN is the image with rectangular boxes surrounded for the objects in an image as well as the class of that object. The following are the steps on how R-CNN works:

  1. Scan input images for possible objects using algorithm called Selective Search and generate around 2000 region proposals,
  2. Run CNN over each of these region proposals,
  3. Take output of each CNN and feed it into :
  • SVM to classify region
  • A linear regressor to tighten bounding box of object if such object exists
R-CNN for object detection

Although R-CNN made a lot of progress over traditional CNN for object localization, detection and classification it still seems a little problem for achieving this in real time. Some of the problems are:

  1. Training data is very difficult to handle and very long,
  2. Training happens in two stages(e.g training region proposals, and classification)
  3. Network is slow at inference time( when dealing with non training data )

To improve R-CNN there are also other algorithms like Fast-RCNN, Faster-RCNN. The later one gives more accurate results for object detection. But they are bit slower for real time detection. So SSD came into play. It has a good balance over accuracy and speed of calculation.

SSD(Single Shot MultiBox Detector) Meaning

Single Shot: Object localization and classification is done in single forward pass of network

MultiBox: Technique for bounding box regression

Detector: Classify the detected objects

Architecture

SSD architecture

The architecture of SSD is built based on the VGG-16 architecture. But here is a little tweak on the VGG-16, we use the set of auxiliary convolutional layers from Conv6 layer onwards instead of fully connected layers. The reason of using VGG-16 as foundational network is its high quality image classification and transfer learning to improve results. Using the auxiliary convolutional layers we can extract features at multiple scales and progressively decrease the size at each following layer. I have discussed how this works in following section. You can see the following image for VGG-16 architecture. It contains fully connected layers.

VGG-16 architecture

Working Mechanism

To train our algorithm, we need a training set that contains image with objects and those objects must have bounding boxes on them. Learning this way, the algorithms learn how to put rectangle on the object and where to put. We minimize the errors between inferred bounding boxes and ground truth to optimize our model to detect the object correctly. Unlike in CNN, we don’t only predict if there is an object in the image or not we also need to predict where in the image the object is. During training the algorithm learn to adjust the height and width of the rectangle in the object.

The image above is the example of our training set of data for object detection. These dataset must contain object with their classes labeled in the image. More default boxes results in more accurate detection but will cost on speed.

The Pascal VOC and COCO datasets are a good starting point for beginners.

Dealing With Scale Problem

In the left we have an image with few horses. We have divided our input image into the set of grids. Then we make couple of rectangles of different aspect ratio around those grids. Then we apply convolution in those boxes to find if there is an object or not in those grids. Here one of the black horse is closer to the camera in the image. So the rectangle we draw is unable to identify if that is horse or not because the rectangle does not have any features that are identifying to horses.

If we see the above architecture of SSD, we can see in each step after conv6 layer the size of images gets reduced substantially. Then every operation we discussed on making grids and finding objects on those grids applies in every single step of the convolution going from back to front of the network. The classifiers are applied in every single step to detect the objects too. So since the objects become smaller in each steps they gets easily identified.

The SSD algorithm also knows how to go back from one convolution operation to another. It not only learns to go forward but backwards too. For e.g if it sees horse in conv4 then it can return to conv6 and the algorithm will draw the rectangle around the horse.

If you like this post, don’t forget to clap the post and follow me on medium and on twitter.