Understand Single Shot MultiBox Detector (SSD) and Implement It in Pytorch

Hao Gao
Hao Gao
Jun 6, 2018 · 6 min read

SSD (Single Shot MultiBox Detector) is a popular algorithm in object detection. It’s generally faster than Faster RCNN. In this post, I will explain the ideas behind SSD and the neural architecture, and then discuss how to implement it. After this, I believe you can implement your own SSD with some patience. You can also checkout my implementation https://github.com/qfgaohao/pytorch-ssd and try the live demo. In this post, I will follow the original architecture from the paper. In the next post, we will plug in Mobilenet as the base net to make it faster.

The Ideas

A typical CNN network gradually shrinks the feature map size and increase the depth as it goes to the deeper layers. The deep layers cover larger receptive fields and construct more abstract representation, while the shallow layers cover smaller receptive fields. For more information of receptive field, check this out. By utilising this information, we can use shallow layers to predict small objects and deeper layers to predict big objects, as small objects don’t need bigger receptive fields and bigger receptive fields can be confusing for small objects.

The following chart shows the architecture of SSD using VGG net as the base net. The middle column shows the feature map sets the net generates from different layers. For example the first feature map set is generated from VGG net layer 23, and have a size of 38x38 and depth of 512. Every point in the 38x38 feature map covers a part of the image, and the 512 channels can be the features for every point. By using the features in the 512 channels, we can do image classification to predict the label and regression to predict the bounding box for small objects on very point. The second feature map set has a size of 19x19, which can be used for slightly larger objects, as the points of the features cover bigger receptive fields. Down to the last layer, there is only one point in the feature map set, which is ideal for big objects.

For Pascal VOC dataset, there are 21 classes (20 objects + 1 background). You have noticed there are 4x21 outputs for every feature point in the classification results. Actually, the number 4 comes from the fact we predict 4 objects with different bounding boxes for every point. It’s a common trick used in Yolo and Faster RCNN. In SSD, multiple boxes for every feature point are called priors, while in Faster RCNN they are called anchors. I won’t draw them here. However you can check the visualisation of anchors in the Faster RCNN post. They bear the same concept. For every prior, we predict one bounding box for all the classes, so there are 4 values for very feature point. Beware it’s different from Faster RCNN. It may lead to worse bounding box prediction due to the confusion among different classes.

VGG based SSD Architecture. (Notations: Conv o256, k3, s2, p1 means Conv2D with 256 output channels, kernel 3x3, stride 2x2 and padding 1x1. Orange represents classification box, pink represents regression head.

The above network is a pure CNN net. Constructing it shouldn’t be difficult. We move directly to the juicy part of implementation. By the way, I am a big advocate of Pytorch, as it enable me to focus on the algorithm rather than framework itself.

Training

Generate Priors

For every feature point we generate a number of priors, which are then used to match ground truth boxes to determine the labels and bounding boxes.

For a better understanding, please check out the anchor generation in this post about Faster RCNN.

Code is the best document. The following implementation is a standalone gist file, try to run it and read it to understand.

Match Priors With Ground-Truth Boxes

You won’t have the targets of the training dataset until you match priors with ground-truth boxes.

The criterion for matching a prior and a ground-truth box is IoU (Intersection Over Union), which is also called Jaccard index. The more overlap, the better match. The process of matching looks like follows:

Scale Ground-Truth Boxes

Intuitively, it’s beneficial to scale the representations of ground-truth boxes to the same scale. In SSD, the scale is as follows:

“variance0” and “variance1” are 0.1 and 0.2 in the paper.

Hard Negative Mining

In the above matching phase, we boost the positive targets (the boxes have object assigned to them) by matching ground-truth boxes to multiple priors. However there are still a lot more unmatched priors. In other words, the huge number of priors labelled as background make the dataset very unbalanced. To make the dataset more balanced, Hard Negative Mining is often used. The idea is only count the background priors with highest confidence into the computation of total loss function. The others are ignored. The ratio between background priors and matched priors becomes much lower (The ratio is 3 in the paper).

The Loss Function

The loss function is the combination of classification loss and regression loss. The regression loss used here is Smooth-L1 loss, which is the same as Faster RCNN and Fast RCNN. Pytorch has documentation for Smooth-L1 Loss.

Smooth-L1 Loss

Data Augmentation

Now you have your complete training data. There is only one step still missing: data augmention. It can help algorithm learn the invariance of data. In fact, unlike in Faster RCNN, data augmentation plays an essential role in SSD. The data augmentation used in the implementation is as follows:

Data Augmentation

Now you can start training using your favorite optimizer.

Prediction

Prediction is simply. By feeding an image into the network, every prior will have a set of bounding boxes and labels. Remember we boost the number of positive priors by matching one object to multiple priors? Now we have mulitple priors predict the same object. To remove the duplicates, NMS (Non-Maximum Suppression) is used.

NMS

NMS only keeps the bounding boxes with the biggest probabilities and remove the bounding boxes having lower probabilites and bigger IoUs with the kept ones. The process is better demonstrated in pseudo code.

The Drawbacks

Shallow layers in a neural network may not generate enough high level features to do prediction for small objects. Therefore, SSD does worse for smaller objects than bigger objects.

The need of complex data augmentation also suggests it needs a large number of data to train. For example, SSD does better for Pascal VOC if the model is pretrained on COCO dataset. So make sure your model is pretrained on big datasets such as Pascal VOC, COCO and Open Images before training it on your own data. That’s the lowest hanging fruit, I guess.

The design of prior boxes is an open question. You should take care of it.


The ideas are simple and yet powerful. However the implementation is not trivial. Good luck with the details! Feel free to checkout or fork my implementation https://github.com/qfgaohao/pytorch-ssd. Suggestions and bug reports are more than welcome.

In the next post, we will replace the VGG Net with MobileNet to make it faster.

Other Resources

  1. A Pytorch SSD implementation. https://github.com/amdegroot/ssd.pytorch
  2. Detectron. https://github.com/facebookresearch/Detectron. It has good NMS implementations.
  3. Caffe2 Operators. https://github.com/pytorch/pytorch/tree/master/caffe2/operators. You may find some of the implementations are useful.
  4. Soft NMS. https://arxiv.org/abs/1704.04503. Another version of NMS.
  5. A Faster RCNN Implementation. https://github.com/jwyang/faster-rcnn.pytorch
  6. The all mighty tensorflow object detection implementation. https://github.com/tensorflow/models/tree/master/research/object_detection