SSD: Single Shot MultiBox Detector explained

5 min readAug 21, 2017

Original paper: https://arxiv.org/pdf/1512.02325v5.pdf

Key idea here is single network (for speed) and no need for region proposals instead it uses different bounding boxes and then adjust the bounding box as part of prediction. Different bounding box predictions is achieved by each of the last few layers of the network responsible for predictions for progressively smaller bounding box and final prediction is union of all these predictions.

SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. The fundamental improvement in speed comes from eliminating bounding box proposals and the subsequent pixel or feature resampling stage.
Our improvements include using a small convolutional filter to predict object categories and offsets in bounding box locations, using separate predictors (filters) for different aspect ratio detections, and applying these filters to multiple feature maps from the later stages of a network in order to perform detection at multiple scales. With these modifications — especially using multiple layers for prediction at different scales — we can achieve high-accuracy using relatively low resolution input, further increasing detection speed.
We summarize our contributions as…

SSD: Single Shot MultiBox Detector explained

Written by Manish Chablani