Let’s dive into the recent Dual Shot Face Detector through a review of two famous detection algorithms Faster R-CNN and Single Shot Detector.

The State Of The Art in Object Detection

Face detection is a fundamental step for many applications, from recognition to image processing. It is a challenging task, as faces in real-world images present a very high degree of variability in scale, pose, occlusion, expression, appearance, and illumination. Blur, makeup, and reflection are good examples of variability that explain why face detection is still extensively studied.

To this day, the approaches that resulted in the greatest performance can be roughly divided into two categories:

  • Methods based on region proposal, and more specifically Region Proposal Network.
  • Single-shot methods such as the Single Shot Detector.

These two architectures are famous and classic algorithms in the broader field of object detection algorithms. Let us dig a little deeper into the two approaches. The aim in object detection is to predict a set of bounding boxes around the objects in an image or a video frame, as well as their respective class.

Note that in the specific field of face detection, there are only two classes: faces and not-faces (or background).

Region-proposal-based methods

The idea of using region proposal to perform object classification was first introduced in 2014, with the R-CNN article. It was based on the observation that a detection task was similar to a classification task, on various regions of the input image.

A simple representation of the idea behind methods based on region-proposal.

The term Region Proposal Network was coined in 2015 by the authors of the Faster R-CNN network and is the core component of this kind of architectures. These methods work as two-stage detection schemes:

  • A Region Proposal Network hypothesizes possible object locations in the image. This region proposal is class-agnostic: it detects areas that are likely to contain objects instead of just background.
  • A region-based convolutional network performs class-specific detections: it classifies the objects located in the proposed regions and refines their bounding-box coordinates.
The main components of the Faster R-CNN architecture. The Region Proposal Network outputs coarse regions of interests that are taken into account by the subsequent layers of the architecture to do detection. Here the region based detection is performed by the Fast R-CNN network, which shares some of its convolutional layers with the RPN.

Note that both components may not be fully disconnected networks. Faster R-CNN indeed brought a great improvement in computation time by sharing the layers of a fully convolutional network between the RPN and the class-specific detection network (which is a Fast R-CNN network in the case of the Faster R-CNN architecture).

Details of the Region Proposal Network, as presented in the original article. A convolutional layer with 256 3x3 filters outputs a 256-dimensional vector at each position of the feature map. The latter is then used to classify the corresponding receptive field as object or background for k possible reference boxes, called anchors and to predict an offset to the coordinates of each of these k anchors. cls is a classification layer. reg is a regression layer.

One key concept in the Faster R-CNN architecture is the use of anchor boxes. Anchors are reference boxes with various shapes and scales that will parametrize the k proposed regions at each point of the feature map. At each position of a sliding window over the convolutional feature map, a region is proposed for each anchor:

  • The regressor layer outputs the coordinates of a refined version of the anchor.
  • The classifier layer outputs a confidence score in a binary object/background classification task of the anchor.

The common configuration is to have 9 pre-defined anchors, involving three different scales, and three different height-width ratios (usually 1:1, 2:1 and 1:2). As a region is proposed for each anchor box and at each point of the feature map, the output of this step is k⨉(number of points in the feature map). Finally, the proposed regions are filtered out to only keep the best ones (highest object classification score for instance).

In order to train the RPN, we need to determine a matching strategy between the predicted bounding-boxed and the ground-truth bounding boxes. In Faster R-CNN, the predicted bounding boxes are assigned either:

  • to the ground-truth box with which they achieve the highest Intersection Over Union (IOU, or Jaccard index) overlap.
  • or to any ground-truth box with which they achieve an IOU overlap higher than 0.7.
Matching strategy between the predicted box (black) and the ground-truth boxes (color).

These predicted boxes are assigned a positive label, that is, their ground truth is considered to be the object class. The predicted boxes that do not meet one of those two criteria and have an IoU lower than 0.3 with all boxes are assigned a negative label. Their ground truth is considered to be the background class. The remaining predicted boxes are ignored.

Once each predicted box has been assigned a label, the RPN network is trained by minimizing the mean over all the predicted boxes of the following loss function:

Loss function of the RPN network.

This loss function is a sum of two other loss functions, ponderated by λ:

  • The classification loss of the predicted box.
  • The regression loss of the predicted box. Note that this regression loss is only computed for the positive anchors (p* = 0 for negative anchors), as the negative anchors are not matched with any ground-truth box.

The detector is trained using a similar loss function. The main difference lies in the classification loss. Indeed the detector component performs a multi-class classification task, instead of a binary one.

Single Shot Face Detection

DSFD architecture is mainly based on the 2016 SSD: Single Shot MultiBox Detector architecture, from Wei Liu et al. This architecture differs from RPN-based networks from the fact that there is no region proposal step. The coordinates and the content of the bounding box are directly predicted from the feature map, hence the name of the network, and shorter prediction times.

Additionally, instead of using a single feature map for detection, classifiers and regressors are ran on several feature maps, located at various depth of a core network, as pictured in the figure below. This core network is composed of the layers of the VGG-16 network (truncated before the classification layers) followed by extra-features convolutional layers. If you want a quick overview of the VGG-16 architecture, you can refer to this blog post. Also note that the VGG-16 layers can be replaced with any layers from other fully convolutional networks, such as res-net.

The architecture of the SSD network. Point-wise classifiers and regressors are ran on feature maps at various depth of the base network. The classifiers predict for each point of the feature map a vector of size 84 = (20 classes + background) ⨉ 4 anchors. Similarly, the regressors predict for each point a vector of size 16 = 4 coordinates ⨉ 4 anchors. Hence 4 boxes are predicted for each position in the feature map. For instance, 38⨉38⨉4 = 5776 boxes are predicted for the shallowest feature map, but only 1⨉1⨉4 = 4 for the deepest one.

Each feature map corresponds to various receptive field sizes. The receptive field of a feature map is the area in the input image whose pixels have been involved in the computation of each point of the feature map. The deeper the feature map, the wider the receptive field.

Intuitively, it means that the deep features maps enable the detection of large objects (taking a larger area in the input image), while shallow feature maps enable to detect smaller objects.

As in RPN-based architectures, reference boxes (anchors) are used to parametrize the detection. These boxes are also called priors, as their coordinates are refined by the regressors. In the case of the SSD architecture, a smaller number of anchors is required, only to account for the various possible shapes (width-height ratios) of the bounding boxes, as the detection is already performed at different scales. The scale of the anchors is hence fixed for each feature map and depends on the depth at which we are performing detection.

The number of positive bounding boxes, that is the ones not associated with background by the classifiers, is finally reduced using non-maximum suppression based on the confidence of the classifiers. We will not cover this subject here.

A matching strategy is used to match predicted boxes to ground-truth ones, or to the background class, if they share a small IoU overlap with all ground-truth boxes. The network is then trained using a similar loss function as the ones used by the Faster R-CNN architecture. Note however that there is here no need for a region proposal loss function.

Now that we have reviewed these two famous baselines in object detection, let us not forget that DSFD face detector is the architecture we are interested in. It is time to dive deeper into the novel ideas proposed by the authors.

The Contributions of DSFD

The article introduces three novel alterations to the previous SSD architecture:

  • A new way of computing the feature maps on which classifications and regressions are conducted.
  • A variant of the loss functions to be minimized during the training of the architecture.
  • An improved strategy to match the predictions to the faces in the image.

