The Evolution of Computer Vision Techniques on Face Detection, Part 2

Alvin Prayuda Juniarta Dwiyantoro
Nodeflux
Published in
7 min readMar 29, 2018

In the previous parts, we already discussed about conventional methods which utilized manual feature extraction. In this post, we will focused on automatic feature learning using deep learning to learn much complex features which describe human’s face. It is also important to be noted that these techniques can also be applied for any other object detection.

Source : https://devblogs.nvidia.com/accelerate-machine-learning-cudnn-deep-neural-network-library/

Deep Learning Methods

These methods is focused on how the computer will learn automatically our target object pattern. The core technique will focus on Convolutional Neural Network (CNN) and Object Proposal Mechanism. CNN will provide automatic feature learning based on training data input. Object Proposal will be focused on how the machine propose several area to be inspected which more likely to contain the targeted object so we did not need the search all possible image patch like sliding window approach. This will result in efficient object detection mechanism. The following discussion will assume that you’ve already grasp the basics of neural networks.

Convolutional Neural Network (CNN)-based Feature Extraction

The previous conventional method focused on finding the best feature descriptor to describe the visual pattern of face directly. However, the current development in deep learning pushed the effort even further by forcing the machine to learn the pattern automatically.

This can be achieved by using deep learning, specifically using convolutional neural network. A convolution is the integral measuring how much two functions overlap as one passes over the other. We can think of a convolution as a way of mixing two functions by multiplying them.

In image analysis, the underlying function is the input image being analyzed, and the second, mobile function is known as the ‘filter’. It will picks up a pattern or feature in the image.

Image 2 : Convolutional Neural Network demo with n_filters = 2, stride = 2, padding =1, and filter size = 3x3 ( Source : https://cs231n.github.io/convolutional-networks/ )

In convolutional neural network, we are passing many filters over a single image patch. Each filter will picking up different pattern and produce feature maps. The deeper the CNN layer goes, the more complex pattern (feature maps) will be picked up.

Image 3 : Result of convolution operation which is called feature maps ( Source : https://killianlevacher.github.io/blog/posts/post-2016-03-01/img/layeredRepresentation.jpg )

This learned feature maps will be the feature descriptor which can be used to classify whether the image patch is a face or not. It can be combined with a fully connected network (not explained in this scope) to perform a classification and combined with the previous sliding window and image pyramid to perform face detection on multiple scale.

Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks (MTCNN)

This method is proposed by Kaipeng Zhang et al. in their paper ‘Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks’, IEEE Signal Processing Letters, Volume: 23 Issue: 10.

They propose a new framework which consist of three stages to perform face detection and facial landmark detection simultaneously. In the first stage, it will propose several candidate windows quickly through a shallow CNN. After that, the second network will refines the windows to reject a large number of non-faces windows through a more complex CNN. Finally, it uses a more powerful CNN to refine the result and output facial landmarks positions

Image 4: Pipeline of MTCNN ( Source : https://arxiv.org/ftp/arxiv/papers/1604/1604.02878.pdf )

Given an image, they use image pyramid so that they have the image in multiple scale. Then the image is given as input to the following three-stage cascaded framework:

  1. In the first stage, a fully convolutional network which is called Proposal Network (P-Net) is used to obtain proposed regions and their bounding box regression vectors. The obtained regression vectors is used to calibrate the proposed regions and then apply non-maxima suppression (NMS) to merge highly overlapped regions.
  2. All proposed regions will be fed to another CNN which is called Refine Network (R-Net), which will reject a large number of false candidates, performs another calibration with bounding box regression and also NMS candidate merge.
  3. In the last stage, it is similar to second stage and is called Output Network (O-Net). To furthermore describe face in details, they also output five facial landmarks positions.
Image 5 : Overall Architecture of P-Net, R-Net, and O-Net ( Source : https://arxiv.org/ftp/arxiv/papers/1604/1604.02878.pdf )

It is mentioned in their paper, if they are using the following loss function in their network:

  • Categorical Cross-Entropy Loss : this loss is used to perform face classification for the proposed regions
  • Euclidean Loss : this loss is used to perform bounding-box regression and facial landmark regression
  • Multi Source Training Loss : Some input of training images will contain not only face images, but also background images. In that case, not all loss will be used. For example when training background images, only face detection loss will be used and the others will be set as 0. The overall learning target is formulated as :

Single Shot Multibox Detector

This method is proposed by Wei Liu et al. in their paper ‘SSD: Single Shot MultiBox Detector’, presented at ECCV 2016. The term of Single Shot Multibox Detector came from these reasons:

  • The term of Single Shot is means that the localization and classification tasks will be completed only with single forward pass of the network
  • The term of Multibox came from the work of one of the team, Christian Szegedy et al. ‘Scalable , High-Quality Object Detection’ published in here. It is a bounding box regression technique which can adapt to multi-scale object.
  • The term of Detector means that this framework will detect and classify object presented in an image
Image 6 : SSD Architecture ( Source : https://arxiv.org/pdf/1512.02325.pdf )

The SSD architecture is consist of base networks which is a high quality networks already proven to deliver outstanding classification result, for example VGG-16 like shown in the paper. They use up until the Conv5 layer and substitute the remaining fully connected layer with several auxiliary convolution layers which enable to extract features at multiple scales. This auxiliary also reduce the input size to the subsequent layers.

Inspired by the Multibox method, they follow its technique in regressing the bounding box location and confidence score. Thus, they apply the following loss function related to these tasks:

  • Confidence Loss : this loss is used to calculate how confidence is the network to present that an area is containing any object in it. This loss is calculated using categorical cross-entropy
  • Location Loss : this loss is a calculated smooth L1 loss to present how far the predicted bounding box coordinates from the ground truth
  • Combined Loss : the overall combined loss is a weighted sum over the confidence and location loss. The alpha is a hyper-parameter which measure how much the contribution of the location loss

Related to bounding box regressing technique, SSD use a pre-computed fixed size default boxes that matched closely with the distribution of the ground-truth boxes. These boxes are what we called as priors or anchors. The anchors will act as an image window in which we need to classify like a sliding window method mentioned in previous part. Sliding window method most likely will produce a huge number of image window to be inspected on multiple scale. However, the anchors will limit the number of windows which need to be inspected significantly, thus it will speeding up the whole process.

In SSD, these anchors will be inspected on feature maps produced from the convolutional layers. These feature maps will be split into multiple fixed size cells like shown in Image 7 below. Each cell will contribute to inspect all anchors and generate location and confidence score from each respective anchors.

Image 7 : Feature maps cells and Anchor boxes ( Source : https://arxiv.org/pdf/1512.02325.pdf )

In training time, each anchors on each cells will try to fit the prediction (location and confidence) to the ground-truth data. At inference process, SSD will generate many prediction from each anchors, thus, Non-maxima suppression (NMS) is deployed to select only the box with the highest confidence score when they are tightly overlapped each other.

Those are several famous methods to employ face detection technique. These methods have already widely used and experimented by community. In the next post we will present to you the available libraries which can be used to do experiment by ourselves. We will also include a comprehensive performance benchmark for each of them

Thank you for reading~

References:

--

--