The Evolution of Computer Vision Techniques on Face Detection, Part 1

Alvin Prayuda Juniarta Dwiyantoro
Nodeflux
Published in
9 min readMar 29, 2018

Introduction

The increasing of computer computing capability and the decreasing of price/performance ratio of computer system contribute to the vast development of computer application. One of the focus is the development of computer’s ability in mimicking one of human’s natural sensor — the eyes. Up until this very moment, human have already succeeded in making a device which can imitate how the eyes working. And now is the time to make the computer understand what information that we have in it, in a field of study of computer vision.

One focus of the computer vision is to enhance capability on surveillance system. The traditional surveillance system is heavily dependent on human’s capability to perceive any information from video camera feed. However, supported by today’s advanced computing capability, many techniques and methods already deployed to enforced automatic surveillance system.

Face detection is among the focuses of the automatic surveillance system. This focus is based on premise that face of a person can be applied into several application to extract useful information. The system can quickly notified if there are unwanted person’s entering an area (face recognition), detect whether there are people looking in and out to our camera (gaze estimation), age and gender estimation, and many more.

Several methods have already proposed in order to extract pattern from images and detect faces from them. In general, all methods proposed is consist of three parts:

  • An algorithm to inspect parts of images, e.g. : sliding window, region proposal, max-margin object detection
  • Obtain extracted features (image patterns) from the parts, e.g. : Haar features, Histogram of Oriented Gradients features, Deep Learning features
  • Classify them whether they’re a face or not using machine learning. e.g : Support Vector Machine, Adaboost, Fully Connected Neural Networks

In the past, the traditional methods rely on how we can model the features or the pattern manually using our philosophical understanding. We try to find edges or blobs from filtered images to define as features and classify them. However, the recent development shows that it’s better for us to delegate those task to the computer and let them learn themselves. The following section will briefly explain about some of the famous traditional methods introduced to us until today.

Traditional Methods

Haar Cascade Classifier

This method was proposed by Paul Viola and Michael Jones in their paper, “Rapid Object Detection using a Boosted Cascade of Simple Features”, CVPR 2001. This algorithm will need a lot of positive images (in this case, face images) and negative images (background images). To extract features from it, haar features shown in the below image are used.

Image 1 : Haar features (Source : https://docs.opencv.org/3.4.1/d7/d8b/tutorial_py_face_detection.html)

This rectangle is what we called as ‘kernels’. This kernels are selected in aw way to capture features in the face like nose, distance between two eyebrows, mouth, etc. Basically, we do matrix operation on our image’s pixel value with the information of this kernels. And so, each feature is a single value obtained by subtracting sum of pixels under the white rectangle from sum of pixels under the black rectangle.

Image 2 : Applying haar features on training faces (Source : https://www.researchgate.net/figure/Haar-features-examples-for-face-detection_fig1_237049645)

Now, a lot of features need to be calculated for all possible sizes and and locations of each kernels. To reduce computational cost on features calculation, they proposed integral image calculation. It reduces calculations for a given pixel to an operation involving just four pixels.

After all features is calculated, only important features are selected. For example, if we see the Image 2 above, we can see some good features on the top row focuses on eye regions which is darker than cheek regions (top-left) and the corner of the eyes is darker than upper nose (top-middle). But if the same windows applied to cheeks or any other place, it will be irrelevant. That’s where the Adaboost algorithm take role. This algorithm will select the best features out of all features.

In the training phase, it will find the best threshold which will classify the faces to positive and negative for each features. Finally the features with minimum error rate will be selected. Then the process will be iterated until the required accuracy is achieved or the required number of features are found.

Image 3 : Overall Architecture (Source : https://www.quora.com/How-can-I-understand-Haar-like-feature-for-face-detection )

When evaluating an image, The image should be split into smaller parts called a window and evaluate them individually. Instead of applying all selected features on a window, the features are grouped into different stages of classifiers and applied one-by-one. This is called as Cascade of Classifiers. Like shown in ‘Features Cascading’ part in Image 3, if a window fails the first stage, discard it. If it passes, apply the second stage of features and so on. The window which pass all stages will be properly classified as face.

To detect face on multiple scale, sliding window method combined with image pyramid can be used. In sliding window method, a fixed size window is selected on an image and it ‘slide’ from corner to corner so that the window will cope all parts of the image. Upon finishing, the image will be resized into a smaller size, so that the same size window will affect greater area of the image. This way, the classifier will be able to detect smaller faces on the image

Image 4 : Sliding window with Image pyramid (Source : https://www.pyimagesearch.com/2015/03/23/sliding-windows-for-object-detection-with-python-and-opencv/)

Upon finishing, there will be multiple detected box pointing to a single face. In order to unify them, non-maxima suppression is used to group all the boxes which tightly overlapped each other into single box.

Image 5 : Non-maxima suppression (Source : https://www.pyimagesearch.com/2015/02/16/faster-non-maximum-suppression-python/)

Histogram of Oriented Gradients (HOG) Classifier

Histogram of Oriented Gradients (HOG) was introduced at the CVPR 2005 by Navneet Dalal and Bill Triggs on their works, “Histograms of Oriented Gradients for Human Detection”. This method will produce a feature descriptor which is a representation of an image patch that simplifies the image by extracting useful information and throwing away extraneous information. In general, a feature descriptor will ‘describe’ an image of size width x height x 3 (channels) with a feature vector of fixed length n. For example, in this explanation the HOG will describe any patch of images with size 64x128x3 with a feature vector of length 3780. Important notes : this explanation will follow paper’s explanation using pedestrian detection. However, this approach can easily adapted to detect faces using face detection dataset as training data. E.g. dlib face detection using HOG

In the HOG feature descriptors, the distribution (histogram) of directions of gradients ( oriented gradients ) obtained from pixels differences are used as features. The philosophical hypothesis is that gradients of an image are useful because we can observed that the magnitude of gradients around edge and corner regions ( where the intensity changes is significant ) is large and they also convey a lot more information about object shape compared to flat regions.

To calculate the HOG descriptor, the horizontal and vertical gradients need to be calculated first. This is achieved by filtering the image with the following kernels.

Image 6 : Kernels for calculating horizontal gradients (left) and vertical gradients (right) (Source : https://www.learnopencv.com/histogram-of-oriented-gradients/)
Image 7 : Resulting gradients [x-gradient (left), y-gradient (right), magnitude (right)] (Source : https://www.learnopencv.com/histogram-of-oriented-gradients/)

The gradient image will remove a lot of useless data ( e.g. constant background color ) and highlighted important outlines. We can easily perceive that there is a person in the image.

At every pixel, the gradient has a magnitude and a direction. For color images, all gradient value from each channel will be evaluated. The maximum of the magnitude of gradients of the three channel will be selected as the magnitude of gradient at a pixel. And the angle will correspond the the maximum gradient value.

For example, you can inspect Image 8 below :

Image 8 : Example of image data ( Source : http://mccormickml.com/2013/05/07/gradient-vectors/)

The value in the x-direction is 94–56=38 , and the y-direction is 93–55=38. Putting it together we will have [38 38] feature vector. The magnitude and direction will be calculated and drawn like this :

Image 9: Example of gradients visualization ( Source : http://mccormickml.com/2013/05/07/gradient-vectors/)

After understanding how the gradients looks and calculated, the next step is the image patch will be divided into 8 x 8 cells. Each cell will contribute their own HOG. This way, we can provide more compact representation and it will be more robust to noise. Global histogram may have noise, but a histogram over 8 x 8 patch will much less sensitive to noise. 8 x 8 size is chosen initially because in its paper it was used to detect pedestrian and this size are big enough to capture interesting features (hat, face, etc).

Image 10 : 8 x 8 patch over 64 x 128 x 3 image patch gradients ( Source : https://www.learnopencv.com/histogram-of-oriented-gradients/ )

The histogram is essentially a vector of 9 bins corresponding to angles 0,20,40, … ,160. After that the magnitude value will be distributed in every direction bin proportional to its direction degree. For illustration, see image 11 below. On the pixel encircled in blue, the magnitude of 2 directly assigned to 80 bin because it has 80 degree of direction. Different cases apply to the pixel encircled in red. It has direction degree equal to 10, so it split it’s magnitude value proportionally to 0 degree bin and 20 degree bin.

Image 11 : Distribution of magnitude value into direction bin based on direction degree ( Source : https://www.learnopencv.com/histogram-of-oriented-gradients/ )

In order to make the descriptor independent of lighting variations, the writer propose a 16 x 16 block normalization. This block will cover 4 ( 8x8 block ) in the previous step, so it will normalize all the 4 distribution vector produced from them. The normalization used is L2 norm of the vector. At the end of the process, each 4 normalized vector will be concatenated into a feature vector with the size of 36 x 1 element ( 4 * 9 x 1 element obtained from 8x8 block ). The normalization block process is using sliding window method with 50% overlap, so it will ‘slide’ 8 pixel in each iteration.

Image 12 : 16x16 block normalization ( Source : https://www.learnopencv.com/histogram-of-oriented-gradients/ )

Finally, for image patch with size of 64x128x3, it will have 7 horizontal and 15 vertical positions of the 16x16 blocks. Hence, it will produce feature vector with size 7x15x36=3780.

To train a classifier for detecting face, similar to Haar classifier, we need a lot of positive and negative images. Positive and negative HOG feature descriptor will be extracted from those image. After that machine learning techniques such as Support Vector Machine (SVM) can be deployed to classify those features.

Image 13 : SVM classifier applied to the training data ( Source : https://www.researchgate.net/figure/Optimal-Separating-Hyperplane-Hard-Margin-linear-SVM_fig1_269987578 )

Upon evaluating an image, similar strategies mentioned in previous segment can be applied by using sliding window and image pyramid.

Those are several traditional methods which are famous for their implementation in real world application. You can find their implementation in famous libraries such as OpenCV and dlib. In the next part, we will introduce you the state-of-the-art techniques which implement automatic feature learning using deep learning method.

Next Part : The Evolution of Computer Vision Techniques on Face Detection, Part 2

Thank you for reading~

References:

--

--