Facial Detection — Understanding Viola Jones’ Algorithm

9 min readJan 24, 2020

Introduction

From my time researching this topic I have come to the realisation that a lot of people don’t actually understand it, or only understand it partly. Also, many tutorials do a bad job in explaining in “lay-mans” terms what exactly it is doing, or leave out certain steps that would otherwise clear up some confusion. So i’m going to explain from start to finish in the most simplest way possible.

There are many approaches to implement facial detection, and they can be separated into the following categories:

Knowledge Based

Rule based (Ex: X must have eyes, x must have a nose)
Too many rules and variables with this method

Feature Based

Locate and extract structural features in the face
Find a differential between facial and non facial regions in an image

Appearance Based

Learn the characteristics of a face
Example: CNN’s
Accuracy depends on training data (which can be scarce)

Template

Using predefined templates for edge detection
Quick and easy
A trade off for speed over accuracy

The approach we are going to be looking at is a mix between feature based and template based. One of the easiest and fastest ways of implementing facial detection is by using Viola Jones Algorithm.

Haar-like Features

Before learning about Viola Jones, we need to take a quick look at Haar-like features (which ill just be calling haar features from now on), and their inspiration: Haar Wavelets — Haar Wavelets were proposed by mathematician Alfred Haar in 1909 and are used in applications such as signal and image compression in electrical and computer engineering. To put simply: Haar Features are essentially collections of pixels in rectangular shapes. Haar features are conceptually similar to kernels in convolutional neural nets. The difference is that these features are created programmatically, they aren’t learned from the raw image data like in the case of deep learning.

But don’t worry, you don’t need to sit there and write thousands of fancy functions to generate these features as they are widely available online in the form of XML files. There are thousands of possible features you can use, because all they really are rectangles with regions for calculating delta values.

The rationale of haar features is that if you apply a feature to an area in the image, and subtract the unshaded region of pixels values from shaded region of pixel values it will give you certain delta values.

Example: region X with 100 pixels has a a summed value of 200 and region Y of the same size has a summed value of 150, then the delta value value of 50. Simples :)

These feature values are then used for training an AdaBoost variant (but more on this later)

Haar feature types

Although there are thousands of possible feature shapes that can be created, the two most common are Edge Features and Line Features.

Edge Features

So let’s say for example you have want to detect part of a face, in this case an eyebrow, naturally the shade of the pixels of on an eyebrow in an image will be darker and abruptly gets lighter (skin). Edge features are great for finding this.

Line Features

Now lets say you want to detect a mouth: naturally the shape of the lips region on your face go from light to dark to light again. For this, Line features prove to be the best.

The cool thing about these features is that they can be used inversely. Meaning that they apply with a dark-light-dark and light-dark-light format. So to summarise…

for each feature type:
   1.Move across the image
   2.Calculate delta of the sum(unshaded) and sum(shaded)
   3.Use these values to train an AdaBoost variant model.

Using all these features together on an image will give you a probability value of something being a face. But is all a simplified view of things, now to get into the nitty gritty

Problems with Only Using Haar Features

So can just use these magical rectangle things and BOOM we have facial detection? Unfortunately, no. There are a number of problems with just using haar features:

In real life scenarios images aren’t just collections of black and white pixels. Most likely the images you’ll be working with will be coloured (RGB) or grayscale. Meaning you wont be working with binary pixel intensities.
Summing up pixel values for all feature types in all images in your dataset can be very computationally expensive, especially depending on the resolution of your images.
There are over 160,000 possible feature combinations that can fit into a 24x24 pixel image, and over 250,000 for a 28x28 image

So how do we get around this problem of having to sum all these pixels values? This is where Viola Jones Algorithm comes in.

Integral Image Explained

Viola and Jones introduced the concept of the Integral Image

The What?

A precomputed version of the source image
Store it in an intermediate form

The How?

Source: Detecting Faces (Viola Jones Algorithm) — Computerphile

Each point in the integral image is a sum of the pixels above and left of the corresponding pixel in the source image

The Why?

Removes the need of summing pixels all the time for every feature classifier
Instead of doing additions for every pixel values for all features — use an integral image to make use of a few subtractions to get the same result.

Integral Imaging in Action

So how are these integral images even created? Using the pixel values in the source image, we get a new value for the integral image. Heres an example:

For the first row of pixels:

1 = 1
1 + 2 = 3
1 + 2 + 4 = 7
and so on…

For the second row of pixels:

21 + 1 = 22
22 + 2 + 2 = 26
26 + 2 + 4 = 32
...

You get the idea. So if we precompute these values, we can cleverly use these new values to make summing redundant during training. Lets look at an example of getting a delta values using this edge feature seen above:

You want to get the delta of both areas
Instead of summing up, use the most bottom right corner values of each region in the integral image and subtract.
178 – 67 = 111 — this is the sum of pixels in the shaded area
Delta = 111–67 = 44

Instead of doing 12 addition operations to get the summed values for the shaded region, you do one subtraction operation. Now isn’t that cool? ( ͡° ͜ʖ ͡° )

This will even work for all shapes and sizes of features, lets take a look at another example:

#Long way  1 + 2 + 4 + 6 + 7 + 21 + 2 + 2 + 4 + 3 + 1 + 2 + 1 + 
  3 + 4 + 1 + 2 + 7 + 8 + 9 = 90  9 + 10 + 11 + 2 + 1 + 1 + 5 + 9 + 10 + 2 + 27 + 1 = 88  Delta: 90 - 88 = 2#Short Way
  
  178 – 90 = 88
  
  Delta = 90 – 88 = 2

In total we have reduced what would have been 31 addition operations down to just 2 subtraction operations, now thats fast! This may not seem like much for a computer, but take into account when you have a tens of thousands of images to compute, it is completely infeasible.

Training

Okay so now we know how to create features, how do we actually learn the feature representations in the data? An ensemble of AdaBoost models are used for achieving the goal of of Viola and Jones’ algorithm. One classifier is created for each haar feature. Each one of these classifiers are considered as “weak” classifiers: Meaning that on their own they don’t provide much predictive power, but when you combine a lot of them they become a strong classifier.

Heres how it works in a nutshell:

You have N number of features
Weights are randomly assigned to each feature classifier
One weak classifier is trained for each feature using AdaBoost
Weights are adjusted and misclassifications are penalized higher than correctly classified ones
When training is complete: sort models based on the least error rate to the highest error rate (best models first)
Select the best weak classifiers based on a threshold value (drop the “useless” ones)

This is saved to what is known as a Attentional Cascade

Attentional Cascade

In a real life scenario, images aren’t going to be 24x24 pixels. Also, it’s not guaranteed images will only have one face present. You want a clever way to use the best classifiers in order to get the most accurate results. For this, you use a thing called an Attentional Cascade. Attentional Cascade are a series of the weak classifiers we trained, when used together to make a strong classifier.

Remember: this cascade is sorted with classifiers from strongest to weakest. This architecture is used to “weed out” the negative samples earlier on. This decreases computation time on inference and allows for super fast detection. However, this comes with a trade off of increasing training time.

We feed the source image to the cascade, and do inference on image regions using the best feature classifier first. If the image has that given feature, then move onto the next classifier. If it doesn’t then it drops the image region and no further classification is done. These steps are repeated until we have run out of classifiers in the cascade. A region will only classify a region of an image as a face if all criteria is met, Meaning only if all features are found.

Summary

So that may have been a lot to take in, so i’ll summarize all the steps explained above…

Preprocessing

1. Create handmade simple haar features

Training:

1. Convert image to integral image
2. Compute delta values for each feature over an image region
3. Train Adaboost model for each feature
4. Sort classifiers by strongest to weakest
5. Drop the “useless” classifiers
6. Add useful classifiers to attentional cascade

Inference:

1. Load cascade
2. Pass image through each classifier in a cascade
3. Get result

Advantages and Disadvantages

Advantages

Detection is very fast
Simple to understand and implement
Less data needed for training than other ML models
No rescaling of images needed (like with CNN’s)
Much more interpretable than contemporary models

Disadvantages

Training time is very slow
Restricted to binary classification
Mostly effective when face is in frontal view
May be sensitive to very high/low exposure (brightness)
High true detection rate, but also high false detection rate

Running Example

Here is a Node.JS web application I put together for example:

Additional Take-Aways

Although this is an old and relatively simple solution, it is still widely used today
It is used in camera view finders, and also even used in snapchat and Instagram face filters