Classy Robots

10 min readApr 15, 2015

Building a face detection algorithm with sophisticated taste using OpenCV and Python

Computers can quickly detect if there are faces in an image and it is amazing. While I have known about this capability for years, I have never really dug into how it works myself. To learn I decided to build a system that would pull images from Instagram, analyze them for faces, and log the results, and build an average face.

A Huge Advancement

There are actually many ways to detect a face, but one popular method was discovered by Viola and Jones in their 2001 paper titled Rapid Object Detection using a Boosted Cascade of Simple Features. This paper has been cited over 10,000 times in academic journals and is a great read. The key takeaways are for fast facial recognition are

Transforming the image into a new data structure called an integral image
Using a machine learning algorithm based on AdaBoost
Chaining the best results of AdaBoost in a cascade for efficiency

Before learning about integral images it helps to understand how images are stored. Each color pixel of an image can be represented as an array the values of red, green, and blue in the range of 0 to 255 inclusive. This pixel is then stored in an array that represents the row of pixels. This row is then stored in another array that has all the rows of the image

A sample color image from fallout software

Imagine each square of this image is one pixel so that it is 8 pixels by 8 pixels. The first three pixels in the top row would have these values [[[0,0,0],[0,0,0],[235,185,47]….]] since the first two pixels are black and the third one is yellow.

Gray scale images are structured similarly but just have one value per pixel.

In this case the values would be [[0, 0, 190…]…]. Note that in grayscale each value is a pixel and each array in the contained in the outer array is a row of pixels.

The following is a quick summary of the key insights of the Viola and Jones paper:

Integral images are a transformation of grayscale images. Integral images are similar integrals in calculus. Integrals of a function are looking for the area under a curve. Integral images look for the sum of all pixel values up and to the left of the current point. Why not below the point like a calculus integral? Because the 0,0 index for an image is in the top left, it is the origin of the image, like 0,0 is the origin of the x-y plane.

Formula for calculating the values in an integral image from Viola and Jones in 2001

The value at any point is it plus the sum of all pixels up and to the left of it. This formula can be computed in one iteration over the image with the following recurrence:

s(x, y) = s(x, y-1) + i(x, y)

ii(x, y) = ii(x-1, y) + s(x, y)

where s(x, y) is the cumulative row sum, s(x, -1) = 0 and ii(-1, y) = 0.

With this new data structure you can quickly calculate the total value of a rectangle with just four calculations. To calculate the value of D in the above image you would take the values at the points 4, 3, 2, and 1 and do the following operation: 4+1 -(2+3). Note that values of D is the whole image minus the values in A, B, and C. So when you subtract the values at 2 and 3 you are actually subtracting the value at 1 twice, so you have to add it back once to get the right value.

Once an integral image is computed you can quickly compute the sum of any rectangle. You can then compare the sum of two or more rectangles to find boarders or light and dark areas using the sums of the rectangles.

These can then be used on a faces to find boundaries of areas that have a strong light and dark contrast.

Each of these combinations of light and dark rectangles are called a haar feature. The learning model looks at over 180,000 for each 24x24 pixel subwindow of an image. This is far to many feature for a final model to look at efficiently, so it chains the top performing features into a cascade to detect probable faces. That is the gist of the paper, but be sure to check out the whole thing. It is a great read.

Face Detection in OpenCV

I have a simple demo for OpenCv in my github repo linked below. Just fire up the iPython notebook and run the cells. One of the more important tuning parameters is minNeighbors. This determines how many consecutive positives a classifier must find in a series of subwindows. It the top left image it is set to 1. Any one detection of a subwindow that might be a face registers as a face. The second image it is set to 2, 3 in the third, and 4 in the fourth.

So just set it to 4 and we are all set? Sadly, no. Different images might have many false faces at a setting of 4. Check out this photo of ice cream with minNeighbors set to 4:

Image from Instagram that clearly has multiple faces in it

You could turn minNeighbors up to 11 or until it excludes ice cream, but this has two issues. Soon you will start excluding actual faces and there could be something else that fires a false positive. So instead of setting a very strict minNeighbors setting I opted to use a face classifier in concert with an eye and mouth classifier.

Building a Refined Classifier in Python

The code is here on Github but the gist is pretty straight forward. There are three classes in the main file face.py

API call
Image
Face

The code generates an API object that gets image metadata from Instagram. Each entry in the response is then turned into an image object from the Image class. The Image class downloads the actual image and holds info like lat and long coordinates and the rest of the image metadata from the API call. Once the image object is created it will run a face detector using the methods described above. If one or more potential faces are found it will create a face object from the Face class for each of them. The image object metadata is logged in the images table of the database for later analysis. The actual image is saved to disk with a file name corresponding to a column in the DB for later lookup.

The Face class will take the specific section of the image that is a probable face and looks for exactly two eyes and exactly one mouth that is located below the eyes. If there is less or more of the required amount of eyes/mouth than the face will be discarded. If a face meeting the above properties is found then it will be logged in the faces table of the database.

The Data

I collected 2523 images from Instagram that were posted on February 17th, 2015. The samples were roughly evenly split between a 5km radium is Manhattan and San Francisco.

Images from Google Maps and generated on freemaptools

To space out the timing I pulled ten images for every ten minute block of the day and removed any video content. This is one of the photos:

The blue boxes represent a probable face, green are the eyes, and pink/light blue and mouth and smile.

Labeling the data

Since all of my data was unlabeled I had to count the number of faces in each image and log them in my database.

I built a tool called face_reviewer.py. It load an image, shows where it saw potential faces and faces it determined to be real, and lets me score how many actual faces are in the image.

This classifier works only on faces that are facing forward. There are cascades for faces that are sideways since it looks quite different than a front face. Of course saying a face is facing forward vs sideways could be somewhat subjective without using photographs taken in a lab. So I used a protractor to make marks on the wall and angled and see when openCV would stop detecting my face. I tested straight forward which worked fine. Then I tested 10 degrees, 20, 25, 30, 35 and 40 and those photos are pictured below. You can see detection breaks down somewhere between 30–35 degrees for my face.

These photos also served as a handy guide during review. I could compare the angle of my face to the person in the picture to see if they could reasonably be assumed to be detected by the haar cascade.

I had a few other restrictions. One was not being rotated more than 30 degrees like this guy:

That the whole face was visible in the frame:

The face must not be obscured in the frame

And that the face was larger than 50 pixels. This is because of the granularity of features we are looking for in the eyes and mouth. I used a set of calipers to enforce this. 1cm is ~50 pixels on the pictured display.

Results

The classifier has a 95.89% precision score, detecting 140 true faces and 6 incorrect ones.

Of the 6 misses 5 were actually faces but it messed up the labeling of the eyes or mouth like this one on the left. The one non face mislabeled photo was a drawing of a person. Technically a miss, but at least it resembled a face and wasn’t ice cream like in the demo!

Recall was mediocre. It found 140 correct faces out of 740 faces I labeled during review. Although most of the missed faces have a relatively easy solution. There were three main problems. The first were faces with glasses.

The second major issue was closed eyes.

The glasses and closed eyes could be fixed with a different haar cascade trained for that scenario.

The last problem was small faces. I think my feature selection for eyes and mouth was too detailed for faces close to the 50 pixel limit a set. Here is a good example of faces around that size where no mouths or eyes can be found.

Originally I wanted to get data on smiling but I never got the smile detector working well. Kayne is clearly not smiling in this image, but the detector indicates it thinks he is with the blue box

While I did encounter those areas for improvement, the face detector worked quite well with otherwise.

My favorite part of looking at the data was finding the average (in orange below) and median faces. They vary slightly but are pretty close.

Images taken by me, overlaid with face data compiled from Instagram

The average facial features appear here in an image:

The actual average face looks like this:

I think this is pretty cool considering that the faces could be angled by up to ~30 degrees to the left or right to the and still be included. While this face is somewhat blurred due to angled faces, it is clearly a face.

To create this composite image I took the region of each face in an image, scaled them all to 640x640 pixels, summed their grayscale pixels values in an array, and divided each value by the number of images.

Parting thoughts

OpenCV is a really fun library to play with. It has a great API for Python, C and Java so anyone could get started. Given the huge number of image services with API I bet there are a lot of fun projects that can be had.

If you are interested in OpenCV then check out their site!

There are many ways this could be improved with new features. Additional classifiers would capture more data. Logging the stage reason a potential face was rejected could be used to refinement. But given the short amount of time spent on this I am quite pleased. Feel free to fork the code and play with it yourself.