One of the most important drivers for ‘democratizing’ data science has been transfer learning. Specifically: the ability to take and use a model that tech giants like Google have invested significant amounts of resources in, really brings AI within the grasp of any tinkerer with an attitude. Real-time object detection for example has become easy with the Tensorflow Object Detection API. We can simply use pre-trained models such as the SSD Mobilenet detector to recognize many objects in real-time on any normal computer. But what if we want to do face recognition? You know, like in the movies:
Can you use a pre-trained face recognition model for recognizing your friends? Because:
- a pre-trained object detection model can recognize most common objects in the world
- but a pre-trained face recognition model can only recognize those faces it is trained on
So we need something extra to be able to use such a model. Suppose you want to build an app that can recognize the faces of some of your friends/family members/colleagues. You couldn’t possibly achieve that in one hour…
Or could you…? Let’s find out!
What you need
- Some images containing faces of the people you want to be able to recognize (they don’t need to be cropped to their faces).
Note: You don’t actually need millions of images to train a model to recognize your set of people of interest. It suffices to just have one image per person! If you want to understand how that works then see this paper.
- Laptop with Python 3 and a webcam
- Download and install some stuff, most notably dlib… we’ll get to that
So, what’s the plan?
- Take a stream of images from the webcam using OpenCV
- The face detection model detects where in the image faces are located. It doesn’t recognize whose face
- We feed the face to the face embedding model to get an embedding, or feature vector, of the face: a vector of size 128
- Now compare this vector to those of your buddies and find the most similar one.
☝️Ehm yeah: this means you should already have run through this process for the images of your buddies, to even have those embeddings!
- Display nice boxes around the recognized faces and label them with their name and maybe something silly like “Match 100%”
So we need two models:
- A face detection model
- A face embedding model that transforms an image into a meaningful feature vector (or embedding), specifically trained on faces
For both, there are multiple options you could use, but the easiest option (since it’s good to be lazy and I promised you’d be done within the hour) is to use the models available in the dlib library:
- the HOG face detector (Histogram Oriented Gradient features + linear SVM classifier)
- the face embedding model: a slightly modified ResNet-34 classification model trained on 3 million faces, where the last classifier layer(s) have been removed to make it into an embedding model.
Sounds like a hassle already? No worry, we’re going to use some packages that have already taken care of the hard parts:
- dlib for the actual detection and recognition stuff
- face_recognition that acts as a nice wrapper to make our lives even easier
- OpenCV to use the webcam and mess around with images a bit
Now let’s get to work!
First off: set up a Python environment and install dlib. Unfortunately this is a little more work than a
pip install since it requires cmake. See this guide on how to install dlib: https://www.pyimagesearch.com/2018/01/22/install-dlib-easy-complete-guide/ which, to keep it simple, comes down to:
- If you don’t have home-brew installed then do so
- Use that to
brew install cmake
- Now activate your Python environment and do
pip install dlib
- OpenCV is also a fun package because you need to install it using
pip install opencv-pythonand use it in Python using the name
Now that that’s out of the way, we can start with Python!
First: do some imports and define some constants:
Now let’s start with a function to get an embedding of any faces in an image. It’s simpler than you might think: just two lines from the face_recognition package!
Btw: I use the words “embedding” and “encoding” interchangeably
Note: OpenCV reads images in BGR format, face_recognition in RGB format, so sometimes you need to convert them, sometimes not.
Using this, now make a little identity database, containing the encodings of our reference images:
Now that we’ve got that covered, let’s get a video stream going from the webcam, using OpenCV. The basic principle works as follows:
From here, all we need to do is insert the recognition part in this loop, plus some code to compare the face embedding with our ‘database’ to find the best match:
Wait, we didn’t define that last function. No biggy… just a couple of OpenCV function calls. Do make sure not to get top, right, bottom, left mixed up!
Alright, almost there! Just merge some of those last script parts into one main function for the app:
Now we’re all set to run it!
database = setup_database()
Results & improvements
It works right? Ok you might be a bit disappointed in the frame rate. Unfortunately, reading the camera, running the detection and recognition and showing the image all in one process is a bit much… But there are a couple of things you could do:
- Compress the frames (to about 25% of the original size) before feeding them into the models. This already makes a huge difference, while recognition still works well!
- Skip some frames. Do the detection/recognition part only on every other frame.
- Use separate threads for camera reading, model inference and image display. This takes some more work though.
- When every bit of performance matters (like maybe on a Raspberry Pi), you could use the Haar Cascade face detector instead of the HOG. It’s not provided in dlib though, so you’ll need to hack away a bit for that.