Leveraging Deep Learning on the Browser for Face Recognition
By Daitan’s Innovation team
Face recognition is probably one of the long-awaited technologies of recent decades. From Hollywood movies and TV sci-fi series to actual cell phone solutions, the face seems to be the perfect authenticator. But, despite the hype, the tech didn’t look ready for a long time.
However, recent advances in machine learning seem to be worth the wait. To get an idea, let’s take a look at what the big four tech companies are doing in this area.
Microsoft, Amazon, and Google all have ready-to-production solutions for face applications. Apple, on the other hand, just recently changed the way we access our smartphones with its new Face ID technology.
And it is everywhere.
Face technology is now a core component of the promise of self-driving cars. In this scenario, it can be used to detect levels of the driver’s distraction, from drivers peeking at their cell phones while on the road to looking sideways in the middle of a crossroad.
Moreover, business payment services are popping up everywhere, most of them based on a face authentication system. Apple Pay, Samsung Pay, and MasterCard Selfie Pay are some examples.
Fields such as health care, advertising, and security are either performing experiments or already deploying face technologies for their users.
If you want a broader look, here is a summary of the top 10 facial recognition APIs of 2018. Note that most of these services offer more than facial recognition capabilities. For example, gender and age prediction, emotional information, and similarity scores are common to most of them.
In this post, we propose a face identification system running on a messenger app like Skype or Slack. Our system is designed to register new people’s faces using images from a raw VGA camera. Then, we put in place a locking system that blocks non-registered users from using the app, all of it running at near real-time speed on the browser.
The task of face detection and recognition is a classic one in the field of computer vision. Approaches such as Haar Cascades and eigenfaces using SVMs are popular solutions for both problems. The main concern with these algorithms is that they seem not to generalize well for real-world applications. And although they were starting to show good results, in terms of accuracy, again, the tech didn’t appear to have reached the production level.
However, since the deep learning revolution, things started to change. It all started with the availability of large labeled datasets and parallel hardware. Suddenly, relatively old algorithms (combined with new techniques) were being used to solve many hard problems—more specifically, perceptual problems like image classification, object detection, and segmentation. In a short time, most of these classical solutions became obsolete to a new generation of algorithms.
It was not different with face-related applications. Nowadays, most solutions for face recognition are based on a very similar idea: to train deep convolutional networks on large datasets. Let’s get into the details.
Deep Learning for Face Recognition
A paper that really set the tone for real-world applications was FaceNet: A Unified Embedding for Face Recognition and Clustering.
FaceNet achieved 99.63% accuracy on the very popular Labeled Faces in the Wild (LFW) dataset. To give you some perspective, this was an improvement of nearly 30% over the last state-of-the-art system.
To do that, the FaceNet authors proposed an effective combination of three components: model architecture, objective function, and training procedure.
In short, FaceNet tackles three related computer vision problems with a single solution.
First, the verifications problem. Here, given two face images, the aim is to verify if they are of the same person. Second, the recognition problem answers the question: Who is this person? Lastly, the clustering problem relates to the problem of finding common faces among a collection of faces.
The core idea is to learn a mapping from images to an embedding space. That is, the system takes images as input and outputs vectors associated with the faces’ features. Then, these vectors are optimized in such a way that distances between them are directly proportional to the similarity among faces. In other words, faces of the same person have small distances, and faces of distinct people have large distances.
One key component to make learning efficient is the concept of the triplet loss function. In a nutshell, the triplet loss does two things.
- It minimizes the distance between an anchor and a positive image.
- It maximizes the distance between the anchor and a negative sample.
Here, both the anchor and the positive sample are faces of the same person. That is, they have the same identity. On the other hand, the negative sample can be a random face of a different person. Moreover, since each image is represented as a high-dimensional 128 long vector, at each iteration, the algorithm changes all three vectors following the twoconstraints above.
The following image shows the FaceNet model architecture. During training, the model receives batches of triplets as described above. These triplets have to be chosen with care so to balance between positive and negative samples. The convolutional neural network takes in triplets and outputs corresponding 128-dimensional vectors. Then, it follows L2 normalization, which results in the face-embedding vector. Lastly, the embeddings are fed to the triplet loss function. The image below describes this pipeline.
There are some advantages when representing faces using embedding vectors. First, as an intrinsic characteristic of the convolution operation, these models are invariant to translation. In this context, it means that the system can recognize the same person, regardless of the face’s location on the frame.
Moreover, they use a large dataset of labeled faces to attain the appropriate invariances to pose and illumination, two of the most challenging problems in computer vision applications.
To put these ideas in place, we decided to implement a proof of concept. The idea is very simple. We built a Slack-like messenger app that uses a VGA camera to keep track of who is in front of the computer screen. If the application senses that a previously registered user is not in front of the camera, the messenger app locks itself. Conversely, if a registered user shows up on the camera feed, the app unlocks and grants him access.
In order to make this demonstration simpler and easy to show, we have two constraints. First, we want to use state-of-the-art algorithms for face detection and recognition. Put differently, it means using a ConvNet-based model that transforms images into embedding vectors.
Second, we want our application to run on the browser at near real-time speed. That means processing frames at near 30 frames per seconds (FPS).
face-api.js is a library for face detection and recognition in the browser.
face-api.js offers pre-trained models for the following tasks:
- Face detection
- Face recognition
- Face tracking
- Face similarity
- Face Expression Recognition
- Face Landmark Detection
For this example, we focus on the two first cases: face detection and recognition. We encourage you to go have a look at their GitHub for further details.
For face recognition, they trained a ResNet-34-like architecture to compute 128-dim vector descriptors. As explained above, the model takes face images in and outputs vectors that characterize the person’s face. Once we have the vector descriptors for, say, two faces, we can measure how alike they are by computing the Euclidean distance between them.
This model achieves a prediction accuracy of 99.38% on the LFW benchmark for face recognition, close to the original FaceNet implementation accuracy of 99.63%. The API also offers quantized versions of the models. These are specially for scenarios with scarce computing power. For face recognition, the quantized ResNet-34 model is roughly 6.2 MB.
Now, let’s dive into our demo.
The first thing we need to do is to register a user in the messenger app. To do that, we built a user-friendly registration interface. Here, the user needs to follow a simple procedure. Basically, our registration interface takes multiple pictures of the user, each for a specific face orientation. This step is required to improve the accuracy of the model. The idea is that different face orientations can generate slightly different representations, which, in turn, yields a better overall representation of the user’s identification.
After registration, we output a visual cue on the webcam video, as well as the name of the registered user. At this point, any other face found by the application is classified as “unknown.”
We assume that even if other faces are present on the screen, we should still keep the screen unlocked. There could be situations where any snoopers should trigger the screen lock, but we thought this would not be the most common case.
Whenever the person leaves the video feed, the locking status starts to change. Then, if the user stays away for more than three seconds, the screen gets locked.
In this situation, other users can’t unlock the application. Only registered users can do that. We can see that on the GIF below. The unlocking process only starts when the original user comes back and shows up on the camera feed.
The functionality is quite simple, but as we explained, there is actually a lot going on behind the scenes!
Face recognition is an emerging technology. Recently, advances in machine learning have boosted algorithm performance in this area. Hence, face recognition applications are more and more present in production-level solutions.
As we can see, the face.api offers a very good starting point. However, if we want to deploy a scalable solution, we should consider improving not only the security aspects but also the model based on the specific-use case. That might include retraining the neural net using a more representative dataset, or even include different features from external sensors like depth information.
Note that in this demonstration, we did not tackle the problem of presentation attacks or spoofing. Certainly, to deal with this problem we would need a model trained to differentiate between these situations. This work, by Bresan, et al., for instance, uses features such as depth, salience, and illumination maps to distinguish attacks from non-attacks in video frames.
Thanks to Ewerton Menezes for advisory on UX/UI in this PoC.