Real time face recognition with Android + TensorFlow Lite

esteban uri
11 min readJun 17, 2020

--

The impressive effect of having the state-of-the-art running on your hands

Introduction

A friend of mine reacted to my last post with the following questions: “is it possible to make an app that compares faces on mobile without an Internet connection? How accurate could it be?. At that time I didn’t know the answer for his questions. Surely a deep learning model will do the job, but which one? And will it be light enough to fit in a mobile device? And will it be fast enough? And how accurate could it be? These questions remained in my mind like a “UNIX demon”, until I found the answers. In this article I walk through all those questions in detail, and as a corollary I provide a working example application that solves this problem in real time using the state-of-the-art convolutional neural network to accurate verify faces on mobile.

What are the key features of this app?

  • It recognizes faces very accurately
  • It works offline, in real time
  • It uses a mobile-oriented deep learning architecture
An example of the working app. Will Farrell (the comedian) vs Chad Smith (the drummer). People usually confuse them. First the faces are registered in the dataset, then the app recognizes the faces in runtime. Tested on my Google Pixel 3.

A installer .apk demo can be downloaded from here.

Overview

Well, but … what’s the big deal here? I’ve seen my old digital camera detecting faces many years ago. So, why this is different?

Face recognition vs Face detection

First of all, let’s see what does “face detection” and “face recognition” mean. While many people use both terms interchangeably, they are actually two very different problems.

Face Detection, in short is: given an input image, to decide if there are people’s faces present in that image. And for each present face, to know where each face is located (e. g. a bounding box that encloses it) and possibly, also to know the position of the eyes, the nose, the mouth (known as face landmarks).

Figure 1. Face detection. We know that faces are present, but we don’t know who they are.

Face recognition: given an image of a person’s face, identify who the person is (from a known dataset of registered faces). Let’s say we have a dataset with the registered faces of the input image of the Figure 1 (Bach, Beethoven and Mozart). For any new face image we want to know who the face belongs to.

Figure 2. face recognition. Beethoven and Bach faces are registered. The lower-right face (Salieri) is not registered, then the result must be “unknown”.

Solving this problem involves finding a metric to compare the similarity between faces.

We are looking for an offline solution

  • When I upload photos to the cloud, Facebook, Google know who is in the images and automatically tag the faces.

All the processing is done in the of servers that those guys have, with GPUs and TPUs. We want to solve this problem offline with our “modest” ARM.

  • How does my mobile banking app verify my face? It works on my smartphone.

The picture of your face is sent through the Internet using a web-service to a back-end (that probably interacts with the Amazon AWS Rekognition behind scenes).

Using Google’s ML Kit for face recognition

  • Why not to use the Google ML Kit to recognize faces?

Well, actually the Google ML Kit does provide face detection but it does not provide face recognition (yet). I will use ML Kit for the first part of the algorithm pipeline, and then something else for recognition that is explained later.

How does a face recognition system work?

Figure 3 shows the pipeline of a facial recognition system.

Figure 3. pipeline of a face recognition system. (image from OpenFace)
  • First step, the face is detected on the input image.
  • Second, the image is warped using the detected landmarks to align the face [4] (so that all cropped faces have the eyes in the same position).
  • Third, the face is cropped, and properly resized to feed the recognition Deep Learning model. Also some image pre-processing operations are done in this step (e. g. normalizing and “whitening” the face)
  • Fourth, the most “juicy part”, is the one depicted as “Deep Neural Network”. We are going to focus more on this step.

The main idea is that the deep neural network DNN takes as input a face F and gives as output a D =128 dimensions vector (of floats). This vector E is known as embeddings. This embeedings are created such as the similarity between the two faces F1 and F2 can be computed simply as the euclidean distance between the embeddings E1 and E2.

Equation 1 Similarity between faces can be computed as the euclidean distance between its embeedings.

Simple, right?

  • We can now compare two faces F1 and F2, by computing its Similarity, and then check it against some threshold. If lower we can say that both faces are from the same person.

For more details, here is a great article [3] from Satya Mallick that explains more in detail the basics, how a new face is registered to the system, and introduces some important concepts like the triplet loss and the kNN algorithms.

Adrian’s approach

In this great article [5], Adrian Rosebrock solves the problem in Python using of OpenCV’s face_recognition library, with the nn4.small2 pre-trained model from the OpenFace project and he achieves around 14 FPS throughput rate in his MacBook Pro. The performance reported for this model is around 58.9 ms/frame in a 8 core 3.70 GHz CPU. The published accuracy for this model claims to be around 93% LFW on this “deep funneled” dataset.

This could possibly be an approach for our mobile application, using the OpenCV SDK for Android, but:

  • Would it be fast enough on mobile?
  • Will it fit in the smartphone RAM? (the nn4.small2 model file size is more than 30 MB)
  • What about accuracy? Adrian himself says on the post that he has some limitations and drawbacks with the accuracy of his implementation.
  • If I take this way, I’m not sure how different could Android and iOS implementations be.

These are all big questions … so let’s see if there is another approach available …

The FaceNet approach

FaceNet: A Unified Embedding for Face Recognition and Clustering.[2] FaceNet is a face recognition system developed in 2015 by researchers at Google that achieved the state-of-the-art results on a range of face recognition benchmark datasets (99.63% on the LFW). This work introduces the novel concept of triplet loss.

In this great article [6], Jason Brownlee describes how to develop a Face Recognition System Using FaceNet in Keras. Although the model used is heavy, its high accuracy is tempting to try using it. Also, as FaceNet is a very relevant work, there are available many very good implementations, as well as pre-trained models. Perhaps, by applying post-training quantization, the model could be reduced and its speed would be good enough on mobile…

So, I decided to give it a chance and I converted David Sandberg’s FaceNet implementation to TensorFlow Lite. I’ve chosen this implementation because is very well done and has become a facto-standard for FaceNet. I thought that the it was going to be an easy task, but I ran into several difficulties. I explain how I did it in this post.

Once I had my FaceNet model on TensorFlow Lite, I did some tests with Python to verify that it works. I took some images of faces, crop them out and computed their embeddings. The embeedings matched their counterparts from the original models. I also noticed much lighter and faster execution with the Lite version on my laptop’s CPU.

As all of this was promising, I finally imported the Lite model in my Android Studio project to see what happened. What I found is that the model works fine, but it takes around 3.5 seconds to make the inference on my Google Pixel 3. The answers to the questions from the beginning, begin to be revealed.

Using the FaceNet, with TFLite we can:

  • Compare faces offline on mobile
  • The comparison is guaranteed to be accurate
  • As a baseline, the execution time takes around 3.5 seconds.

Although not in real time, there are many useful applications that this way could be done, If the user is willing to wait a bit.

But… nowadays as users, we want it all and we want it now, don’t we? So is there any other alternative?

Well, if we want speed and lightness we should give a try to a Mobile DNN Architecture!

Using the MobileFaceNet

MobileFaceNets [1] is a great work by researchers at Watchdata Inc. in Beijing, China. They presented a very efficient CNN model specifically designed for high-precision real-time face verification on mobile devices. They achieved impressive speeds with very high accuracy with a model of just 4.0 MB. The accuracy they obtained is very similar to that of other heavier models (such as FaceNet).

I looked for some MobileFaceNet implementation to bring it to TensorFlow Lite. Most available implementations are for PyTorch, which could be converted using the ONNX conversion tool. But since this tool is still in early stages of development, I opted for this excelent MobileFaceNet implementation on TensorFlow, from sirius-ai. Note: To convert the model the answers from this thread were very helpful.

The resulting file is very light-weight only 5.2 MB, really good for a mobile application.

Figure 4 — The model size once converted is only 5.2 MB

Once I had my Lite model I did some tests in Python to verify that the conversion worked correctly. And the results were good, so I was ready to get my hands on mobile code.

Creating the mobile application

We are going to modify the TensorFlow’s object detection canonical example, to be used with the MobileFaceNet model. In that repository we can find the source code for Android, iOS and Raspberry Pi. Here we will focus on making it work on Android, but doing it on the other platforms would simply consist of doing the analogous procedure.

The code for this app can be found on my github repository.

Adding the Face Recognition Step

The original sample comes with other DL model and it computes the results in one single step. For this app, we need to implement several steps process. Most of the work will consist in splitting the detection, first the face detection and second to the face recognition. For the face detection step we are going to use the Google ML kit.

Let’s add the ML kit dependency to our project by adding the following line to the build.gradle file:

When the project finished sync, we are ready to use the FaceDetector into our DetectorActivity. The face detector is created with options that prioritize the performance over other features (e.g. we don’t need landmarks for this application)

The original app defines two bitmaps (the rgbFrameBitmap where the preview frame is copied, and the croppedBitmap which is originally used to feed the inference model). We are going to define two additional bitmaps for processing, the portraitBmp and the faceBmp. The first is simply to rotate the input frame in portrait mode for devices that have the sensor in landscape orientation. And the faceBmp bitmap is used to draw every detected face, cropping its detected location, and re-scaling to 112 x 112 px to be used as input for our MobileFaceNet model. The frameToCropTransform converts coordinates from the original bitmap to the cropped bitmap space, and cropToFrameTransform does it in the opposite direction.

When the frames arrive the face detector is used. Face detection is done on the croppedBitmap, since is smaller it can speed up the detection process.

When the faces are detected, the original frame is drawn in the portraitBmp bitmap. For each detected face, its bounding box is retrieved and mapped from the cropped space to portrait space. This way we can get a better resolution image to feed the recognition step. Face cropping is done by translating the portrait bitmap to the face’s origin and scaling to match the DNN input size.

Adding the face recognition step

First we need to add the TensorFlow Lite model file to the assets folder of the project:

And we adjust the required parameters to fit our model requirements in the DetectorActivity configuration section. We set the input size of the model to TF_OD_API_INPUT_SIZE = 112, and TF_OD_IS_QUANTIZED = false.

Let’s change the name of the Classifier interface to SimilarityClassifier since now what the model returns is similarity, its behavior is a little different. It allows us to register recognition items in the dataset. We rename the confidence field as distance, because having confidence on the Recognition definition would require do something extra stuff. By now, we are going to use just distance as a measure of similarity, in this case it is the opposite to confidence (the smaller the value, the more sure we are that the recognition is from the same person), for example, if value is zero it is because it is exactly the same image.

Now, let’s change the model implementation, by now we implement our dataset in the simplest possible way, that is a dictionary that stores the name of the person and its recognition (which has the embeedings). The recognizeImage method, is modified to retrieve the embeedings, and if necessary store them into the recognition result, when we have the embeeding, we just look for the nearest neighbor embeeding into the dataset by perfirming a linear search.

The rest is pretty straightforward, all the code is provided and any additional details can be seen in the repository.

Testing the App

some examples

Example 1- left: registering Satya Mallick. middle: registering Adrian Rosebrock. (two great CV publishers). right: the app in runtime
Example 2 — left: Katy Perry (the singer). middle: Zooey Deschanel (the actress) right: the app in runtime
Example 3 recognition on an image from the movie Being John Malkovich

Improvements and future work

Face pre-processing

Ideally the face should be aligned and whitened, before use. In my case I am using the result as it comes from ML Kit, just scaling to the required input size and that’s it. This is a very important improvement point, but in Java or Kotlin it might be more laborious than in Python. Undoubtedly, this would allow improving the accuracy of the results (although even without aligning, the results are very good). This is something I will add for future work.

References

[1]: Chen, Sheng, et al.[“MobileFaceNets: Efficient CNNs for Accurate Real-Time Face Verification on Mobile Devices] — Chinese Conference on Biometric Recognition. — Apr, 2018

[2]: F. Schroff, et al. [“FaceNet: A unified embedding for face recognition and clustering,”] — 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, 2015, pp. 815–823 — Jun, 2015

[3]: Satya Mallick, at al. [Face Recognition: An Introduction for Beginners] — learnopencvhttps://www.learnopencv.com/face-recognition-an-introduction-for-beginners/ — Apr, 2019

[4]: Adrian Rosebrock. [Face Alignment with OpenCV and Python] — pyimagesearch — https://www.pyimagesearch.com/2017/05/22/face-alignment-with-opencv-and-python/ — May, 2017

[5]: Adrian Rosebrock. [OpenCV Face Recognition] — pyimagesearch — https://www.pyimagesearch.com/2018/09/24/opencv-face-recognition/
pyimagesearch — Sep, 2018

[6]: Jason Brownlee. [How to Develop a Face Recognition System Using FaceNet in Keras] — machinelearningmastery — https://machinelearningmastery.com/how-to-develop-a-face-recognition-system-using-facenet-in-keras-and-an-svm-classifier/ — June, 2019

--

--