Recognizing Face in Android using Deep Neural Network + TensorFlow Lite
In the previous article, we explored how we could implement face detection in android apps to introduce a face recognition pipeline on mobile devices.
Before we start, there are various jargons that readers should know about, such as Deep Neural Network, Convolutional Neural Network, Triplet Loss and Inference Time which I’ll happily explain below:
- Deep Neural Network
When we talk about neural networks, we talk about how machine learning works. A neural network consists of several connected units called nodes. These nodes mimic how a neuron in a human brain works, and each node will process the input, then give the result and pass this result to the next node. These chains of nodes are basically what a neural network is.
These nodes are then grouped into a layer, and this layer has a specific kind of input and a specific kind of output.
When we want to solve a problem, we often need multiple kinds of outputs, and can’t be simply contained to single processing. Therefore, to solve this, we need another layer that will do different things to solve different problems.
And when the layer count is more than 1, it means the neural network has “depth” as in multiple layers, hence Deep Neural Network.
- Convolutional Neural Network
In mathematical terms, Convolutions means a mathematical operation on two functions, let’s say f and g to produce a third function h that expresses how the shape of the function is modified by one another.
In the context of neural networks, it means that we replace the general matrix multiplication that is usually the calculation involved inside a neural network with convolution operation instead.
This process gives better results when it comes to processing visual imagery, including facial recognition, object recognition, and various other visual processing.
- Triplet Loss
What is triplet loss? Triplet loss is a loss function (which is a function that maps events or values of variables, in this case, an array, into a real number representing cost or loss associated) for machine learning algorithms.
This loss function work where an input (which we called “Anchor”) is compared to pre-existing matching input (means the known input which we know that is the same person as the anchor, called “positive”) and pre-existing non-matching input (a totally different person, called “negative”).
The goal of this loss function is to minimize the distance between anchor to positive and maximize the distance between anchor to negative.
- Inference Time
Inference time is a metric of how long a machine learning model runs to decide on a solution. In computing terms, how long is an output being made for each input being made.
Kinda understand it now? Great! Now let’s move on to the implementation of Face Recognition.
What is TensorFlow Lite? Tensor Flow Lite is a library that’s developed by Tensor Flow to run Machine Learning models that are developed in Tensor Flow Language to run on edge/mobile devices. This enables us to convert existing Machine Learning models into a format that even mobile phones can run and apply.
I won’t explain how to install/apply this library to the android project. You can actually learn how to apply it properly here, but I do explain the step I took personally to complete this project.
There are various models that can be used, but for brevity’s sake, I’ll use two particular models, which are:
A Convolutional Neural Network Based Implementation of MobileNet V2 for face recognition with reduced parameters that allows it to work with a mobile device at a reasonable accuracy. The output of this model is Euclidean space of 192-bytes parameters. As the author explains:
We present a class of extremely efficient CNN models, MobileFaceNets, which use less than 1 million parameters and are specifically tailored for high-accuracy real-time face verification on mobile and embedded devices.
Another implementation of Convolutional Neural Network that implements the Embedding to a Euclidean space of 128-bytes parameters to map the face in a likeness array. This method leverages the Triplet Loss method during its training and achieves almost impeccable performance results.
Between these two algorithms that I have chosen, there are advantages and disadvantages when being applied to run with Android Apps:
- Based on MobileNet V2 (retrofit to V3)
- Very Fast Inference time (168–320ms for dataset source, +-160ms for typical)
- Reduced Parameters, can run on less powerful SoC (System on Chip, or what we usually called the “Processor” or “Brain” of a mobile phone).
- Not so accurate recognizing face with an expression
- Not so accurate recognizing face rotation
- More accurate when using straight face photo
- Input format ARGB8888 112x122 px Bitmap
- Output format Float Array 192D
- Small Model size (5.1 MB un-Quantized)
- Relatively slow inference time (560–1496ms for dataset source, +-200ms for typical)
- Not run well on slow SoC (System on Chip, or what we usually called the “Processor” or “Brain” of a mobile phone).
- Can handle more variables in environment and facial expression including dark environment, smile, and visible teeth.
- Can detect a person even if using a mask
- Can handle Face Rotation to some degree
- Input format ARGB8888 160x160 px Bitmap
- Output format Float Array 128D
- Large Model size (45MB un-Quantized)
Currently, our implementation of Face Recognition in Android is using pre-DNN-ed Images that specify each person like this :
Where Likeness is the Result of Euclidean space from the DNN model used via TensorFlowLite. This Euclidean space is representing the likeness of a face in an N-Dimensional Array, where we could do some simple arithmetic to calculate the average distance to each recognized face.
Then following the flowchart below:
We can determine which person is which from our pre-DNN-ed list of people. The explanations of step by step from the flowchart above are as explained below:
- Following the Face-Detection step previously, we already had an image with ARGB8888 at specified n x n Resolution which the model requires. (112 x 112 px for MobileFaceNet and 160 x 160 px for FaceNet).
- This image was then inputted to the DNN Model, waiting for output
- If the output is successful, then we should have an N-Dimensional array that represents the Likeness of a face in a Euclidean space.
- This likeness then compared to our existing list of people above.
- For each person in the list above, we calculate the L2 Euclidean Distance from the Result we have to each likeness a Person has. This distance was then averaged for each person.
- After each person’s distance is averaged, then we take the smallest average, the person associated with this smallest average is our candidate of similarities.
- This smallest average compared to our predetermined maximum distance, which in this case that we took reference from the DeepFace library (1.0 for FaceNet and 1.0 for MobileFaceNet).
- If the smallest average is bigger than our maximum distance, then we assume that this is not the same person as our candidate. If yes, we assume that it is indeed the same person.
Then for the implementation in the Android app, we require dependency below :
Then make sure our model (which should be .tflite model) is added to /app/src/main/assets path.
For implementations, after we got the cropped image from the Face Detection process, we need to convert them to ByteBuffer :
Then run the Inference process to DNN Model :
This FloatArray Result is the Likeness array that we got from the DNN model. Which should be calculated as mentioned step-by-step above.
As for how we calculate the distance, we have two possible method, one is using L2 Normalization by finding the Normalized value of two arrays (one array is the the array of the result from above and the other is the array from our pre-calculated database) which we can use the code below:
And another method called Cosine Similarity between 2 array:
What was this distance calculation for? Basically this: you had the result from the face recognition model. This result comes in the format of arrays. The model itself doesn’t which face is who, but the result of this array is basically the coordinate of a dot in an n-dimensional graph, with n representing the size of the array (128 or 192 respectively).
Previously we had the pre-calculated face data that pass from the same model, this array also represents the same data we mentioned above. From this, we should be able to calculate the distance between the array we got from the new face vs our database of arrays that represents specific faces.
This n-dimensional distance calculation can be done by Cosine Similarity mentioned above, with x1 representing the new array we had, and x2 representing the database of data we had previously. It is also doable to use L2 Normalization between 2 arrays as a preliminary step.
As for the Result in Android Implementation using the FaceNet model, we can see the image below:
As you can see, the average of each person in our database shows as above:
- Wyndham: 0.70820
- Zidni: 1.190301
- Alfin: 1.075332
- Reza: 1.012211
The Person with the lowest Average Distance is Wyndham, and because it is lower than our maximum limit of 1.0, we can determine that this person indeed a person called Wyndham.
Great! We have implemented a proper Face Recognition system in an app and see the result. Unfortunately, All of that implementations aside, there are some requirements and limitations which may or may not be dealbreakers for your particular application. Few which I list below:
- Minimum SDK is 21 (Lollipop 5.0)
- Minimum Tested System RAM is 2GB
- Has Camera Access
- Can be fooled with photograph
- The Model Size is big and incompressible.
As we see throughout this article, it is possible to do Face Recognition on mobile/edge devices. This implementation in particular uses pre-existing models to recognize the faces. On the implementation side of things, we are using TensorFlow Lite, available on various platforms, including Android.
Before we can use proper face recognition tho, we need to prepare our face data first, which we touch upon in the first article. After we have a proper face image that we can use for face recognition, we convert the image to a byte array, then run it through the pre-trained model, and the result we can use to calculate the distance between the scanned image and with existing databases of images.
When it comes to the model, it is uses several models that uses the same basic methods: Convolutional Neural Network with Triplet Loss.There are various models available, but in this example, I’m using 2 in particular: MobileFaceNet and FaceNet.
In the next article, we will explore another field of machine learning in mobile devices, some of which include text recognition and filling up missing data in such a way that we can extract proper information from it.
Thank you very much and see you in the next article!