Face detection and tracking on Android using ML Kit — Part 1

Published in

Onfido Product and Tech

6 min readJan 30, 2019

At Onfido, we’re creating an open world where identity is the key to access.

Since our goal is to prevent fraud while maintaining a great user experience for legitimate people, we launched advanced facial verification. With this new approach, we ask people to record live videos instead of static selfies. Users record themselves moving their head and talking out loud to prove they are real people.

Our clients adopted this new feature. But after a few weeks, we noticed that the videos were hard to process, making identity verification more difficult. Users were not performing the right actions. They would retry multiple times and many eventually gave up. This meant that legitimate users were falling through the cracks.

We needed to find a solution, so over 5 days, we talked to machine learning engineers, biometric specialists, mobile developers, product managers and UX designers. Our goal was to radically improve our selfie video verification.

This series of blog posts will document our process to improve the selfie video feature. You can try it now on our demo app.

To improve our selfie video feature, we decided to tell users if they were doing the actions right. This will lead to less uncertainty from users, and better quality videos.

To start the video recording, we need to make sure the user’s face is inside the screen, so we can be confident that:

a) There is one face in the image

b) This face is completely contained inside the video we are recording, and not cropped

We call this set of guarantees face detection. Once those two conditions are verified, the video recording automatically starts.

Next, on the head turn part, we also want to make sure that users make the right movement. In this case, 2 movements: They must turn their head to the left or right side, and then come back to face the camera. We name this as face tracking.

ML Kit

To detect and track a face in a picture, we used Firebase ML Kit, a tool that tries to simplify bringing machine learning algorithms to mobile devices. When it comes to face detection and tracking, ML Kit gives us the ability to provide pictures to a FirebaseVisionFaceDetector and then receive callbacks with information about the faces in that picture. I won’t get into many details about the tool itself, since Joe Birch already published a great blog post on that.

In our case, we were interested in 2 features that ML Kit offers:

Face bounding box in a picture, for detection purposes
Face Euler Y angle, for head turn tracking

We built our own FaceDetector wrapper around this library, which takes care of the detector setup and leverages RxAndroid to expose a method returning an Observable<FaceDetectionData>

Face detection

How to ensure the face is inside the oval
Provide meaningful feedback after a successful face detection

Oval where the face should be positioned

It’s unlikely that an Android device produces a camera feed with the same dimensions as the view in which we’re displaying it. That leaves us with two different sets of dimensions to take into account when evaluating the position of a face in a frame:

Frame dimensions + face bounding box: The dimensions of the frame sent to the detector and the resulting bounding box for the detected face relative to this frame.
Displaying view dimensions + oval bounding box: The dimensions of the view where the frame is being displayed and the position of the oval inside this view.

When checking if the face is contained inside the oval, we had to map the oval position inside the view to its correspondent position inside the camera frame. Assuming we have these mapped dimensions as ovalTop, ovalBottom, ovalLeft and ovalRight, we can be sure the following works:

After a face is detected inside the oval, we want to provide feedback to users. We chose to provide both visual and haptic (vibration) feedback, leveraging Material Design guidelines and native Android APIs.

Face tracking

Regarding the face tracking movement, the action we want people to perform can be described as follows:

From a central position, users must rotate their head to either left or the right (depending on the instructions we give), until a certain Euler Y angle (face angle relative to the Y axis) — let’s call it α — is reached. This angle can be α or -α, again depending on the movement orientation
After the α /-α angle is reached, users must turn their head back to the centre. However, we don’t need them to finish right at the perfect 0º, but instead at some β / -β closer to the centre than α (0º < β < α)

Here, the faceAngle property is the useful piece of data. We defined our α and β angles according to our backend processing requirements, but we decided to abstract from this idea of angles as thresholds.

Instead we created a LiveVideoProgressManager which maps an angle θ to a decimal number representing the progress, where 0 < progress < 1. This way, we can later restrict our changes to this LiveVideoProgressManager in case we change our requirements, and leave the rest of our business logic untouched.

As soon as the face tracking action starts, we start feeding the face detector with frames coming from the camera (just as we explained for the face detection part). Also, we subscribe for FaceDetectionData results coming from the detector, which we then transform into progress using our progress manager, and we actually end up with live notifications of how much of the intended movement the user has achieved so far.

Finally, we perform realtime updates in our UI to make sure the user understands that the progress is being tracked and whether they are getting closer or further from their movement goal. For that, we used custom views together with the Canvas and ValueAnimator classes from the Android framework to provide a nice user experience.

This is it! I hope you enjoyed reading this overview of the development of our selfie video improvements. In the next post of this series we will dive into more details about the solution and technical challenges we faced along the process.

PS: In case you want to come work on this and other interesting features, we’re hiring!

Also, read Charlotte Sferruzza’s post about the motivation behind these improvements, from a UI/UX point of view:

Adding friction to improve user experience - Charlotte Sferruzza - Medium

The actions we asked users to perform while taking their selfie video were designed for us to detect fraud accurately…

medium.com