By Robert Bragg, Senior Software Engineer at Impossible Labs
Since joining Impossible Labs recently, I’ve jumped straight in the deep end with Glimpse:
Until physicists start hacking space time at human scales, we are unfortunately constrained by those pesky newtons in how we present ourselves. Add social conformance to the mix and we might not pick that Wonder Woman outfit in the morning. Glimpse is about extending self expression without real-world limits.
A system such as Glimpse needs to be able to recognise who we’re looking at, and to start with we are going to base that on facial features. We also want to be able to track mouth and limb movements, so that we can animate 3D avatars in sync with someone’s real body.
When it comes to understanding the face we can break that down into face detection (“oh look there’s a face”), recognition (“oh look it’s Pete”) and landmark detection (“Pete, you’re gawping!”).
Face detection is a good place to start as a prerequisite to landmark detection which is itself a prerequisite for face recognition. We also expect skeletal tracking (tracking a person’s body, arms and legs) will be simplified if we start knowing where the head is.
Methods for real-time face detection (Oh I’m sorry, is that your face?)
Support for real-time face detection is quite commonplace these days, with the Viola-Jones cascaded classifiers method being a major advancement back in 2001 and since 2005 Histogram of Oriented Gradients (HOG) methods have proven even more effective.
These methods first rely on some machine learning techniques to train a system based on labeled images of faces and in the case of HOG methods we can visualize the resulting representation used for detection like this:
Looking around for existing, usable implementations quickly led to Dlib with a well respected, open source implementation of fast HOG-based object detection including a pre-trained face detector model which made it a natural first-choice to investigate. Dlib also has a fast implementation of face landmark detection that will pinpoint these 68 features:
However, fast face detection is relative. Anecdotal references to these algorithms being ‘fast’ can be compared to them running on PC class hardware but we want them to run on mobile phones. My first (naive) cut at running Dlib’s face detector on a Lenovo Phab 2 took about 20 seconds to process a single camera frame. Yikes!
After instrumenting Dlib to measure where time was being spent, it became apparent that there are some low-hanging fruit to pick. Without too much trouble we can process a single camera frame in about 85 milliseconds, or about 12 updates per second. Incidentally, more than half of that time is spent just rotating and downsampling the large 1920x1080 camera frames we get from Google Tango, before Dlib handles the actual detection in about 30 milliseconds and landmark detection in about 10 milliseconds.
Next steps: Tracking body movements
There is still lots of opportunity to speed this up, but it’s also good enough for now that we can start looking ahead to some of the other interesting challenges for Glimpse.
We are now reviewing the state of the art in research around skeletal tracking. Microsoft’s research on real-time skeletal tracking using depth images might provide a solid base, but more recent research on skeletal tracking using standard RGB cameras offers the tantalising prospect that we might not require specialized depth cameras.
We’ll keep updating here on the Glimpse progress, so keep your eyes peeled.