AR is Empty — where are all the people?

Introducing Human-Centric Augmented Reality — how it works and why it’s the next step
Apple’s ARKit understands surfaces but misses out on people (source: Spilly)

In the late 1960s universities across the world asked themselves the question of whether or not computers would one day be able to see. Of particular interest to them was whether or not they could one day see and interact with human beings. This quest lead to the creation of a new field known as “computer vision”. By the early 1990s interest surged, by ‘91 researchers at MIT had a machine detecting a human head and by ‘98 a team at Manchester University had one detecting facial features. But it wasn’t until 2013, six years after the emergence of the smartphone and three years after the sudden emergence of deep learning AI technology in the form of neural networks, that a team at KTH Stockholm were able to get this running on mobile.

Facial alignment tracking — Manchester University ’98, 17 years before Snapchat Lenses

Up to this point the work had been entirely academic, but in 2014 a Ukrainian startup known as “Looksery” used this technology to create a digital makeup selfie app for consumers that saw over a million downloads. Snapchat, seeing an even bigger opportunity, acquired the company. Six months on Looksery had become Snapchat’s now famous “Lenses” product. Facebook, sensing the need to match its rival feature for feature, followed up by acquiring the team behind the app “MSQRD” early the following year. Human-centric computer vision, applied in public form as “Selfie AR”, suddenly had worldwide appeal and had become key weapon in the battle for attention between the two social media heavyweights.

From Looksery to Snapchat Lenses (source: TechCrunch)

In 2017, both Apple and Snapchat have premiered SLAM-based technology with “ARKit” and “World Lenses” that allow for the placing of digital objects on surfaces, while Facebook opened up its “AR Studio” for designers to create their own filters. None of these efforts are said to have translated into runaway hits with users.

So what comes next? For us, it’s the emergence of a new kind of Social AR (detailed here) that will not only bridge the gap between the Selfie AR stage and emergent, glasses-based AR, but the underlying technology is likely to be a critical component for years to come. To do this we needed to develop a neural network to detect and track people in real-time on mobile, in all configurations, not just for selfies. This presented a large set of challenges for us.

Timeline of Consumer AR — origins to 2020 and beyond

Tracking with the front camera for Selfie AR is essentially a single special case of all the possible cases you could have when identifying and tracking a person. Going from front to back camera exposed us to many of these other cases, among them:

  1. The subject is much more likely to be off centre relative to the camera
  2. They can appear at different distances / sizes
  3. They are often not facing the camera so we can’t just look for faces, rather we have to look for the backs of heads, hair, hats and all sorts of other features
  4. There are often multiple people in the shot

These were all things we had to overcome in order for our technology to work. So what does the technology do exactly? We can break it into four parts.

  1. Multi-Person head & body detection
We detect multiple peoples’ heads and bodies simultaneously in real-time

Given the camera image of the user, the application needs to identify areas in the image showing heads and their corresponding bodies.

What does this enable? This allows us to estimate the distance of the person based on the head size. With the body we can then anchor any visual information to the movement of the person.

2. Scene/shot persistent individual tracking

We compare the information of the multi-person head and body detection over multiple frames in order to track the movement and identity of the people in the scene. This enables us to have visual information anchored to a specific individual even when they are surrounded by other people and even if they leave the camera view and reenter it.

3. Individual background & all-body segmentation

For each tracked person, we further classify which pixels belong to their face, skin, hair, clothes and the background (the last of which we leave out). This gives us clear separations in the form of a series of layers that we can use for advanced blending of AR effects. This would otherwise only be possible with lightfield or depth sensing capture technology, which are not readily available on smartphones.

4. Editor

We trained our neural networks specifically to produce these layers in a way that any designer can easily interact with and manipulate. Because the networks are based on simple math, it is easy for us to have them running at equal quality on both desktop and mobile. This allows designers to quickly iterate and design the visual effects powering our Spilly apps using our custom editor.

We’ve seen how this can be done. Now let’s look at some use cases:

  • Our three little social AR apps — people are encouraged to get together and act as their favorite stars, to play back and forth pranks and more.
  • Gaming experiences — people can now become characters in gameplay that are targetable and visually manipulable, i.e. reacting to being hit with an attack, with avatars or sidekick characters that are unique to them.
  • Fashion applications — placing outfits/filters on the body of a person both for fun as well as “try out” and purchase.
  • Putting yourself into 3rd party content — with our segmentation users can place their moving heads on any person in any video, which is all tracked in real-time. Anyone up for starring in their favorite movie?

There’s a lot more there. In the glasses-based future, people are likely to be contextual triggers for a whole host of interactions, initially indoors, like people-specific reminders (ask your husband to do X), for personal details (profession, compatible interests, etc.) or richer gaming experiences (imagine a puzzle game with the person as the canvas that starts and ends whenever you see them). Slightly later we’re likely to see outdoor interactions involving commercial transactions, like in-person payment for “Craigslist” items, as well as visual augmentations — expect the same motivations behind Tumblr and Pinterest to be extended to your person.

In glasses-based AR, people will act as contextual triggers for a whole host of information and interactions

In short, people run the world, in a world powered by technology, and this advance in human-centric vision technology will only further bind the two together. We are well on our way to a future where smartphone based commerce, entertainment and self expression are set to explode off of the screen and into our world. We will need to proceed cautiously, but the benefits for all are clear.