How crucial is context in AR?

What is context in AR and how does Selerio SDK help?

We recently announced the release of our SDK here and introduced briefly how the SDK exposes contextual information. In this post, we dig a little deeper into what context actually means and how it is useful in mobile AR.

As your phone moves through the world, its position needs to be tracked relative to the physical world, often referred to as 6 Degrees of Freedom position (6-DOF). ARKit/ARCore added this functionality by tracking feature points using the phone’s camera. They also expose horizontal and vertical surfaces that are detected in the scene (these may correspond to planes on the ground, walls, etc.). This spatial awareness essentially allows us to anchor virtual content to the real world and takes us one step closer to achieving fully immersive AR experiences.

source: https://developers.google.com/ar/discover/concepts

However, the need for spatial awareness in AR isn’t just limited to generic planar surfaces. To create experiences which are more realistic, information such as location, time, scene type, objects and their attributes (category, orientation, position, shape, material) with their relationships is essential. For example, a couch is on the ground, with a cushion on top. This is referred to as contextual information. This is the key to unlock the true potential of AR. Let us describe this with a short example on how we may visualize this:

Imagine designing an experience such as a virtual dog jumping on the couch or a hammer cracking the laptop screen (because: why not? Hulk angry! Hulk smash!). There is so much contextual information embedded in those two short sentences.

It is impossible to build such experiences with the tools that are available today, unless we go through some very cumbersome tasks.

How is this achieved today?

As mentioned earlier in our article about occlusion in AR, developers use a myriad of tricks such as using designated spaces to design bespoke experiences. In addition to this:

  • Lots of developers use location through location APIs to tailor content.
  • Apple, Vuforia released object recognition API to help with this further.
  • Unity is experimenting with MARS project to support building against reality. We tested the alpha release and found it super cool. If you are experimenting with MARS, expect an integration plugin soon from us.
Source: Unity Unite Berlin Keynote’ 18

However, we believe all of the above lack key understanding of the scene dynamics. Let’s understand what has to happen to develop such a “cracking my laptop” experience:

  1. We need to be able to detect the laptop (read: object detection).
  2. After the laptop is detected, we need to understand the position and orientation of the screen (read: 6DoF).
  3. Attach an animation that exactly matches the position and scale of the screen of the laptop (read: scaled content).

In technical terms, first, we want to integrate object detection into our application. Any developer who is familiar with integrating deep learning models for tasks such as object detection knows the pain in getting around this. Even more concerning, all these models are restricted to 2D images, and not suitable for 3D objects. The next step is to figure out the objects’ orientation and scale in the real world. Remember, all of this 3D information needs to be extracted from the camera RGB image (2D). Lastly, we need the location of the object in the physical world to curate this experience.

How did we “crack” this?

Again, no points if you guessed we have a solution. We used state-of-the-art computer vision research (years of our PhD research put to hard work) to perform object detection and pose estimation with real-world scaling to do the heavy lifting in the app. We figured out a clever way to train a deep learning model to perform all of these tasks combined and as a result, our model infers the combined task of 1) object detection 2) pose estimation and 3) true scale estimation in physical world using 2D images.

Screen grab of Selerio SDK’s realtime meshing and scene understanding captured on an iPhone7.

If you guessed that all of this is running on a GPU on the cloud, you couldn’t be more wrong. We managed to bake this model into our SDK and expose all this information with on-device inference. It is fast! (100ms inference time on iPhone7). This information combined with our real-time meshing helps us deliver the complete contextual information you need for your AR application. But hey, don’t take our word for it, take the sample app for a spin.

So, how good are our object detection and pose estimation?

We trained our object detection network on the state-of-the-art image datasets and it can recognize up to 80 common occurring objects in daily life (full list). We are adding more shortly. If you would like to add custom objects, shout out and we would be glad to help.

Check a few samples of the pose estimation prediction shown below on a few images we tested from Pascal3D.

Pose estimation on a few images from the pascal dataset using inference from our trained model. Green cubes show the estimated pose from a given 2d camera image

All of this is fine but what does it mean to me as an AR app developer?

Remember the experience we mentioned above “cracking a laptop screen”? Let’s see how we can redesign it using our SDK on iOS (Sorry iOS for now! Android will have a similar workflow). Designing this becomes as easy as:

  1. Extend your ViewController with our Delegate. You can see a sample here.
  2. Every time an object is detected you are notified with an anchor object. The object has geometry, position and orientation information with scale.
  3. You can trigger your desired experience when the object anchor is detected (SceneKit to the rescue!).
Final result of the screen cracking experience on an iPhone7 using Selerio SDK

Yes! It’s that easy. Go ahead, hammer away!

Limitations:

This is the fine print to look at while designing your AR application.

  1. Location is accurate to about 5 cm.
  2. Pose is currently accurate up to 22 degrees in rotation.
  3. Objects supported currently are: laptop, bottle, cup, bowl, chair, couch, and more.

Of course, we are working continuously to improve metrics for object detection and pose estimation, so expect even better accuracy in subsequent versions. If you are interested in custom objects, improved pose estimation then get in touch (experimental, like cookies from the hot oven).

Phew! That was a little longer than the previous post. If you have made it this far and have a question, why not join our slack to ask?

If you are wondering how is this different from existing tools such as Vuforia, ARKit, keep an eye on the blog, we will be releasing articles to answer this. We invite you to join our closed beta and get access to the API documentation.