State of mobile AR development

Published in

Selerio

6 min readFeb 6, 2019

What can you do with Vuforia, ARKit and Selerio?

If you have been following our recent articles, you may have read about how we solved a few roadblocks for mobile AR adoption. In this article, we describe the current state of AR development tools. You can expect this to touch upon the typical functionalities available through various tools on the market currently, their shortcomings and how you can leverage our SDK for some of your development needs.

For any kind of Markerless AR you essentially need these components:

World camera tracking: tracking the device’s position and orientation.
Planar surface tracking and detection (vertical and horizontal planes).
Scale and light estimation.

With these features, you will be able to anchor/augment content to the physical world and explore as you navigate the space, seen in the figure below

Fig. 1 AR experience — explore content in the physical world

Fortunately, these features are exposed to developers with SDKs such as ARKit/ARCore. Content can be consumed via apps developed for iOS/Android using these features. Recently, there have been AR experiments using web browsers (webAR) as well.

Since content consumption via AR as a medium is quite novel, it is currently done in the following ways:

Designated spaces for games, experiences etc, ...
Triggering experiences via recognition of images (2D) or objects (3D).
Triggering experiences using ultra-precise localisation from the AR Cloud. An excellent write up about AR Cloud by Ori Inbar here.

So how can you consume content in AR today?
In the interest of brevity, we will not discuss the use of designated spaces at length here. This approach has been showcased extensively in the past. It essentially involves demarcated areas for AR interactions.

Next, we look at how AR experiences can be triggered using known features of the user’s environment such as 2D images or 3D objects. An example is a museum app where a virtual curator appears when the user points their device at a painting (2D). In another AR experience, virtual pieces might be placed on a board game when the player points their device at a game piece (3D). This is currently the most popular way to consume content in AR.

A recent campaign from NY Times showcases Lakeith Stanfield’s Balancing Act in AR. Upon placing the book on the floor and pointing the camera on a specific page, a life-size hologram of the artist comes to life. You can experience him teetering on a beam, high above, in front of your eyes.

Such experiences are impressive and present novel ways of augmenting content. However, they are cumbersome to set up. If there are obstacles such as a couch in the living room, the experience would be disrupted. For true immersive AR, imagine if the animated actor could actually infer information from the environment and interact with it such as dodge chairs in your living room?

A third way of consuming content in AR is using ultra-precise location in a 3D map of the physical world. This 3D map is often referred to as the “ARCloud”. Using the ultra-precise location (centimetres level accuracy) in this map, we can anchor content to the physical world and bridge the gap between both worlds in a persistent manner. The AR Cloud focuses on solving two issues in regards to persistence in AR:

Short term persistence e.g., when you attend a phone call, content in your AR session persists.
Long term persistence e.g., when you place content in the real world, it stays there forever, so others can consume it too with their own device.

The AR Cloud gives us the best possible experiences in AR. However, it is still in its infancy and we have to wait until this project evolves further.

How do you design such experiences currently?
Triggering content when specific images(2D) or objects(3D) are detected can be developed using ARKit and Vuforia. Let us see how below.

With ARKit, your iOS app is distributed with one or more known 2D images, and ARKit tells you when and where those images are detected during an AR session. Vuforia provides Image Targets to achieve similar functionality.

Full code sample — 2D image detection using ARKit.

A more compelling and immersive experience can be achieved by detecting 3D objects in the user environment. For example, interactive 3D visualizations can appear when the user points their device at a displayed sculpture or artifact in a museum.

The process to achieve such experiences roughly follows this pattern:

Scan 3D objects to capture key descriptors. Picture (Fig. 2) below shows the capture process.
Build and distribute an AR App with the captured descriptors.
The phone detects the objects in the physical world using the descriptors and triggers the intended AR experience.

Fig. 2 Apple’s 3D object capture process shown on the left, an example from Vuforia on the right

Full code samples — 3D object detection using ARKit, using Vuforia

Limitations:
Although very impressive, there are a few underlying assumptions which are potential blockers for wide-spread AR adoption.

No interactions between the virtual content and space around the detected object (e.g: Occlusion).
As the number of 3D objects to detect increases or the objects needs updating over time, distribution at scale becomes a challenge

What we do different and where do we fit?
Let’s look at our process. Selerio SDK enables you to have any number of experiences on a single app. All of this is made possible by training deep learning networks on large datasets of images and 3D models. We ingest 2D images or 3D objects into our Machine Learning (ML) pipeline to generate ML models. Using these models, our SDK enables you to not only detect objects in the real world but also enable interactions between them.

We can currently detect about 80 common objects on-device, for free. For custom object detection, we rely on our cloud-based solution. This means that you do not need to ship your app with data on the objects of interest.

For 3D recognition, once the object is detected in the world, we retrieve its 3D model and apply the computed 6-DoF pose to accurately match the detected object. This means that you do not need the scanning stage a priori. This is how we enable designing unlimited experiences; i.e., scalability.

To show how this all works, let us take the scenario where you trigger a virtual character to interact with a physical object in the real world. For example, a virtual bunny sits on a chair. For this to happen, the app needs to:

Map the terrain, E.g., the office space shown below
Detect a chair, retrieve the 3D model and apply a computed 6-DoF pose
Make the virtual character (bunny) walk to the chair on the mapped terrain and sit down

In-App footage of an AR experience using Selerio SDK

Note that the SDK did not require any scanning phase of the chair a priori. In order to train the scene understanding AI above, i.e., to recognize the chair and 3D model (similarly for other 80 objects), we used publicly available datasets. For detecting custom objects in 3D, our ingestion pipeline will handle the integration into the app seamlessly.

If you have any questions on how the ML training pipeline works or how you can integrate this with 2D images and 3D objects, please get in touch with us. Join our slack to ask any clarifications.

We invite you to join our closed beta and get access to the API documentation.

State of mobile AR development

Written by Pavan Kumar