Signatures: Augmented Reality at the world scale

HERE Technologies is foremost a mapping company, a company trying to capture reality and represent it as a map, a format that’s as old as civilization itself. While maps aren’t going away the methods we use to capture reality, the ways we represent that reality and how we interact with those representations are evolving quickly.

A prime example of that is Augmented Reality. AR allows people to directly superimpose information on top of what they’re currently seeing. Fitting into the above paradigm, in AR we capture what a person is seeing, connect it with a representation of the world, and modify what they see based on that representation.

Most AR is currently done using 3-D representations of point clouds of visually discernible descriptors. A user will aim their camera at a scene. The scene will be decomposed into descriptors and those will be matched to the 3-D point cloud map. Once the user’s relative orientation is established we can then map back to the original 2-D scene and project elements on to the user’s view.

While that AR process is feasible at small scale, to scale to the size of a town, a city or more is far from obvious. There are two main reasons for this, the first is the underlying 3-D cloud map representation. The point clouds are made using a process called bundle adjustment that orients and aligns multitudes of images and descriptors to make a coherent map. This process is CPU intensive and iterative needing significant resources to hopefully converge to an accurate solution. The second difficulty is deployment. Ideally you would want the visual positioning to be doable on an edge device, but this is also computationally expensive and is typically done partially server side for larger locations.

Signatures

To be able to do AR at the scale of the world, the scale of HERE’s maps, we need to rethink things. Three dimensional representations seems like a natural transition from typical 2-D maps and hence the natural framework for AR. But as we saw above they are hard to scale. In addition, it is not obvious that it is necessary. Most people operate and navigate in the real world without using 3-D maps in our brains. We recognize what we’re looking at and navigate using landmarks, maybe using an internal map of relationships between places.

From this realization we came up with the idea for Signatures. What if we could have small portable representations of places/ things that a user might see? Lets say these representations could be efficiently applied to an image to detect if the place was there and where in the image it was. What possibilities does this open up?

A Signature allows us to detect a specific place in an image. The signature is a small digital foot print of the place that’s recognizable across different images.

Navigation Application

Let's consider one application, last meter navigation. Here a user is navigating to a destination. They have a dashcam active and on their screen they see the outside scene with useful information superimposed. As they pass guidance features, for example “make a right at Joe’s Pizza”, or when they approach their final destination, the screen will highlight the relevant places.

Dashcam AR Navigation application.

Using Signatures such an application is possible. Such an application would regularly send a server its approximate location and local places it needs, and the server would return with Signatures for those places. The application would calculate the presence of places in images on the edge device directly, and augment the user’s image of the outside appropriately. All in all this is a very lightweight and robust process. The Signatures themselves are precalculated and on the order of kilobytes, requiring minimal bandwidth and minimal server-side processing. At no point do we need to get an accurate positioning with even a noisy GPS signal sufficing.

Signatures require little bandwidth and are robust to positioning errors.

This is but one of many different potential applications. For example, Signatures could be used to do indoor navigation. Rather than forming a 3-D map of a venue and finding a way to do indoor positioning, a non-trivial task, we can instead use a handheld device to recognize places and things indoors. If we know the relative relationship between places, e.g., the men’s clothing store is to the left of the candy shop, then that’s enough to navigate indoors. This is very similar to how people give directions but with the added benefit of the device recognizing the landmarks and knowing the relationship map between them.

An illustration of how Signatures could be used to navigate in an indoor venue.

Encoding Architecture

There are lots of pieces that go into making Signatures. Some are at the gathering data stage such as identifying places to encode and annotating them semantically. In this section though we will focus on the high-level architecture of how Signatures are encoded and used.

A Signature needs to be small and portable, but it also needs to provide robustness to variations in pose, scale and observation conditions. In order to achieve these properties we use a two stage architecture. Conceptually we might think of a one-to-one mapping between a Signature and a place, but instead each place can have multiple Signatures dependent on the observers viewpoint. In fact, we really think of Signatures as having two levels, one an encoding that can be used to match a user’s viewpoint to similar previously seen viewpoints, and the other an encoding for a place that’s specialized for that approximate viewpoint.

Signatures have two levels. The first is to match a user’s query image to previously scene viewpoints. The second to recognize a place from a similar view point.

The first level Signatures are very closely related to Image or Instance Retrieval for which there is broad ongoing and historical research. We will then focus on the second level, what we call Image Demarcation Signatures. The ID Signatures are used to find a place (or thing) in an image, assuming the Signature was encoded for a similar image, as in the following figure.

Conceptually, the high-level architecture for learning and using Signatures is summarized in the following diagram. A Backbone network extracts features from a query image and a Demarcate network identifies the place in the query image. Signatures are encoded by a specialized encoder and the signature is incorporated in the computation of the Backbone and Demarcate networks. The incorporation can take many forms, it can be additional inputs applied to the demarcation process, it can form a feature mask that’s correlated or matched to the features of the backbone network or it can even be aspects of a hyper network, where parts of the signatures represent weights or parameters of the other two networks.

Training in this architecture is a kind of co-training as the different pieces ultimately need to work together even though in the field the signatures will be precalculated and the Signature Encoder is typically used offline. Different training schedules can be used to maximize efficiency and improve generalization, where the initial goal is to generate accurate and crisp demarcation and then later to consider variations in image properties the edge device might run into.

Conclusion

Signatures is a way to achieve Augmented Reality at a large scale with limited resources by reframing it as a computer vision problem. This is one of the ways that research at HERE is trying to reinvent the concepts of what maps are, how they are created and how they can be used. This is a great space with lots of interesting data, many potential applications and wide-reaching impact.

--

--

Ofer Melnik
Machine Learning & AI in Automated Map Making

Currently Sr Principal Data Scientist at HERE research. Enjoys working on tough problems that involve AI/ML.