Bringing advanced Visual Re-localization pipeline using scene coordinate regression with deep learning in ARwayKit

Published in

ARWAY

3 min readJun 1, 2021

One of the core components of ARwayKit is Visual Re-localization; our developers can only deliver their AR Location-based experiences after accurate localization; hence improving localization accuracy and robustness across various environments at scale is a top priority for us.

Over the last few months, we have been researching many ways to improve localization robustness, especially in lighting and viewports changing scenarios from modifying traditional CV-based methods to tuning the state-of-the-art machine learning methods.

In the new localization pipeline, we want to take advantage of recent advancements in Edge processing and 5G to distribute some parts of the processing from the end-users to the Edge devices during the mapping process.

Images of the same scene in different seasons from Long-term Visual Localization Benchmark

Scene Coordinate Regression Localization method

It is a fundamental problem in computer vision and robotics, and it refers to estimating camera pose from an image. Recent state-of-the-art approaches use learning-based methods, such as Random Forests (RFs) and Convolutional Neural Networks (CNNs), to regress for each pixel in the image its corresponding position in the scene’s world coordinate frame, and solve the final pose via a RANSAC-based optimization scheme using the predicted correspondences.

The framework consists of a deep neural network and fully differentiable pose optimization. The neural network predicts scene coordinates, i.e. dense correspondences between the environment's input image and 3D scene space created at the time of Mapping. The pose optimization implements robust fitting of pose parameters using differentiable RANSAC (RAndom SAmple Consensus).

The system consists of two stages: Scene coordinate regression using a CNN **(top)** and differentiable pose estimation **(bottom)**.

Mapping Process for the new method

Compared to the traditional SLAM-based method (Simultaneous Localization And Mapping), the on-device experience will be different as tracking across frames won’t be required in real-time. Hence mapping will be much easy for beginners. The underlying AR Service, such as ARKit or ARCore, will be running in the background and will be required for mapping.

During the mapping process, the device will collect visual data, Camera Poses, and other metadata that will be processed in the cloud consisting of state of the art GPU’s such as Nvidia A-100. Once the map data is successfully uploaded, it can take from 6–24 hours of Training time on each map, depending on the map size and complexity of the environment.

Dynamically updating the Maps over time

This new method will enable the dynamic updating of maps in the cloud, given the map is used in production localization. The retraining will happen once successful localization request counts reach the threshold. An updated DL model will be applied to all next localization requests, enabling height accuracy in continuously changing environments.

The System shows how successful localization requests will be used to update the DL Model.

Try ARwayKit

ARwayKit is on the horizon — and you and your team can be among the first to harness the power of Spatial Computing by unlocking its location capabilities for building your Spatial AR apps.

Sign up today and get started building🚀

ARWAY

Bringing advanced Visual Re-localization pipeline using scene coordinate regression with deep learning in ARwayKit

Scene Coordinate Regression Localization method

Mapping Process for the new method

Dynamically updating the Maps over time

Try ARwayKit

Published in ARWAY

Written by Nikhilsawlani

No responses yet