Bringing advanced Visual Re-localization pipeline using scene coordinate regression with deep learning in ARwayKit
One of the core components of ARwayKit is Visual Re-localization; our developers can only deliver their AR Location-based experiences after accurate localization; hence improving localization accuracy and robustness across various environments at scale is a top priority for us.
Over the last few months, we have been researching many ways to improve localization robustness, especially in lighting and viewports changing scenarios from modifying traditional CV-based methods to tuning the state-of-the-art machine learning methods.
In the new localization pipeline, we want to take advantage of recent advancements in Edge processing and 5G to distribute some parts of the processing from the end-users to the Edge devices during the mapping process.
Scene Coordinate Regression Localization method
It is a fundamental problem in computer vision and robotics, and it refers to estimating camera pose from an image. Recent state-of-the-art approaches use learning-based methods, such as Random Forests (RFs) and Convolutional Neural Networks (CNNs), to regress for each pixel in the image its corresponding position in the scene’s world coordinate frame, and solve the final pose via a RANSAC-based optimization scheme using the predicted correspondences.
The framework consists of a deep neural network and fully differentiable pose optimization. The neural network predicts scene coordinates, i.e. dense correspondences between the environment's input image and 3D scene space created at the time of Mapping. The pose optimization implements robust fitting of pose parameters using differentiable RANSAC (RAndom SAmple Consensus).
Mapping Process for the new method
Compared to the traditional SLAM-based method (Simultaneous Localization And Mapping), the on-device experience will be different as tracking across frames won’t be required in real-time. Hence mapping will be much easy for beginners. The underlying AR Service, such as ARKit or ARCore, will be running in the background and will be required for mapping.
During the mapping process, the device will collect visual data, Camera Poses, and other metadata that will be processed in the cloud consisting of state of the art GPU’s such as Nvidia A-100. Once the map data is successfully uploaded, it can take from 6–24 hours of Training time on each map, depending on the map size and complexity of the environment.
Dynamically updating the Maps over time
This new method will enable the dynamic updating of maps in the cloud, given the map is used in production localization. The retraining will happen once successful localization request counts reach the threshold. An updated DL model will be applied to all next localization requests, enabling height accuracy in continuously changing environments.
Try ARwayKit
ARwayKit is on the horizon — and you and your team can be among the first to harness the power of Spatial Computing by unlocking its location capabilities for building your Spatial AR apps.
Sign up today and get started building🚀