Navigation Using Cameras and 5G: How Visual Positioning Service Works

Published in

Machine Learning & AI in Automated Map Making

4 min readJul 22, 2020

Team Members — Landis Huffman, Josh Finken, Sanjay Boddhu

Global Positioning System (GPS) is used every day for navigation, but its limitations can be frustrating. GPS is unreliable indoors and in dense urban areas where buildings block and reflect GPS satellite signals; even when GPS finds your position, it can’t determine which direction your phone is facing.

Instead of using satellites for positioning, a Visual Positioning Service (VPS) positions you using photos or video from a camera, bringing the power of positioning anywhere, including indoors. Mobile 5G devices can use VPS in real time with a connection to Multi-access Edge Computing (MEC), a high-end computing platform with nodes distributed in the service area. The device uploads video frames to the MEC, which computes and returns camera positions and orientations in real time with sub-meter accuracy. VPS provides a robust alternative to GPS for navigating urban canyons and indoors. Furthermore, the real-time 3D pose estimate from VPS can be paired with augmented reality for new levels of interaction and entertainment.

Here is a glimpse into the cutting edge computer vision advancements that enable VPS.

How VPS Positions an Image

A single photograph is all that’s needed for VPS to estimate both the location and orientation of the camera. The processing pipeline follows a three-step process illustrated below.

1. Feature Extraction. Analyze the query image to find visually salient features that might be useful to orient the image. For example, a tall church steeple in the image suggests to VPS which direction you are facing.

2. Feature matching. Find if extracted features look like any that have been seen before. Does that church steeple look like any known landmarks?

3. Pose estimation. The camera pose is computed using matches between pixels of the query and known 3D point cloud landmarks.

Now let’s break down each step in more detail

Feature Extraction

VPS starts by finding visually distinct locations, called keypoints, in the image. Keypoints are elementary features like corners or textured patches. The appearance of the pixels around the keypoint are then numerically encoded in a vector called a descriptor. There are many popular algorithms for keypoint and descriptor extraction. For example, the Scale Invariant Feature Transform (SIFT) is a public domain algorithm which identifies keypoints in the image, and computes descriptors in the form of histograms of gradient orientations. The SIFT descriptor computed at each keypoint is a 128-dimensional vector.

SIFT descriptors extracted from an image (source)

Feature Matching

VPS next looks to see if any of the extracted visual descriptors look like any descriptors that have been seen before. A large database of photographs of the scene have been collected ahead of time and then used to build a 3D point cloud of the surrounding area using techniques like structure-from-motion (SfM). This point cloud consists of a large collection of 3D point “landmarks,” each which have a fixed (latitude, longitude, altitude) position and a visual descriptor extracted from source imagery. The feature matching step compares descriptors from the query image to those in the point cloud to find similarities. Descriptors which are similar enough to pass the ratio test are matched, and thus provide a correspondence between a query image keypoint pixel, and a 3D point from the cloud. Note that exhaustively comparing all pairs of query and point cloud descriptors is immensely expensive, so that approximate nearest neighbor solutions are required for practical runtime.

The Iconic Art Institute of Chicago as a point cloud

Pose Estimation

In the final step, VPS uses the matched pixel/point pairs to triangulate the full 3D camera pose; thus, VPS not only positions the camera, but estimates the direction it faces. The correspondence between 2D image pixels and 3D world coordinates provide geometrical cues to estimate the camera’s position and orientation on the Earth. A perspective-n-point (PNP) algorithm is used to solve for the six degrees of freedom of the camera’s 3D pose, and the VPS positioning is complete! This entire pipeline is executed on the MEC in real time, so the pose is returned to the user in the blink of an eye.

Navigation Using Cameras and 5G: How Visual Positioning Service Works

How VPS Positions an Image

Feature Extraction

Feature Matching

Pose Estimation

Written by Landis Huffman