Mapping the World — Part 2: A New Hope

Published in

Scape Technologies

8 min readApr 25, 2019

At Scape Technologies, we believe in a future in which humans and machines can intelligently co-exist. But for any phone, vehicle, headset, or delivery drone to be an intelligent agent, it will need a human-like understanding of the world around them. That is why we at Scape built a Vision Engine that connects the digital and physical world. This week part 2: A New Hope.

Chicken-Inspired Mapping

In our last blog post, chickens taught us how we can create 3D maps from simple 2D images. Consequently, we are able to build complete 3D worlds using every picture from every smartphone camera, digital camera, dash cam, body cam and drone camera available. This method is called Structure-from-Motion.

Unfortunately, modern implementations of this technology do not generate maps with the

accuracy
robustness
scale

that we require for our Visual Positioning Service… until now.

New power, old principals

Powerful modern tools, like GPU programming, Deep Learning, and Cloud Computing have been achieving mass adoption and incredible results in recent years. Therefore, we decided to unleash the power of these 21st-century tools on this 20th-century, Structure-from-Motion, algorithm.

Jigsaw Puzzle

Structure-from-Motion can be thought of as putting together a jigsaw puzzle. Instead of having puzzle pieces, we have images that we trim into puzzle pieces and slot together.

Step 1: Creating the puzzle pieces

The part in which we create puzzle pieces from images is called Feature Extraction. The shape of a puzzle piece is a representation of its observable features. For example, if two images both observe the clock of Big Ben, the Feature Extractor will have to cut them a loop and socket that allow them to fit together. That particular loop and socket are then a description of the feature “Big Ben’s clock”.

The Feature Extractor **describes** the images with loops and sockets based on its **features**.

Step 2: Fitting them together

When the features are described, we create a list of which loops and sockets should fit together. This process called Feature Matching and the list serves as input to the Reconstructor: the module that “clicks” all the puzzle pieces together.

The Feature Matcher creates a list of similar **descriptors.**

Step 3: Puzzle to real world

After the puzzle is completed, the Geo-Aligner aligns it with real-world coordinates. Do we know the exact GPS coordinates of Big Ben? If so, we can calculate the GPS coordinates of the surrounding buildings as well.

The Geo-Aligner aligns the 3D model with GPS coordinates.

Great Idea, But…

Now, imagine that a puzzle has millions of pieces and that every piece has thousands of loops and sockets with tiny imperfections. Imagine also that a single puzzle piece fits with hundreds of other pieces in a high-dimensional space.

Though this makes the puzzle more complex, it does not explain why the resulting map can be inaccurate.

What is going wrong?

In the past, we all have laid out a puzzle and slotted two pieces together that were not supposed to. Consequently, even though subsets of the puzzle are correct, the “completed” one is not.

When two different features are described with the descriptor, the resulting model might end up with inconsistencies.

This exact same thing happens in Structure-from-Motion, but which module is to blame?

Blame it on the 90's

Surely, if two puzzle pieces fit together when they are not supposed to, two features that are different were described with the same descriptor; the same loop and socket.

The default feature descriptor, SIFT, was created in 1999 and was a clever sequence of mathematical operations. It was able to recognize the same feature in similar images, but often fails when comparing images from different cameras or different seasons.

Benchmark shattering

Nowadays, Deep Learning is reaching human levels of visual recognition, allowing us to describe visual features with far more precision (“this is Big Ben’s clock!”). We called our learned feature descriptor SOSNet and, unsurprisingly, we shattered all academic benchmarks.

Before, Structure-from-Motion pipelines might have incorrectly glued a part of the London Eye to Big Ben. Consequently, our VPS would have told you you are standing in front of Big Ben while you are actually standing in front of the London Eye. But no longer.

They also come with a bonus.

Robustness in, robustness out

Because we can now use any two images to create a 3D map we can also use any image to localize against it. Hereby, implicitly solving another goal: robustness.

SOSNet is the Structure-from-Motion equivalent of orthopedic insoles: a minor change in the foundation that improves the entire system as a whole.

Now, all that is left to do is to do this on a very large scale.

Puzzling Scalability

With the classical implementations of Structure-from-Motion, if we tried creating a 3D map the size of London, two modules prohibited us from doing that.

Using prior information

First, there is the Feature Matcher module. Having an endless stream of images means that we have to do an endless amount of descriptor comparisons as well.

Of course, there is some prior information available for the imagery. For example, we can use the GPS to only compare images from the same neighborhood. But this still forces us to do tens of times more comparisons than we have to, so we look at our jigsaw puzzle again.

Pile the pieces

When we are laying out a puzzle of central London, we would put all the pieces that have a part of Big Ben on one pile, the pieces that have London Eye in them on another, and so on. After that, we would only try to fit a London Eye piece with other London Eye pieces.

If we know which pieces observe the same building we can avoid a lot of matching attempts.

Why not introduce a similar approach when comparing images?

Again, this seems like a task that requires a human-level of visual recognition (“this is piece has a part of the London Eye!”). So again, we leverage the power of Deep Learning.

We created a proprietary global image descriptor that beats the state-of-the-art in image categorization and is 8x more lightweight.

One module’s scalability problem solved, one remaining: the Reconstructor.

Divide and Conquer

The second to last step before we have a model is an optimization step, called Bundle Adjustment, which scales terribly with the amount of input data.

For example, if the number of images used to create a map increases from 50 to 500, the time needed for Bundle Adjustment increases by 1000. If it took 1 hour before, it will now take almost 6 weeks!

Teamwork makes the dream work

Here again, our puzzle metaphor unlocks the solution. Let us say you were laying out the puzzle with a friend. One of you will likely start to reconstruct Big Ben and the other the London Eye. Once the two “super-pieces” are constructed, you would merge them to create one larger super-piece. Furthermore, if you both take an espresso before starting, you would be done even quicker. Therefore, our solution to reconstructing at scale is two-fold.

Dividing the problem into multiple smaller problems allows us to parallelize its solution.

Espresso first

The industry default for doing Bundle Adjustment is Ceres Solver by Google Research. A generic non-linear optimization library produced to solve “large, complicated optimization problems” and runs on a CPU. The operative words here are “generic” and “CPU”.

Bundle Adjustment is an algorithm that deals with very large matrix operations that are perfect for a GPU to process. Moreover, if we were to write an optimization library that is in itself optimized for the problem that we are solving, it should be faster than a generic one.

Therefore, we built one and — just like Ceres — named it after an asteroid: Apophis. Apophis is an optimization library for the GPU that solves bundle adjustment problems up to 100x faster than Ceres Solver.

Creating super-pieces

Dividing the bigger task into smaller tasks is relevant to reduce the total time spent on Bundle Adjustment. However, we can really see its potential if we can also parallelize these tasks: if you find 800 friends, you will finish your puzzle 800 times faster.

Our innovations reduce the time needed for large-scale Structure-from-Motion, making it a feasible mapping solution for the first time.

Old Idea, New Application

Given the large-scale mapping problem and all of the mentioned improvements, we need to process high volumes of data in burst, in parallel, and on GPU’s. No wonder we turned to the proven Visual Effects Industry for inspiration for its design. But, does this mean we have to build our own “render farm” equivalent?

Renting a farm

Fortunately not. With the highly scalable, powerful, and affordable cloud infrastructures available nowadays, spinning up hundreds of GPU machines in parallel is trivial.

No matter if we are mapping a building, a neighborhood, or a whole city: we can process it in less than a day.

Hope Becomes Reality

These are just a handful of the many innovations we are putting in place to lift Structure-from-Motion into the future. We are closing that gap between what the physical world is showing us to be possible and what the digital limit has been.

In the next blogs of this series, we will describe the previously mentioned innovations in detail. We start with how we developed an optimization library that is up to 100 times faster than Google’s.

Huub is co-founder and CTO of Scape Technologies, a computer vision startup in London, working towards unifying human-machine understanding by connecting the physical and the digital world.

Follow Huub and Scape on Twitter.