Mapping the World — Part 3: Introducing ‘Apophis’, a lighting-fast nonlinear optimization library

Published in

Scape Technologies

9 min readMay 24, 2019

This week: Math Strikes Back — A deep dive into optimization algorithms.

Scape Technologies has built a large-scale visual positioning service that allows devices to pinpoint their location, using a camera.

As stated in the previous blog posts in this series (part one and two available here), in order for us to create an accurate visual positioning service, we need accurate maps. We do this by taking thousands of images and processing them using an algorithm called Structure-from-Motion.

However, in its pure form, Structure-from-Motion, or SfM for short, can lead to inaccurate or blurry maps. We, therefore, refine the resulting maps using a mathematical optimization process called Bundle Adjustment.

For accurate localization, we must refine crude reconstructions and make them crispy and defined.

In pursuit of the perfect optimization library, Scape Technologies has developed ‘Apophis’: a nonlinear solver which is 100x faster than the leading library ‘Ceres’, developed by Google Research.

However, before diving into what Apophis does, it’s useful to understand how Structure-from-Motion works.

SfM Fundamentals

The first step of Structure-from-Motion is to take a collection of images, divide them into pairs, and estimate the relative camera position between the two image pairs at the time the images were captured.

We compute the relative position and orientation information between image pairs.

Whilst this gives us a good starting point, for most image pairs, there will be some inaccuracies in the calculated position. For example, if the difference in angle between the two images was 90°, the calculated angle could be 89° or 91°.

Furthermore, as shown below, these relative positionings may not be globally consistent.

For example, we can calculate the difference in angle between image pairs A/B and B/C. But if we add the rotations from A/B and B/C, we might not get the same answer as if we calculate the difference in angle between images A/C directly.

Example of inconsistency in pairwise relative position/orientation data.

How can we get a refined result if individual measurements are prone to inaccuracies?

The answer: We should not rely on any individual measurement!

The Wisdom of Crowds

In his book The Wisdom of Crowds, James Surowiecki observed people in an animal fair that try and gauge the weight of an animal. He points out a remarkable phenomenon whereby the average of all guesses will beat any individual guess! This because all individual biases will simply disappear in the multitude while taking the average.

To apply this idea to the camera placement problem, we consider all pairwise positioning information and accommodate for any differences in a fair way as illustrated in the diagram below.

The difference in angles between all images pairs have been globally considered and optimized. Adjusting each estimate by 2° to make it globally consistent.

Reconstructing the scene

Having estimated the optimal position of cameras when they captured the images, we can plan the 3D reconstruction of the observed scene.

We do this using a process called triangulation. This is the act of deducing the position of 3D points from two (or more) observations. However, just like estimating the position of cameras, observations do not always agree on where points are. The result can be quite significant: a small positioning error can yield huge errors in the triangulation of points.

For example, assume two cameras positioned 1m apart observed the same point 40m in the distance. If the calculated angle of one camera is off by just 0.4° this will amount to an inaccuracy of 9–10m for the triangulated point!

Even after applying the Wisdom of the Crowd’s principle and averaging all camera estimates, the result of the triangulation will be inaccurate. This step is just too sensitive to camera placement errors, even tiny ones.

We, therefore, need another method to refine and optimise the results of the triangulated 3D model. We do this using ‘mathematical optimization’.

What is ‘Mathematical Optimization’?

‘Mathematical Optimization’ is the challenge of finding the values that give an optimal solution to a problem, given some criteria. The measure of how well a solution is performing is described by a function. In the simplest case, this means that some measure of a function needs to be minimized or maximized.

In our case, the values that we are trying to determine are the positions of cameras and points within a scene. The function that we are trying to minimize is the mismatch between the position on the image where an observed point actually appeared, and where it would appear on the image according to our current estimate.

Therefore, the optimal solution represents the best possible consistency between the estimated location of cameras and the observed points within the 3D reconstruction.

In the context of Structure from Motion, we call this specific optimization challenge Bundle Adjustment since it consists of the adjustment of ‘bundles’ of light rays such as the ones in the figure below.

Bundle of light rays that reached the camera — they indicate possible positions for objects.

We use Bundle Adjustment to minimize or eliminate all mismatches between cameras and points.

We iteratively optimize the global reconstruction by giving each image a ‘weight’ to their predicted location in the ‘Wisdom of Crowds’ analogy and use these weights to calculate the best solution possible using a nonlinear optimization library.

The state of the art

For the last few years, the industry standard for implementing mathematical optimization has been a software library called ‘Ceres’, named after an asteroid that lies between the orbits of Mars and Jupiter.

Developed at Google Research for refining Google Maps, Ceres has been the go-to library for Bundle Adjustment since 2007 and is used in the majority of Structure-from-Motion pipelines today.

However, while Ceres can produce excellent results, it was less suited for our 3D reconstruction pipeline, due to the massively large scale at which we are working. After all, we don’t just want to reconstruct a building or a street, we need our camera-based mapping and localization to be available everywhere in the world!

Even an area as small as a few city blocks could consist of more than 2,000 images as in the example below with 2,372 images and 1,331,812 points.

Using Ceres, a reconstruction like the one above could take several days to complete, which would be far too slow and expensive to process.

Therefore, in order to be truly scalable, we needed to write our own mathematical optimizer, designed to solve massive bundle adjustment instances in a fraction of the time.

Code that writes code

In pursuit of the perfect mathematical optimizer, we stumbled upon a little trick that we could use to improve upon state of the art. We discovered that code that computes mathematical data can be tricked to do something else.

We realized that code could be ‘tricked’ to re-write itself in order to become more efficient!

So, how does it work?

Simply put, all computer languages allow you to redefine arithmetic operations (like plus, minus, multiply, etc) in order to extend the scope of a given program. For example, changing the format of a cell from numbers to time in Excel changes the nature of how particular arithmetic operations work.

We used this inherent flexibility to reassign mathematical symbols to generate additional data that we can use at a later date.

Our code generator is a smart way to redefine the meaning of such symbols to produce code which we use whenever we create 3D reconstructions. In the optimization process, we use this generated code to compute auxiliary data. Compared to a direct approach our method inherently avoids calculations which we can predict to be unnecessary. Ultimately, this saves us time in the process.

How much time?

Well, the complexity of any mathematical function can be measured in the number of mathematical operations the function requires. Simply plugging our auto-generated code into Ceres reduced the number of numeric operations from 1652 to 680, resulting in a nice speedup of 2x.

Whilst this was good, however, it was still not enough for our world-conquering plans. For that, we needed to harness the power of the GPU.

C++ code on a trip to the GPU

Ceres is currently designed to operate on the CPU, which comes with inherent limitations, due to a lack of large-scale parallel processing. On your laptop, for example, you might have 4 CPU cores, allowing you to work on 4 operations at the same time. However, a GPU has hundreds of cores, allowing you to work on hundreds of operations simultaneously.

While we could have ported Ceres to the GPU, moving any complex operation from the CPU to the GPU can be a cumbersome process. This is because with many tasks operating simultaneously in parallel, it is a challenge to ensure individual operations do not ‘trip up’ over each other.

Thankfully, the code produced by our ‘code generator’ was perfect for parallel computation on a GPU, allowing us to build our own optimizer from scratch and forget about Ceres all together.

Not only does the code generator allow us to implement lightning-fast bundle adjustment, it’s also incredibly flexible. Furthermore, it can even be used to solve a wider variety of large, complicated optimization problems anywhere in our pipeline!

We call our new optimization library ‘Apophis’ which, like Ceres, is also named after an asteroid. Unlike Ceres, however, the asteroid Apophis comes with a serious chance of impacting the Earth!

Ceres on the left. Apophis on the right.

Apophis — The headline stats

Apophis is a highly efficient library for nonlinear optimization which:

Uses GPUs at their full power making the computation entirely parallel.
Is especially fine-tuned for bundle adjustment and geometric optimization problems.
Is still general enough to be fed different optimization problems or possibly implement variations of the bundle adjustment.
Is highly configurable and allows to fine tune memory usage/speed trade-offs.
Can compute movements that give us some play in deforming structures, in a way that is still rather compatible with our image data (see image below)

Animation of deformations of cameras in the above reconstruction, σₙ representing resistance/stiffness of the with respect to the displayed movement.

The results of Apophis exceeded expectations yielding performance improvements up to 100x in bundle adjustment compared with Google’s Ceres.

In practice, this means that an area that previously would take 2 weeks to process can now be achieved in hours! Not only does this save time, but it also means that we can process city-scale 3D reconstructions on the cloud, without breaking the bank! 💰💰💰

In the next blog of this technical series, we will dive into our benchmark-shattering research on ‘image features and descriptors’, sharing how we created the most robust method for matching points-of-interest between RGB images.

If you are keen to learn more about the work we are doing at Scape or what you can do to partner with us, please, get in touch.

Additionally, if you are interested in learning more about our research projects, would like to collaborate, or would like to join the team, reach out to research@scape.io

This article is part three of a technical series on building Scape’s ‘Vision Engine’. Read part one and two of the series on Medium.

Maurizio is Principal Mathematician at Scape Technologies, a computer vision startup in London, working on mathematical optimization and numerical problems connected to Computer Vision.

Follow Maurizio and Scape on Twitter.