Mapping the World — Part 4: SOSNet to the Rescue!

Published in

Scape Technologies

11 min readJun 14, 2019

--

At Scape Technologies, we believe in a future in which humans and machines can intelligently co-exist. But for any phone, vehicle, headset, or delivery drone to successfully operate within the physical world, those devices will need a human-like understanding of the world around them.

That is why at Scape, we have built the ‘Vision Engine’, an infinitely-scalable mapping and localization pipeline to connect the digital and physical world.

This video shows our Visual Positioning Service determining the correct location of a mobile phone, at night, during a thunderstorm.

However, the underpinning map used by our VPS was created using data captured roughly a year before, at noon.

Nighttime localization using a ‘daytime map’ was one of the hardest problems in computer vision until SOSNet came to the rescue!

Our previous blog post introduced Apophis, our lightning-fast non-linear optimization library that is used for solving large-scale problems commonly found in Structure-from-Motion (SfM). One of the core steps of SfM is inferring the relative camera motion between two frames, which is done with a technique called “Image Matching”. To perform image matching at the scale and quality that is required for our Vision Engine, we had to create a more robust matching method, without sacrificing computational efficiency.

In this post, we introduce SOSNet, which will be presented as an oral at CVPR 2019, and leads to significant improvements compared to the state-of-the-art methods. We built SOSNet by making neural networks go from learning similarities to learning similarities between similarities.

However, before digging into SOSNet and taking a look at how vastly it improves the accuracy of reconstructed 3D maps, it is useful to describe the problem of image matching.

Recovering relative camera motion

For humans, it is relatively easy to estimate how a camera moved between two images. Given the example images below, imagine that we are standing in a way that we observe scene A. To get to scene B, we should take a few steps to the left and then rotate our head right. On the other hand, to go from scene A to C we need to move right and then tilt our head left.

Recovering the relative camera motion between two images allows us to build large scale 3D maps of the world.

Recovering relative camera motion from image matching

In the pixel space where cameras live, we need to identify a set of point-to-point correspondences, in order to recover relative camera motion between two images. This was originally described in a 1981 article published in Nature from Christopher Longuet-Higgins. In particular, through the magic of two-view geometry, we can estimate a mathematical representation of the camera motion, which is known as the Essential Matrix.

Fun fact: One of Christopher Longuet-Higgins’ PhD students is Geoffrey Hinton, who later went on to recieve the Turing Award for his contibutions on neural networks that helped spawn the “deep learning era” in computer vision. It’s this same era of deep learning that led to our SOSNet method.

We can recover the relative camera motion between two images, by finding lots of correspondences between specific areas on the left image, with the “matching” areas of the right image. This is the process that is called “image matching”.

In theory, the well known eight-point algorithm, requires just 8 point-to-point correspondences between the two images to recover the relative camera pose. In practice, however, things are not that ideal.

The algorithm assumes that the matching process is done without errors, something that is not realistic. Thus, a common practice is to find as many correspondences as possible, which can then be used to recover the relative camera pose. The classical pipeline can be summarised as: Detect, Describe & Match.

Detect, Describe & Match

The classical image matching pipeline. After a set of points are detected in each image, small patches are cropped around them. Then, a descriptor is computed for each patch, and these descriptors are compared between pairs of patches to find correspondences.

In the detection stage, we focus on specific areas of the image that are deemed interesting. In the description stage, we extract small patches around them, and we assign one descriptor per patch. We can think of the descriptor as being a special type of a barcode that encodes information about the appearance of the patch. In the final matching stage, random pairs of descriptors are compared across two images, and if two descriptors are similar, a match is recorded. Repeating this process many times can lead to a large number of matches between the images.

There are many problems that arise in this pipeline including:

which points or areas should we label as interesting?
how many interesting points should we try to match?
how do we extract patches around points?

However, one specific problem is very important: How do we go from an area of an image which is deemed interesting, to a patch descriptor, which can be used to compare the area against other interesting areas?

Below we briefly discuss the history of patch descriptors, and we describe our SOSNet method which outperforms several state-of-the-art methods across several academic benchmarks.

A Blast From the Past: SIFT

The year is 1999, we enter The Matrix, and the Y2K Scare is making some people think the end of the world is near.

At the same time, David Lowe, a professor from UBC in Canada, published a paper that would end up being one of the most popular computer vision papers of all time. The paper’s title is Object Recognition from Local Scale-Invariant Features and among other things, it discusses SIFT, a method to go from interesting points, to patches, to descriptors.

To understand how SIFT works, we need to know what the image gradient is. Simply put, the gradient shows to us the direction and the magnitude (strength) of the steepest increase in image brightness.

Visualisation of gradients for three simple image cases. Each pixel’s gradient is represented with an arrow, whose direction shows the direction we have to take to go to an increased image brightness, and the length shows how drastic the brightness change is.

So what does the gradient has to do with describing patches?

We can use the gradients of an image to build up a descriptor that encodes information about the image and is somewhat invariant to visual changes. In SIFT, the patch is split into a 4x4 grid. By computing the number of pixels observed exhibiting specific gradient directions in each individual grid segment, we end up with a histogram, which is the SIFT descriptor.

Below we see several images, their individual gradient direction responses, and the full SIFT descriptor (histogram) below each image.

Examples of images and their SIFT descriptors.

The original SIFT method is far more complex than the simplified description above and contains several implementation tricks. The computer vision library VLFeat has several SIFT details for anyone who wants to dive more deeply. Interested readers that want to dig into the C code, can also take a look here.

The SIFT paper has been cited over 65,000 times, and it has been the de-facto standard method used for image matching even as recently as 2015.

However, SIFT often fails to work, especially in difficult cases. For example, consider the set of patches shown below. Their variation is too complex to handle since the simple aggregation of gradient statistics is not enough to cope with the significant change in appearance.

An example of a challenging set of patches. SIFT is not able to handle the variation in appearance.

SIFT comes from 1999, which was a completely different era in computer vision. More specifically, it was before the whole field of computer vision shifted its focus to deep learning, especially since Alex Krizhevsky , Ilya Sutskever and Geoffrey Hinton won the Imagenet 2012 competition by a significant margin and surprised the computer vision world.

The effect of the deep learning era in computer vision also impacted the image matching methods.

Deep Learning Descriptors

Instead of manually designing a descriptor, deep learning focuses on learning from examples.

A network is fed with a large number of annotated patches and predicts an estimate about whether the patches are matching. If it makes a mistake, the network parameters are updated such that the mistake is likely to be fixed. After many iterations, the network correctly predicts most of the training examples and it is considered trained. For more information about deep learning, you can check some of the relevant tutorials. In addition, there is an awesome online page from Google where you can play with your own networks.

Using a neural network as a local patch descriptor. Based on the output of the network, we can estimate whether the two patches show the same point, or not.

To understand more about the training of deep learning descriptors, we use a simplified version where the descriptor is a point in our well-known two-dimensional space (i.e. it is represented by two values). In such a case, the similarities between patches can be computed using their distance in the 2D plane. For example, in the image below, the two patches connected with the shorter, green line, are more similar than the patches connected with the longer, red line.

(left) For a two-dimensional case, the position of the patch in the space indicates the two descriptor values. (right) Similarities between descriptors can be defined by measuring distances in the two-dimensional plane.

But how can we make use of similarities in the 2D space to train descriptors using deep learning?

How to train your deep descriptor

The state of the art architecture for training patch descriptors using neural networks is the triplet network.

To train a triplet network, we feed it with triplets of patches. Two of the patches are from the same 3D point, while the remaining patch comes from a separate, non-matching 3D point. The job of the network is to produce descriptors such that the similarity between the matching patches, is higher than the similarity between the non-matching patches.

A triplet network learns to describe patches such that two patches from the same 3D point have higher descriptor similarity than two patches from different 3D point.

Most state-of-the-art methods presented in recent conferences, like TFeat, L2-Net and HardNet all use the triplet architecture and descriptor similarities to learn descriptors.

However, training with triplets has one specific downside: There is no enforcement of consistency of distances between dissimilar pairs of patches. This can be seen in the figure below, where after the training, the configuration of the 4 patches in the 2D space is not consistent. More specifically, the distances between the two different families of patches (illustrated using the blue and red lines) are not similar.

Illustration of the effect of learning 2D descriptors with a triplet loss. Similar patches are pushed close to each other, and dissimilar are pushed away. However, no consistency is enforced between the different families of patches.

This inconsistency between distances in the case of triplet networks was the observation that led to the design of our SOSNet descriptor.

But how to enforce consistency between different patches? We thought, what if we take it one step further? What if we made use of the similarity between similarities?

SOSNet: From similarities, to similarity of similarities

Second-order matching has been shown to be really useful in graph matching problems. This observation inspired us to devise a training method based on second-order similarity information.

Second-Order Similarity goes from comparing similarities between pairs, to comparing similarity between similarities. E.g, how similar are *sim₁ and* sim₂?

But what is second-order similarity (SOS)? To define it, we need 4 patches instead of 3. Essentially, second-order similarity measures how similar or different similarities between pairs of patches are. In the image on the left, the second-order similarity is low, since the two similarities sim₁ and sim₂ are not equal.

We incorporate higher order constraints, by adding second-order similarity into the training of triplet networks, and we name our method SOSNet. From the figure below, we can observe that the distances between different families of patches (illustrated using the green lines) are now consistent, and thus their second-order similarity is high.

Illustration of the effect of second-order similarity (SOS). By enforcing SOS, the final configuration of the descriptors are more consistent between different patches.

For more information about the technical details of our method, please refer to our paper on arXiv, where it is described in detail together with several experimental results.

Overall, we are able to achieve significant performance gains and, at time of writing, SOSNet currently ranks #1 in the ‘WISW contest’, devoted to image matching using local image descriptors, within planar and non-planar scenes.

In addition, we provide several experiments that analyse the configuration of descriptors in the space. This is useful for analysing the utilization of the descriptor space across different methods and discovering their strengths and weaknesses.

SOSNet in action: Image Matching

Below we show some examples of how SOSNet compares against SIFT. SIFT is still one of the most commonly used methods for image matching. We can observe that SOSNet is able to match correctly most of the points even in extremely challenging cases, while SIFT fails to follow.

SOSNet in action: Building 3D Maps

Finally, we compare SOSNet with a standard triplet network, trained without our second-order similarity loss. To do this, we build 3D maps using a set of 2D images in a small road in London. Note that in both cases, we use exactly the same images in order to build the map.

Map of a London area built with a standard triplet network.

Map of the same London area, built using our **SOSNet** descriptor.

We can observe that the camera motion paths (shown in red) are recovered in their entirety, and the resulting 3D map is significantly more accurate.

A peek into the future

While SOSNet is able to handle significant viewpoint variations, it still relies on the standard Detect-Describe-Match pipeline, which can still fail in extreme changes.

We are now working on the next generation on matching methods, that utilize attention mechanisms to perform global image matching. More details about this work will be shared soon. For now, we offer a sneak preview below!

**SOSNet** (left) is not able to handle extreme scene variations. Our current research (right) into methods that don't use the classical Detect-Describe-Match pipeline is very promising in such cases.

If you are interested in hearing more about SOSNet, please come to our Oral presentation at CVPR 2019 where our research intern Yurun Tian will give his talk. We will be also presenting a poster after our oral, where you can have an in-depth conversation with the members of our team. The full conference program can be found here.

Oral Presentation. 13:48, Thursday 20th June 2019.

Poster Presentation. 15:20, Thursday 20th June 2019.

CVPR 2019, Long Beach, California

If you are keen to learn more about the work we are doing at Scape or what you can do to partner with us, please, visit our website or email us.

Additionally, if you are interested in learning more about our research projects, would like to collaborate, or would like to join the team, reach out to research@scape.io

Vassileios Balntas leads the research team at Scape Technologies, which focuses on several problems related to large-scale visual localization.

Follow Vassileios and the company on Twitter.

Interested to learn more about Scape Technologies?

We send a newsletter every couple of months, making sense of the AR industry and sharing our progress.

Sign Up to our newsletter.