At Scape Technologies, we believe in a future in which humans and machines can intelligently co-exist. But for any phone, vehicle, headset, or delivery drone to successfully operate within the physical world, those devices will need a human-like understanding of the world around them.
That is why at Scape, we have built the ‘Vision Engine’, an infinitely-scalable mapping and localization pipeline to connect the digital and physical world.
This video shows our Visual Positioning Service determining the correct location of a mobile phone, at night, during a thunderstorm.
However, the underpinning map used by our VPS was created using data captured roughly a year before, at noon.
Nighttime localization using a ‘daytime map’ was one of the hardest problems in computer vision until SOSNet came to the rescue!
Our previous blog post introduced Apophis, our lightning-fast non-linear optimization library that is used for solving large-scale problems commonly found in Structure-from-Motion (SfM). One of the core steps of SfM is inferring the relative camera motion between two frames, which is done with a technique called “Image Matching”. To perform image matching at the scale and quality that is required for our Vision Engine, we had to create a more robust matching method, without sacrificing computational efficiency.
In this post, we introduce SOSNet, which will be presented as an oral at CVPR 2019, and leads to significant improvements compared to the state-of-the-art methods. We built SOSNet by making neural networks go from learning similarities to learning similarities between similarities.
However, before digging into SOSNet and taking a look at how vastly it improves the accuracy of reconstructed 3D maps, it is useful to describe the problem of image matching.
Recovering relative camera motion
For humans, it is relatively easy to estimate how a camera moved between two images. Given the example images below, imagine that we are standing in a way that we observe scene A. To get to scene B, we should take a few steps to the left and then rotate our head right. On the other hand, to go from scene A to C we need to move right and then tilt our head left.
Recovering relative camera motion from image matching
In the pixel space where cameras live, we need to identify a set of point-to-point correspondences, in order to recover relative camera motion between two images. This was originally described in a 1981 article published in Nature from Christopher Longuet-Higgins. In particular, through the magic of two-view geometry, we can estimate a mathematical representation of the camera motion, which is known as the Essential Matrix.
In theory, the well known eight-point algorithm, requires just 8 point-to-point correspondences between the two images to recover the relative camera pose. In practice, however, things are not that ideal.
The algorithm assumes that the matching process is done without errors, something that is not realistic. Thus, a common practice is to find as many correspondences as possible, which can then be used to recover the relative camera pose. The classical pipeline can be summarised as: Detect, Describe & Match.
Detect, Describe & Match
In the detection stage, we focus on specific areas of the image that are deemed interesting. In the description stage, we extract small patches around them, and we assign one descriptor per patch. We can think of the descriptor as being a special type of a barcode that encodes information about the appearance of the patch. In the final matching stage, random pairs of descriptors are compared across two images, and if two descriptors are similar, a match is recorded. Repeating this process many times can lead to a large number of matches between the images.
There are many problems that arise in this pipeline including:
- which points or areas should we label as interesting?
- how many interesting points should we try to match?
- how do we extract patches around points?
However, one specific problem is very important: How do we go from an area of an image which is deemed interesting, to a patch descriptor, which can be used to compare the area against other interesting areas?
Below we briefly discuss the history of patch descriptors, and we describe our SOSNet method which outperforms several state-of-the-art methods across several academic benchmarks.
A Blast From the Past: SIFT
At the same time, David Lowe, a professor from UBC in Canada, published a paper that would end up being one of the most popular computer vision papers of all time. The paper’s title is Object Recognition from Local Scale-Invariant Features and among other things, it discusses SIFT, a method to go from interesting points, to patches, to descriptors.
To understand how SIFT works, we need to know what the image gradient is. Simply put, the gradient shows to us the direction and the magnitude (strength) of the steepest increase in image brightness.
So what does the gradient has to do with describing patches?
We can use the gradients of an image to build up a descriptor that encodes information about the image and is somewhat invariant to visual changes. In SIFT, the patch is split into a 4x4 grid. By computing the number of pixels observed exhibiting specific gradient directions in each individual grid segment, we end up with a histogram, which is the SIFT descriptor.
Below we see several images, their individual gradient direction responses, and the full SIFT descriptor (histogram) below each image.
The original SIFT method is far more complex than the simplified description above and contains several implementation tricks. The computer vision library VLFeat has several SIFT details for anyone who wants to dive more deeply. Interested readers that want to dig into the C code, can also take a look here.
The SIFT paper has been cited over 65,000 times, and it has been the de-facto standard method used for image matching even as recently as 2015.
However, SIFT often fails to work, especially in difficult cases. For example, consider the set of patches shown below. Their variation is too complex to handle since the simple aggregation of gradient statistics is not enough to cope with the significant change in appearance.
SIFT comes from 1999, which was a completely different era in computer vision. More specifically, it was before the whole field of computer vision shifted its focus to deep learning, especially since Alex Krizhevsky, Ilya Sutskever and Geoffrey Hinton won the Imagenet 2012 competition by a significant margin and surprised the computer vision world.
The effect of the deep learning era in computer vision also impacted the image matching methods.
Deep Learning Descriptors
Instead of manually designing a descriptor, deep learning focuses on learning from examples.
A network is fed with a large number of annotated patches and predicts an estimate about whether the patches are matching. If it makes a mistake, the network parameters are updated such that the mistake is likely to be fixed. After many iterations, the network correctly predicts most of the training examples and it is considered trained. For more information about deep learning, you can check some of the relevant tutorials. In addition, there is an awesome online page from Google where you can play with your own networks.
To understand more about the training of deep learning descriptors, we use a simplified version where the descriptor is a point in our well-known two-dimensional space (i.e. it is represented by two values). In such a case, the similarities between patches can be computed using their distance in the 2D plane. For example, in the image below, the two patches connected with the shorter, green line, are more similar than the patches connected with the longer, red line.
But how can we make use of similarities in the 2D space to train descriptors using deep learning?
How to train your deep descriptor
The state of the art architecture for training patch descriptors using neural networks is the triplet network.
To train a triplet network, we feed it with triplets of patches. Two of the patches are from the same 3D point, while the remaining patch comes from a separate, non-matching 3D point. The job of the network is to produce descriptors such that the similarity between the matching patches, is higher than the similarity between the non-matching patches.
However, training with triplets has one specific downside: There is no enforcement of consistency of distances between dissimilar pairs of patches. This can be seen in the figure below, where after the training, the configuration of the 4 patches in the 2D space is not consistent. More specifically, the distances between the two different families of patches (illustrated using the blue and red lines) are not similar.
This inconsistency between distances in the case of triplet networks was the observation that led to the design of our SOSNet descriptor.
But how to enforce consistency between different patches? We thought, what if we take it one step further? What if we made use of the similarity between similarities?
SOSNet: From similarities, to similarity of similarities
Second-order matching has been shown to be really useful in graph matching problems. This observation inspired us to devise a training method based on second-order similarity information.
But what is second-order similarity (SOS)? To define it, we need 4 patches instead of 3. Essentially, second-order similarity measures how similar or different similarities between pairs of patches are. In the image on the left, the second-order similarity is low, since the two similarities sim₁ and sim₂ are not equal.
We incorporate higher order constraints, by adding second-order similarity into the training of triplet networks, and we name our method SOSNet. From the figure below, we can observe that the distances between different families of patches (illustrated using the green lines) are now consistent, and thus their second-order similarity is high.
For more information about the technical details of our method, please refer to our paper on arXiv, where it is described in detail together with several experimental results.
Overall, we are able to achieve significant performance gains and, at time of writing, SOSNet currently ranks #1 in the ‘WISW contest’, devoted to image matching using local image descriptors, within planar and non-planar scenes.
In addition, we provide several experiments that analyse the configuration of descriptors in the space. This is useful for analysing the utilization of the descriptor space across different methods and discovering their strengths and weaknesses.
SOSNet in action: Image Matching
Below we show some examples of how SOSNet compares against SIFT. SIFT is still one of the most commonly used methods for image matching. We can observe that SOSNet is able to match correctly most of the points even in extremely challenging cases, while SIFT fails to follow.
SOSNet in action: Building 3D Maps
Finally, we compare SOSNet with a standard triplet network, trained without our second-order similarity loss. To do this, we build 3D maps using a set of 2D images in a small road in London. Note that in both cases, we use exactly the same images in order to build the map.
We can observe that the camera motion paths (shown in red) are recovered in their entirety, and the resulting 3D map is significantly more accurate.
A peek into the future
While SOSNet is able to handle significant viewpoint variations, it still relies on the standard Detect-Describe-Match pipeline, which can still fail in extreme changes.
We are now working on the next generation on matching methods, that utilize attention mechanisms to perform global image matching. More details about this work will be shared soon. For now, we offer a sneak preview below!
If you are interested in hearing more about SOSNet, please come to our Oral presentation at CVPR 2019 where our research intern Yurun Tian will give his talk. We will be also presenting a poster after our oral, where you can have an in-depth conversation with the members of our team. The full conference program can be found here.
Oral Presentation. 13:48, Thursday 20th June 2019.
Poster Presentation. 15:20, Thursday 20th June 2019.
Additionally, if you are interested in learning more about our research projects, would like to collaborate, or would like to join the team, reach out to firstname.lastname@example.org
Interested to learn more about Scape Technologies?
We send a newsletter every couple of months, making sense of the AR industry and sharing our progress.
Sign Up to our newsletter.