TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

NetVLAD: CNN Architecture for Weakly Supervised Place Recognition

--

This article is a paper summary of a paper named NetVLAD: CNN Architecture for Weakly Supervised Place Recognition. This paper is popular in image retrieval tasks and provides a key solution for place recognition tasks.

Image Retrieval Result

Convolutional Neural Network(CNN) has been the heart of the Computer Vision area. As computation resources developed tremendously, people focused on improving the performance of the model with less burden of computational complexity. Thus, in image retrieval, CNN architecture is used for feature extraction. However, using CNN architecture still suffers from low performance.

Image Retrieval is a task that focuses on finding the most similar image in the database. The keyword similar seems quite subjective since there is no strict definition of similarity. Moreover, we cannot use the naive image array to calculate similarity. To solve it, we define a feature extract function f and distance function d. This scheme is called Metric Learning. Further details about the Metric Learning is explained below.

Visual Place Recognition Problem focuses on correctly localizing the query image using the information in the database. A candidate solution is to use Image Retrieval. For the given query, approximate its location to the most similar image’s location. This approximation methodology is called Instance Retrieval Task.

In this article, we are going to introduce NetVLAD: CNN architecture for weakly supervised place recognition. This paper was published in 2016 and introduced a great CNN structure layer for the Visual Place Recognition tasks.

Before NetVLAD

There were many trials to extract great local features in the images. Unfortunately, original CNN structures weren’t suitable for Visual Place Recognition tasks. Furthermore, many ‘off-the-shelf’ techniques limit building end-to-end manner. The main contributions of this paper are following

  1. Create a CNN architecture that is trainable in an end-to-end manner for the Visual Place Recognition.
  2. Gather data that is sufficient to train the CNN architecture.
  3. Use the CNN architecture for feature extraction and evaluate its performance.

Metric Learning

A Slide in Presentation Material

The main idea of Metric Learning is to learn the distance function and feature extracting function. For convenience, we usually use linear functions and learn their parameters. Because images are just simple integer array, the naive image is challenging to get distance among them. Thus, we use feature extracting functions to extract local descriptors. It uses CNN with NetVLAD as a feature extracting function and uses the euclidean distance as a distance function. It selected euclidean distance because it has worked well in the experiments.

VLAD (Vector of Locally Aggregated Descriptor)

To better understand what VLAD is, I recommend you refer to the following links.

VLAD is nothing but a feature quantization technique. It is similar to familiar concepts named Bag of Words and Fisher Vector. When you think about k-means clustering, it could help you to understand.

VLAD is a K X D matrix that stores the information of clusters. Initially, K is given as hyper-parameter and randomly initializes K clusters in embedding space. (D is a dimension of the local descriptor) Each column is representing the sum of the residual in each cluster. We can focus on function a_k, which outputs 0 or 1. It is 1 only when the cluster k is the closest center, otherwise 0.

Since the function a_k is no more differentiable, it fails to be end-to-end. Thus, it uses a soft assignment instead. The below image shows the actual definition of soft assigned a_k. When we enlarge alpha value sufficiently, it approximates the original a_k function.

Moreover, we can factorize this function to

It seems we only need to learn c_k. However, in the paper below, it shows that decoupling dependencies of {c}, {w}, and {b} will improve the performance. It means learning {c}, {w}, {b} gains better performance than only learning {c}. Further details are introduced in the below paper.

The overall network structure is the following.

NetVLAD Structure

We added the NetVLAD layer after the Conv5 layer and extracted the feature with VLAD format using the NetVLAD layer. It performs intra-normalization and L2-normalization at the end. Further details are explained in the paper.

Annotating Data

Unfortunately, there was no dataset that has ground truth in 2016. Thus, they used Weak Supervision for a solution.

Weak Supervision is Supervision with noisy labels due to a lack of manual-annotated data.

It used Google Street View Time Machine, which provides only the image and the location of it. Then for each query image, classifies other database images to Potential Positive and Definitive Negative. Potential Positive is the images that are inside 10m from the query image. Definitive Negative is the images that are more than 25m from the query image. Then we can intuitively understand the following equations.

The loss function becomes large if the most similar image has a small distance, and definitive negative images are not similar to the query. This loss function is called a triplet loss function. To understand what is the triplet loss function, I recommend you this article

Evaluation Protocol and Experiment Details

In this experiment, it uses Pittsburgh(Pitts250k, Pitts30k) and Tokyo24/7, which are based on Google Street View Time Machine. It used a recall for evaluation protocol that calculates the percentage of correctly recognized queries. It is deemed correctly localized when top-N retrieved database images are within 25m. The hyper-parameter K is 64.

Result

  1. VLAD based on VGG16 convincingly outperforms Root-SIFT + VLAD+ whitening, which is composed of “off-the-shelf” techniques.
  2. NetVLAD can rich yet compact image representations for place recognition.
  3. NetVLAD performs better than max-pooling in visual place recognition tasks.
  4. Regardless of which network backbone is selected(AlexNet, VGG16, Places205), it outperforms other state-of-the-art techniques.

The above image shows the result of another experiment. The experiment checks that as we change the lowest trained layer, we gather better performance. However, when we learn all layers, it causes overfitting so that the performance drops.

Conclusion

The NetVLAD offers a powerful pooling mechanism with learnable parameters that can be easily plugged into any other CNN architecture.​ Since all of the functions in the NetVLAD are differentiable, it can provide an end-to-end manner when it is used in a network. Due to its convenience, it is still a beloved method in Visual Place Recognition tasks.

Presentation URL: https://docs.google.com/presentation/d/168ErmavKMUHGHdNAG9j-IVXhcmPgSudTLnOWu_4McxQ/edit?usp=sharing

Paper URL: https://arxiv.org/pdf/1511.07247.pdf

Contact me: jeongyw12382@postech.ac.kr

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

No responses yet