Describe-to-Detect(D2D) — A Novel Approach for Feature Detection

An new framework to detect highly informative and discriminative Keypoints from the dense descriptors.

Bala Manikandan
The Startup
5 min readOct 30, 2020

--

In recent years, Computer vision has opened up several new possibilities in almost every field that we can imagine. The whole focus of research has been shifted from classification to various applications like SLAM, Multiple object tracking, Information retrieval, Camera localization, etc. Extracting meaningful descriptions from the data has been a crucial problem in their pipeline. It involves the process of detecting keypoints using feature detectors and extracting information based on feature descriptors. The final features are expected to be very informative, can be easily localized, and should be adapted to the different needs of an application.

The history of Keypoint detector algorithms goes way beyond the 1980’s. Even with significant progress on neural network based descriptors, they are still used in many real-world applications. Most of them are based on two main strategies, Detect then Describe (or) Joint Detect and Describe framework. While having some advantages, it suffers from sub-optimal compatibility between detection and description. Describe-to-Detect(D2D) inverts the traditional process by first describing and then detecting the keypoint locations.

Traditional(Left, Middle) & Proposed(Right) architectures | source: Paper

There are three main research directions towards improving image matching such as non-detector specific descriptor, non-descriptor based detection, and jointly learned detection-description. The CNN based descriptors perform significantly better when trained and applied in the same data domain. Similarly, different keypoint detectors are suitable for different tasks. Hence, finding an optimal pair of detectors and descriptors for a given task requires extensive experiments. D2D aims to solve the need for an approach that adapts keypoint detector to any CNN based descriptor without training.

Inspired by the detectors based on saliency measures, D2D builds on the idea of Shannon entropy. It defines key points based on the Absolute and Relative saliency measure of deep feature maps produced by CNN’s. D2D detects keypoints via descriptor similarity in the metric space and therefore makes use of rich information content across entire depth. Hold on! before going into details, let me describe some of the aforementioned keywords.

Keypoints: The point in an image that has the potential of being repeatably detected under different imaging conditions.

Detector: Finds those keypoints in the given image.

Descriptor: Describes the information behind those keypoints.

Absolute saliency: It is the measure of information contained in the keypoint produced by its corresponding descriptor.

Relative Saliency: It is the measure of uniqueness in its spatial neighbour-hood.

D2D Pipeline | source: Paper

D2D takes in dense features from descriptors as input rather than image itself. So, there are two possibilities, if the descriptor of a particular point is highly informative it has high absolute saliency. Alternatively, if it is highly discriminative in its space then it is said to have high relative saliency. However, either one of them is not sufficient for identifying keypoints. For instance, highly informative but spatially non-discriminative structures cannot be localized and high discriminative structures with less information are useless. So, a point is considered to be a keypoint only if both the saliency measures are high.

Absolute saliency can be measured by computing the entropy of the descriptor. Similar to binary descriptors, it can be formulated as below,

where F¯(x, y) is the mean value of the descriptor F(x, y) across its dimensions.

Similarly, the relative saliency of a point can be measured using the autocorrelation function which is the measure of the relationship between the variable’s current value and its neighboring value. This has been modified and implemented as a sum of squared differences(SSD) for accommodating dense descriptors. This can be formulated as,

where F(x, y) indicates the descriptor centered at the location (x, y), and also || · ||2 is the L2 distance. The high value indicates the point stands out from its neighbors according to the description provided by the pre-trained descriptor model.

Based on above equations we assign score for each points by,

Practically, We can compute the above equations from descriptors with dense features provided by fully convolutional networks(FCN). Given the network architecture and an input image of size H × W, the output feature must be of size (H/4 − 7) × (W/4 − 7). The receptive field is of size 51 × 51. Therefore, each descriptor F(x, y) describes a 51 × 51 region centered at (4x + 14, 4y + 14) with stride of 4. The research paper provides the implementation results on HardNet as follows,

source: Paper

As shown, Sₐₛ highlights all regions that have high-intensity variations, while Sᵣₛ has high scores in structured areas. Finally, S combines the two parts, resulting in a low score for repeated/non-textured areas and edges. Points with Sᵣₛ greater than Sₐₛ are informative but not locally discriminative. This includes repeated textures like tree leaves and tiles on building roof, as well as intensity noise in visually homogeneous regions. Otherwise, line structures are less informative but can be discriminative from the adjacent regions, which results in Sₐₛ greater than Sᵣₛ. It is also noted that the amount of content the network sees in a 51 × 51 image depends on the resolution of the image.

D2D offers a significant boost to the matching performance of various descriptors as shown below,

source: Paper

To conclude, D2D is simple, does not require training, is efficient, and can be combined with any existing descriptor. The descriptor saliency is the most important property and we use absolute and relative saliency measures to select keypoints that are highly informative in descriptor space and discriminative in their local spatial neighborhood.

Acknowledgments

This article was intended to summarize the research paper ‘D2D: Keypoint Extraction with Describe to Detect Approach’. Kindly refer to the paper for detailed information on the approach.

--

--