[CV] 11. Scale-Invariant Local Feature Extraction(1): Auto Scale Selection
1. Motivation: Why scale-invariant?
Let’s recall what we studied about the local feature extractors: Harris and Hessian in the last article. Harris and Hessian are strong corner detectors. However, detectors are rotation invariant. However, they are not scale-invariant, which is a crucial drawback in terms of feature detection. The reason why scale invariance is important is intuitively visualized in the following figure slide from [2].
Let’s say we have two same images but with different scales (one is an upsampled version of the other one) as is shown above. When the image is in a suitable scale for the detection window to contain the corner in it, the Harris and Hessian detectors can find the corner (left panel).
How about the case when the image is upscaled, and therefore, the corner is accordingly magnified so that the corner has a larger size than the detection window?
As is illustrated in the right panel of Figure 1, what detection windows of the detector perceive becomes an edge, not a corner. Therefore, the detectors cannot find the corner which was able to be detected before the scale change.
In short, A corner in an image being detected by mentioned local-feature detectors (Harris and Hessian) depends on the scale. And for sure, this is not desirable. We want the detectors to be able to detect corners in an image regardless of the scale. Thus, this article will address some approaches to build detectors, which find corners in a scale-invariant manner.
2. Fundamentals to Scale-Invariant Detectors.
Before getting into the detail of scale invariance, there are two conceptions that should be preceded.
2.1 From point to region
The Harris and Hessian operators return us the interest points (𝒙, 𝑦) where corners are located. However, in order to define the feature descriptions (feature vector representations) for the interest points, and to compare the points between images, we need more than just a point of interest, but a region around it.
Think of the HoG feature descriptor. It computes the histograms of gradient and constructs a feature representation for pixels in the given image patch (region).
The simplest way to define a region is to take 16 ⨯ 16 pixels around the detected point.
2.2 Scale selection
By using feature representations extracted from regions around the interest points, we can compare and measure how similar the two image regions are to each other.
Now the question is, considering the fact that Harris and Hessian are not scale-invariant, how to choose the correct scale for respective regions around interest points?
2.2.1 Exhaustive search
One naive approach is to compare a feature representation of point A from one scale to another feature representation of point B for each region around B from all possible scales. This way is called ‘Exhaustive search’ since is computationally inefficient, but still possible for matching.
2.2.2 Automatic scale selection
Another approach to the optimal scale selection is to design a signature function on the region (image patch around the interest point) that is scale-invariant. More specifically, a designed function takes a region around the point and outputs its response as a scalar given the region. Keep in mind that the varying input of this signature function is not the point (x,y), but the size of the region around the point (x,y). Therefore, we can consider the signature function as a function of region size (or image patch width) for a point in one image.
The described definition of signature function might sound not clear. To clarify, the behavior of signature function f with different region sizes of an interest point is illustrated below.
In Fig 5, we can observe that the function outputs different response values depends on the region size. And more importantly, the local maxima of this function is a strong clue indicating that the region size corresponding to the local maxima should be invariant to image scale.
Note that the signature function we are talking here is not the Harris and Hessian response function! They are different.
The following sequence of images in Fig 6 is the visualized process of automatic scale selection given two images (same object but from a different angle, scale, etc.), adapted from [1, 2].
In this section 2.2.2, we have seen how the optimal scale for an interest point is selected. However, one major question remains unanswered:
Which function can be the signature function used here?
3. Choice for signature function: Laplacian-of-Gaussian
The popular choice is the Laplacian-of-Gaussian (LoG).
If LoG does not ring any bell, please find more detail in here, or google it.
Due to the characteristic of LoG, it detects ‘blobs’ in the input image and returns the highest response around them. To understand why this is the case, think of the nature of filters. Filters have a shape of what they are designed to detect, and they output the maximum response when the input has the same looking as the filters. Around the center of LoG filter, we can find a blob, and this is why LoG is a blob detector.
In short, now we are looking for blobs in the image as a local feature and their corresponding suitable scales by applying LoG filters with varying scales 𝝈, as is illustrated in Fig 8.
One thing to keep in mind from Fig 8 is that unlike Fig 5, where the point with the scale corresponding to the global maxima is selected as a interest point, here, a point (x, y) in image is considered as a candidate of interest point if it is local maxima among adjacent pixels in the same and neighboring scales (8 pixels from the same scale space and 18 from neighboring scale spaces).
Of course, this leads to a larger number of candidates, and therefore, some sort of thresholding is applied to remove unlikely candidates and to reduce computation.
The above figure is a local feature detection result using LoG as a detector and as a signature function for automatic scale selection. Each center of the red circle indicates the position of interest point (x, y), and the radius implies the corresponding scale detected from Fig 8. This means the interest points from a larger scale space have a larger radius.
So far, we have taken a look at how to detect local features and how to find suitable scales (region size) for detected local feature points. In the next article, we will take a look at how to make a feature representation by making use of detected local feature points and scales.
Reference
[1] RWTH Aachen, computer vision group
[2] CS 376: Computer Vision of Prof. Kristen Grauman
[3] S. Seitz
[4] Forsyth & Ponce
[5] Prof. Svetlana Lazebnik
[6] Prof M. Irani and R. Basri
[7] Prof. Li fei-fei
[8] Denis Simakov
[9] Alexej Efros
[10] Prof. Krystian Mikolajczyk
Any corrections, suggestions, and comments are welcome