[CV] 10. Local Feature Descriptors: Harris and Hessian Corner Detector
1. Motivation
1.1 Why do we need to find local features?
In the last article, we extracted global features that represent a given image patch in order to detect objects within it. While the global feature representations do nicely their jobs, they also have some major limitations to certain situations, such as occlusions, intra-category variations, and articulations.
An alternative way is to extract (describe) and match only local regions, which brings the increased robustness for the mentioned situations.
Fig 1 shows the difference between global and local feature extraction approaches.
While the global feature representations are built on taking a whole image (or image patch) as an input, the local feature representations are made for each local region within the image (or patch).
1.2 Application of local features
(1) Image matching
Local features are useful when it comes to performing the image matching tasks. Let’s say we have two images taken by two tourists. How do we ensure that these two pictures are from the same tourist spot?
To answer the question, one can extract local features from the pictures, and compare them. If their similarity is higher than a certain threshold, the two pictures are taken from the same tourist spot.
(2) Image stitching
Another application of local feature representations is Image stitching. Think of two images, which are of the same object but from different angles, that we want to combine them properly so that they become a one picture.
Fig 3 is showing two images of a mountain and the task is to stitch them. To do this, we can begin with
- detecting feature points in both images, which are noted as white dots in the following figure.
- Then to find corresponding pairs of detected feature points. (upper panel of Fig 4)
- lastly, use the found pairs in order to align the images so that we obtain the final stitched image. (lower panel of Fig 4)
I briefly introduced how the local features can be used to solve mentioned two tasks. Now let’s get into the details.
2. Overview of Local Feature Matching
The general approach for both applications: image matching and stitching, with local feature representations, can be summarized as following steps with Fig 5:
(1) Find distinctive key points from given images.
(2) Define a local region around each keypoint.
(3) Extract and normalize the region content.
(4) Compute a local descriptor for each normalized region using pixel information (i.e. gradient orientation, magnitude).
(5) Match the local descriptors between images.
Note that the reason why we need to define a region around each keypoint is that we cannot generate a feature representation from just a single-pixel keypoint. Instead, we need a set of pixels (region) around key point to produce feature representation out of the region (think of how we define feature descriptors, e.g. HoG).
If this is not clear, please read the previous article regarding how feature representations are built using pixels within an image patch.
2.1 Common problems
The defined local feature descriptors have two requirements to meet in order to successfully do their job.
- The descriptor should detect the same keypoint independently in both images
- For each point from one image, the descriptor should produce a reliable feature representation to find its corresponding key point in the other image.
If there is no keypoint pair found, then matching and stitching are impossible.
2.2 Keypoint candidates
The keypoints we are looking for should have the following properties:
- should be repeatably detected
- should be precisely localized
- should contain interesting contents
There are some types of keypoint that include the mentioned properties, such as blobs and corners. Among them, this article will address two detectors that are very specific to find corners.
The main idea behind detecting corners as key points is that the region around a corner has two (or more) dominant gradient directions, and they can be used as a repeatable and distinctive feature that represents the region.
2.3 Why corners as distinctive interest points?
Fig 6 shows three types of the region we can generally observe from images.
From the localization perspective, a key point should be distinctive from the other points in the regions around it, and the corner is the most suitable type of region for this purpose, where the unique key point can be found compared to the other two regions.
Let’s elaborate on this with an example in Fig 6. The windows in each panel indicate where our detector is currently looking at. Firstly, in the flat regions, regardless of the direction we move the window, the gradient magnitude and orientation of pixels in the window does not change. This means there is no distinctive key point (pixels) in this region as they are all the same.
Secondly, in the edge regions, moving the window along the edge does not make any change in pixel gradients. Shifting the window in the direction of the pixel gradient, however, makes a change. Nevertheless, It is not sufficiently strong.
Lastly, the corner, unlike the previous two types of region, yields a significant change in gradient magnitude and orientation when the window is shifted in any direction. This happens because the pixel RGB (or Intensity) values in the window are largely changed as it is shifted. This behavior of corners is greatly helpful for good localization, and therefore corners in the image are the local features that our detector is looking for.
Now, let’s take a look at how detectors work to find local features (specifically, corners).
3. Harris Detector
3.1 Second-moment matrix
We know that corners are regions in a given image with a large variation in intensity between pixels in all directions (i.e. large gradient magnitude). The Harris detector starts with this characteristic of corners. It finds pixel-intensity displacement of (u, v) such that the function E gets maximized for pixels in the window, as is expressed below.
By applying the Taylor Expansion, we can represent E(u, v) in a form of matrix multiplication like the second line of the above equations. What is important here is the matrix M which is called the second-moment matrix, where I𝑥 and I𝑦 are x and y-derivative, respectively.
3.1.1 Then why the second-moment matrix is important?
It’s because we can get a clue about whether a corner lies inside of the window or not, by looking at the matrix.
For example, let’s say our image window is like the one in Fig 7, which contains an axis-aligned corner. Then M is guaranteed to be a diagonal matrix, where non-diagonal entries (I𝑥 × I𝑦) are zero.
If why I𝑥 × I𝑦 = 0 is not clear, take a look at edge regions, which are marked with arrows in Fig 7.
- In vertical edge, I𝑦 = 0 as there is no chance in pixel intensity for pixels on the edge.
- Similarly, in the horizontal edge, Ix = 0.
- In the flat regions, both Ix and I𝑦 are zero as pixel intensities do not change in both x and y directions.
As described, the second-moment matrix is giving us a strong clue about the existence of a corner in a window. However, corners being axis-aligned is a too naive assumption in practice. We need to ensure that the second-moment matrix M works in more general cases, i.e. it should be able to detect non-axis-aligned corners as well. Fortunately, the matrix M does detect!
The following figure shows the constructed M in general cases when a corner is not axis-aligned. By the definition of M, it is symmetric and therefore, it can be orthogonally diagonalizable (recall the characteristic of symmetric matrix regarding diagonalization).
M becomes a product of two rotation(orthogonal) matrices and a diagonal matrix whose entries are the eigenvalues. As one might notice, this diagonal matrix is what we need, again, in order to determine the existence of a corner.
For further detail regarding how the ellipse in Fig 7 is constructed, please refer to [12]
3.2 Interpreting the second-moment matrix
So far, we have constructed the matrix M which is the key component for corner detection. Now, the following content addresses how exactly the detector determines the existence of corner making use of the second-moment matrix M.
The bird’s eye view for interpreting M according to the eigenvalues are as below.
For each image point (pixel) in the window, we compute the second-moment matrix M to obtain eigenvalues, and based on their values, the detector classifies the type of region that pixel belongs. The verbalized decision process for region classification is as follows.
- If one eigenvalue is significantly higher => derivative with respect to one direction is much stronger than the other => pixel lies on an edge
- If both eigenvalues are small => pixel intensities do not change in any direction => pixel lies on a flat region
- If both eigenvalues are large => pixel intensities largely chance in both x and y direction => pixel lies on a corner
In order to automate the above process in a mathematical manner, the corner response function R is defined as is shown in Fig 9.
3.3 Choice of window function w(x, y)
Before summing up the Harris detector, one thing left is the candidates of window function w when the second-moment matrix M is computed. In the above figures, the uniform window is chosen as the window function. But it has one severe problem that the matrix M with the uniform window is not rotation invariant.
An alternative window function to guarantee the rotation invariance is performing convolution with a Gaussian filter for each derivative map: I𝑥², I𝑥 × I𝑦, I𝑦².
3.4 Steps of Harris Detector
We gain all the necessary knowledge about how the Harris detector finds local features (corners). All details above can be summarized in 5 steps (including non-maximum suppression: NMS after the 4th step).
3.5 Experiment with Harris Detector
4. Hessian Detector
While the basic ideas of detecting corners remain the same as the Harris detector, the Hessian detector makes use of the Hessian matrix and determinant, instead of second-moment matrix M and corner response function R, respectively.
Note that the entries in the Hessian matrix are the second derivatives.
5. Limitation
Harris and Hessian detectors are rotation invariant. However, they are not scale-invariant, which is a crucial drawback in terms of feature detection. The reason why scale invariance is important is intuitively visualized in the following figure slide from [2].
Once a corner gets magnified and becomes bigger than the size of the window by zooming, the Harris and Hessian can no longer detect the corner. It is because what the detectors perceive through the window is not a corner anymore but an edge due to the scale change.
In order to overcome this limitation, a number of researches have been performed to create scale-invariant feature detectors, which will mainly be addressed in the next article.
6. Reference
[1] RWTH Aachen, computer vision group
[2] CS 376: Computer Vision of Prof. Kristen Grauman
[3] S. Seitz
[4] Forsyth & Ponce
[5] Prof. Svetlana Lazebnik
[6] Prof M. Irani and R. Basri
[7] Prof. Li fei-fei
[8] Denis Simakov
[9] Alexej Efros
[10] Hessian-Laplace
[11] openCV
[12] Harris ellipse derivation
Any corrections, suggestions, and comments are welcome