A Beginners Guide to Computer Vision (Part 5)- Scale Invariant Feature Transform (SIFT) Part 1
One of most cited paper in history of computer science. Let’s learn and implement it in python from scratch.
We have already discussed about Harris Interest Point detector in my previous article. In case, if you cannot recall about it, go through the article given below. What is SIFT?
SIFT is a interest point detector and a descriptor, this algorithm is developed by David Lowe and it‘s patent rights are with University of British Columbia. It is the fourth most cited paper in the history of computer science. The full form of SIFT is Scale Invariant Feature Transform. The name itself gives us a lot of information. SIFT is a detector which is invariant to scale or any affine transformation. It is robust to noise, also invariant to change to illumination.
There are four major steps in SIFT algorithm: 1) Extrema Detection from Scale Space, 2) Keypoint Localization, 3) Orientation Assignment, and 4)Keypoint Descriptor
The code in this article is taken from here which is implemented in C++.
1) Extrema Detection from Scale Space
Interest points should be invariant to scale or affine transformations. That means, we should search for interest points at difference scales.
Have a look at the image above. Flowers which are closer to the camera are large in size compared to the flowers which are far away due to perspective effect. We need to detect the flower close and far away from camera.
Do you remember about Laplacian of Gaussian(LoG) filter which we discussed in Marr-Hildreth edge detector. If you cannot recall go through the link below.
The shape of Laplacian of Gaussian function will be like an inverted Mexican hat. look at the image below.
We have three blobs in increasing size on X-axis and Laplacian of Gaussian filter in increasing size(sigma) on Y-axis. If you interpret the response, there is a correlation with the blob size and the size of Laplacian of Gaussian filter.
From the above intuition, you might got an idea that applying one filter (one sigma) is not going to help you to detect all the interest points in an image. So, we generate a stack of images produced after applying LoG filter with increasing sigma values. The discrete function of this stack of images will be something like this f (x,y,sigma). When we do the same thing in Pyramid format it is called as Scale Space.
Difference of Gaussian (DoG or D) an approximate for LoG
The difference-of-Gaussian function provides a close approximation to the scale normalized Laplacian of Gaussian
You can clearly understand this relation from the heat diffusion equation,
Re-writing the above equation using the difference of nearby scale at kσ and σ.
and therefore,
The above equation shows that the difference between two Gaussian images can approximate Laplacian of Gaussian. We exploit this phenomena to create out scale space. Look at the image below to understand it in more detail.
Lowe suggested to use sigma value of 1.6 for first the image and increase the sigma by multiplying it with k (whose relation is given below) until we complete an ocatve(stack). When we move to next level of the pyramid we resize the image which will be 2 images from top of the previous stack. We call this stack of images as octave. There is a term here called, number of intervals (s), whose value is taken as 3 as suggested Lowe. The formula for number of images per octave is given by this relation s+3.
Look at the code snippet below to understand about scale space generation.
Local Extrema Detection
In order to detect the local maxima and minima of D(x, y, σ ), each sample point is compared to its eight neighbors in the current image and nine neighbors in the scale above and below. Look at the image below.
It is selected as an Extrema only if it is larger than all of these neighbors or smaller than all of them.
2) Keypoint Localization
we refine these extrema in two stages: 1) Removing the unstable extrema points due to low contrast and 2) Removing poorly localized edge responses. Before doing the keypoint localization, pixels near the keypoint should be normalized.
Removing the unstable extrema points due to low contrast
We represent a level in scale space as D(x, y, σ ). When we expand this function at a sample point using Taylor series.
where x = [[x],[y],[σ]]. We can find the extrema location by finding the derivative of above equation and equating to zero. Doing this we will get the shift vector ( x̂ ) in x, y and σ direction.
If the offset x̂ is larger than 0.5 in any dimension, then it means that the extremum lies closer to a different sample point. In this case, the sample point is changed and the interpolation performed instead about that point. The final offset x̂ is added to the location of its sample point to get the interpolated estimate for the location of the extremum.The function value at the extremum D(x̂) is equal to
If the value of D(x̂ ) is less than 0.03, that keypoint will be rejected.
Removing poorly localized edge responses
A poorly defined peak in the difference-of-Gaussian function will have a large principal curvature across the edge but a small one in the perpendicular direction. The principal curvatures can be computed from a 2 × 2 Hessian matrix, H, computed at the location and scale of the keypoint:
The eigenvalues of H are the indicators of principle curvature of D. We already discussed the same concept in my article on Harris Interest Point Detection. If you cannot recall go through the link below.
Let α be the eigenvalue with the largest magnitude and β be the smaller one. Then, we can compute the sum of the eigenvalues from the trace of H and their product from the determinant:
Let r be the ratio between the largest magnitude eigenvalue and the smaller one.
We will reject the keypoint if condition given below is not satisfied.
Look at the code snippet below to understand about Extrema Detection and localization.
Doing all this analysis on the image taken in front my college gate. We will get the results like shown below.
We left with Orientation assignment and Descriptor. I will end this article at this point and continue the remaining topics in next part. Below is the link for code shown in this article.
Stay tuned for more on Computer Vision :)