SIFT (Scale Invariant Feature Transform)

5 min readJul 30, 2019

Summary: SIFT can manage the scale invariant on the feature extracted problem.

What the meaning of Scale Invariant?

Let me define the "feature" first.

Feature can be a corner or edge. The feature can provide the meaningful information without the whole pixels.

Hence, as you can see, to extract the corners from the left side, can use the filter, such as Gaussian filter. However, if the image become larger, like right side, the previous filter we have, cannot apply to this case. This is called Scale Invariant Problem.

SIFT comes to help.

1/5 Scale-space Extrema Detection

The smaller scale factor of Gaussian is for the small corner; otherwise uses the bigger ones.

The cost of computation from Laplacian of Gaussian, is costly. Hence, SIFT use “difference” to calculate two different scale factors of the Gaussian filter(DoG), and this approximate to the LoG.

To get the key point, each pixel compares with the 8 neighbors. If the pixel is local extrema, it is a potential key point.

from OpenCV

What is the Laplacian?

The Laplacian is to calculate the divergence of the gradient of data.

What is the Divergence?

Divergence is to calculate the flow value from vector 1 to vector 2.(I treat it as the changed rate of the gradient).

Actually, the vector is flow. The divergence is to measure the flow value generated by two vectors.

2/5 Key point Localization

From the previous step, we get the location of key point each scale. However, these exist the variance. Hence, we have to get the location more precisely.

SIFT uses Taylor series expansion of scale space to get more accurate location.

Taylor series expansion: https://ch-hsieh.blogspot.com/2009/01/blog-post.html

Then, if the calculated value < threshold(0.03), it is rejected. This threshold in OpenCV named contrastThreshold.

However, the DoG has higher response for edges, it should be removed. (Due to focus on the strong interest points, such as corner.) Therefore, SIFT uses the similar way, Harris Corner detector. They use 2x2 Hessian matrix to compute the principle curvature.

Harris Corner detector for edges is using one eigen value > others. edgeThreshold in opencv with value (10).

Harris Corner: https://docs.opencv.org/3.4/dc/d0d/tutorial_py_features_harris.html

3/5 Orientation Assignment

By assign the orientation to the image, it makes its descriptor invariant to the rotation.

To do so, it can be treated as 4 steps.

Determine sample points.
Determine the gradient magnitude and orientation of each sample point.
Create an orientation histogram.
Extract dominant directions.

Determine sample points

In the previous step, we get the refined key point, and we know its Gaussian scale factor.

Hence, we apply this Gaussian to smooth the image.
Using a certain radius to extract the process region, named gaussian-weighted circular window.

Determine the gradient magnitude and orientation of each sample point

Create an orientation histogram

SIFT uses 36 histogram to get the orientation.

The histogram using 36 bins, each covering 10 degrees.
Each sample within gaussian-weighted circular window is weighted its gradient magnitude.

SIFT take the peaks from the histogram, and any peak > 80% is also considered to calculate the orientation. It creates keypoints with same location and scale, but different directions. It contribute to stability of matching.

4/5 Keypoint Descriptor

Every key point get the location, scale, and orientation. we want the distinctive vector, partially invariant to the different viewpoint changed.

SIFT take the 16 x 16 neighbourhood around the key point.
16 x 16 neighbor array divide into 16 sub-blocks of 4 x 4 size.
For each sub-block, 8 bins of orientation histogram is created.
Total is 4x4x8 = 128.

5/5 Key point Matching

Keypoints between two images are matched by identifying their nearest neighbours.

However, the second closet match may be very near to the first, because of the noise or other factors. We can use ratio of closest-distance to second-closest distance. if >0.8, then it is rejected.

It eliminaters around 90% of false matches while discards only 5% correct matches, as per the paper.

Recap:

Use difference of gaussian smoothed to get the scale space local extrema.
Refine the key point by using Taylor series expansion of scale space(> threshold 0.03) and dropped the edge feature(< threshold 10).
Assign the orientation to each key point by weighting the gradient magnitude within the gaussian-weighted circular window. Then, chose the peak from the histogram with 36 bins, and the >80% ones.
Describe the key point to against different viewpoint changed.
Match the keypoints and reduce the noise by using the ratio of closest-distance to second-closest distance.