Implementing SIFT in Python: A Complete Guide (Part 2)

Dive into the details and solidify your computer vision fundamentals

Russ Islam
7 min readFeb 18, 2020

In this article, we continue our discussion of the implementation details behind the scale-invariant feature transform (SIFT). If you haven’t read Part 1, you can find it here.

Recap

So far we’ve generated a series of scales at which to blur our input image, which we used to generate a Gaussian-blurred image pyramid and a difference-of-Gaussians image pyramid. We found our keypoint positions by looking for local maxima and minima along the image width, height, and scale dimensions. We localized these keypoint positions to subpixel accuracy with the help of quadratic interpolation.

What’s left? We need to compute orientations for each of our keypoints. We’ll likely also produce keypoints with identical positions but different orientations. Then we’ll sort our finished keypoints and remove duplicates. Finally, we’ll generate descriptor vectors for each of our keypoints, which allow keypoints to be identified and compared.

At the end of this tutorial, we’ll see SIFT in action with a template matching demo.

Let’s get started.

Keypoint Orientations

Take a look at computeKeypointsWithOrientations(). The goal here is to create a histogram of gradients for pixels around the keypoint’s neighborhood. Note that we use a square neighborhood here, as the OpenCV implementation does. One detail to note is the radius_factor, which is 3 by default. This means the neighborhood will cover pixels within 3 * scale of each keypoint, where scale is the standard deviation associated with the Gaussian weighting. Remember that 99.7% of the probability density of a Gaussian distribution lies inside three standard deviations, which in this case means 99.7% of the weight lies within 3 * scale pixels of the keypoint.

Next, we compute the magnitude and orientation of the 2D gradient at each pixel in this neighborhood. We create a 36-bin histogram for the orientations — 10 degrees per bin. The orientation of a particular pixel tells us which histogram bin to choose, but the actual value we place in that bin is that pixel’s gradient magnitude with a Gaussian weighting. This makes pixels farther from the keypoint have less of an influence on the histogram. We repeat this procedure for all pixels in the neighborhood, accumulating our results into the same 36-bin histogram.

When we’re all done, we smooth the histogram. The smoothing coefficients correspond to a 5-point Gaussian filter, and you can find a neat explanation of how to derive these coefficients from Pascal’s triangle here.

To help us visualize what’s going on, I’ve plotted the raw histogram and smoothed histogram below for an actual keypoint produced when our code is applied to the template image box.png (found in the repo).

The raw orientation histogram.
The Gaussian-smoothed orientation histogram with peaks in red.

Now we look for peaks in this histogram that lie above a threshold specified in the SIFT paper. We localize each peak using quadratic interpolation — finite differences again! (You can find a derivation of the interpolation formula here.) We create a separate keypoint for each peak, and these keypoints will be identical except for their orientation attributes. As stated in the SIFT paper, these additional keypoints significantly contribute to detection stability when used in real applications.

Cleaning Up Keypoints

Now we’ve found all our keypoints. Before moving on to descriptor generation, we need to do some cleanup.

We sort and remove duplicates with removeDuplicateKeypoints(). The comparison function we use for keypoints is identical to that implemented in OpenCV.

We also need to convert our keypoints from base image coordinates to input image coordinates, which we can accomplish by simply halving the relevant attributes.

Generating Descriptors

At last, descriptor generation. Descriptors encode information about a keypoint’s neighborhood and allow comparison between keypoints. SIFT descriptors are particularly well designed, enabling robust keypoint matching.

For each keypoint, our first step is to create another histogram of gradient orientations. We consider a square neighborhood (different side length this time) around each keypoint, but now we rotate this neighborhood by the keypoint’s angle. This is what makes SIFT invariant to rotation. From this rotated neighborhood, we do something special. We compute row and column bins, which are just indices local to the neighborhood that denote where each pixel lies. We also calculate each pixel’s gradient magnitude and orientation, like we did when computing keypoint orientations. However, instead of actually building a histogram and accumulating values, we simply store the histogram bin index and bin value for each pixel. Note that here our histograms have only 8 bins to cover 360 degrees, instead of 36 bins as before.

Now comes the tricky part. Imagine that we take our square neighborhood, a 2D array, and replace each pixel with a vector of length 36. The orientation bin associated with each pixel will index into its 36-length vector, and at this location we’ll store the weighted gradient magnitude for this pixel. This will form a 3D array of size (window_width, window_width, num_bins), which evaluates to (4, 4, 36). We’re going to flatten this 3D array to serve as our descriptor vector. Before that, however, it’s a good idea to apply some smoothing.

What kind of smoothing? Hint: It involves more finite differences. We’ll smooth the weighted gradient magnitude for each neighborhood pixel by distributing it among its eight neighbors in three dimensions: row bin, column bin, and orientation bin. We’ll use a method known as trilinear interpolation, or more precisely, we’ll use its inverse. Wikipedia provides the formulas and a good visualization of the method. Put simply, each neighborhood pixel has a row bin index, column bin index, and orientation bin index, and we want to distribute its histogram value proportionately to its eight neighbor bins, all while making sure the distributed parts add up to the original value.

You may wonder why we don’t simply allocate one-eighth of the histogram value to each of the eight neighbors. The problem is the fact that the neighborhood pixel may have fractional bin indices. Imagine each neighborhood pixel is represented by the 3D point [row_bin, col_bin, orientation_bin], shown by the red point in the figure below. This 3D point may not lie at the exact center of the cube formed by its eight integer-value neighbor points, the blue points below. If we want to accurately distribute the value at the red point to the blue points, the blue points that are closer to the red point should receive a larger proportion of the red point’s value. This is precisely what (the inverse of) trilinear interpolation does. We split the red point into two green points, we split the two green points into four points, and finally we split the four points into our final eight points.

We perform the inverse of trilinear interpolation, taking the possibly off-center red point and distributing it among its eight neighbors (image from Wikipedia).
We first divide the red point into two possibly unequal green points, and recurse until we reach the blue points (image from Wikipedia).

Our last step is to flatten our smoothed 3D array into a descriptor vector of length 128. Then we’ll apply a threshold and normalize. In the OpenCV implementation, the descriptor is then scaled and saturated to lie between 0 and 255 for efficiency when comparing descriptors later.

We repeat this procedure to generate one descriptor vector for each keypoint.

That’s the entire SIFT algorithm. I know it was a lot, but you’ve made it this far! Now let’s see SIFT in action with a template matching demo.

Application: Template Matching

This script from the repo is adapted from OpenCV’s template matching demo. Given a template image, the goal is to detect it in a scene image and compute the homography. We compute SIFT keypoints and descriptors on both the template image and the scene image, and perform approximate nearest neighbors search on the two sets of descriptors to find similar keypoints. Keypoint pairs closer than a threshold are considered good matches. Finally, we perform RANSAC on the keypoint matches to compute the best fit homography.

Give it a try. You should see something like this:

Output plot of template_matching_demo.py, showing the detected template along with keypoint matches.

Try this script with your own template and scene images to get a feel for the stability of SIFT keypoints. You can explore under which conditions they work well, and under which conditions they perform poorly.

Wrapping Up

You’ve made it! SIFT is an indispensable component of any computer vision engineer’s toolkit. I hope this tutorial has shed light on the myriad implementation details that elevate SIFT from just a nice idea on paper to practical, robust code. Don’t forget to try SIFT on your future vision projects, whether you use it on its own, as part of a larger image processing pipeline, or just as a benchmark against learning-based approaches.

Work your way through the code here and really understand it. It’ll have its own rewards. Don’t give up on the details!

--

--

Russ Islam

Computer vision engineer at Mujin. I live in Tokyo, but California is home. Stuff I code: robotics, computer vision, data science. Reach out and say hi!