2D Feature Tracking — Part 2: Feature Description

11 min readAug 15, 2023

A set of binary descriptors for keypoints detected on an image of a cat. (Image Courtesy: Udacity Sensor Fusion Nanodegree)

In Part 1 of this series, I explained feature/keypoint detectors, which are regions of interest in images. We also compared the performance of several types of keypoint detectors. If you haven't read my previous article, I encourage you to do so.

Once you are familiar with locating keypoints in an image, the next step will be to track these from one image to the next. To do so, we need to employ a concept called “descriptors”, which describes (hence the name) the surrounding area around a keypoint in a unique manner. It can be located again in the next image provided by the camera.

When we extract keypoints, why do we need to use descriptors?

While keypoints are points of interest in an image, they only provide location information and do not capture the visual characteristics or appearance of the region around the keypoint. This is where descriptors come into play.

Matching keypoints based solely on their locations can be problematic because keypoints may be affected by changes in scale, rotation, or lighting conditions. By using descriptors, we can overcome these challenges and improve the accuracy and robustness of keypoint matching.

Descriptors are used to describe the visual properties of the region surrounding each keypoint. They encode information about the shape, texture, color, or other visual features of the region. By comparing the descriptors of keypoints in different images, we can determine if they represent the same or similar visual content.

For example, let’s say we have two images of the same object, but one image is taken from a different angle or under different lighting conditions. The keypoints in both images may have distinct locations, but their descriptors can still capture similar visual characteristics. By comparing the descriptors, we can find matching keypoints and establish correspondences between the two images.

An excellent explanation of this topic can be found here as well.

Types of Descriptors

Designing a good local feature descriptor requires an effective and discriminative representation of local ROIs (Regions of Interest). Local features can be hand-crafted or learned using algorithms like convolutional neural networks (CNN) and deep learning. Normally, a processing pipeline for all hand-crafted based descriptors is to detect a set of repeatable interest points also known as keypoints from images and represent the neighborhood around each keypoint in an efﬁcient and discriminative feature vector known as a keypoint feature descriptor. Hand-crafted descriptors involve finding the right balance between accuracy and computational efficiency, while learned features depend on the training set and require careful selection of a representative dataset.

For this article, we will focus on hand-crafted or conventional descriptors. There are two main categories of such descriptors: floating-point and binary descriptors.

1. Floating Point Feature Descriptors

The ﬂoating-point feature descriptors such as SIFT (Scale-Invariant Feature Transform), and SURF (Speed Up Robust Features) are computed and stored as real-valued feature vectors. These descriptors can be divided into distribution-based descriptors, differential descriptors, and spatial-frequency descriptors. However, the most frequently used ﬂoating-point descriptors are the distribution-based ones, which are based on counting the occurrence of gradient orientation in the local ROIs of the image because the local object appearance and shape within images can be described by the distribution of intensity gradients or edge directions.

Unfortunately, these descriptors are based on histogram statistics of gradient information (i.e., magnitude and orientation) which are more expensive to compute and, hence have high computational complexity. Besides, they produce high dimensional feature vectors, which become a signiﬁcant concern to run on devices with limited computational and storage capability. Since each dimension of the descriptor is stored as a decimal number, making the descriptor requires more memory.

The above limitations make floating point descriptors less useful for real-time applications, especially those running on limited computing power and memory capacities like smartphones and embedded devices.

2. Binary Feature Descriptors

With the rapid growth of real-time applications, binary descriptors that achieved fast runtime and compact storage have become increasingly well-known. They show similar quality as SIFT-like descriptors but at significantly lower computational costs and require lesser amounts of memory.

Some of the most recent and promising binary feature descriptors are BRIEF (Binary Robust Independent Elementary Feature), ORB (Oriented Fast and Rotated BRIEF), BRISK (Binary Robust Invariant Scalable Keypoints), FREAK (Fast Retina Keypoint) and AKAZE (Accelerated KAZE). The operations used to compute the descriptor are quite different from the operations involved in the ﬂoating-point descriptors.

The binary descriptors are a compact representation of ROIs in the form of a binary string resulting from pixel intensity comparisons at sampled predeﬁned locations in the region. The binary descriptors require lower memory compared with gradient-based descriptors, and the similarity of descriptors can be evaluated by computing the Hamming distance, which can be computed very efﬁciently.

Let's look at examples of SIFT and ORB descriptors to better understand each descriptor type.

SIFT (Scale Invariant Feature Transform)

SIFT is a floating-point descriptor belonging to the Histogram of Gradients (HoG) family. SIFT was presented in 1999 by David Lowe and includes both a keypoint detector and a descriptor. The basic idea behind HOG is to describe the structure of an object by the distribution of its intensity gradients in a local neighborhood. To achieve this, an image is divided into cells in which gradients are computed and collected in a histogram. The set of histograms from all cells is then used as a similarity measure to uniquely identify an image patch or object.

The SIFT method includes both a keypoint detector as well as a descriptor and it follows a five-step process, which is briefly outlined in the following.

1. First, keypoints are detected in the image using an approach called the Difference of Gaussian (an approximation of Laplacian-Of-Gaussian (LoG)), which is based on second-degree intensity derivatives. The DoG is applied to various scale levels of the image and tends to detect blobs instead of corners. One or more orientations are assigned to each keypoint location based on local image gradient directions. All future operations are performed on image data that has been transformed relative to the assigned orientation, scale, and location for each feature, thereby providing invariance to these transformations.

2. Second, for every keypoint, its surrounding area is transformed by removing the orientation and thus ensuring a canonical orientation. Also, the size of the area is resized to 16 x 16 pixels, providing a normalized patch.

3. Third, the orientation and magnitude of each pixel within the normalized patch are computed based on the intensity gradients Ix and Iy.

4. Fourth, the normalized patch is divided into a grid of 4 x 4 cells. Within each cell, the orientations of pixels which exceed a threshold of magnitude are collected in a histogram consisting of 8 bins.

5. Last, the 8-bin histograms of all 16 cells are concatenated into a 128-dimensional vector (the descriptor) which is used to uniquely represent the keypoint. In addition to this, several measures are taken to achieve robustness against illumination changes, rotation, etc.

SIFT — concatenating histograms from different squares

*SIFT descriptors illustration (Source:* https://gilscvblog.com/2013/08/18/a-short-introduction-to-descriptors/)

The SIFT detector/descriptor is able to robustly identify objects even among clutter and under partial occlusion. It is invariant to uniform changes in scale, to rotation, to changes in both brightness and contrast and it is even partially invariant to affine distortions.

The downside of SIFT is its low speed, which prevents it from being used in real-time applications on e.g. smartphones. Other members of the HOG family (such as SURF and GLOH), have been optimized for speed. However, they are still too computationally expensive for real-time applications.

Now, let us see how to implement the SIFT detector and descriptor in OpenCV.

/**
 * @brief SIFT Keypoint detector
 * @param keypoints output detected keypoints
 * @param img grayscale input image
 */
void detKeypointsSIFT(std::vector<cv::KeyPoint> &keypoints, cv::Mat &img)
{
    cv::Ptr<cv::FeatureDetector> detector;
    // SIFT detector
    detector = cv::SIFT::create();
    //detector = cv::xfeatures2d::SIFT::create(); //Use this for old versions of OpenCV
    detector->detect(img, keypoints);
}

/**
 * @brief SIFT Keypoint descriptor
 * @param keypoints input detected keypoints
 * @param img grayscale input image
 * @param descriptors output descriptor vector
 */
void descKeypointsSIFT(vector<cv::KeyPoint> &keypoints, cv::Mat &img, cv::Mat &descriptors)
{
    cv::Ptr<cv::DescriptorExtractor> extractor;
    extractor = cv::SIFT::create();
    //extractor = cv::xfeatures2d::SIFT::create(); //Use this for old versions of OpenCV
    extractor->compute(img, keypoints, descriptors);
}

A more detailed explanation of SIFT can also be found in this medium article by Minghao Ning.

ORB (Oriented FAST and Rotated BRIEF)

A much faster alternative to HOG-based methods is the family of binary descriptors, which provides a fast alternative at only slightly worse accuracy and performance. ORB, which was introduced in 2011, belongs to this family of binary descriptors.

From a high-level perspective, binary descriptors consist of three major parts:

A sampling pattern that describes where sample points are located around the location of a keypoint.
A method for orientation compensation, which removes the influence of rotation of the image patch around a keypoint location.
A method for sample-pair selection, which generates pairs of sample points that are compared against each other with regard to their intensity values. If the first value is larger than the second, we write a ‘1’ into the binary string, otherwise, we write a ‘0’. After performing this for all point pairs in the sampling pattern, the binary chain (or ‘string’) is created (hence the family name of this descriptor class).

Just like SIFT, ORB is also a combination of a keypoint detector and descriptor algorithms. It works in two steps — Keypoint detection using FAST and Description using BRIEF. Let us go through each step in detail to understand how ORB works.

Keypoint detection using FAST

FAST stands for Features from Accelerated Segments Test. FAST features are widely used because of their computational properties. However, FAST features do not have an orientation component which is what ORB builds upon and makes it rotation invariant.

ORB starts by detecting the set of keypoints from the image using a FAST detector. It chooses a pixel as a keypoint by comparing the intensity thresholds of neighboring pixels in a small area. It then employs the Harris corner measure to order the keypoints to exclude the non-corner points.

FAST feature detection (Image Courtesy: Udacity Sensor Fusion Nanodegree)

In order to produce multi-scale features, a scale pyramid of the image is employed, and a set of FAST features ﬁltered by Harris are detected at each level in the pyramid.

Image pyramid for scale-invariant (Image Courtesy: Udacity Sensor Fusion Nanodegree)

Since FAST does not provide a measure of corner orientation, the intensity centroid measure is used to compute the keypoint orientation.

Intensity centroid for a keypoint (Image Courtesy: Udacity Sensor Fusion Nanodegree)

The final keypoints created by ORB are shown below. Note the size of the circle depicts the magnitude (keypoints with large sizes were found at higher levels of the image pyramid) and the orientation is shown by the angle of the circle.

Final keypoints created by ORB (Image Courtesy: Udacity Sensor Fusion Nanodegree)

Keypoint description using BRIEF

BRIEF (Binary Robust Independent Elementary Features) is a feature descriptor that uses simple binary tests between pixels in a smoothed image patch. Its performance is similar to SIFT in many respects, including robustness to lighting, blur, and perspective distortion. However, it is very sensitive to in-plane rotation. The BRIEF descriptor works in the following steps,

1. To reduce the sensitivity of the filter to high-frequency noise, the BRIEF algorithm first applies a Gaussian kernel to smooth the given image.

2. Given a keypoint, BRIEF then selects a random pair of pixels inside a defined neighborhood or patch around the keypoint. The first keypoint in the patch is drawn from a Gaussian distribution centered around the keypoint with a standard deviation or spread of σ. The second pixel is selected from another Gaussian distribution which is centered around the first pixel and has a standard deviation of σ/2.

Random pixel pair selection (Image Courtesy: Udacity Sensor Fusion Nanodegree)

3. BRIEF then starts constructing the binary descriptor by comparing the brightness of two pixels. If the first pixel is brighter than the second then it assigns the value of “1” to the corresponding bit in the feature vector otherwise it assigns “0”. The first bit of the feature vector corresponds to the first random pair of points for this keypoint.

4. Now for the same keypoint, BRIEF calculates a new pair of random pixels and again compares the brightness to assign the second bit in the vector. For a 256-bit vector, BRIEF then repeats this same process 256 times for the same keypoint before moving to the next keypoint. BRIEF creates a 256-bit feature vector for each keypoint in the image.

ORB uses a modified version of BRIEF called rBRIEF (Rotation aware BRIEF) to make the feature vector invariant to rotation. It follows the same step as BRIEF and selects random pairs of pixels around each keypoint. However, ORB then rotates the pair by the orientation of keypoint and then compares the brightness to create the feature vector. The final 256-bit feature vector created by this process is the ORB descriptor.

Now, let us see how to implement the ORB detector and descriptor in OpenCV.

/**
 * @brief ORB Keypoint detector
 * @param keypoints output detected keypoints
 * @param img grayscale input image
 */
void detKeypointsORB(std::vector<cv::KeyPoint> &keypoints, cv::Mat &img)
{
    // ORB detector
    int nfeatures = 5000; // The maximum number of features to retain
    cv::Ptr<cv::FeatureDetector> detector;
    // ORB detector
    detector = cv::ORB::create(nfeatures);
    detector->detect(img, keypoints);
}

/**
 * @brief ORB Keypoint descriptor
 * @param keypoints input detected keypoints
 * @param img grayscale input image
 * @param descriptors output descriptor vector
 */
void descKeypointsORB(vector<cv::KeyPoint> &keypoints, cv::Mat &img, cv::Mat &descriptors)
{
    cv::Ptr<cv::DescriptorExtractor> extractor;
    extractor = cv::ORB::create();
    extractor->compute(img, keypoints, descriptors);
}

Comparing the performance of different descriptors

To compare the descriptors, I have analyzed a consecutive series of 10 images depicting a car navigating through traffic using various combinations of detectors and descriptors. Each combination is run to detect keypoints and then describe the keypoints in all 10 images and the aggregated data is compiled to produce the graph. You can also access the full implementation on my GitHub repository.

UdacityProjects/Sensor_Fusion_Engineer/Camera/SFND_2D_Feature_Tracking-MidTermProject at…

This repo is a collection of codes developed through my Udacity courses …

github.com

Comparison of median descriptor execution time

As you can see in the above graph, the BRIEF descriptor has the lowest execution times with different detector combinations followed by BRISK and ORB.

In the next article, we will see how to match these descriptors between two images so we can track features which is our end goal.