2D Feature Tracking — Part 1: Detection

Nikhil Nair
11 min readAug 4, 2023

--

Key points detected on a traffic image

Have you ever wondered how your phone can track your face as you move around in a video? Or how can self-driving cars keep track of other vehicles on the road? These amazing feats are all made possible by 2D feature tracking, a computer vision technique that allows objects to be tracked in videos.

2D feature tracking is broadly a three-step process:

  1. Keypoint Detection: This step involves finding distinctive points or features in the image. These keypoints are typically invariant to changes in scale, illumination, and rotation, which makes them useful for tracking objects over time.
  2. Feature Description: This step involves creating a unique descriptor for each keypoint. This descriptor is used to match keypoints between frames, which allows the object to be tracked.
  3. Feature Matching: This step involves finding the matches between keypoints in different frames. This is done by comparing the descriptors of the keypoints.

2D feature tracking is a powerful technique that is used in a wide variety of applications, including:

  • Object tracking: This is the most common application of 2D feature tracking. Object tracking can be used to track people, vehicles, and other objects in videos. This is the most common use case in a self-driving car.
  • Video stabilization: This technique can be used to remove camera shake from videos. This makes the videos smoother and easier to watch.
  • Structure from motion: This technique can be used to reconstruct the 3D structure of a scene from a sequence of images.

2D feature tracking is a rapidly evolving field, and new algorithms are being developed all the time. As these algorithms become more accurate and efficient, they will play an even greater role in a wide variety of applications.

I will cover all three steps of the 2D feature tracking in a series of medium articles starting with the keypoint detection in this article.

If you’re interested in learning more about 2D feature tracking, I encourage you to read my upcoming articles on feature description, and feature matching.

Keypoint Detection

To start tracking features, we must first locate them within an image. These “points of interest” in an image are also commonly referred to as keypoints. Keypoints are like the “landmarks” of an image. They are the points most likely to be unique and invariant to changes in the image, such as rotation, scale, or illumination.

In this article, we will look at several ways to detect keypoints and assess their advantages and disadvantages. Our focus will be on HARRIS, SHITOMASI, FAST, BRISK, ORB, AKAZE, and SIFT algorithms. We will evaluate their effectiveness based on the number of keypoints they detect and how long it takes for the algorithms to run.

Most keypoint types are determined by analyzing the distribution of brightness across an image. Areas with significant changes in brightness are known as high-intensity gradients and are potential keypoints. In this discussion, we will focus on the HARRIS detector for establishing the core concepts of keypoint detection, while briefly touching on other detectors.

1. HARRIS Detector

The Harris Corner Detector is one of the earliest keypoint detection algorithms. It identifies corners by looking for significant changes in intensity in different directions. The algorithm calculates the corner response function, which is a measure of local intensity changes. Keypoints are then selected based on thresholding of this response function.

The idea of keypoint detection is to detect a unique structure in an image that can be precisely located in both coordinate directions. Corners are ideally suited for this purpose as shown in the image below.

Image Credits: Shree Nayar (https://youtu.be/Z_HwkG90Yvw)

Next, to programmatically find these corners in an image we need to calculate the intensity gradient along both the x and y direction. The corners will show strong gradients across both directions compared to other features such as edges.

Image Credits: Shree Nayar (https://youtu.be/Z_HwkG90Yvw)

This is done using SSD (Sum of Squared Difference) which is expressed mathematically as below. The window function ‘w’ is either a rectangular window or a Gaussian window which gives weights to pixels underneath.

After shifting the window ‘w’ by an amount u in the x-direction and v in the y-direction the equation sums up the squared differences of all pixels within ‘w’ at the old and at the new window position.

We have to maximize this function E(u,v) for corner detection. That means we have to maximize the second term. Applying Taylor Expansion to the above equation and using some mathematical steps, we get the final equation:

The result of our mathematical transformations is a matrix M, which can now be conveniently analyzed to locate structural change in a local window ‘w’ around every pixel position u, v in an image. In literature, the matrix M is often referred to as the covariance matrix.

To do this, it helps to visualize the matrix M as an ellipse, whose axis length and directions are given by its eigenvalues and eigenvectors. As can be seen in the following figure, the larger eigenvector (denoted by the λ symbol) points in the direction of maximal intensity change, whereas the smaller eigenvector points in the direction of minimal change. So, to identify corners, we need to find positions in the image which have two significantly large eigenvalues of M.

Robert Collins: CSE486, Penn State Lecture

Based on the eigenvalues of M, the Harris detector method evaluates the following expression to derive a corner response measure at every pixel location with the factor k being an empirical constant which is usually in the range between k = 0.04–0.06.

So, the magnitudes of these eigenvalues decide whether a region is a corner, an edge, or flat. The brighter a pixel, the higher the Harris corner response. This concept is represented in a pictorial format below.

Robert Collins: CSE486, Penn State Lecture

The result of Harris Corner Detection is a grayscale image with these scores. Thresholding for a suitable score gives you the corners in the image. To locate corners, we now have to perform a non-maxima suppression (NMS) to:

  • ensure that we get the pixel with maximum cornerness in a local neighborhood and
  • prevent corners from being too close to each other as we prefer an even spread of corners throughout the image.

The concept of non-max suppression is, taking the pixel with the highest intensity within a window, setting that as the corner point and the rest to zero.

Below is the implementation of Harris corner detection in Open CV.

You can also find the implementation in my GitHub Repo:

 /**
* @brief HARRIS keypoint detector
*
* @param keypoints detected keypoints
* @param img input grayscale image
* @param bVis visualization flag
*/
void detKeypointsHarris(std::vector<cv::KeyPoint> &keypoints, cv::Mat &img, double &time, bool bVis)
{
// Detector parameters
int blockSize = 2; // for every pixel, a blockSize × blockSize neighborhood is considered
int apertureSize = 3; // aperture parameter for Sobel operator (must be odd)
int minResponse = 100; // minimum value for a corner in the 8bit scaled response matrix
double k = 0.04; // Harris parameter (see equation for details)

double t = (double)cv::getTickCount();
// Detect Harris corners and normalize output
cv::Mat dst, dst_norm, dst_norm_scaled;
dst = cv::Mat::zeros(img.size(), CV_32FC1);
cv::cornerHarris(img, dst, blockSize, apertureSize, k, cv::BORDER_DEFAULT);
//std::cout<< "dst.size() = " << dst << std::endl;
cv::normalize(dst, dst_norm, 0, 255, cv::NORM_MINMAX, CV_32FC1, cv::Mat());
cv::convertScaleAbs(dst_norm, dst_norm_scaled);

// Locate local maxima in the Harris response matrix
// and perform a non-maximum suppression (NMS) in a local neighborhood around
// each maximum. The resulting coordinates shall be stored in a list of keypoints
// of the type `vector<cv::KeyPoint>`.

// Loop throught the harris response matrix and find maxima
double maxOverlap = 0.0;
for (size_t j =0; j <= dst_norm.rows; ++j)
{
for (size_t i = 0; i <= dst_norm.cols; ++i)
{
int response = (int)dst_norm.at<float>(j,i);
if (response > minResponse) // only store points above a threshold as keypoints
{
cv::KeyPoint newKeyPoint;
newKeyPoint.pt = cv::Point2f(i,j);
newKeyPoint.size = 2*apertureSize;
newKeyPoint.response = response;

//Find if there is overlap between the new keypoint and the existing keypoints
bool Overlap = false;
for (auto it = keypoints.begin(); it != keypoints.end(); ++it)
{
double kptOverlap = cv::KeyPoint::overlap(newKeyPoint, *it);
// If there is overlap, check if the new keypoint has a higher response
if (kptOverlap > maxOverlap)
{
Overlap = true;
// If the new keypoint has a higher response, replace the old keypoint with the new one
if (newKeyPoint.response > (*it).response)
{
*it = newKeyPoint;
break;
}
}
}
// If there is no overlap, add the new keypoint to the list
if (!Overlap)
{
keypoints.push_back(newKeyPoint);
}
}
}
}

2. SHITOMASI (Shi-Tomasi Corner Detector):

The Shi-Tomasi corner detector is a simple but effective keypoint detector that is based on the Harris corner detector. The algorithm computes the corner response function using eigenvalues of the gradient matrix. It calculates the minimum eigenvalue of a 2 x 2 gradient covariance matrix at each pixel. Corners have both eigenvalues above a threshold, making them stand out.
Advantages: Simple, better performance than the Harris Corner Detector, less sensitive to noise.
Disadvantages: Limited to corner detection, not fully invariant to scale and rotation.
OpenCV Implementation:

// Detect keypoints in image using the traditional Shi-Thomasi detector
void detKeypointsShiTomasi(vector<cv::KeyPoint> &keypoints, cv::Mat &img, double &time, bool bVis)
{
// compute detector parameters based on image size
int blockSize = 4; // size of an average block for computing a derivative covariation matrix over each pixel neighborhood
double maxOverlap = 0.0; // max. permissible overlap between two features in %
double minDistance = (1.0 - maxOverlap) * blockSize;
int maxCorners = img.rows * img.cols / max(1.0, minDistance); // max. num. of keypoints

double qualityLevel = 0.01; // minimal accepted quality of image corners
double k = 0.04;

// Apply corner detection
double t = (double)cv::getTickCount();
vector<cv::Point2f> corners;
cv::goodFeaturesToTrack(img, corners, maxCorners, qualityLevel, minDistance, cv::Mat(), blockSize, false, k);

// add corners to result vector
for (auto it = corners.begin(); it != corners.end(); ++it)
{

cv::KeyPoint newKeyPoint;
newKeyPoint.pt = cv::Point2f((*it).x, (*it).y);
newKeyPoint.size = blockSize;
keypoints.push_back(newKeyPoint);
}

3. FAST (Features from Accelerated Segment Test):

FAST is designed for real-time applications where speed is crucial. It identifies corners by comparing the intensity of a central pixel to its surrounding pixels. FAST tests a set of pixels in a circular pattern around a candidate corner pixel. If enough contiguous pixels are brighter or darker than the central pixel, it is considered a corner.
Advantages: Extremely fast, suitable for real-time applications.
Disadvantages: Prone to false positives, not invariant to scale and rotation.
OpenCV Implementation:

// FAST detector
int threshold = 30; // difference between intensity of the central pixel and pixels of a circle around this pixel
bool bNMS = true; // perform non-maxima suppression on keypoints
cv::FastFeatureDetector::DetectorType type = cv::FastFeatureDetector::TYPE_9_16;
detector = cv::FastFeatureDetector::create(threshold, bNMS, type);
t = (double)cv::getTickCount();
detector->detect(img, keypoints);
t = ((double)cv::getTickCount() - t) / cv::getTickFrequency();
cout << "FAST with n= " << keypoints.size() << " keypoints in " << 1000 * t / 1.0 << " ms" << endl;

4. BRISK (Binary Robust Invariant Scalable Keypoints):

BRISK is designed to combine speed with robustness. It uses a binary descriptor for efficiency. BRISK detects keypoints by analyzing the distribution of intensity differences around a pixel. It computes a pattern that is robust to rotation and scale changes. A binary descriptor is then generated by comparing pixel pairs in a circle around the keypoint.
Advantages: Robust to rotation and scale changes, faster computation due to binary descriptors.
Disadvantages: May not be as accurate as some other detectors.
OpenCV Implementation:

// BRISK detector
int threshold = 30; // AGAST detection threshold score
int octaves = 3; // detection octaves
float patternScale = 1.0f; //apply this scale to the pattern used for sampling the neighbourhood of a keypoint.

detector = cv::BRISK::create(threshold, octaves, patternScale);
t = (double)cv::getTickCount();
detector->detect(img, keypoints);
t = ((double)cv::getTickCount() - t) / cv::getTickFrequency();
cout << "BRISK with n= " << keypoints.size() << " keypoints in " << 1000 * t / 1.0 << " ms" << endl;

5. ORB (Oriented FAST and Rotated BRIEF):

ORB combines corner detection and feature description in a single algorithm. ORB first identifies corners using a FAST-like approach. It then computes oriented BRIEF descriptors to capture keypoint features. The orientation helps improve invariance to rotation.
Advantages: Combines corner detection and feature description, performs well in real-time applications.
Disadvantages: May not be as robust as other detectors in certain scenarios.
OpenCV Implementation:

// ORB detector
int nfeatures = 5000; // The maximum number of features to retain
detector = cv::ORB::create(nfeatures);
t = (double)cv::getTickCount();
detector->detect(img, keypoints);
t = ((double)cv::getTickCount() - t) / cv::getTickFrequency();
cout << "ORB with n= " << keypoints.size() << " keypoints in " << 1000 * t / 1.0 << " ms" << endl;

6. AKAZE (Accelerated-KAZE):

AKAZE is an extension of the KAZE keypoint detector, designed for robustness across different scales and transformations. AKAZE uses nonlinear diffusion to enhance the detection of scale-invariant keypoints. It employs a multiscale approach to capture keypoints at various levels of image detail.
Advantages: Robust across scales and rotations, performs well under varying transformations.
Disadvantages: Relatively slower compared to some other detectors.
OpenCV Implementation:

// AKAZE detector
detector = cv::AKAZE::create();
t = (double)cv::getTickCount();
detector->detect(img, keypoints);
t = ((double)cv::getTickCount() - t) / cv::getTickFrequency();
cout << "AKAZE with n= " << keypoints.size() << " keypoints in " << 1000 * t / 1.0 << " ms" << endl;

6. SIFT (Scale-Invariant Feature Transform):

SIFT is a widely used algorithm known for its robustness to scale and rotation changes. SIFT detects keypoints by identifying maxima and minima in the difference of Gaussian (DoG) scale space. Keypoint descriptors are generated by considering gradients in various directions around the keypoint.
Advantages: Robust to scale, rotation, and lighting changes, widely used and reliable.
Disadvantages: Slightly computationally intensive, patented (may have limitations on usage).
OpenCV Implementation:

// SIFT detector
detector = cv::SIFT::create();
//detector = cv::xfeatures2d::SIFT::create(); //Use this for old versions of OpenCV
t = (double)cv::getTickCount();
detector->detect(img, keypoints);
t = ((double)cv::getTickCount() - t) / cv::getTickFrequency();
cout << "SIFT with n= " << keypoints.size() << " keypoints in " << 1000 * t / 1.0 << " ms" << endl;

Performance Evaluation

To compare the algorithms, I have analyzed a series of 10 images depicting a car navigating through traffic using various algorithms. Each algorithm is run to detect keypoints in all 10 images and the aggregated data is compiled to produce the graphs. You can also access the full implementation on my GitHub repository.

The following graphs show the comparison of all the detector algorithms in terms of the number of keypoints detected, execution time, and the execution time per ms.

Based on the above comparison, ORB detects the maximum number of keypoints but FAST is the “fastest” in terms of execution.

Once you have identified the keypoints in an image, the next ‘key’ step is to give them unique descriptors so that they can be easily tracked in consecutive images. We will be discussing the topic of feature descriptors in our upcoming article, so stay tuned!

--

--

Nikhil Nair

🌍 ADAS/AD Engineer | Tech & AI Enthusiast 🚗 | Sharing insights on cutting-edge self-driving tech | Let's dive into the future of mobility together