2D Feature Tracking — Part 3: Feature Matching

9 min readSep 17, 2023

Matched keypoints between consecutive image sequences (Images from the KITTI dataset)

In the first two parts of this series, we learned about detecting keypoints in an image, which are significant landmarks, and describing them. Once you have located and described a set of keypoints for each image in a sequence of frames, the next step in the tracking process is to find the best match for each keypoint in successive frames. To do so, you need to implement a suitable similarity measure so your tracking algorithm can uniquely assign keypoint pairs. In the final part of our 2D feature tracking series, we will learn how to match the feature descriptors between two images.

I highly recommend reading the first two parts before this article.

Part 1: Feature Detection

Part 2: Feature Description

We will start with an overview of available similarity measures; we will also look at how to deal with ambiguities that can occur when features look very similar, and then we will conclude with an overview of performance measures to describe the quality of a matching method so you will be able to compare algorithms and make an informed choice when it comes down to selecting a suitable method for your project.

Similarity Measures

Measuring the similarity of data samples is important to understand how close or related they are to each other. There are different methods to compute the distance between two descriptors and transform their differences into a single number. This number is then used as a simple similarity measure.

One of the distance functions used in computer vision is called the "Sum of Absolute Differences (SAD)." To compute SAD, two descriptor vectors, da and db, are used as inputs. The SAD is calculated by subtracting from every component in da the corresponding component at the same position in db and then adding the absolute value of the result. This process is repeated for every component in the vectors. The SAD norm is also known as L1-norm in the literature.

When comparing two descriptor vectors, another distance function that can be used is the Sum of Squared Differences (SSD). It is similar to SAD in that it computes differences between individual components of both vectors. However, the main difference between SAD and SSD is that the latter calculates the sum of the squared differences instead of absolute differences. The SSD norm is also known as the L2 norm in the literature.

When it comes to measuring the geometric distance between both vectors, the L2-norm provides a more accurate measure.

When dealing with a binary descriptor that contains only ones and zeros, the most effective and efficient way to compare them is by using the Hamming distance. This measure calculates the difference between two vectors using an XOR function. The XOR function returns zero if two bits are identical and one if two bits are different. Therefore, by adding up all the XOR operations, we can easily determine the number of differing bits between both descriptors.

Types of descriptor distance functions (Image Courtesy: Udacity Sensor Fusion Nanodegree)

It is important to use the appropriate distance measure depending on the type of descriptor being used. For gradient-based methods like SIFT, the L2-norm is the most suitable measure. On the other hand, when dealing with binary descriptors, the Hamming distance should be used. This ensures that the distance measure is adapted to the type of descriptor being used.

Finding Matches

Brute Force Matching

Assuming we have N keypoints and their associated descriptors in one image and M keypoints in another image, the most straightforward method of finding corresponding pairs would be to compare all features with each other. This involves performing N x M comparisons.

Each keypoint in the second image is taken for a given keypoint from the first image, and its distance is calculated. The keypoint with the smallest distance is considered its match. This approach is called Brute Force Matching or Nearest Neighbor Matching, and it is available in OpenCV as the BFMatcher.

The result of brute force matching in OpenCV is a list of keypoint pairs arranged by the distance of their descriptors under the chosen distance function. Brute force matching is effective for small keypoint numbers but can be computationally expensive as the number of keypoints grows.

Example of BF Matcher implementation in OpenCV:

void BFMatcher(cv::Mat &descSource, cv::Mat &descRef, std::vector<cv::DMatch> &matches)
{
  
  // configure matcher
  bool crossCheck = false;
  int normType = cv::NORM_HAMMING;
  cv::Ptr<cv::DescriptorMatcher> matcher;

  // Create matcher object
  matcher = cv::BFMatcher::create(normType, crossCheck);

  // Finds the best match for each descriptor in descSource using NN selection
  matcher->match(descSource, descRef, matches); 

  cout << " (NN) with n=" << matches.size() << " matches in " << endl;

}

When performing BF Matching, a descriptor distance threshold is applied to limit the number of matches to only the 'good' ones. If a pair does not correspond, it is discarded. Matches between corresponding 'good' pairs are labeled as 'True Positives (TP),' while mismatches are called 'False Positives (FP).' The challenge is to select the appropriate value for 'T,' which maximizes the number of TP matches and minimizes FP matches. It is essential to balance TP and FP depending on the image content and the detector/descriptor combination used. The following figure presents two SSD distributions of TP and FP, which can help illustrate the threshold selection process.

Threshold Selection (Image Courtesy: Udacity Sensor Fusion Nanodegree)

When matching features, we set a threshold to determine which matches are true positives (TP) and which are false positives (FP). The first threshold, T1, is set to allow for some TP matches while avoiding most FP matches. However, with this setting, many TP matches are also discarded. By increasing the threshold to T2, more TP matches can be selected, but at the same time, the number of FP matches also increases significantly. In practice, separating TP and FP matches is difficult, so setting a matching threshold is always a compromise between balancing good and bad matches. Although FP matches cannot be avoided in most cases, the goal is always to reduce their number as much as possible.

FLANN Matcher (Fast Library for Approximate Nearest Neighbors)

David Lowe, the creator of SIFT, and Marius Muja developed an open-source library in 2014 called "Fast Library for approximate nearest neighbors" (FLANN). The library uses machine learning concepts to train an indexing structure for identifying potential matching pairs quickly. FLANN creates a highly efficient data structure, a KD-tree, to search for matching pairs, and it avoids the exhaustive search of the brute force approach. As a result, it's faster while still producing excellent results, which depend on the matching parameters used.

To use a FLANN-based matcher in OpenCV, replace the matcher create function as shown below. The rest of the code remains the same as BF Matcher.

matcher = cv::FlannBasedMatcher::create();

Selecting Matches

When using brute force matching to match keypoints between two images, there will always be a match for a keypoint in the first image, even if it's not actually present in the second image, as long as the selected threshold T is not exceeded. This can lead to several false matches. To address this issue, cross-check matching can be used. This involves applying the matching procedure in both directions and only keeping the matches where the best match in one direction is the same as the best match in the other direction.

To reduce the number of false positives, a helpful technique is to calculate the nearest-neighbor distance ratio for each keypoint. D. Lowe first introduced this technique in his 1999 paper on SIFT. Instead of applying the threshold directly to the SSD, the method locates the two best matches for each keypoint in the reference image and computes the ratio between their descriptor distances. An ambiguous match is then filtered out using a threshold applied to the ratio. The diagram below illustrates this approach.

Image Courtesy: Udacity Sensor Fusion Nanodegree

The example demonstrates how an image patch with a descriptor da is compared to two other image patches with descriptors db1 and db2. The patches look very similar and can result in unreliable matches. To filter out weak candidates, the SSD ratio between the best and second-best match is computed.

In practice, a threshold value of 0.8 has been effective in balancing true and false positives. In the original SIFT paper, this setting eliminated 90% of false matches while losing less than 5% of the correct matches.

Here's an implementation of the D Lowe's ratio test and K-NN matches using BF matcher in openCV.

void BFMatcher(cv::Mat &descSource, cv::Mat &descRef, std::vector<cv::DMatch> &matches)
{
  
  // configure matcher
  bool crossCheck = true;
  int normType = cv::NORM_HAMMING;
  vector< vector<cv::DMatch> > knn_matches;
  cv::Ptr<cv::DescriptorMatcher> matcher;

  // Create a matcher object with k nearest neighbors = 2
  matcher = cv::BFMatcher::create(normType, crossCheck);
  matcher->knnMatch( descSource, descRef, knn_matches, 2 );

  //-- Filter matches using the Lowe's ratio test
  const float ratio_thresh = 0.8f;
  for (size_t i = 0; i < knn_matches.size(); i++)
  {
    if (knn_matches[i][0].distance < ratio_thresh * knn_matches[i][1].distance)
    {
      matches.push_back(knn_matches[i][0]);
    }
  }

  cout << " (KNN) with n=" << matches.size() << " good matches in " << endl;

}

Evaluating Matcher Performance

When it comes to detecting and describing features, many different types of algorithms are available. Choosing the right pair of algorithms depends on the specific problem that needs to be solved, considering factors like the accuracy of detected features and the number of matching pairs. The following section overviews some of the most common measures used in this context.

The True Positive Rate (TPR) is the ratio between keypoints that have been matched correctly (true positives — TP) and the sum of all potential matches, including ones missed by the detector/descriptor (false negatives — FN). A perfect matcher would have a TPR of 1.0, as there would be no false matches. In the literature, the TPR is also called recall and can be used to quantify how many possible correct matches were actually found.
The False Positive Rate (FPR) is the ratio between keypoints that have been incorrectly matched (false positives — FP) and the sum of all features that should have been without a match. A perfect matcher would have an FPR of 0.0. The FPR is also referred to as false alarm rate and describes how likely it is that a detector/descriptor selects a wrong keypoint pair.
The Precision of a matcher is the number of correctly matched keypoints (TP) divided by the number of all matches. This measure is also referred to as inlier ratio.

To determine whether a match or non-match is accurate, it is necessary to have ground-truth information about the image material used for evaluation. Many publications have used image sequences where all keypoints are positioned on a flat surface. In such cases, a model-based estimation can be utilized to distinguish between true positive (TP) and true negative (TN) results from false positive (FP) and false negative (FN) results.

The Receiver Operating Characteristic (ROC) is a graphical plot that illustrates the ability of a detector/descriptor to distinguish between true and false matches as its discrimination threshold is adjusted. The ROC can be used to visually compare different detectors and descriptors and select a suitable discrimination threshold for each. The name ROC originated during World War II when radar operators used the method to identify enemy targets.

An ideal detector / descriptor would have a TPR of 1.0 while the FPR would be close to 0.0 at the same time.

ROC Curve (Image Courtesy: Udacity Sensor Fusion Nanodegree)

The graph below shows ROC curves of different descriptors, including SIFT, BRISK, and others, for visual comparison.

Image source — Research paper titled **Efficient discriminative projections for compact binary descriptors**, in European Conference on Computer Vision, by Trzcinski et al., 2012

I compared the performance of various detector/descriptor and matcher combinations on a sample dataset from KITTI and visualized the results in graphs. The reference code is available on my GitHub.

Descriptor matching applied on a sequence of traffic images from the KITTI dataset

UdacityProjects/Sensor_Fusion_Engineer/Camera/SFND_2D_Feature_Tracking-MidTermProject at…

This repo is a collection of codes developed through my Udacity courses …

github.com

% of good matches from all the keypoints detected

As evident from the comparison study, FAST + BRIEF or FAST + ORB offer a good balance between faster execution time and achieving over 80% suitable matches.