Next-Generation Stereo Vision Autocalibration Software for Advanced Driver Assistance Systems (ADAS) and Autonomous Vehicles (AV)

Published in

NODAR Blog

13 min readSep 9, 2020

Depth map where color encodes the distance to each pixel. Red is closer than blue.

We compared NODAR’s stereo vision autocalibration algorithm to state-of-the-art calibration methods and found that NODAR’s patent-pending technology achieves greater than ten-fold improvement in estimating the relative camera orientation. The improved performance opens up a new operating regime for stereo vision cameras: long-baseline (> 0.5-m) stereo vision systems that maintain performance in harsh vibration environments over 15 years without the need for tedious and expensive manual calibration.

Estimating the relative position and orientation of the stereo camera pair — the so-called camera extrinsic parameters or relative camera pose — is needed for row-aligning the left and right images (a process called image rectification) since virtually every stereo matching algorithm searches for common features along rows to make the computation fast enough for real-time operation. A misalignment of the cameras that causes the relative orientation to shift, such as a tiny deformation of the camera mount from a small temperature change, can cause the left and right rows to misalign and completely ruin the depth map. Autocalibration is the key component for stereo vision systems that must stay aligned for years in outdoor environments, as those systems must remain aligned to better than 1/100th of a degree to ensure accurate depth estimation of every pixel.

A quick search using Google Scholar indicates 68 papers on the topic of stereo vision autocalibration explaining how to automatically rectify the camera images without the need for a manual calibration using a checkerboard or another calibration target. The theory of autocalibration has been so well developed that an entire chapter in Hartley and Zisserman’s textbook, Multiple View Geometry in Computer Vision, is devoted to the topic [1]. Although the multitude of journal articles and the textbook chapter gives the impression that stereo autocalibration is a solved problem, it is certainly not the case for automotive stereo vision with natural scenes. The angular accuracy requirements for automotive applications are far more stringent than for indoor robotics (the application area for most of the aforementioned published work) because the distances of interest are much longer than the baseline length, resulting in higher angular resolution requirements to achieve the same range resolution. For example, Hartley and Zisserman note that their indoor autocalibration system can recover angles of the metric reconstruction to within 1 degree [1, p.497], but this error in automotive applications leads to completely garbled depth maps, as will be shown later in the article. Previous methods are based on matching features, or keypoints, in the left and right images. Through an epipolar geometry constraint, one could then estimate the relative orientation and translation direction of the stereo camera pair. The main challenges of this approach are:

Finding features points to sub-pixel accuracy
Matching features reliably from the left image to the right image
Having a good distribution of feature points across the image.

Because of these many challenges that come with keypoint matching algorithms, NODAR is not using this approach with our autocalibration software. Still, it has been about 10–20 years since many of the 68 papers on stereo camera autocalibration were published, so we were wondering whether a more modern keypoint algorithm using deep learning feature descriptors and matchers could improve the accuracy of the rectified images. The short answer is no, the improvement by using deep learning feature descriptors and matchers over ORB and SIFT is small.

OpenCV includes multiple different keypoint matching algorithms, including SIFT [2] and ORB [3], that are commonly used for feature detection. Here, we will investigate the performance of a modern feature matching technique grounded in deep learning called Superglue [4] (which was popular in the CVPR 2020 conference) to see how it compares to more traditional methods and determine whether there should be a shift towards these deep learning approaches as this area of research grows.

General Approach to Camera Calibration

Estimating the relative camera pose with matching keypoints requires several steps that each have room for modification and improvement: first the matching feature points or keypoints in the image pair are extracted, then those correspondences are used for relative pose estimation, and then the estimates are used to rectify the images. Once the images are rectified, further analysis can be done, including developing accurate disparity and depth maps. While the method of feature point extraction may differ from approach to approach, the common strategy for camera calibration follows the general outline in Mathworks’ autocalibration example [5]. In the following code, we use OpenCV and Python.

With the images undistorted and converted to grayscale, you can begin with finding the keypoints. If using SIFT, you would do:

If using ORB, you would do:

From there, the two matching algorithms follow similar approaches for retrieving the matching keypoints:

Where points1 and points2 hold the corresponding matching keypoints for the two images.

With the matching keypoints and intrinsic information, you can retrieve the extrinsics with the following code:

And, the recovered rotation matrix and translation vector, i.e. the camera extrinsics, can be used along with the camera pair intrinsics to rectify the image pair.

SIFT and ORB

Both SIFT and ORB follow the same general approach: using the change in intensity and contrast between neighboring pixels to determine local extrema, performing filtering to select the best keypoints, and then doing nearest neighbor matching on the keypoints. Both methods are scale and rotation invariant. The main differences between the two are outlined in the table below:

You can read more about SIFT here and more about ORB here.

SIFT and ORB are well known, widely used, and perform fairly well. But what if we could make our keypoint detection better, and in turn make the recovered extrinsic information closer to the ground truth? With a deep learning model, we were hoping to do just that.

SuperGlue

While SIFT’s and ORB’s feature matching methods are grounded in finding change in intensity and contrast from pixel to pixel, SuperGlue’s approach is a bit different. SuperGlue is “a neural network that matches two sets of local features by jointly finding correspondences and rejecting non-matchable points.” It also uses SuperPoint, another local feature detection algorithm, to find the “visual descriptors” of the feature points, and then runs those points through the neural network to determine the best matches. The first component of the network is an “Attentional Graph Neural Network,” which “computes matching descriptors” from the keypoints “by letting the features communicate with each other.” It also creates one vector with both the keypoints’ “visual appearance and location” so they can be evaluated together. After this network, there is an “optional matching layer” creating “a partial soft assignment matrix.”

The general structure of the architecture is described in the below figure, as seen in the Superglue paper:

Figure 1: “SuperGlue Architecture” as seen in SuperGlue paper [4]

SuperGlue provides a public pre-trained network with weights for indoor and outdoor scenes, and seems to be gaining popularity. You can read more about the SuperGlue algorithm here and view their demo code here.

We believed this method would be promising as it both compares keypoints in an image to other keypoints in the same image as well as keypoints in the other image to gather more information. Furthermore, SuperGlue scores and gives confidence ratings for its matches, making it easier to detect which matches are good.

Dataset

To evaluate the performance of the different matching algorithms on real data taken from a vehicle, we chose to use sample imagery from the well-known KITTI Vision Benchmark Suite. More specifically, we used the unrectified drive 15 Road data from the KITTI raw dataset, which consists of just over 300 frames, or about 30 seconds of images. We chose KITTI because it is widely used, has reliable ground truth intrinsics and extrinsics, and provides many frames. 6 image pairs are included in the git repository, and the rest are available here.

Method

In order to test the performance of SIFT, ORB, and SuperGlue in comparison to each other, we developed a script that allows you to select which matching algorithms you want to compare. The user can specify the number of matching keypoints to generate for each keypoint/feature detector, and then the best 100 matches are used for camera pose estimation.

This script can be run with KITTI or any other dataset uploaded into the proper directory, following the same format as the KITTI data. You can find our code here.

Results

All the results in this section come from running all three keypoint matching algorithms and NODAR’s algorithm over 303 frames of KITTI data with each algorithms’ default maximum points, of which the top 100 matches are used for pose recovery. Running Superglue, ORB, and SIFT in Python on a MacBook took around 10 seconds, 0.5 seconds, and 1 second, respectively, per frame. All three feature descriptor algorithms are fast enough for daily or even hourly re-calibrations of a vehicle-mounted stereo vision system.

Below are the pitch, yaw, and roll estimates (for the rotation matrix) in degrees for the different frames compared to the ground truth over 303 frames:

Figure 2: Rotation values over 303 frames

The recovered values from NODAR’s autocalibration software are so close to the ground truth values that it is hard to see the difference in the plot above. For the keypoint approaches, while ORB bounces around, sometimes being off by over 20 degrees, SIFT and SuperGlue stay closer to the ground truth. The pose estimation from the keypoint approaches was so bad on a frame-by-frame basis that we decided to average their orientations over all 303 frames to see if we could come close to obtaining recognizable disparity maps. When taking the absolute difference of the ground truth and averaged result (e.g., for pitch, abs(actual pitch — sum(all pitch estimates)/303)), we get the following errors below:

When looking at the table of errors above, NODAR has the lowest error for pitch, yaw, and roll. For example, NODAR’s estimate of roll is 47 times better than that of ORB’s. SuperGlue does show slightly better error than the other keypoint approaches for pitch and yaw. Meanwhile, ORB shows the poorest performance across all orientations.

The average pitch, yaw, roll, and ground truth translation vector were then used to rectify the raw KITTI images. The disparity map was then computed from the rectified images using semi-global block matching.

ORB, SIFT, and SuperGlue result in noisy disparity maps that have many missing values, but NODAR’s disparity map accurately captures the car, the posts, the trees, and other road structures for reliable obstacle detection. Also, Figures 4–6, corresponding to the disparity maps for the keypoint approaches, show the recovered disparity map using information averaged over 303 frames, which is not possible in practical systems (it uses future information to predict the past!). Moreover, the recovered disparity maps from averaged information still produce noisy and non-informative disparity maps, showing that the recovered orientation averaged over all image pairs does not approach the ground truth values.

The disparity maps for keypoint approaches are indeed much worse in practice — below you can see the resulting disparity maps corresponding to each frames’ recovered pose estimation. The topmost video is the original image followed by NODAR, then SuperGlue, then SIFT, then ORB. Notice that the disparity map using the non-averaged estimate for the orientation parameters is considerably noisier for SuperGlue, SIFT, and ORB. While some frames can correctly capture relevant information, generally on a frame-by-frame basis, the estimated orientation parameters are clearly not sufficient for rectification, leading to poor resulting disparity maps. Even though the below video indicates that SuperGlue, SIFT, and ORB are insufficient for correctly estimating the camera pose for each frame and merely average to a slightly better answer over many frames, NODAR is able to produce a clean disparity map for every frame. As a result, the extra work of averaging over many frames is not necessary with NODAR’s autocalibration software.

Figure 8

Discussion

To understand the poor performance of keypoint approaches and the differences in the resulting disparity maps among SIFT, ORB, and SuperGlue, we next looked at the detected keypoints and their matches.

ORB does a good job detecting key features, however, its keypoints often cluster solely around these features, leading to a poor distribution across the whole image. Take for example frame 2 of the KITTI dataset with the ORB feature points overlaid on the image pair (shown in Figure 9). There are 100 matches, yet there are only 30-ish features (or clusters of keypoints) mainly found in a narrow horizontal stripe in the center of the image.

Figure 9: Frame 2 (left and right) of KITTI with ORB keypoint overlay

ORB picks out good features, like the car on the main road and the cars on the side road. However, all the matching keypoints returned are heavily clustered around the key objects with few keypoints in other areas of the image, like the road, grass on the side, or nearby trees. So, recovering the pose without uniformly sampling the image leads to poor results.

SIFT has a similar clustering issue as ORB, though it is not as pronounced which leads to better results. SIFT similarly tends to cluster keypoints around key features, but it also is able to detect keypoints in other areas of the image. Below is an example of the same frames shown in Figure 9 but with the SIFT keypoints overlaid:

Figure 10: Frame 2 (left and right) of KITTI with SIFT keypoint overlay

Instead of just clustering around the cars and the area near them, SIFT’s keypoints are present in more areas of the image, like the lines defining the road space, the wires above the street, and the trees in the distance. Still, even though the keypoints encompass more space, there are many along one feature. So, for example, instead of having lots of overlapping keypoints on the cars like with ORB in Figure 9, there are many keypoints along the white line on the right side of the street. They’re not overlapping, but they are still essentially detecting the same feature. In addition, notice that the keypoints on the telephone lines in the upper left part of the image do not correctly match between the left and right images, which is another source of error for the pose estimate. For these test images, SIFT has a slightly better spread of keypoints compared to ORB, but still not good enough to recover the pose accurately.

Meanwhile, SuperGlue has the best distribution of keypoints of the three matching algorithms. Below is the same frame as the above figures with the SuperGlue keypoints overlaid:

Figure 11: Frame 2 (left and right) of KITTI with SuperGlue keypoint overlay

In Figure 11, all of the keypoints are distributed across a large portion of the image, very few keypoints are overlapping or directly next to each other, and they fill the image space more evenly. While SIFT was able to capture keypoints across the whole image, its points tended to cluster around the features it picked, so although it was better distributed, the scope of the keypoints was still pretty narrow. SuperGlue’s keypoints are well spread out, covering much of the image, and finding good features, which could lend itself to good pose estimation and disparity map recovery. Furthermore, the matching of keypoints in the left and right images seems to be more accurate than for ORB or SIFT.

To see these differences all together, we can observe the below image, which has frame 2 overlaid with the keypoints of all the different keypoint matching algorithms. Notice that there is very little overlap between the different features, which further explains why the type of feature detector is important for pose estimation.

Figure 12: Frame 2 of KITTI with keypoints from all methods overlaid

Conclusion

Compared to the old keypoint approaches for stereo vision autocalibration, NODAR’s software provides the necessary accuracy and robustness for natural road scenes with better relative camera pose estimation, better-rectified images, and cleaner and more informative disparity maps. However, in comparing deep learning feature matching to traditional keypoint matching algorithms, SuperGlue performed better, but not quite good enough for automotive applications. SuperGlue had lower errors than SIFT and ORB for pitch and yaw; but even when its estimated orientations were averaged over 303 frames, the resulting disparity map was only of fair quality.

Applying keypoint matching approaches to stereo autocalibration on real road scenes have highlighted the fragility and weaknesses of this approach for automotive applications. Even after using more sophisticated learned feature descriptors and matchers (SuperGlue), only marginal improvements to autocalibration accuracy are obtained. And, as seen with NODAR’s autocalibration results, perhaps the solution is to forego feature matching altogether, as there are clearly alternatives that can outperform the current widely used approach.

Moreover, although the focus of this article was on calibration over long time scales (hours, days, weeks), an automotive sensor is subject to vibration and shock, which can cause deflection over shorter timescales. None of the keypoint approaches that we are aware of can compensate for fast perturbations, which is necessary for a practical car sensor. Our software compensates for fast perturbations as well as slow variations, but that is for another article!

By Zoe Weiss, Brown University BSCS, NODAR Intern Summer 2020

www.nodarsensor.com

Citations

[1] Hartley, Richard, and Andrew Zisserman. Multiple View Geometry in Computer Vision, ch.19. Cambridge Univ. Press, 2004.

[2] “Introduction to SIFT (Scale-Invariant Feature Transform).” OpenCV.

[3] Rublee, E., Rabaud, V., Konolige, K., & Bradski, G. (2012). ORB: an efficient alternative to SIFT or SURF.

[4] Sarlin, P., DeTone D., Malisiewicz, T.m Rabinovich, A. (2019) SuperGlue: Learning Feature Matching with Graph Neural Networks.

[5] “Uncalibrated Stereo Image Rectification.” Uncalibrated Stereo Image Rectification — MATLAB & Simulink