Stereo Vision, Re-envisioned
From stationary installations to rugged sensors for cars. And a quick look at one of the latest learning-based stereo vision algorithms.
Binocular vision for capturing 3D information is older than the human species itself and has done a good job of keeping us safe and aware of our surrounding environment. Stereo vision technology is old, really old. For example, consumer stereo film cameras pre-date the 1950s. It was only until the 1980s when personal computers became fast enough to complete stereo reconstruction at minutes per frame. Then, in the 1990s, digital cameras with CCD and CMOS sensors became widely available. A few more decades of development was required to get CMOS sensors to meet and exceed human eyesight performance (2–3 electron readout noise and 120 dB dynamic range) and special-purpose, automotive-grade ASICs for stereo reconstruction exceeding 30 FPS. Although the individual components of the stereo camera have greatly advanced since 1950, the mechanical structure of the camera has remained essentially the same — a thick and rigid multilayer supporting chassis that held fast the left and right camera assemblies. The chassis was often very short (less than 5–10 cm) lest any perturbation changed the position or orientation of the cameras. Stereo vision requires that the left and right cameras are row-aligned at the pixel level, so even a 1/100th of a degree shift in the relative orientation of the left and right cameras could completely ruin the output depth map.
At NODAR, we realized that the Achilles heel of stereo vision was it’s extremely difficult alignment tolerances that could only be met in stationary settings, that is, laboratories, package inspection systems, manufacturing floor, or any facility where the cameras could be mounted on a solid, stationary surface in a temperature-controlled environment. By removing the requirement of hard-mounting between the left and right cameras, the untethered stereo vision system could be used in dynamic applications, such as cars, trucks, or robots that must operate outdoors, vibrate, experience shock and extreme temperature variations — and yet must still stay aligned for over 15 years! Instead of mechanically forcing the cameras into alignment, our system allows the cameras to move freely while our software tracks their positions (using only the vision feed), which allows us to dynamically correct the calibration in real-time and provide high-quality 3D data without any need for manual calibration.
Another approach to building stereo systems that are insensitive to changes in relative alignment between the left and right cameras, the so-called extrinsic calibration parameters, is to design a stereo correspondence algorithm that can somehow convert poorly rectified images into accurate depth maps. We analyze one such stereo correspondence algorithm and show that, although less sensitive to perturbations than the traditional semi-global block matching (SGBM) technique, it is still very sensitive to perturbations and does not remove the need for frequent, manual calibrations. The remainder of this article compares two interesting stereo matching algorithms to changes in the relative pitch and roll angles of the stereo cameras:
- A learning-based stereo matching algorithm that was trained to be insensitive to pitch changes between the left and right cameras
- A traditional SGBM technique that serves as a benchmark for comparison
Depth estimation basics
Depth estimation from RGB sources is achieved through a technique called block matching, which exploits the relative displacement of features in images due to different perspectives of the cameras of the same scene. This relative displacement of pixels between cameras gives rise to a disparity map, which is an image containing the relative displacement of each pixel between two camera views.
Block matching generates a disparity map by taking a small group of pixels, say in the left camera image (such as in a stereo camera setup), searching the whole right camera image to find a similar patch. The most common similarity measure used is the sum of absolute differences or SAD. To compute the sum of absolute differences between the template and a block, each pixel in the template is subtracted from the corresponding pixel in the block and take the absolute value of the differences. Then these differences are summed up, and this gives a single value that roughly measures the similarity between the two image patches. A lower value means the patches are more similar. In this way, depth is obtained from a stereo setup through matching cost computation, cost aggregation, disparity optimization, and some post-processing. At first glance, it may seem that searching the whole right image for every patch of pixels in the left image has quadratic time complexity, and therefore would be impractically slow. This is where the rectification of images comes in — using the camera calibration parameters, the two images are aligned so that the search space is limited to the same row as the patch in the reference image. Therefore, to get an accurate depth estimate, the images have to be rectified to within half the window size of the block matcher.
This basic block-matching only takes into account local features and is therefore highly sensitive to depth discontinuities (due to occlusion, for example). To solve this, global optimization techniques, such as graph cuts, and belief propagation skip the cost aggregation step and define a global energy function. The disparity is obtained with fewer discontinuities by minimizing the energy function step by step. However, this method is highly time-consuming. Semi-global methods achieve a good balance between accuracy and speed by optimizing a pathwise form of the energy function in many directions and is the current industry standard for obtaining depth from stereo image pairs. Unfortunately, the performance of these traditional stereo matching methods is limited by the handcrafted cost functions, and therefore, there is a lot of interest in using deep learning techniques to generate depth maps. As mentioned before, both deep learning networks and semi-global methods depend on well-rectified images, which in turn depend on the precise calibration of the cameras. But in most real-world applications, the cameras will be subjected to external forces, which will change camera extrinsic, and in some cases intrinsic parameters. Therefore, NODAR’s fast and continuous auto-calibration of the cameras becomes important, especially in safety-critical applications such as autonomous vehicles.
Deep learning networks for stereo matching
The deep learning network we are going to explore in this article is called ‘Hierarchical Deep Stereo Matching on High-resolution Images’ [1], or HSM for short. The network uses the concept of coarse to fine correspondence search, by using coarse resolution images to estimate large disparities (closer objects), which are then used to bias the correspondence search at higher resolutions. In this way, this network is capable of outputting depth estimates much faster for close-by objects, while the depth estimates are still being generated for smaller disparities (far away objects). Fig 1 shows the network architecture, as described in the paper.
In the above network, given a pair of rectified images, the feature encoder generates multiscale feature descriptors for each image of the pair with a custom ResNet encoder-decoder network. These descriptors are used to construct 4-D feature volumes at each scale of the pyramid by taking the difference of potentially matching features extracted from epipolar scanlines. Each feature volume is decoded or filtered with 3D convolutions, making use of striding along the disparity dimensions to minimize memory. The decoded output is (a) used to predict 3D cost volumes that generate on-demand disparity estimates for the given scale, and (b) upsampled so that it can be combined with the next feature volume in the pyramid.
The HSM network uses “y-disparity augmentation” to reduce the output’s sensitivity to changes in the relative pitch angle between the left and right camera. When the left and right images are perfectly rectified, all matching points lie on the same horizontal scan line. However, it is difficult to perfectly rectify images because of misalignments from material deformations and vibrations. Degradation of the disparity map output is most sensitive to changes in the relative pitch of the camera pair because the images are displaced vertically with respect to each other and are no longer row-aligned. Y-disparity augmentation forces the network to learn robustness to such errors during training time.
The authors of the paper have made the weights of the network, as well as the code available on their Github page. To test this network on the Kitti stereo vision dataset, the following command-line argument can be used (be sure to change the model path):
Analysis
For our analysis, we used the KITTI road dataset (drive 15), a 30-second video with 303 frames, an image resolution of 1392 x 512 pixels, 33 cars, 1 van, 1 truck, and 1 cyclist. Starting with the ground truth values of the camera parameters, we gradually tuned the pitch and roll of the right camera in increments of 0.001 and 0.004 rad, respectively. Note that the depth map is 4 times more sensitive to pitch errors than roll errors.
Fig 2a shows the first frame of the image sequence. The image has a good mix of vehicles, rails, poles, trees, grass, and road. Fig 2b shows the corresponding lidar point cloud with the car at 33.4 m in front of the test vehicle. Fig 5 and Fig 6 show the HSM stereo vision depth maps constructed for different pitch and roll angles. As the angles increase, the depth map degrades. Even though HSM is trained to work with imperfect rectification, a pitch or roll change by a mere 0.006 radians or 0.016 radians, respectively, can ruin the results.
The depths maps using the traditional SGBM algorithm for different pitch and rolls are plotted in Fig 9 and Fig 10, respectively. The disparity maps become noticeably degraded with a pitch offset of 0.003 radians or a roll offset of 0.008 radians. Comparing these offsets to that of HSM, we can see that SGBM is approximately two times more sensitive to angular errors than HSM.
Although SGBM is twice as sensitive to change in the pitch and roll than HSM, SGBM is more accurate than HSM for our test data set. Table 1 summarizes the accuracy scores for the different pitch and roll offsets for both the HSM and SGBM algorithms. To get the accuracy score, we re-projected the lidar point cloud onto the disparity map plane and took the difference of the disparities between the lidar points in the disparity map plane and the corresponding points in the disparity map. If the difference in disparity for a particular lidar point and the corresponding pixel was greater than 3 pixels (or 5 pixels), we counted that as a disparity error. Table 1 shows the accuracy scores as a percentage of the total number of pixels that is within a disparity difference threshold, for different pitch and roll offset angles. Fig 11 shows an example of the lidar point-cloud aligned with a disparity map to illustrate this process (zoom into the image to see the lidar point cloud dots). The SGBM algorithm performs better than the HSM network if the stereo images are well rectified. However, as can be seen from Table 1, the HSM network is more robust to slight changes in the rotation parameters of the stereo setup.
The SGBM algorithm is not only more accurate than HSM for KITTI Road Drive 15 data, but it is faster than the HSM network. While most deep learning networks are capable of producing high-quality depth maps from well-rectified images, they tend to be either highly resource-intensive or produce a significantly downsampled depth map, potentially causing the accuracy of long-range depth estimation to suffer. The time taken for generating the 1392 x 512 depth maps for the SGBM and HSM network are 162ms and 450ms respectively, which corresponds to about 6.5 fps and 2 fps (running on a Intel Core i7 CPU with 32GB of memory, and Nvidia Quadro M1200 GPU with 4GB GPU memory). Also worth noting is the SGBM algorithm is written in C++ and the HSM network uses pyTorch. The maximum disparity search range for SGBM and HSM were both 128 pixels.
Deep learning networks for depth estimation have come a long way, but HSM still lags behind the traditional pixel correlation techniques, such as SGBM, both in terms of accuracy and speed of execution, especially on embedded hardware. Furthermore, we have not yet seen a deep learning technique that is significantly robust to camera parameter changes, which is important for autonomous driving applications. Fortunately, NODAR’s autocalibration software obviates the need for such an algorithm and allows the stereo matching algorithm to be trained for high accuracy without degrading the training set with y-disparity augmentation.
By Harish Satishchandra, Boston University MS, NODAR Engineer, Summer 2020
References
[1] Yang, G., Manela, J., Happold, M., & Ramanan, D. (2019). Hierarchical deep stereo matching on high-resolution images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5515–5524).