Seeing Clearly: Advancing Robotic Stereo Vision

Published in

Toyota Research Institute

7 min readOct 27, 2021

Figure 1: Corresponding color and depth images for a difficult scene in a real home using our learned stereo algorithm.

Why Learned Stereo?

At TRI, we are constantly striving to invent new ways for robots to sense and perceive the world around them. 3D depth sensors have been a common choice for robotic perception and scene understanding. Most commercial 3D depth sensors utilize active illumination via light projectors to “paint” the world with some texture or an encoded light pattern. This makes depth calculations easier and more precise on textureless surfaces (e.g. white walls or flat doors).

While this works most of the time, there are still drawbacks to this approach — particularly on surfaces which are highly light-absorbent or highly light-reflective. Consider the following image (left) taken from our lab using a hi-res RGB camera and the corresponding depth image (right) generated from a collection of Intel RealSenseD415 depth cameras.

Figure 2: Hi-res RGB image (left) compared to its corresponding depth image (right) captured using commercially available RGB-D depth sensors. (bottom) The resultant point cloud has large missing areas due to the presence of challenging surfaces.

Notice that there are gaps in both images, particularly on the black trash bin and black kitchen counter (light-absorbent) on the right and on the stainless steel refrigerator (light-reflective) on the left. Such surfaces are commonly challenging for depth sensors commercially available today that rely on the infrared color spectrum to add texture for classical stereo-matching to work.

At TRI, we’re leveraging advances in machine learning to develop an efficient learned stereo-matching algorithm that provides dense, hi-resolution depth for any stereo-color camera pair. Key to our approach is a number of optimizations, including a post-processing step that combines a learned confidence measure along with classical filtering to ensure metrically accurate point clouds. Using a mix of real and synthetic training data, our approach produces accurate depth-maps on challenging surfaces and objects.

Figure 3: Hi-res RGB image (left) compared to our learned-stereo depth image (right) using only a left and right RGB image pair. (bottom) The resultant point cloud is more dense with accurate depth on challenging surfaces.

Using the same earlier scene, notice how our stereo method is able to perceive accurate dense depth on the reflective surfaces like the stainless steel fridge and the black counter. Pixels in the far field are ignored beyond a reasonable depth, allowing our stereo-depth system to be optimized for use in homes at robotic manipulation range.

Our Model Architecture

Figure 4: The network diagram illustrates how our network processes two input images to predict a full resolution disparity image.

Our learned model is outlined in Figure 4. It follows a fairly common approach used in other learned methods — a structure which is quite similar to more classical methods, though with key components replaced with optimized functions found via deep learning. Table 1 illustrates a summary of the main differences between a classical stereo approach and a learned method like ours.

Table 1: Stereo architecture breakdown: Classical Method vs. Learned Method

More specifically, our learned model adds the following features:

Feature Extraction: The feature extractor is based on a dilated ResNet [1]. This gives the model a large receptive field without requiring a layer depth that would prohibit real-time inference at high resolutions. The output of the feature extractor is a 16-dimensional feature map downsampled from the input resolution.

Cost Volume Creation: A cross-correlation cost volume [2] is used to create a 4D feature volume at a configurable number of disparities. For our robot manipulator, we’ve chosen 384 disparities.

Cost Aggregation: The volume is passed through a series of 3D and 2D convolutions to produce a 3D cost volume at the same resolution of the feature extraction stage, similar to [3], with fewer operations dedicated to 3D convolution, and more 2D filtering to result in higher quality depth maps at faster rates.

Disparity Computation: A differentiable soft argmin operation [4] is used to regress a continuous disparity estimate per pixel. A matchability operation [5] is used to estimate a confidence of the disparity estimate.

Disparity Refinement: A second dilated ResNet is used to calculate a disparity residual given the original input image, low resolution disparity, and matchability. The low resolution disparity is bilinearly upsampled to the full input resolution and added to the disparity residual to produce the final disparity estimate.

Data, Data, Data …

We have found that using a diverse combination of real and synthetic data gives the best benchmark and real-world performance. Our real data is captured using a number of custom data-collection rigs as shown in Figure 5.

Figure 5: Our first data collection head (left) uses an array of Intel RealSense D415 stereo cameras that overlaps the aggregate depth onto a pair of Basler RGB hi-res cameras mounted below. Our second data collection head (right) uses a Microsoft Azure Kinect depth camera that has a FOV similar to the Basler RGB pair mounted below.

Using these collection units, we’ve scanned ten homes, resulting in 410 unique scans of 154 scenes. Of those scans, we’ve extracted 323,192 useful stereo frames labeled with ground-truth depth. However, because our collection units rely on reprojected sensed depth to assign to RGB pixels, we can only obtain a sparse collection of ground truth labels which is limited by FOV, IR-light reflectivity of the scene, and general overlap of sensing range. To fill in the remaining gaps in our real data, we rely on synthetic data.

Figure 6: A sampling of our training data. Top: a sampling of all the real homes we’ve gone and scanned using our collection head. Bottom, from left to right: a real image taken using our collection head, a randomized texture image procedurally generated, a rendered synthetic “flying things” image, a Facebook Replica reconstructed sample.

Synthetic data is useful because of the large variety of scenes, objects, materials, and lighting we can generate quickly. Also, because it is in simulation, we’re able to get perfect ground truth data on non-lambertian surfaces such as glass and metal that are missing in the real dataset. Our synthetic data is pulled from a number of freely available data sets (as shown in Figure 6):

The Randomized Texture (RT) synthetic dataset consists of procedurally generated indoor manipulation scenes containing a room, the robot, furniture, and objects sampled from the ShapeNet database.
The Synthetic Flying Things (SFT) dataset (re-rendered using a higher fidelity renderer to capture the complex lighting effects on shiny materials) consists of randomized objects placed floating about the environment in all types of configurations.
The Facebook Replica synthetic dataset is a set of high quality reconstructions of a variety of indoor spaces. Each reconstruction has clean dense geometry, high resolution and high dynamic range textures, glass and mirror surface information. We sample views randomly from all of the available scenes to generate a diverse set of realistic samples.
Commercially synthetic dataset from third parties such as http://www.coohom.com and http://www.datagen.tech

How Well Does it Work?

Our described stereo network has been integrated into our general purpose robot as the primary perception sensor for performing household tasks. Using only our dense stereo depth for localization, mapping, manipulation, and motion planning, our robot is able to complete a number of various household tasks.

Shown here is the robot laundry task, where the robot autonomously navigates a hallway carrying a laundry basket and places the laundry into the dryer. Note the quality of stereo depth on the reflective surfaces of the dryer, the internal drum, and the transparent laundry door.

Shown here is the “robot wiping task,” where the robot autonomously wipes a tiled kitchen.Note the quality of stereo depth on the non-textured tiles and plane cabinets, as well as the detail on the flower in the glass vase.

Robotic manipulation systems require high camera framerates (5 Hz or higher) for effective obstacle avoidance and closed-loop control. Moreover, given the large number of other processes running on a given robot, the processing budget per frame for stereo-depth is limited.

Our learned model is optimized to produce high quality depth maps as efficiently as possible, leveraging operations and sequences of operations amenable to GPU execution. To date, we’re able to achieve stereo at 2560x2048 resolution with 384 disparities in 30ms. To our knowledge, there isn’t any prior work that has produced such high-quality, high-resolution depth maps at these rates. Though other work outperforms us in terms of metric quality, we are at least 15X faster than any approach producing depth maps of similar or better quality at high resolutions.

What’s Next?

You can read more about our learned stereo algorithm here and see how it matches up against other benchmarks in the field. At TRI, we’re continually looking for ways to advance the field of robotics, and learned stereo is just one of the many capabilities we’re fortunate enough to explore and contribute to. If this excites you, join our team!

References

P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell, “Understanding convolution for semantic segmentation,” in 2018 IEEE winter conference on applications of computer vision (WACV). IEEE, 2018, pp. 1451–1460.
N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox, “A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4040–4048.
N. Smolyanskiy, A. Kamenev, and S. Birchfield, “On the importance of stereo for accurate depth estimation: An efficient semi-supervised deep neural network approach,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 1007–1015.
A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry, “End-to-end learning of geometry and context for deep stereo regression,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 66–75.
J. Zhang, Y. Yao, Z. Luo, S. Li, T. Shen, T. Fang, and L. Quan, “Learning stereo matchability in disparity regression networks,” in 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 2021, pp. 1611–1618.