Self-supervised Learning in Depth — part 2 of 2

Toyota Research Institute

Published in

Toyota Research Institute

9 min readMay 22, 2021

Beyond self-supervised learning for depth estimation.

By: Vitor Guizilini, Rares Ambrus, Adrien Gaidon

*Learning to predict depth from images, using not only geometry but also other sources of information when available.*

In our previous post, we discussed how deep neural networks can predict depth from a single image. In particular, we showed that this problem can be self-supervised using only videos and geometric constraints. This approach is highly scalable and can even work on uncalibrated cameras or multi-camera rigs commonly used for autonomous driving.

However, there are some inherent limitations of self-supervised learning for monocular depth estimation, for instance, scale ambiguity. Thankfully, these issues can be addressed by adding minimal additional information. In this second blog post, we will discuss these challenges and recent research we have done at TRI to overcome them, going beyond self-supervised learning.

Weak Supervision: Tell Me How Fast, I Will Tell You How Far

In theory, it is geometrically impossible to estimate the metric scale of a scene from pixels only. An object could be small and very close, or large and very far away, yet appear the same on a picture. This is why self-supervised monocular depth prediction methods are scale-ambiguous: they can only recover the geometry of a scene up to an unknown multiplicative scale factor. However, for most applications in robotics, we definitely want to know how far objects are in meters! Thankfully, there are several ways to recover scale in practice.

The first and most common method is to rely on known camera calibration information, for instance the height of the camera above the ground or the distance between camera pairs in a multi-camera rig. Using these parameters, it is easy to metrically scale depth predictions like we did in our Full Surround Monodepth paper described in part 1. The problem with these static priors is that they are assumed fixed after calibration. Run into a pothole and things move a bit? Now your predictions will be way off until you recalibrate.

A more robust approach is to rely on velocity information as weak supervision during training. Weak supervision, a very broad term used in the Machine Learning community, simply means an inexpensive source of information that gives you some idea of what the correct prediction is. In our case, most cars, robots, and mobile phones have additional inexpensive sensors (e.g., accelerometers or odometers) that can measure speed in meters per second. But speed is how fast you move, not how far everything is from you. So how can that information be used to make our depth network output predictions in meters?

In part 1, we explained that self-supervised learning relies on the motion of the camera to reconstruct the previous image from the current one. This ego-motion has six degrees of freedom (three for rotation and three for translation) and is not measured by a sensor but instead predicted from image pairs by a pose network (e.g., the two-stream PoseNet from our CoRL paper). This pose network is learned jointly with the depth network, and both output unscaled quantities. But what if we could use the aforementioned velocity information to make the pose network learn to output translation in meters? Would this be enough to make depth predictions in meters? As we have shown in our CVPR paper introducing PackNet, the answer is yes!

Surprisingly, adding a weakly-supervised velocity regression objective to the pose network is enough to make the self-supervised learning of the depth network scale-aware. As the figures below illustrate, when using velocity information (red curve), there is a sudden phase transition in the depth estimation quality measured in meters. This corresponds to a key moment during training when the depth network adopts the metric scale, matching the performance of LiDAR-based rescaling!

Transfer Learning: From What to How Far

An effective alternative from weak supervision is to leverage completely different datasets and tasks that might not directly relate to depth prediction. This is a wide-spread technique in deep learning called transfer learning. In robotics, we care about more than just where things are, we also want to know what they are. Consequently, most robots also use computer vision models for tasks like panoptic segmentation (e.g., trained in simulation, more on that in a future blog post…). Can these pre-trained networks also help depth estimation?

Once again, the answer is yes, as we show in our ICLR paper on Semantically-Guided Representation Learning for Self-Supervised Monocular Depth. As a monocular depth network uses a single image, it has to learn patterns relating how things appear and how far they are. This is only possible if there is a pattern in the first place! Fortunately, there are many such patterns in the real world. Adult humans tend to be less than 2 meters tall. Cars too have a distinctive shape and size. Roads, buildings, and other geometric structures are projected in specific, predictable ways onto camera frames. This is in essence why we can predict 3D properties from images (and also why they suffer from certain biases as we will discuss below).

Of course, pre-programming this prior knowledge relating semantic categories to size is brittle. It might not account for the possibility of a 2.72m tall human for instance. In addition, pre-trained semantic segmentation networks might not generalize perfectly. This is why we proposed to only use pretrained networks to guide the self-supervised learning of depth networks. Our new architecture leverages pixel-adaptive convolutions (a form of self-attention) to produce semantic-aware depth features during self-supervised learning. By using semantic content as guidance, features corresponding to the same category will have similar activations, encouraging both smoothness within objects and sharpness in boundaries.

*Our semantically-guided depth network injects features from a pre-trained semantic segmentation network directly into the depth network, to enable the learning of semantic-aware depth features.*

Because the semantic segmentation network is not modified in any way, it can still be used to generate predictions as an additional task (or for interpretability), and its features can be efficiently reutilized by the depth network. Note that this relation between semantics and depth is not a one-way street, and we are also exploring how self-supervised depth can help semantics too (e.g., in our recent GUDA paper).

Qualitative examples of our semantically-guided depth network (right), relative to the standard PackNet architecture (middle), for the same inputs (left). Using predicted semantic information leads to much sharper boundaries and better definition and objects further away.

Partial Supervision: Use It When You Have It

So far, we have not mentioned the elephant in the room: LiDAR. In some cases, you might indeed be able to use a reliable source of accurate depth measurements. Nonetheless, it is useful to have multiple independent sources of depth information to increase robustness via redundancy, e.g., by also predicting depth from images. In fact, large-scale commercially available LiDAR sensors tend to be very sparse. Because they are mainly used for safety features, they often measure only 100 points at 10Hz with 4 laser beams. Clearly, this information is not enough to perceive the environment, but can we use it to improve monocular depth prediction?

The first natural question is whether we can use cost-effective LiDARs as an extra source of supervision to complement geometric self-supervised learning as described in part 1. This setting is called semi-supervised learning, because we only have partial ground truth depth measurements. This requires combining heterogeneous supervision: at the 3D point level for LiDAR, but at the 2D pixel level for cameras. In our CoRL paper on Robust Semi-Supervised Monocular Depth Estimation, we proposed a novel loss specifically tailored for the semi-supervised setting, where the depth error from one frame is projected onto the other jointly with the reconstructed point clouds. This leads to substantial improvements in depth estimation, even with as few as 4 LiDAR beams!

Our semi-supervised models with reprojected distances are able to leverage much sparser LiDAR information at training time and still generate highly accurate metric models.

So we’ve demonstrated that you can use lower cost LiDARs during training, but what about inference? So far we have been focusing on monocular depth prediction, in which a network is trained to produce estimates from a single image at test time. An alternative setting is depth completion, in which partial depth information is also available at test time and needs to be completed into per-pixel estimates. Completion is typically tackled in a different fashion than prediction, but it does not have to be! In our CVPR paper on Sparse Auxiliary Networks (SAN), we propose an architecture for unified monocular depth prediction and completion. SAN is an additional differentiable module that can be introduced directly into a depth prediction network like PackNet to also enable depth completion and train both jointly. To exploit the high sparsity of these partial input depth maps, the SAN module uses sparse convolutions to only process valid information, which significantly improves computational speed. This work is our first step towards what we call dialable perception: flexibly dialing up or down perception algorithms (here depth completion and prediction) depending on available sensors, sensor failures, dropped packets, or other sources of variation.

*PackNet-SAN results on DDAD, for depth prediction and completion (same network, different inputs).*

*PackNet-SAN results on NYUv2, for depth prediction and completion (same network, different inputs).*

The Biases of Self-Supervised Learning

Learning from raw videos is great for scalability and adaptability. However, being self-supervised also means there is less human oversight in the learning process. As a result, the system might suffer from unchecked biases. The semantic guidance we mentioned above is one way to address this. Nonetheless, another common source of error remains: objects moving exactly like the camera are predicted as being infinitely far away because they never get closer! It is quite common in driving data where following a lead vehicle is frequent. This infinite depth problem is an inherent limitation of self-supervised, motion-based, learning in images.

We found a simple solution to this issue, described in our ICLR paper. First, we train on all the data, leading to the undesirable infinite depth predictions on some nearby objects. Second, we run this biased depth network on the training images to automatically detect the ones containing these infinitely deep holes into what otherwise looks like a flat ground plane. Finally, we discard those training images that exhibit this issue and retrain the depth network. We found that this simple and automatic data cleaning and retraining procedure is enough to make the resulting model much more robust to the infinite depth problem.

*Our proposed two-stage training methodology is able to eliminate the infinite depth problem in self-supervised depth estimation without introducing any sources of supervision.*

Conclusion

In the first post of this series, we showed how self-supervised learning can leverage our prior knowledge about projective geometry to predict depth across a variety of camera configurations. In this second post, we outlined the practical limitations of pure self-supervision and then how we can go beyond them using other scalable sources of information for weak supervision, transfer learning, semi-supervised learning, or data cleaning to mitigate biases.

This is just the beginning for self-supervised learning and 3D computer vision. Although some of this technology is already in use, there are many remaining research challenges in key areas like dataset building (including using synthetic data), dialable perception, safety, and closing the loop with downstream control tasks to make future cars and robots more helpful for us all. This is why we are organizing the Mono3D workshop and DDAD competition at the CVPR 2021 conference, and why our code and datasets are available for further research: we hope you will join us on this exciting scientific journey!