Computer Vision is a field of Artificial Intelligence that enables computers to represent the visual world. Deep Learning has revolutionized this field thanks to neural networks that can learn from data how to make accurate predictions. Recent progress promises to make cars safer, increase freedom to move through automated vehicles, and eventually provide robotic assistance for those with disabilities and for our rapidly aging global population.
However, there is a catch. Beyond privacy and other ethical issues to carefully consider when designing machine learning systems, all state-of-the-art models in computer vision rely on millions of labels (or more!) to reach the high level of accuracy required for safety-critical applications in the real world. Manual labeling is expensive and time-consuming, taking hours and costing tens of dollars per image. And, sometimes, it is impossible altogether.
This is the case for monocular depth estimation, where the goal is to help the computer understand the depth of images and predict how far scene elements are for each pixel of a single image.
Although many sensor rigs can measure depth, whether directly (e.g., LiDAR) or indirectly (e.g., stereo systems), single cameras are cheap and ubiquitous. They are in your mobile phone, vehicle dash cameras, internet videos, and so forth. Therefore, being able to generate useful depth information from videos is not only an interesting scientific challenge, but is also of high practical value. We also know that humans can do a pretty good job at it without explicitly measuring everything. We rely instead on strong inductive priors and our ability to — consciously or not — reason about 3D relations. Try closing one eye and reaching for something. You shouldn’t have a problem judging the depth.
This is exactly the approach we are following at TRI. Instead of training deep neural networks by telling them the precise answer (a.k.a. supervised learning), we are trying instead to rely on self-supervised learning by using projective geometry as a teacher! This training paradigm unlocks the use of arbitrarily large amounts of unlabeled videos to continuously improve our models as they get exposed to more data.
In this two-part blog post series, we will dive into our recent research on how to design and efficiently train deep neural networks for depth estimation (and beyond). In this first post, we will cover self-supervised learning using projective geometry across a variety of camera configurations. In the second post, we will discuss the practical limitations of self-supervised learning and how to go beyond them using weak supervision or transfer learning.
Part 1: Self-Supervised Learning for Depth Estimation
Deep Origins: Supervision and Stereo
Eigen et al was the first to show that a calibrated camera & LiDAR sensor rig can be used to turn monocular depth estimation into a supervised learning problem. The setup is simple: we convert an input image into per pixel distance estimates (a depth map) using a neural network. Then, we use accurate LiDAR measurements reprojected onto the camera image to supervise the learning of the depth network weights via standard backpropagation of prediction errors using deep learning libraries like PyTorch.
Building on the work of Godard et al, we explored a self-supervised learning approach that used only images captured by a stereo pair (two cameras next to each other) instead of LiDAR. Images of the same scene captured from different viewpoints are indeed geometrically consistent, and we can use this property to learn a depth network. If it makes a correct prediction on the left image of the stereo pair, simple geometric equations explain how to reconstruct the left image from pixels in the right image only, a task called view synthesis. If the depth prediction is wrong, then the reconstruction will be poor, giving an error, called the photometric loss, to minimize via backpropagation.
Note that the network is still monocular: it is trained only on left images, whereas the right image, and prior knowledge about projective geometry, is only used to self-supervise the
learning process. This is also different from most self-supervised learning works in Computer Vision that only learn representations: here we learn the full model for the task of depth estimation without any labels at all!
Self-supervised depth: the devil is in the details
In our SuperDepth ICRA’19 paper, we discovered that a major bottleneck to monocular depth performance is low image resolution. The devil is in the details, and if they get lost in the typical downsampling operations common to most deep convolutional networks, then it is hard to get a precise self-supervised error signal. Inspired by super-resolution methods, we found that sub-pixel convolutions on intermediate depth estimates are capable of recovering some of those fine-grained details to boost prediction performance, especially at the high resolutions critical for self-driving (2 megapixels and above).
While stereo cameras can facilitate self-supervised learning, Zhou et al have surprisingly shown that this approach also works for videos acquired by a single, moving, monocular camera! One can indeed use similar geometric principles with temporally adjacent frames instead of left-right stereo images. This vastly widened the application potential of self-supervised learning, but also made the task much harder. Indeed, the spatial relationship between consecutive frames, also called the camera’s ego-motion, is not known and must therefore be estimated as well. Luckily, there is an ample body of research around the ego-motion estimation problem (including our own), and this can be seamlessly integrated with self-supervised depth estimation frameworks, e.g., via the joint learning of a pose network.with self-supervised depth estimation frameworks, e.g., via the joint learning of a pose network.
Just as in SuperDepth, we found that high-resolution details are key in this setting too, but this time we went further. Instead of trying to recover lost details via super-resolution, we set out to efficiently preserve them throughout the entire deep network. Consequently, in our CVPR’20 paper we introduced PackNet, a neural network architecture specifically tailored for self-supervised monocular depth estimation. We designed novel packing and unpacking layers that preserve spatial resolution at all intermediate feature levels thanks to tensor manipulations and 3D convolutions. These layers serve as substitutes to traditional downsampling and upsampling operations, with the difference that they can learn to compress and uncompress key high-resolution features that help with depth prediction.
In our experiments with PackNet, we confirmed that it is possible to preserve these details in real time, which is crucial for robotics applications. As a result, we showed that our self-supervised network can match or even outperform models supervised with LiDAR!
Importantly, we demonstrated that performance improves not just with resolution, but also model size and data, extending to self-supervised depth estimation empirical findings made by other researchers for other supervised tasks.
The Dense Depth for Automated Driving (DDAD) Benchmark and Competition
Many of the results you have seen above are on data from the TRI fleet that we use for research, development, and testing of our autonomous driving and advanced driver-assistance systems. In order to promote reproducibility and foster further open research, we have released part of that data to form a new challenging benchmark called DDAD (for Dense Depth for Automated Driving). It includes six calibrated cameras time-synchronized at 10 Hz and high-resolution long-range LiDAR sensors used to generate dense ground-truth depth estimates up to 250m. DDAD has a total of 12,650 anonymized training samples in challenging and diverse urban conditions in Japan and the US. We have also released a validation set and are organizing a depth estimation competition on DDAD.
Full Surround Monocular Point Clouds
As mentioned above, DDAD actually includes synchronized data from six cameras, not just one. Why this many? In robotics, and especially in a driving setting, we want to understand what is happening around the robot, not just in front of it. This is why LiDAR scanners provide a full 360o coverage. The same can be achieved with multiple cameras by judiciously placing them to provide full coverage. However, these camera rigs typically have minimal overlap and very different viewpoints to minimize the number of cameras required, and hence costs. Sadly, this setup breaks standard computer vision methods for multi-view depth estimation, reducing depth estimation to be independent between cameras, and thus potentially inconsistent.
Even though the overlaps between cameras are small, we can still leverage them if we consider their relations across time. This is what we show in one of our latest works called Full Surround Monodepth (FSM). In a nutshell, our method uses a combination of multi-camera spatio-temporal photometric constraints, self-occlusion masks, and pose averaging to learn, in a self-supervised way again, a single depth network that can reconstruct metrically-scaled point clouds all around the robot, just like a LiDAR.
Alternative Camera Models: Neural Ray Surfaces
The projective geometry that underpins all of the aforementioned works relies on an important assumption: the relation between the 2D image and the 3D world is accurately modeled by the standard pinhole model with known calibration. This enables the projection of information between cameras, which is central to self-supervised depth estimation. However, this convenient assumption does not always hold in practice due to unmodeled distortions, for instance with wide angle cameras (e.g., fisheye, catadioptric), under water, or even dashcams behind a windshield on a rainy day!
How can we solve that issue without having to carefully engineer and calibrate ad hoc camera models for each specific scenario? Can we learn a general projection model directly from raw data without prior knowledge? This is what we set out to do in our 3DV paper on Neural Ray Surfaces (NRS). We showed that we can learn to predict per-pixel projection operators together with depth and pose networks. This is all optimized end-to-end in a self-supervised way, just like before, but without any assumption about the camera model. In other words, NRS is very flexible and works across a wide range of different camera geometries. It has been amazing to see what people can do with these ideas using the open source code we released!
Self-supervision is a powerful tool to learn deep networks for depth estimation using only raw data and our knowledge about 3D geometry. But we can see applications far beyond depth estimation. We believe self-supervised learning has the potential for many applications that could benefit society and increase opportunities for mobility for all.
Nonetheless, roadblocks remain. This is why we have released our code and data to encourage more open research on these important challenges. We are ourselves hard at work on several of them, in particular how to go beyond pure self-supervision towards scalable supervision, not just for performance improvements but also mitigation of some biases of self-supervised learning. We will discuss some of our related research in part 2 of this blog post, so keep your eye (one is enough) peeled!