Self-supervised Learning in Depth — part 1 of 2

Toyota Research Institute
Toyota Research Institute
9 min readMay 13, 2021


Self-supervised learning and beyond for depth estimation from images.

By: Vitor Guizilini, Rares Ambrus, Adrien Gaidon

Self-supervised learning enables the prediction of accurate pointclouds from a single image using only videos as training data.


Computer Vision is a field of Artificial Intelligence that enables computers to represent the visual world. Deep Learning has revolutionized this field thanks to neural networks that can learn from data how to make accurate predictions. Recent progress promises to make cars safer, increase freedom to move through automated vehicles, and eventually provide robotic assistance for those with disabilities and for our rapidly aging global population.

However, there is a catch. Beyond privacy and other ethical issues to carefully consider when designing machine learning systems, all state-of-the-art models in computer vision rely on millions of labels (or more!) to reach the high level of accuracy required for safety-critical applications in the real world. Manual labeling is expensive and time-consuming, taking hours and costing tens of dollars per image. And, sometimes, it is impossible altogether.

This is the case for monocular depth estimation, where the goal is to help the computer understand the depth of images and predict how far scene elements are for each pixel of a single image.

In monocular depth estimation, the goal is the generation of pixel-wise estimates (a.k.a. a depth map) of how far each scene element is from the camera.

Although many sensor rigs can measure depth, whether directly (e.g., LiDAR) or indirectly (e.g., stereo systems), single cameras are cheap and ubiquitous. They are in your mobile phone, vehicle dash cameras, internet videos, and so forth. Therefore, being able to generate useful depth information from videos is not only an interesting scientific challenge, but is also of high practical value. We also know that humans can do a pretty good job at it without explicitly measuring everything. We rely instead on strong inductive priors and our ability to — consciously or not — reason about 3D relations. Try closing one eye and reaching for something. You shouldn’t have a problem judging the depth.

This is exactly the approach we are following at TRI. Instead of training deep neural networks by telling them the precise answer (a.k.a. supervised learning), we are trying instead to rely on self-supervised learning by using projective geometry as a teacher! This training paradigm unlocks the use of arbitrarily large amounts of unlabeled videos to continuously improve our models as they get exposed to more data.

In this two-part blog post series, we will dive into our recent research on how to design and efficiently train deep neural networks for depth estimation (and beyond). In this first post, we will cover self-supervised learning using projective geometry across a variety of camera configurations. In the second post, we will discuss the practical limitations of self-supervised learning and how to go beyond them using weak supervision or transfer learning.

Part 1: Self-Supervised Learning for Depth Estimation

Deep Origins: Supervision and Stereo

Eigen et al was the first to show that a calibrated camera & LiDAR sensor rig can be used to turn monocular depth estimation into a supervised learning problem. The setup is simple: we convert an input image into per pixel distance estimates (a depth map) using a neural network. Then, we use accurate LiDAR measurements reprojected onto the camera image to supervise the learning of the depth network weights via standard backpropagation of prediction errors using deep learning libraries like PyTorch.

Building on the work of Godard et al, we explored a self-supervised learning approach that used only images captured by a stereo pair (two cameras next to each other) instead of LiDAR. Images of the same scene captured from different viewpoints are indeed geometrically consistent, and we can use this property to learn a depth network. If it makes a correct prediction on the left image of the stereo pair, simple geometric equations explain how to reconstruct the left image from pixels in the right image only, a task called view synthesis. If the depth prediction is wrong, then the reconstruction will be poor, giving an error, called the photometric loss, to minimize via backpropagation.

Note that the network is still monocular: it is trained only on left images, whereas the right image, and prior knowledge about projective geometry, is only used to self-supervise the

learning process. This is also different from most self-supervised learning works in Computer Vision that only learn representations: here we learn the full model for the task of depth estimation without any labels at all!

Self-supervised depth: the devil is in the details

In our SuperDepth ICRA’19 paper, we discovered that a major bottleneck to monocular depth performance is low image resolution. The devil is in the details, and if they get lost in the typical downsampling operations common to most deep convolutional networks, then it is hard to get a precise self-supervised error signal. Inspired by super-resolution methods, we found that sub-pixel convolutions on intermediate depth estimates are capable of recovering some of those fine-grained details to boost prediction performance, especially at the high resolutions critical for self-driving (2 megapixels and above).

While stereo cameras can facilitate self-supervised learning, Zhou et al have surprisingly shown that this approach also works for videos acquired by a single, moving, monocular camera! One can indeed use similar geometric principles with temporally adjacent frames instead of left-right stereo images. This vastly widened the application potential of self-supervised learning, but also made the task much harder. Indeed, the spatial relationship between consecutive frames, also called the camera’s ego-motion, is not known and must therefore be estimated as well. Luckily, there is an ample body of research around the ego-motion estimation problem (including our own), and this can be seamlessly integrated with self-supervised depth estimation frameworks, e.g., via the joint learning of a pose network.with self-supervised depth estimation frameworks, e.g., via the joint learning of a pose network.

Self-supervised learning uses depth and pose networks to synthesize the current frame based on information from an adjacent frame. The photometric loss between original and synthesized images is the objective to be minimized during training.

Just as in SuperDepth, we found that high-resolution details are key in this setting too, but this time we went further. Instead of trying to recover lost details via super-resolution, we set out to efficiently preserve them throughout the entire deep network. Consequently, in our CVPR’20 paper we introduced PackNet, a neural network architecture specifically tailored for self-supervised monocular depth estimation. We designed novel packing and unpacking layers that preserve spatial resolution at all intermediate feature levels thanks to tensor manipulations and 3D convolutions. These layers serve as substitutes to traditional downsampling and upsampling operations, with the difference that they can learn to compress and uncompress key high-resolution features that help with depth prediction.

PackNet is an encoder-decoder neural network that leverages novel packing and unpacking blocks to learn to preserve important spatial details, leading to high quality depth predictions.

In our experiments with PackNet, we confirmed that it is possible to preserve these details in real time, which is crucial for robotics applications. As a result, we showed that our self-supervised network can match or even outperform models supervised with LiDAR!

Qualitative results comparing PackNet to other state-of-the-art depth estimation models, both supervised and self-supervised.

Importantly, we demonstrated that performance improves not just with resolution, but also model size and data, extending to self-supervised depth estimation empirical findings made by other researchers for other supervised tasks.

PackNet scalability experiments relative to the standard ResNet architecture. We analyse scalability relative to network complexity, image resolution, and depth ranges.

This model is quite powerful in practice, and anyone can easily reproduce our results from our open source codebase packnet-sfm. We also released pre-trained models and a new dataset: DDAD.

The Dense Depth for Automated Driving (DDAD) Benchmark and Competition

Many of the results you have seen above are on data from the TRI fleet that we use for research, development, and testing of our autonomous driving and advanced driver-assistance systems. In order to promote reproducibility and foster further open research, we have released part of that data to form a new challenging benchmark called DDAD (for Dense Depth for Automated Driving). It includes six calibrated cameras time-synchronized at 10 Hz and high-resolution long-range LiDAR sensors used to generate dense ground-truth depth estimates up to 250m. DDAD has a total of 12,650 anonymized training samples in challenging and diverse urban conditions in Japan and the US. We have also released a validation set and are organizing a depth estimation competition on DDAD.

Example Scenes From DDAD

Full Surround Monocular Point Clouds

As mentioned above, DDAD actually includes synchronized data from six cameras, not just one. Why this many? In robotics, and especially in a driving setting, we want to understand what is happening around the robot, not just in front of it. This is why LiDAR scanners provide a full 360o coverage. The same can be achieved with multiple cameras by judiciously placing them to provide full coverage. However, these camera rigs typically have minimal overlap and very different viewpoints to minimize the number of cameras required, and hence costs. Sadly, this setup breaks standard computer vision methods for multi-view depth estimation, reducing depth estimation to be independent between cameras, and thus potentially inconsistent.

Even though the overlaps between cameras are small, we can still leverage them if we consider their relations across time. This is what we show in one of our latest works called Full Surround Monodepth (FSM). In a nutshell, our method uses a combination of multi-camera spatio-temporal photometric constraints, self-occlusion masks, and pose averaging to learn, in a self-supervised way again, a single depth network that can reconstruct metrically-scaled point clouds all around the robot, just like a LiDAR.

Self-supervised scale-aware 360º point cloud produced using Full Surround Monodepth (FSM) on DDAD.

Alternative Camera Models: Neural Ray Surfaces

The projective geometry that underpins all of the aforementioned works relies on an important assumption: the relation between the 2D image and the 3D world is accurately modeled by the standard pinhole model with known calibration. This enables the projection of information between cameras, which is central to self-supervised depth estimation. However, this convenient assumption does not always hold in practice due to unmodeled distortions, for instance with wide angle cameras (e.g., fisheye, catadioptric), under water, or even dashcams behind a windshield on a rainy day!

How can we solve that issue without having to carefully engineer and calibrate ad hoc camera models for each specific scenario? Can we learn a general projection model directly from raw data without prior knowledge? This is what we set out to do in our 3DV paper on Neural Ray Surfaces (NRS). We showed that we can learn to predict per-pixel projection operators together with depth and pose networks. This is all optimized end-to-end in a self-supervised way, just like before, but without any assumption about the camera model. In other words, NRS is very flexible and works across a wide range of different camera geometries. It has been amazing to see what people can do with these ideas using the open source code we released!

Self-supervised depth estimation results using NRS on a wide range of camera models.


Self-supervision is a powerful tool to learn deep networks for depth estimation using only raw data and our knowledge about 3D geometry. But we can see applications far beyond depth estimation. We believe self-supervised learning has the potential for many applications that could benefit society and increase opportunities for mobility for all.

Nonetheless, roadblocks remain. This is why we have released our code and data to encourage more open research on these important challenges. We are ourselves hard at work on several of them, in particular how to go beyond pure self-supervision towards scalable supervision, not just for performance improvements but also mitigation of some biases of self-supervised learning. We will discuss some of our related research in part 2 of this blog post, so keep your eye (one is enough) peeled!



Toyota Research Institute
Toyota Research Institute

Applied and forward-looking research to create a new world of mobility that's safe, reliable, accessible and pervasive.

Recommended from Medium


See more recommendations