Self-supervised Learning in Depth — part 1 of 2

Self-supervised learning and beyond for depth estimation from images.

By: Vitor Guizilini, Rares Ambrus, Adrien Gaidon

Self-supervised learning enables the prediction of accurate pointclouds from a single image using only videos as training data.


However, there is a catch. Beyond privacy and other ethical issues to carefully consider when designing machine learning systems, all state-of-the-art models in computer vision rely on millions of labels (or more!) to reach the high level of accuracy required for safety-critical applications in the real world. Manual labeling is expensive and time-consuming, taking hours and costing tens of dollars per image. And, sometimes, it is impossible altogether.

This is the case for monocular depth estimation, where the goal is to help the computer understand the depth of images and predict how far scene elements are for each pixel of a single image.

In monocular depth estimation, the goal is the generation of pixel-wise estimates (a.k.a. a depth map) of how far each scene element is from the camera.

Although many sensor rigs can measure depth, whether directly (e.g., LiDAR) or indirectly (e.g., stereo systems), single cameras are cheap and ubiquitous. They are in your mobile phone, vehicle dash cameras, internet videos, and so forth. Therefore, being able to generate useful depth information from videos is not only an interesting scientific challenge, but is also of high practical value. We also know that humans can do a pretty good job at it without explicitly measuring everything. We rely instead on strong inductive priors and our ability to — consciously or not — reason about 3D relations. Try closing one eye and reaching for something. You shouldn’t have a problem judging the depth.

This is exactly the approach we are following at TRI. Instead of training deep neural networks by telling them the precise answer (a.k.a. supervised learning), we are trying instead to rely on self-supervised learning by using projective geometry as a teacher! This training paradigm unlocks the use of arbitrarily large amounts of unlabeled videos to continuously improve our models as they get exposed to more data.

In this two-part blog post series, we will dive into our recent research on how to design and efficiently train deep neural networks for depth estimation (and beyond). In this first post, we will cover self-supervised learning using projective geometry across a variety of camera configurations. In the second post, we will discuss the practical limitations of self-supervised learning and how to go beyond them using weak supervision or transfer learning.

Part 1: Self-Supervised Learning for Depth Estimation

Eigen et al was the first to show that a calibrated camera & LiDAR sensor rig can be used to turn monocular depth estimation into a supervised learning problem. The setup is simple: we convert an input image into per pixel distance estimates (a depth map) using a neural network. Then, we use accurate LiDAR measurements reprojected onto the camera image to supervise the learning of the depth network weights via standard backpropagation of prediction errors using deep learning libraries like PyTorch.

Building on the work of Godard et al, we explored a self-supervised learning approach that used only images captured by a stereo pair (two cameras next to each other) instead of LiDAR. Images of the same scene captured from different viewpoints are indeed geometrically consistent, and we can use this property to learn a depth network. If it makes a correct prediction on the left image of the stereo pair, simple geometric equations explain how to reconstruct the left image from pixels in the right image only, a task called view synthesis. If the depth prediction is wrong, then the reconstruction will be poor, giving an error, called the photometric loss, to minimize via backpropagation.

Note that the network is still monocular: it is trained only on left images, whereas the right image, and prior knowledge about projective geometry, is only used to self-supervise the

learning process. This is also different from most self-supervised learning works in Computer Vision that only learn representations: here we learn the full model for the task of depth estimation without any labels at all!

Self-supervised depth: the devil is in the details

While stereo cameras can facilitate self-supervised learning, Zhou et al have surprisingly shown that this approach also works for videos acquired by a single, moving, monocular camera! One can indeed use similar geometric principles with temporally adjacent frames instead of left-right stereo images. This vastly widened the application potential of self-supervised learning, but also made the task much harder. Indeed, the spatial relationship between consecutive frames, also called the camera’s ego-motion, is not known and must therefore be estimated as well. Luckily, there is an ample body of research around the ego-motion estimation problem (including our own), and this can be seamlessly integrated with self-supervised depth estimation frameworks, e.g., via the joint learning of a pose network.with self-supervised depth estimation frameworks, e.g., via the joint learning of a pose network.

Self-supervised learning uses depth and pose networks to synthesize the current frame based on information from an adjacent frame. The photometric loss between original and synthesized images is the objective to be minimized during training.

Just as in SuperDepth, we found that high-resolution details are key in this setting too, but this time we went further. Instead of trying to recover lost details via super-resolution, we set out to efficiently preserve them throughout the entire deep network. Consequently, in our CVPR’20 paper we introduced PackNet, a neural network architecture specifically tailored for self-supervised monocular depth estimation. We designed novel packing and unpacking layers that preserve spatial resolution at all intermediate feature levels thanks to tensor manipulations and 3D convolutions. These layers serve as substitutes to traditional downsampling and upsampling operations, with the difference that they can learn to compress and uncompress key high-resolution features that help with depth prediction.

PackNet is an encoder-decoder neural network that leverages novel packing and unpacking blocks to learn to preserve important spatial details, leading to high quality depth predictions.

In our experiments with PackNet, we confirmed that it is possible to preserve these details in real time, which is crucial for robotics applications. As a result, we showed that our self-supervised network can match or even outperform models supervised with LiDAR!

Qualitative results comparing PackNet to other state-of-the-art depth estimation models, both supervised and self-supervised.

Importantly, we demonstrated that performance improves not just with resolution, but also model size and data, extending to self-supervised depth estimation empirical findings made by other researchers for other supervised tasks.

PackNet scalability experiments relative to the standard ResNet architecture. We analyse scalability relative to network complexity, image resolution, and depth ranges.

This model is quite powerful in practice, and anyone can easily reproduce our results from our open source codebase packnet-sfm. We also released pre-trained models and a new dataset: DDAD.

The Dense Depth for Automated Driving (DDAD) Benchmark and Competition

Example Scenes From DDAD

Full Surround Monocular Point Clouds

Even though the overlaps between cameras are small, we can still leverage them if we consider their relations across time. This is what we show in one of our latest works called Full Surround Monodepth (FSM). In a nutshell, our method uses a combination of multi-camera spatio-temporal photometric constraints, self-occlusion masks, and pose averaging to learn, in a self-supervised way again, a single depth network that can reconstruct metrically-scaled point clouds all around the robot, just like a LiDAR.

Self-supervised scale-aware 360º point cloud produced using Full Surround Monodepth (FSM) on DDAD.

Alternative Camera Models: Neural Ray Surfaces

How can we solve that issue without having to carefully engineer and calibrate ad hoc camera models for each specific scenario? Can we learn a general projection model directly from raw data without prior knowledge? This is what we set out to do in our 3DV paper on Neural Ray Surfaces (NRS). We showed that we can learn to predict per-pixel projection operators together with depth and pose networks. This is all optimized end-to-end in a self-supervised way, just like before, but without any assumption about the camera model. In other words, NRS is very flexible and works across a wide range of different camera geometries. It has been amazing to see what people can do with these ideas using the open source code we released!

Self-supervised depth estimation results using NRS on a wide range of camera models.


Nonetheless, roadblocks remain. This is why we have released our code and data to encourage more open research on these important challenges. We are ourselves hard at work on several of them, in particular how to go beyond pure self-supervision towards scalable supervision, not just for performance improvements but also mitigation of some biases of self-supervised learning. We will discuss some of our related research in part 2 of this blog post, so keep your eye (one is enough) peeled!

Toyota Research Institute

Subsidiary of Toyota Motor North America with mission to…

Toyota Research Institute

Subsidiary of Toyota Motor North America with mission to improve the quality of human life through advances in AI, automated driving and robotics.

Toyota Research Institute

Written by

Applied and forward-looking research to create a new world of mobility that's safe, reliable, accessible and pervasive.

Toyota Research Institute

Subsidiary of Toyota Motor North America with mission to improve the quality of human life through advances in AI, automated driving and robotics.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store