Unsupervised Monocular Depth Estimation

Merantix Momentum

Published in

Merantix Momentum Insights

12 min readJan 11, 2023

Principles and Recent Developments

Authors: Alexander Koenig and Julien Siems

Motivation

Our first two blog posts (part one, part two) of the Merantix Momentum Research Insights discussed methods that focus on self-supervised learning in the context of representation learning. They are methods that generate descriptive, low-dimensional representations of input given a large number of unlabeled data. These representations can then be used in downstream tasks, for instance, to learn to classify images based on a few examples of each class.

Our current blog post views self-supervised learning, a subset of unsupervised learning, from a task-specific angle. We pick out the task of dense monocular depth estimation visualized in Figure 1, where the distance of an object from the camera at each pixel is inferred from one image alone (as opposed to a stereo vision setup with two cameras). Labeling this image with a consistent and accurate pixel-wise depth map would be impossible for a human to do without at least a few reference points with known depth.

Figure 1: Example of an image and its high-resolution depth map (Miangoleh et al. 2021).

The darker the depth map, the more distant the object at a specific pixel is from the camera and vice versa. This also holds for black-and-white depth maps shown later.

However, there are other options to obtain training labels other than human labeling. For example, one could synthesize pixel-perfect labeled data using a renderer. Unfortunately, such approaches suffer from domain bias since real-world data differs from simulated imagery. Measures to reduce the gap between simulation and reality have been introduced (Atapour-Abarghouei et al. 2018). While achieving impressive results, these methods suffer from artifacts if the real and synthetic data domains vary too much, e.g., in situations with complex shadows or sudden lighting changes (Atapour-Abarghouei et al. 2018).

Another approach to obtain ground-truth data more conveniently would be to record a laser scan (e.g., Lidar) alongside the image and reproject the laser measurement into the camera frame. But this comes with its own set of challenges. First, just like a camera cannot see in the dark, a Lidar scanner also has weak spots. As shown in Figure 2, laser light is reflected by fog or even car exhaust gasses. Hence, not every Lidar measurement may be trusted equally. Due to high demand from the autonomous driving industry, LIDAR has become more affordable and is expected to become cheaper in the future. Nevertheless, the cost and energy consumption of a Lidar sensor may still rule out the deployment of the sensor on a mass-market autonomous system. Finally, company policies or economic decisions may also constrain access to Lidar depth information.

Figure 2: Smoke-induced artifacts in Lidar measurements (Fritsche et al. 2016).

In the following, we will discuss how tremendous advancements in the last years have leveraged ideas from self-supervised learning to enable robust monocular depth estimation despite the scarcity of labeled data for depth estimation. Achieving high-quality and robust monocular depth estimation has many applications. They range from creating a 3D map of your living room with your smartphone to navigating autonomous cars or robots through complex environments using simple, easy-to-calibrate hardware.

Non-learning-based approaches

Monocular depth estimation is a topic that was intensely studied already before the advent of deep learning (DL). Early approaches to monocular depth estimation are based on the depth-from-focus or depth-from-defocus principle. They rely on the intuitive principle that objects which are imaged in focus lie on the focal plane at an equal distance to the observer. Objects which are not in focus are not on that plane and appear blurry. The larger the distance of an object to the focal plane, the blurrier it is imaged, depending, in particular, on the used aperture size. Determining in-focus regions and measuring the degree of blurriness allows inferring depth from a single image or focal stack (Chaudhuri et al. 1999). In other words, they exploit the property that blurrier regions are further away from the focal plane which contains sharper, higher-frequency content. As shown in Figure 3, the approach can process multiple images with different focus points to achieve a more robust estimation of the depth (Jin et al. 2002).

Figure 3: Depth-from-defocus: the left image is near-focused, the middle image is far-focused, and the right image is the resulting depth map (Jin et al. 2002).

Another classical method is called Structure from Motion (SfM), where depth is extracted by using motion parallax, which is the effect of objects close to the camera moving faster in image space than objects with the same velocity that are further away. Other approaches not only passively process the incoming light to infer depth but actively emit light patterns, a process known as Structured Light. These methods model the distortion of the known pattern and can thereby infer depth.

The previous methods rely on physical effects such as parallax or the optical properties of the imaging system to accurately measure depth. However, depth can also be estimated using a single image by incorporating domain knowledge. While we humans heavily rely on our stereo vision to perceive depth, inferring relative depth information using just a single eye does not pose significant challenges to us.

Humans integrate a great deal of contextual information into coping with ambiguities and robustly estimating depth. For example, the presented non-learning-based depth estimation algorithms may have difficulties distinguishing between a gigantic imaginary car at 100 meters distance and a regular-sized car at five meters distance because their apparent size and shape will be almost identical in the resulting image. However, humans know that a car is usually between four and six meters long and can hence use this prior information to estimate the distance to the car given its apparent size. Such contextual priors are hard to integrate into classical depth estimation algorithms. If attempted, this may harm their generalizability due to the introduced domain knowledge that can only cover a subset of all possible scenarios. In the next section, we discuss learning systems that extract contextual constraints from massive amounts of data.

Self-supervised depth estimation 101

Neural nets and depth prediction

Before we dive into the current state of the art (SOTA), we will explain several developments which paved the way for self-supervised monocular depth estimation using deep neural networks. Before the advent of self-supervised learning, the most direct approach to monocular depth estimation using deep learning was treating it as a supervised regression problem with a neural network mapping directly from images to (sparse) depth maps. This approach was pioneered by Eigen et al. (2014), as shown in Figure 3. They use the medium-sized NYU (Silberman et al. 2012) and KITTI datasets (Geiger et al. 2013). Eigen et al. used a global depth network to reconstruct the overall structure and a refinement net for details, resulting in a new SOTA. Figure 4 shows their results on the NYU indoor scenes dataset. Note that in 2014 the DL infrastructure was not as developed as it is now. Even popular deep learning frameworks such as PyTorch or TensorFlow were not been publicly released yet. Hence what may seem straightforward to implement now was a big engineering effort at the time.

Figure 3: Vanilla supervised monocular depth estimation with a neural network (Eigen et al. 2014).

Figure 4: Results of supervised monocular depth estimation on the NYU indoor scenes dataset. Left: input image. Middle: depth estimation. Right: ground-truth depth (Eigen et al. 2014).

Godard’s self-supervised approach

We already know that labeled datasets come with drawbacks. To the rescue come self-supervised approaches, which do not require explicit labels. In the following, we cover two major papers by Godard and Zhou et al. introduced at CVPR 2017, laying most of the theoretical grounding on which SOTA approaches are built today.

Godard’s key idea is to treat predicted depth not as the learning target directly but to use it to enforce a photometric self-consistency loss. Figure 5 summarizes their approach which uses stereo images as training data. First, a depth predictor infers the pixel-wise depth map of the left image. Second, a differentiable image sampler uses the predicted depth map and warps the pixel values from the left into the right image. Finally, the loss is calculated as the difference between the proposed right image and the recorded right image. At inference, the approach does not require warping.

Figure 5: Schematic overview of Godard et al.’s approach (Godard et al. 2017). Images from Middlebury Stereo Image Dataset (Baker et al. 2011).

The two key advantages of this method are that it’s no longer necessary to collect ground-truth depth data, and the pixel-wise photometric loss provides a denser learning signal than e.g. a sparse point cloud from a Lidar scan. Despite exceeding all supervised baselines at the time, a critical disadvantage of the method proposed by Godard et al. remained the reliance on accurately calibrated stereo images which are used to train the depth predictor. However, public datasets for stereo images are scarce. Hence, more data may need to be recorded with a special stereo camera.

Zhou’s self-supervised approach

While stereo videos are expensive to record, monocular videos are more readily available from the internet or smartphone cameras. By extending the ideas of Godard et al., Zhou et al. (2017) were the first to propose a self-supervised learning system that was trainable on monocular video streams from a moving camera. While in a stereo camera setup, the ground-truth relative camera poses are fixed and known, the shift from the current to the next camera frame is unknown in monocular video footage. Hence, Zhou et al. add a second network that predicts the ego motion of the camera between frames.

The training procedure in the seminal work by Zhou et al. is shown in Figure 6. Suppose you have three images Iₜ₋₁, Iₜ, and Iₜ₊₁ from three consecutive time steps t-1, t, t+1. Two networks are jointly optimized: a depth convolutional neural network (CNN) and a pose CNN. The training workflow proceeds as follows:

The center image Iₜ is fed into the depth network to predict the pixel-wise depth map.
All three images are input into the pose CNN, which then outputs the camera’s ego-motion between the previous and the current, and the current and the next frame. More formally, the ego-motion is a camera transform with three translational and three rotational parameters.
Using a differentiable image renderer, known camera intrinsics, and the estimated depth and camera poses, they warp the source image Iₜ into the previous and next camera frames. In principle, this is similar to Godard et al.’s warping process.
Their loss function is simply the photometric error between the predicted and ground-truth previous as well as the next image.

At inference, the pose CNN is not needed — you simply feed a single image into the depth CNN, and, voilà, you get a pixel-wise depth map.

Figure 6: Architecture of the work by Zhou et al. In contrast to Godard’s work. Zhou et al. present a fully monocular training pipeline (Zhou et al. 2017).

Check out their results below in Figure 7. At the time, they qualitatively and quantitatively performed comparably to the SOTA supervised (!) baselines with their self-supervised approach — using no labels.

Figure 7: Qualitative results comparing supervised and unsupervised depth estimation by Zhou et al. (Zhou et al. 2017).

While impressive at the time, Zhou’s work is limited by not explicitly reasoning about moving objects and assuming a static world. That is, changes in appearance are assumed to be caused solely by ego motion. Hence, the network is incentivized to predict a huge depth value if an object moves with the same velocity as the observer in the training dataset. This problem can introduce unwanted artifacts at inference time. An example is dashcam footage of a car in a traffic jam where all adjacent vehicles are moving at the same velocity, at an approximately constant distance. Since the appearance of the other vehicles merely change, Zhou’s model aims to predict their depth to be infinite because their pixel values don’t change from one frame to the next — as is the case for objects that are effectively placed at optical infinity such as clouds or objects on the horizon.

Fast-forward to 2022

The success of the work by Godard and Zhou et al. depended entirely on self-supervision, in this case, predicting unknown but desired quantities such as depth or ego-motion and enforcing consistency with the images at hand rather than measurements of the desired depth itself. The most natural way to further improve results was to find other quantities for which the same is possible. Let’s fast forward six years and see what’s possible today in 2022.

Zhou et al.’s limitation to assume a static world is addressed in a recent publication by Guizilini et al. 2022 which includes both scene and optical flow estimation besides depth and ego-motion into a neural network architecture. Optical flow estimation is the task of predicting the 2D motion of object parts in the image plane between two consecutive video frames. Figure 8 shows an animation of optical flow for an example video. Scene flow is the three dimensional generalization of optical flow and measures the 3D movements of points in a video. Optical flow can be obtained by projecting scene flow into the camera frame.

Figure 8: Visualization of optical flow, modeling the 2D vector field of pixel intensity changes in an image sequence (Source).

The joint estimation of ego-motion, depth, optical flow, and scene flow in the new DRAFT architecture leads to SOTA results on self-supervised monocular depth prediction (Guizilini et al. 2022). By the way: DRAFT also achieves SOTA performance on optical flow and scene flow estimation. In Figure 9, the authors compare their DRAFT architecture against ManyDepth (Watson et al. 2021), another self-supervised algorithm. Their qualitative results indicate that DRAFT produces sharper boundaries and models dynamic objects better (check the middle image and the artifacts generated for the moving car).

Figure 9: SOTA results achieved by modeling dynamic objects through optical- and scene flow (Guizilini et al. 2022).

Conclusion

This blog post summarized the achievements in monocular depth estimation, from classical computer vision methods to self-supervised deep learning approaches. While current methods already achieve impressive results, the problem is not solved by any means. An exciting challenge still requiring a robust solution is the accurate calibration of the depth maps to real-world depth in meters. Some approaches integrate the ground-truth camera velocity (e.g., through wheel encoders on a car) as a prior for the scale (Guizilini et al. 2020). Furthermore, creating models that can handle arbitrary camera models and lenses (e.g., tele and fish-eye lenses) is necessary. Currently, the models can only reliably process data from the same camera systems they were trained on. Finally, fusing the depth information from multiple cameras (e.g., creating a 360-degree depth map around a vehicle) remains a significant challenge.

References

(Miangoleh et al. 2021) Miangoleh, S. Mahdi H., Sebastian Dille, Long Mai, Sylvain Paris, and Yagiz Aksoy. “Boosting monocular depth estimation models to high-resolution via content-adaptive multi-resolution merging.” In Conference on Computer Vision and Pattern Recognition. 2021.
(Atapour-Abarghouei et al. 2018) Atapour-Abarghouei, Amir, and Toby P. Breckon. “Real-time monocular depth estimation using synthetic data with domain adaptation via image style transfer.” In Conference on Computer Vision and Pattern Recognition, 2018.
(Fritsche et al. 2016) Fritsche, Paul, Simon Kueppers, Gunnar Briese, and Bernardo Wagner. “Radar and Lidar Sensor Fusion in Low Visibility Environments.” In International Conference on Informatics in Control, Automation and Robotics, 2016.
(Chaudhuri et al. 1999) Chaudhuri, Subhasis, and Ambasamudram N. Rajagopalan. Depth from defocus: a real aperture imaging approach. Springer Science & Business Media, 1999.
(Jin et al. 2002) Jin, Hailin, and Paolo Favaro. “A variational approach to shape from defocus.” In European Conference on Computer Vision, 2002.
(Silberman et al. 2012). Silberman, Nathan, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. “Indoor segmentation and support inference from rgbd images.” In European Conference on Computer Vision, 2012.
(Geiger et al. 2013) Geiger, Andreas, Philip Lenz, Christoph Stiller, and Raquel Urtasun. “Vision meets robotics: The KITTI dataset.” The International Journal of Robotics Research 32, no. 11 (2013): 1231–1237.
(Eigen et al. 2014) Eigen, David, Christian Puhrsch, and Rob Fergus. “Depth map prediction from a single image using a multi-scale deep network.” Advances in Neural Information Processing Systems, 2014.
(Godard et al. 2017) Godard, Clément, Oisin Mac Aodha, and Gabriel J. Brostow. “Unsupervised monocular depth estimation with left-right consistency.” In Conference on Computer Vision and Pattern Recognition, 2017.
(Baker et al. 2011) S. Baker, D. Scharstein, J. Lewis, S. Roth, M. J. Black, and R. Szeliski. A database and evaluation methodology for optical flow. IJCV, 2011
(Zhou et al. 2017) Zhou, Tinghui, Matthew Brown, Noah Snavely, and David G. Lowe. “Unsupervised learning of depth and ego-motion from video.” In Conference on Computer Vision and Pattern Recognition, 2017.
(Guizilini et al. 2022) Guizilini, Vitor, Kuan-Hui Lee, Rareş Ambruş, and Adrien Gaidon. “Learning Optical Flow, Depth, and Scene Flow Without Real-World Labels.” In IEEE Robotics and Automation Letters 7, 2022.
(Watson et al. 2021) Watson, Jamie, Oisin Mac Aodha, Victor Prisacariu, Gabriel Brostow, and Michael Firman. “The temporal opportunist: Self-supervised multi-frame monocular depth.” In Conference on Computer Vision and Pattern Recognition, 2021.
(Guizilini et al. 2020) Guizilini, Vitor, Rares Ambrus, Sudeep Pillai, Allan Raventos, and Adrien Gaidon. “3D packing for self-supervised monocular depth estimation.” In Conference on Computer Vision and Pattern Recognition, 2020.