A comprehensive review of techniques used to estimate depth using Machine Learning and classical methods.
Conventional displays are two dimensional. A picture or a video of the three dimensional world is encoded to be stored in two dimensions. Needless to say, we lose information corresponding to the third dimension which has depth information.
2D representation is good enough for most applications. However, there are applications that require information to be provided in three dimensions. An important application is robotics, where information in three dimensions is required to accurately move the actuators. Clearly, some provisions have to made to incorporate the lost depth information, and this blog explores such concepts.
How do we estimate depth?
Our eyes estimate depth by comparing the image obtained by our left and right eye. The minor displacement between both viewpoints is enough to calculate an approximate depth map. We call the pair of images obtained by our eyes a stereo pair. This, combined with our lens with variable focal length, and general experience of “seeing things”, allows us to have seamless 3D vision.
Engineers and Researchers have realized this concept and tried to emulate the same to extract depth information from the environment. There are numerous approaches to reach the same outcome. We will explore the hardware and software approaches separately.
1. Dual camera technology
Some devices have two cameras separated by a small distance (usually a few millimeters) to capture images from different viewpoints. These two images form a stereo pair, and is used to compute depth information.
2. Dual pixel technology
An alternative solution to the Dual Camera technology is Dual Pixel Autofocus (DPAF) technology.
Each pixel is comprised of two photodiodes, which are separated by a very small distance (less than a millimeter). Each photodiode considers the image signals separately, and then analyzes it. This distance of separation is surprisingly sufficient for the images produced by the photodiodes to be considered as a stereo-image pair. Popularly, Google Pixel 2 uses this technology to calculate depth information.
A good alternative to multiple cameras is to use sensors that can infer distance. For instance, the first version of Kinect used an Infra-Red (IR) projector to achieve this. A pattern of IR dots is projected on to the environment, and a monochrome CMOS sensor (placed a few centimeters apart) received the reflected rays. The difference between the expected and received IR dot positions is calculated to produce the depth information.
LIDAR systems fire laser pulses at the objects in the environment, and measures the time it takes for these pulses to get reflected back (also known as time of flight). They also additionally measure the change in wavelength of these laser pulses. This can give accurate depth information.
An alternative and inexpensive solution would be to use Ultrasonic sensors. These sensors usually include a transmitter that projects ultrasonic sound waves towards the target. The waves are reflected by the target back to the sensor. By measuring the time the waves take to return to the sensor, we can measure the distance to the target. However, sound based sensors may perform poorly in noisy environments.
Using additional hardware not only increases the cost of production, but also makes the depth estimation methods incompatible with other devices. Fortunately, methods to estimate depth by using software only techniques do exist, and is also an active research topic. Below are some of the popular methods to estimate depth using software:
1. Multiple image methods
The easiest way to calculate depth information without using additional hardware is to take multiple images of the same scene with slight displacements. By matching keypoints that are common with each image, we can reconstruct a 3D model of the scene. Algorithms such as Scale-Invariant Feature Transform (SIFT) are excellent at this task.
To make this method more robust, we can measure the change in orientation of the device to calculate the physical distance between the two images. This can be done by measuring the accelerometer and gyroscope data of the device. For instance, Visual-Intertial Odometry is used in Apple’s ARKit to calculate the depth and other attributes of the scene. User experience is refined as even slight motions of the device is enough to create stereo image information.
2. Single image methods
There are several single-image depth estimation methods as well. These methods usually involve a neural network trained on pairs of images and their depth maps. Such methods are easy to interpret and construct, and provide decent accuracy. Below are examples of some popular learning based methods.
A. Supervised Learning based methods
Supervised methods require some sort of labels to be trained. Usually, the labels are pixel-wise RGB-D depth maps. In such cases, the trained model can directly output the depth map. Commonly used depth datasets include the NYUv2 dataset, which contains RGB-D depth maps for indoor images, and the Make3D dataset, which contains RGB-D depth maps for outdoor images. You can checkout this GitHub repo for information on more datasets.
Target labels need not necessarily be pure depth maps, but can also be a function of depth maps, such as hazy images. Hence, we can use hazy and haze-free image pairs for training the model, and then the depth can be extracted using a function that relates a hazy image with its depth value. For this discussion, we will only concentrate on methods that use depth maps as target labels.
Autoencoder are among the simplest type of networks used to extract depth information. Popular variants involve using U-Nets, which are convolutional autoencoders with residual skip connections connecting feature maps from the downsampling (output of convolutions) and upsampling (output of transposed convolutions) arm.
Improvements can be made over the basic structure. For instance, in the paper “Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture” multiple neural networks have been used, with each network operating on input in different scales. The parameters of each network such as kernel size and stride are different. The authors claim that extracting information from multiple scales yields a higher quality depth than single scale extraction.
An improvement over the above method is presented in “Structured Attention Guided Convolutional Neural Fields for Monocular Depth Estimation”. Here they use a single end-to-end trainable model, but they fuse features maps of different scales using structured attention guided Conditional Random Fields (CRFs) before feeding it as input to the last convolution operation.
Other methods treat depth extraction as an image-to-image translation problem. Conventional image translation methods are based on the pix2pix paper. These methods directly extract the depth map given an input image.
Similarly, improvements can be made over this structure as well. The performance can be enhanced by improving GAN stability and output quality, by using methods like gradient penalty, self-attention and perceptual loss.
B. Unsupervised Learning based methods
It is hard to obtain depth datasets of high quality that account for all possible background conditions. Unsurprisingly, enhancing performance of supervised methods beyond some point is difficult due to the lack of accurate data. Semi-supervised and Unsupervised methods remove the requirement of a target depth image, and hence are not limited by this constraint.
The method introduced by “Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue” involves generating the right image, for a given left image in a stereo image pair (or vice versa). This can be performed by training an autoencoder as in the supervised scenario. Our trained model can output right-side images for any left-side image. Now, we calculate the disparity between the two images, which in our case is the displacement of a pixel (or block) in the right-image with respect to its location in the left-image. Using the value of disparity, we can calculate the depth, given the focal length of the camera and the distance between the two images.
The above method is considered to be truly unsupervised when our algorithm can adapt to non stereo image pairs as well. This can be done by keeping track of the distance between the two images by checking the sensor data on the device. Improvements can be made over this method, as done in this work “Unsupervised Monocular Depth Estimation with Left-Right Consistency” where the disparity is calculated both with respect to the left image and the right image, and then the depth is calculated by considering both values.
The limitation of using learning based methods, especially that of supervised methods, is that they may not generalize well to all use-cases. Analytical methods may not have enough information to create a robust depth map from a single image. However, incorporating domain knowledge can aid extraction of depth information in some cases.
For instance, consider Dark Channel Prior based haze removal. The authors observed that most local patches of hazy images have low intensity pixels in atleast one channel. Using this information, they created an analytical haze removal method. Since haze is a function of depth, by comparing the dehazed image with the original, the depth can be easily recovered.
A clear limitation of unsupervised methods is that they require additional domain information such as camera focal length and sensor data to measure image displacement. However, they do offer better generalization than supervised methods, atleast in theory.
Applications of depth estimation
1. Augmented reality
One of the key applications of depth estimation is Augmented Reality (AR). A fundamental problem in AR is to place an object in 3D space such that its orientation, scale and perspective are properly calibrated. Depth information is vital for such processes.
One impressive application is IKEA’s demo, where you can visualize products in your home using an AR module before actually purchasing them. Using this method, we can visualize its dimensions, as well as view it from multiple scales.
2. Robotics and object trajectory estimation
Objects in real life move in 3D space. However, since our displays are limited to two dimensions, we cannot accurately calculate motion along the third dimension.
With depth information, we can estimate the trajectory along the third dimension. Moreover, knowing the scale values, we can calculate the distance, velocity and acceleration values of the object within a reasonable accuracy. This is especially useful for robots to reach or track objects in 3D space
3. Haze and Fog removal
Haze and Fog are natural phenomena that are a function of depth. Distant objects are obscured to a greater extent.
Hence, image processing methods that aim to remove haze must estimate the depth information first. Haze removal is an active research topic, and there are several quantitative and learning based solutions proposed.
4. Portrait mode
Portrait mode on certain smart phone devices involve focusing on certain objects of interest, and blurring other regions. Blur applied as a function of depth creates a much more appealing image than using just uniform blur.
Depth Estimation is a challenging problem with numerous applications. Through efforts taken by the research community, powerful and inexpensive solutions using Machine Learning are becoming more commonplace. These and many other related solutions would greatly pave the path for innovative applications using Depth Estimation in many domains.