Depth Estimation Techniques

6 min readMay 27, 2018

Want to estimate the depth information of the images real-time? Stuck on which algorithm to choose ? You might wanna start looking at the hardware requirements first?

Wikipedia didn’t help, did it. Are you pulling out your hair by now ?..In this post I will try to explain the methods that can be used for estimating distance of objects from the camera. I will describe the popular methods and give a comparison in the various complexities and accuracies of these methods.

TL;DR at the bottom of the page.

Depth information nowadays is used by almost every vision algorithm. Why wouldn’t one use it ! Almost all the state of the art algorithms use some kind of combination of the data(depth included) and extract as much information as they can from it. RGB-D images are pretty common and, with a boom in the Autonomous Driving technology , a significant boost in noticed in the accuracy of the depth estimation techniques.

In this post, I will briefly go through the different type of algorithms to use depending on the available hardware. We will be restricting this post on the use of stereo camera as the acquisition device.

Calculating Disparity Maps

Using the Stereo Images, you can calculate the disparity map by estimating the distance moved by a particular point in the left and right images. An inverse correlation exists between the shift of the point and the distance of the object from the camera, i.e the larger the point’s movement, smaller will be the distance of the object from the camera. Since a stereo camera assumes the epipolar geometry, instead of searching for a point in the whole image, we search for a particular point only along the horizontal-x axis of the stereo images. In simpler words, all we have to do is to find the correspondence of a point, in the left image, to its position in the right image along the same x axis,usually with an offset and range. This drastically reduces the space complexity of our algorithm and reduces the multi dimensional problem to a single dimension.

Even though the problem of finding correspondences in the two images sound easy, it is quite difficult to find the correspondence of “all” the points even when they lie in the same y axis. Not all the points in an image are easily distinguishable and even if they are, the size of image varies anywhere from 100x100 to 640x480 and above. Finding the correspondences of all the 3x 10⁵ points can pose some computation problems.

Despite the challenging computation problems, there are 3 types algorithms that are most widely used: —

Local Methods

These methods are one of the simplest approaches to stereo matching. They also go by a reasonable assumption that corresponding pixels in left and right images have similar colors, which is known as the photo consistency assumption.

The local methods search for a point by exploiting the surroundings of a point, usually done via a block (5x5, 7x7). The whole block is matched with similar sized blocks on the horizontal axis of the second image and the block with the greatest similarity provides us with the point’s location in the two images. The algorithm is simple with the approximate time complexity of O(N*W), where N corresponds to the number of points to be search and W is the number of pixels in the window. Usually a lot of optimization and color correlation functions are used for matching the windows. Some of the popular function for matching are :- Hamming’s Distance, computation on Census Transformed images, weights windows, adaptive weighted windows. One of the methods with low computation time and higher accuracy is the Multi-Block matching.

Simple block matching approaches are easy to implement on both processors and FPGAs, have low computational time( compared to global methods) but are relatively less accurate. These methods are very popular and a lot of research papers are available which use variations of the local methods.

Despite of the high research attention that local methods have gained over the last couple of years, they do not render global approaches useless. The problem is that even large support windows are not sufficient for handling highly ambiguous data, that is, very un-textured image regions.

Global methods

Global methods differ from local ones in that they express the smoothness assumption in an explicit form via a so-called smoothness term. These methods involve an energy function E(D) that measures the quality of the disparity map D. Later, the energy function is optimized to find the disparity map with the lowest energy and of the highest quality. The optimization of the energy function can be done via several ways — dynamic programming, graph cuts, message passing etc. The algorithms in itself have high complexity due to the high dimensional variable incorporated in the energy function, since each pixel disparity value is a variable. However, the performance of these methods are usually the best. The peak of the performance comes with a lot of time complexity overhead. It is quite difficult to implement these methods on hardware due to its high memory footprint. The best methods of global matching are approximately 100 times slower than those of local matching. However, there are quite a few hardware implements of the global methods that try to manage the memory problem by using extra memory optimization and hardware.

Semi- global methods

Semi-Global Matching (Hirschmüller, 2005 and 2008) successfully combines concepts of global and local stereo methods for accurate, pixel-wise matching at low runtime. The method reduces the multi dimension approach to one dimension. This reduces the computational complexity of the global methods with path wise optimization many folds while retaining the accuracy of the global methods. In order to better understand the optimization I suggest you read the presentation by Dr. Hirschmüller.

Semi-global matching is one of the best approaches available for hardware implementations, I suggest you use this method due to its low complexity and high accuracy.

TL;DR

To summarize the three methods —

Local methods — simple block matching approach, easy to impliment on both processors and FPGAs, low computational time( compared to global methods) but relatively less accurate.
Global methods — energy based optimization method, uses an energy function and optimizes it using dynamic programming or some other approach. Highly accurate but computational time is higher than local methods. Difficult to implement on hardware though there are some papers that provide some solutions.
Semi- global methods — this is one of the latest benchmark methods with accurate depth maps and low computational time . Uses one dimensional optimization approach. There have been some papers regarding its implementation on FPGA, I would suggest you to use this method because of its low complexity and high accuracy.

Comparisons based on time complexity :-
Local < Semi-Global < Global
Comparisons based on accuracy of the method:-
Local < Semi-Global == Global

Hopefully now you can start building your depth estimation algorithm. In case you need any help regarding any of these methods a quite a few open source github codes are available. Even Opencv provides APIs for the semi global matching approach.

StereoSGBM

(Python) An example illustrating the use of the StereoSGBM matching algorithm can be found at…

docs.opencv.org

To view the benchmarks for the popular Middlebury dataset —

vision.middlebury.edu/stereo/eval3

OpenCV's "semi-global block matching" method; memory-intensive 2-pass version, which can only handle the quarter-size…

vision.middlebury.edu

For a little enhanced local method estimation techniques you can check out my work on github. The paper is yet to be accepted, I will definitely post the link after acceptance.

In case you come across any problem or an interesting , do write it up, share or comment on the post. Happy coding:)