Stereo Vision: Depth Estimation between object and camera

Apar Garg
Analytics Vidhya
Published in
6 min readNov 17, 2021

Problem

It is not possible to estimate the distance (depth) of a point object ‘P’ from the camera using a single camera ‘O’. This is because however close or far ‘P’ is on the projective line, it will map to the same point ‘p’ in the image.

Source: [1]

Solution

Stereo vision is a technique used to estimate the depth of a point object ‘P’ from the camera using two cameras. The foundation of stereo vision is similar to 3D perception in human vision and is based on the triangulation of rays from multiple viewpoints.

Popular stereo camera systems

1. Parallel system

Source: [1]

2. General system

Source: [1]

In this tutorial, we’ll be using the Parallel stereo camera system for depth estimation.

Depth estimation

Overview

Easy to understand illustration for depth estimation using parallel camera stereo system
  • P = Target point in the physical world (scene point)
  • PL = (xl,yl) = Point P in left camera image
  • PR = (xr,yr)= Point P in right camera image
  • n1 = Horizontal pixel distance of point PL in left camera image
  • n2 = Horizontal pixel distance of point PR in right camera image
  • T = Baseline distance between center of left and right cameras
  • f = Focal length of the camera
  • d = Physical size of a pixel in camera sensor CMOS/CCD
  • Z = Distance between point P and camera center

Disparity (D) = n1-n2 =xl-xr

So finally we have,

What can we learn from this equation? 📚

  • Depth is inversely proportional to disparity. The more the disparity is, the closer the object is to the baseline of the camera. The less the disparity is, the farther the object is to the baseline.
Source: [2]
  • Disparity is proportional to baseline. This is easy to visualize. If we have a small baseline distance between the two cameras, then the difference/disparity between the two images is going to be small. As we increase the baseline, the disparity is going to scale up.

These two are very important considerations. When we are designing a stereo system, we want to measure disparity very precisely because that’s what gives up depth. For this, we would have to use a stereo configuration where the baseline is large enough because larger the baseline more precisely we can make disparity measurements.

Stereo Matching

For point PL, how did we get the corresponding PR? 🙄

Template Matching!!

We first take a window/block of pixels in the left image with PL as the center of that window.

Now we need to use this window as a template to find a matching window of the same size in the right image.

But here’s a catch!!💡 We don’t need to scan the whole right image to find the matching window. The corresponding window must lie in the same horizontal line (called scanline) in the right image. A parallel stereo camera system doesn’t give any disparity along the vertical direction. In other words, yl=yr.

Illustration for sliding window along scanline for template matching

So we take the template in the left image and compute similarity with all the windows along the scanline in the right image.

How to calculate similarity?? 🤔

There are a number of methods to calculate similarity. A few of them are:

1. Sum of Squared Differences (SSD)

Equation to calculate SSD

Window with Minimum SSD = Most Similar/Matching Window.

2. Sum of Absolute Differences (SAD)

Equation to calculate SAD

Window with Minimum SAD = Most Similar/Matching Window.

The center of the most similar window will be called PR.

Example

Q. Your vision system consists of two calibrated cameras; they have the same focal length 20 cm, a baseline of 10 cm, the pixel size in the camera is assumed to be 0.1cm/pixel; each image has a resolution of 1920*1080 pixels. Evaluate what would be the maximum distance it can measure.

A. Given: f = 20, T = 10, d=0.1.

Goal: To maximize Z.

After inserting the given values in the equation we get,

To maximize Z, we need minimum D. So D=1.

So our vision system can measure a maximum distance of 20 m.

Depth Map

If we calculate disparity for every pixel location in the two images, we can generate a depth map/image. If we use this depth map as the 4th channel in an RGB image, we call the image RGB-D.

Enough theory. It’s time to code!!👨🏻‍💻

Step 1 — Load necessary packages

Step 2 — Load the left and right images in grayscale

Images have been taken from Middlebury Stereo Vision Dataset [3].

Step 3 — Display the two input images

Step 4 — Define a function to calculate and show disparity

Step 5 — Perform depth estimation

Block size (bSize) = 5

Block size (bSize) = 25

Code can be found here.

Misnomer “3D”

The real 3D images display the image in three full dimensions. What you see in stereo vision is just a 2D image with some depth information, not real 3D. It tricks your brain by sending 2 different 2D images taken from 2 different points of view (left and right eye), so you have a kind of false depth perception on the screen. You can’t move aside and see things from different angles or what’s behind them. What you are looking at is fake 3D, or 2.5D.

Mobile 3D Cameras 📱

Consumer electronics manufacturers are finding it increasingly practical to include 3D cameras in the next generation of mobile devices as the cost and size of mobile cameras and 3D sensors continue to reduce.

One of the earliest consumer mobile 3D devices was the LG Thrill. This Android-powered smartphone had two 5-megapixel cameras that captured 3D images and video using the stereoscopic effect. Despite having a glasses-free 3D display, the device failed to gain traction in the market due to its enormous weight and bulk, as well as its comparatively low photograph quality [4].

Other 3D data representations

  • Volumetric
  • Polygonal mesh
  • Point cloud

References

  1. http://www.cs.toronto.edu/~fidler/slides/2015/CSC420/lecture12_hres.pdf
  2. https://www.oreilly.com/library/view/learning-opencv-3/9781491937983/
  3. http://vision.middlebury.edu/stereo/data/scenes2001/
  4. Kadambi, Achuta, Ayush Bhandari, and Ramesh Raskar. “3d depth cameras in vision: Benefits and limitations of the hardware.” In Computer vision and machine learning with RGB-D sensors, pp. 3–26. Springer, Cham, 2014.

That will be it for this article.

Don’t forget to 👏 if you liked the article.

If you have any questions or would like something clarified, you can find me on LinkedIn.

~happy learning.

--

--