Extract depth from any 3D SBS movie

Pablo Dawson
5 min readJul 18, 2023

--

If you want to try this directly instead of learning about the method, you can do it with Touchly Renderer Pro, and play them directly as volumetric in the Touchly App for Quest.

It’s safe to say that 3D movies had their heyday a while ago, with the release of iconic films like Avatar and Gravity.

However, it is evident that their popularity has waned over time. Despite this decline, I’ve always wanted to find a meaningful way to utilize the depth present in these worlds.

Implementing it to my app Touchly, to be able to turn them into volumetric videos, was a big motivation for the same reason.

First a small introduction on how 3D movies work

Most of this section is based on this presentation, diagrams were copied directly.

Humans perceive depth with a combination of cues. We use Monocular cues like parallax, overlapping, shadows, etc. And also Binocular cues.

For the latter, your brain infers distance of objects by combining the images of each eye.

We can mimic this in film by using two cameras shooting at slightly different viewpoints, then showing one to each eye. These are typically spaced at roughly adult eye ‘interocular’ distance (approx 6cm).

The intuitive solution to creating 3D movies would be to also position the cameras in parallel, similar to our eyes, right?

This is generally desired to simplify Stereo Vision tasks. But in 3d filmmaking, convergence is generally used instead.

The convergence point determines where the object appears in relation to the screen.

This can be done by recording footage with the cameras at an angle or with post-production (the latter is most common as it is more flexible).

This gives filmmakers control over eye Accommodation so that it does not cause discomfort, and adjusting the Depth Budget to create effective 3d.

Parallax: Which is the relative difference in position of the same object in both images, now has positive and negative values for things in the front and the back of the screen. We’ll see why this matters afterward.

How do computers sense depth? (the usual way)

The same way our eyes do, we could try to infer depth by using the images of both left and right cameras. This field is called Stereo Vision.

Let’s say we have an ideal setup, that means:

  • Cameras are in a parallel and vertically aligned.
  • Camera distortions are dismissible.

Or:

  • We know all necessary camera parameters, so we can “fake” an ideal setup by applying stereo rectification.

Then if we know:

  • The projections of each point P(x,y,z) in the left (xL, yL) and right (xR, yR) image.
  • The disparity (d) (derived from the above), that is, the horizontal pixel difference between xR and xL for each point.
  • The focal length (f) of the camera in pixels.
  • The baseline (B): known as the distance between the centers of both cameras.

After doing triangulation and similar triangles (outside the scope of this article), we arrive at this formula for Depth (Z):

Z = B * f / d

In simple terms: If we have an ideal setup, and know the horizontal pixel difference of the same point in space in both cameras (disparity), we can infer its depth (Z).

So with this we learn that:

  • Depth is inversely proportional to disparity
  • Disparity is proportional to baseline

This article is a more in-depth dive

Then we can then use algorithms to identify similar points in both images, and generate a Disparity Map by using their horizontal pixel differences.

A common method for this is Stereo Block Matching: It involves calculating disparities by dividing the image in windows of various sizes.

Most modern methods though involve using Machine Learning to extract features and apply matching to those features.

Disparity maps look like this:

The grayscale intensity represents the disparity amount on each pixel. It can then be converted to distance by using the formula presented.

Now back to 3D movies

This part is based on the MiDAS paper.

Now let’s just apply these principles to extract depth from 3D movies, right?

There are a few caveats that prevents us from doing that:

  • The camera setup is not like our ideal case. Remember convergence? Other parameters are also unknown.
  • Let’s say we manage to get the camera parameters and try rectification.
    They usually change over the duration of the movie, to adjust to the artistic vision of each scene, for example by varying the Depth Budget.
  • As we saw, convergence creates positive and negative parallax. Current stereo vision methods were created to work well with positive disparities only.

The only “easy” solution we have left is to calculate the shift of pixels between both images, regardless of the setup and parameters, and use it as a proxy for disparity.

This will give us relative (non-metric) depth only. But as this is meant to work on volumetric video experiences, metric depth is not necessary.

What we’re looking for, then, is Optical Flow

Optical flow algorithms can handle both positive and negative disparities.

From the OpenCV docs:

“Optical flow is the pattern of apparent motion of image objects between two consecutive frames caused by the movement of object or camera.”

It is a 2D vector field where each vector is a displacement vector showing the movement of points from first frame to second.

Our process would be to:

For each video frame:

  1. Calculate the optical flow between the Left and Right images.

2. We want something similar to horizontal disparity (which we used to get depth), so only consider the horizontal flow.

3. Because of convergence, we will have positive and negative flow, so offset all vectors to be positive only.

Step 3 can be dynamically adjusted to the most negative value if you want all depth information to be displayed.

In my case, I used RAFT for optical flow.

The results with Avatar 2 look something like this:

--

--