(PPS) Efficient Deep Learning for Stereo Matching

Kevin Shen
Mini Distill
Published in
3 min readJun 19, 2018

This is another paper on stereo matching: the task of matching pixels from two different pictures to deduce depth information. A common application is autonomous driving where given two cameras that are simultaneous capturing images of a scene, you want to find out the depth of objects in that scene. For their method, the authors assume rectification (which just means the two cameras are parallel). This means they only need to check how left/right displaced a pixel is in one image from another, not up/down. There is a left camera and a right camera and at some instance, the cameras each capture an image producing a left image and a right image. The task is to match the pixels from the left and right images. Once we have done that, there is a known equation relating the disparity or mismatch (of pixels) to depth (of pixels).

A visualization of stereo matching. In a, we have two cameras that capture a scene with a person and a palm tree. The images produced by the two cameras are shown at the bottom are clearly translated from each other. In b, we see the definition of disparity: how much the pixels of one image are translated in the other image. Disparity is pixel-by-pixel value.

Previous state-of-the-art for stereo matching was Siamese networks: twin neural networks that share parameters. The left and right images are simply passed through the same neural network and compared in latent space. However, it’s often too difficult to learn by passing the entire image through the neural network at once so people compare patches of left and right images. In this way, Siamese network approaches work by binary classification: match or don’t match.

In their approach, the authors compute a disparity image. A disparity image is an “image” where a displacement value is stored at each location of the image instead of color (RGB) information. This displacement value specifies the displacement of the pixel from the left image to the right image. This is a good intro to disparity images.

Below is the proposed model:

Visualization of the proposed model.

Here’s the main idea. We’re trying to figure out the disparity value for each pixel in a k⨯k patch of the left image. The patch in question is shown in blue on the bottom left-hand side. What we do is we cut out N k⨯k blocks of the right image (along the same row since the two images can only be left/right displaced), making the assumption that the disparity of any pixel is at most N. This is shown in the bottom right-hand side. The N k⨯k blocks are not explicitly shown to be separated but we can think of them as being so. All the patches are passed through a CNN to get latent representations. These are the left and right feature volumes. Finally the left feature volume (think of it as the “convolution weights”) is convolved across the right feature volume, outputting a k⨯k matrix for each N locations. This means the output is N⨯k⨯k and each pixel in the left patch undergoes a N-way classification for disparity value.

The authors report 5% error at 5 pixel tolerance and 10% error at 2 pixel tolerance on the KITTI 2015 dataset. The model runs at 1fps.

As a closing remark, stereo matching is largely an unsolved problems. Some of the biggest challenges facing stereo matching include occlusions (how do we find disparity if objects are blocked in one image but not in the other), textureless regions, and repetitive patterns. Furthermore, most applications require depth information in real-time.

--

--