(PPS) Cascade Residual Learning: A Two-stage Convolutional Neural Network for Stereo Matching

Kevin Shen
Mini Distill
Published in
3 min readJun 18, 2018

--

This paper tackles stereo matching which is the task of matching pixels from two images taken with two different cameras to deduce depth information. This is similar to how the human brain deduces depth from the two eyes. In this particular paper they assume that the two cameras are at the same height and facing the same direction. They’re only separated left-right. Suppose each camera takes a picture of the same scene at the same time, producing a left image and a right image. Given pixel (x,y) in the left image, we want to ask “what is the disparity d such that pixel (x+d, y) in the right image corresponds to pixel (x, y) in the left image?” As a side note, this is a good paper to read if you know nothing about stereo matching since the authors go over the basics. Suppose we have some way to figure out the disparity d. Then given d, we can deduce the depth of objects in an image. In particular, if the disparity is d for pixel (x, y) then the depth is:

Pixel-wise equation relating disparity to depth.

Here f is the focal length and l is a baseline depth, both parameters of the camera setup. Therefore, to figure out the depth of objects in an image, we simply need to figure out the disparity of each pixel.

However, stereo matching is a difficult task because we need to deal with the following complications:

  • Occlusion: a pixel appearing in the left image is blocked and doesn’t appear in the right image
  • Repeated patterns: for a example consider the image of a fence with many planks that look similar, it’s hard to tell which repetition (plank) a pixel belongs to
  • Textureless regions: no landmarks to identify matching pixels

While this paper doesn’t completely solve any of these challenges, the authors propose an end-to-end neural network method that improves on previous state-of-the-art.

Here is their proposed model:

End-to-end neural network model for stereo matching.

The model consists of two stages. In the first stage, the left and right images are passed through a neural network (shown in blue) which outputs a “disparity image” (each pixel is a scalar value for disparity d instead of a color). The authors’ contribution is the second stage where the disparity image is passed into another neural network (shown in orange) for refinement. In previous works, authors used post-processing tricks for stage 2 (in other words, they applied hard-coded tricks to clean up the disparity image). The contribution of this paper is to do the post-processing using a neural network.

In particular, the second NN receives:

  • d_1​ is the initial disparity prediction
  • The warped left image Ĩ_L is essentially “what would the left image I_L​ look like if we take the disparity d_1​ seriously and transformed the right image I_R by d_1?”
  • The error e_L is I_L − Ĩ_L

Intuitively it makes sense to input both an error image e_L and the hypothetical image Ĩ_L (we can imagine how it would be easier for a human to pick out problems with the disparity map given these two things). The authors call the second neural network a “resnet” for the reason it only needs to predict corrections (a.k.a. residuals) to d_1, rather than a completely new disparity map. This is seen in the bottom part of the diagram where d_1 is added to the output of the NN.

Finally, the authors employ a trick whereby they densely supervise the second network by having disparity predictions at different resolutions (downsampled by powers of 2). I’m guessing this is to maintain global consistency (pixels of the same object should be at roughly at similar depths) while getting local precision (separate boundaries between neighboring objects).

--

--