How many numbers does it take to compute Optical Flow?

Anurag Ranjan
5 min readJun 1, 2017

--

The Deep Learning story of Optical Flow

Cover image from MPI-Sintel

As the wave of deep learning took control of all the major computer vision problems, one of them still managed to escape its reach for some time. The problem of computing optical flow or motion between frames of a sequence has been a historical computer vision problem. The problem can be stated as follows:

A point moving from (x, y) to (x+u, y+v) in two video frames has optical flow (u, v) [sensblogs]

Given two images frames of a continuous video sequence, estimate the motion of each and every pixel in the first frame. Assuming that pixel intensities are constant over one frame to another, we can write —

where I is the image intensity as a function of space (x, y) and time t. (u, v) are the velocity fields or optical flow that needs to be estimated. This means that if we take the first image I(x, y, t-1) and move its pixels by (u, v), we can obtain the next image I(x+u, y+v, t).

Deep Networks in Optical Flow estimation [2]

Although the solution to estimating optical flow (u, v) has historically been an optimization problem, more recent approaches [1, 2, 3] try to apply deep learning for its estimation. These deep learning methods take two video frames, pass it through a deep neural network, and out comes the optical flow.

f is a neural network that computes optical flow between two consecutive image frames

However, computing optical flow using deep neural networks has been challenging due to unavailability of training data. There has not been a way to generate large sets of realistic training data for optical flow. This requires figuring out the exact motion of each and every point in the image to a sub-pixel accuracy. This is not very easy for humans, so they can not provide labels for training.

In order to go around the problem of training data, we turn to computer graphics. We can simulate large worlds and would know the motion of each and every point in the video sequence. One such attempt was made by MPI-Sintel dataset [4]. MPI-Sintel took an open source CGI movie, and rendered optical flow for various sequences of the movie. However, the movie was not as realistic as natural world. Moreover, the length of the rendered optical flow sequences was 1048, an order of magnitude smaller than typical size of dataset needed for deep learning.

A scene from Sintel

Other researchers turned to more simpler approaches. Flying Chairs dataset[2] generated large sequence of the order of 20,000, which could be used for training deep neural networks. However, the sequence consisted of just chairs flying against random backgrounds.

The Flying Chairs dataset. Frames on the left, and colour coded optical flow ground truths on the right.

The question was — Can neural networks trained on such simple scenes of flying chairs generalize to natural scenes which are more diverse?

Flownet [2] was the first deep network that was trained on the flying chairs. Although its performance was not as good as state of the art methods, it was better that most of well known methods of the past. The network consisted of several convolutional layers and a large number of parameters were learned — about 32 million.

The Game of Numbers

It seemed like 32 million numbers in Flownet could do a good job of representing motion. With these 32 million numbers, we could compute fairly good optical flow for natural sequences. The question we asked was — Are 32 million parameters too big or too small for computing optical flow?

With this in mind, we introduced a new neural network architecture based on some recent deep learning ideas[5] and some very early computer vision ideas[6]. We called it a Spatial Pyramid Network — SPyNet. SPyNet is based on the idea of spatial pyramids of images introduced in the 80s. A spatial image pyramid mirrors the multiple scales of processing happening in the human visual system. In this approach, we use multiple scales of images at different resolutions to compute optical flow.

An image pyramid

The architecture of SPyNet is quite simple. Instead of using a big deep neural network with millions of parameters, we use small neural networks at each level of the pyramid. At each scale, SPyNet computes residual optical flow that can not be estimated at lower scales. In this way, we propagate up the pyramid and obtain the complete flow. All of these small neural networks can do only a part of the job of estimating residual flows at their scale. The results are then consolidated to obtain the complete flow.

The amount of parameters that can be reduced in this way is huge. The total number of parameters that are learned in SPyNet are 1.2 million, a 96 percent reduction from 32 million parameters in FlowNet. Interestingly, we found that the results are as good or even better in some cases than FlowNet. So, this reduction in parameters does not limit our accuracy.

The number of parameters each of the neural networks need to compute optical flow.

Then, we went ahead and visualized our learned parameters. We were not very surprised but these parameters looked like Gabor-like filters that are observed in human visual system. SPyNet was able to model some primitive parts of human visual systems that are seen in other neural networks as well.

Neural network filters learned by SPyNet resemble V1 cells of Visual Cortex

The Future

We have entered the age of massive data and this is going to ease machines to learn. At the same time, we need to find ways to get better data which can help these programs to learn. To this end, the next version of Flying Chairs dataset was released that contained large corpus of animated 3D scenes. The AVG group at Max Planck Institute for Intelligent Systems is working on creating a dataset of real scenes for optical flow using high speed cameras. FlowNet’s new version, FlowNet 2.0 is a lot better than its predecessor. SPyNet continues to figure out if 1 Million parameters are still too much?

--

--