What are some fundamental problems in Computer Vision that Deep Learning alone may not be able to solve?

Henry Minsky
6 min readMay 7, 2019

--

The two spirals above are topologically different. One is a single closed curve, the other is two separate but intertwined curves. Can you tell which is which?

Can you formulate a multilayer neural network which can tell ? For any arbitrary shape of the spiral, not just the one in the picture?

A child can do it, by tracing the curve with their finger or eye and seeing if they traverse the entire curve. This is an example of a so-called ‘visual-routine’ as defined by Shimon Ullman. There are plenty of ways to write a computer program to do this particular task, a flood-fill algorithm for example would show the connectivity of the components. The point is not at all that this example is a hard problem in general, but that it is an example of an incredibly difficult computation for the kind of multilayer perceptron architectures used by popular Deep Learning vision networks today.

Marvin Minsky’s Perceptrons book proved some theorems that show that some kinds of problems present acomputationally very expensive function for a multilayer parallel network to compute, as a function of the number of pixels. And hence very expensive to try and train such a network to compute this function using back-propagation. The drawings on the cover of the Perceptrons book illustrates one such problem.

But this is not just a toy problem. Many useful ‘common sense’ tasks require some kind of serial processing, which is not they way things are done with current deep network approaches. These serial methods do not seem individually very complex, but they are fundamentally different kinds of computations than are done in the popular multilayer networks used for object recognition today. It seems clear we want to combine the strengths of parallel distributed networks with different kinds of serial processing to get something which could do this kind of task which humans easily learn to do.

There are many useful pattern recognition tasks that CNNs can do today, but there are also inherent limitations with the current architectures. People depending on such systems are then surprised when these networks cannot be trained to perform tasks which are simple for even young children, or are easily fooled into producing incorrect or nonsensical output. This is not merely academic, as much of what we call ‘common sense’ ends up being out of the scope of what a monolithic convolutional neural network can actually do. It is no crisis if your computer vision system misclassifies a Christmas tree ornament as your uncle Phil in a family photo album, but not fine if your car drives under a truck because it mistook the middle of the truck for open sky.

Below is a similar problem that is difficult for a multilayer perceptron network, for similar reasons. Which line would you pull to catch the fish? This is not a profound problem for even a young child to solve. However it is extremely difficult for a Convolutional Neural network architecture, because they work by tabulating visual properties without taking into account (possibly transitive) semantic relations, causing vital information to be discarded. Again, most programmers will immediately come up with several ways to write a program to solve this particular task, but you should think hard about what weaknesses of the current multilayer perceptron CNN pattern recognition architectures this example highlights.

Many thought that when Marvin Minsky pointed out the computational complexity of a perceptron network computing the XOR function , that it was a ‘toy problem’, and some people even came back after they had made a multilayer perceptron network that could compute XOR for a small special case, and thought they had addressed his concern.

However, in this case (the intertwined spirals example) the theorem is related to the computational complexity of computing the parity of an input vector. While theoretically learnable by a Deep Learning network, the parity function is very expensive to train using back-propagation, requiring an exponential number of examples. Again, while parity isn’t a difficult function to compute in general (the humble XOR gate does it with a couple of transistors), it’s a worst case function for this particular architecture to learn. Other important operations (such as counting the number of “X”s on a curve in the example below) can also be easily computed serially but are highly inefficient to compute using the image-parallel convolutional techniques which have been so successful in teaching deep learning networks to recognize objects. So the solution clearly lies in some hybrid architecture which can do both serial and distributed parallel tasks.

Some further examples from Shimon Ullman are shown below of visual tasks he refers to as ‘visual cognition’, that take into account relations which are not spatially local, and not easily learned by parallel distributed convolutional networks. Figuring out that two dots lie on the same line, or counting them, involves marking features and transitively preserving some invariant properties (am I still on the line if I move in this direction?) or enumerating unique events (how many X’s have a I seen so far, without duplication? Did I cross a line boundary when getting from one X to the other?), which are not the things that CNNs are designed to compute.

This is not a purely academic issue, there are numerous real world computer vision problems which will require other ways of image processing of one sort or another in addition to convolutional multilayer networks. Is there an architecture where you can train a network as to which visual routine to use for which type of task? Hinton has proposed ‘capsule networks’, as a way to start incorporating more complex feature detectors into networks to perform more specialized vision tasks using a less homogenous network architecture. His capsules compute an object’s ‘pose’ and transform as an atomic operation. Perhaps some more ‘capsules’ could contain serial visual-routine processing units that could attack some of the Ullman’s example problems, and the network could learn which ones to combine (in which order) for different tasks.

In the famous work below by Yarbus, an eye tracking system shows where a person looked when asked different questions about the same picture called “The Unexpected Visitor’:

‘’The unexpected visitor.’’ The same subject inspected the top left image seven times, following different instructions in each viewing: (a) free examination of the picture; (b) estimate the material circumstances of the family in the picture; © give the ages of the people; (d) surmise what the family had been doing before the arrival of the ‘’unexpected visitor’’; (e) remember the clothes worn by the people; (f) remember the position of the people and objects in the room; (g) estimate how long the ‘’unexpected visitor’’ had been away from the family. Source: Yarbus (1967).

What this says to me is that vision to parse a scene is an active task, requiring a set of goals for information to be extracted, and a planning and motor system to carry these out. The motor system consists both of physical eye motion (in the case of humans) as well as application of both serial and parallel visual routines, for which the system has learned what the expected information is that can be extracted. This goes far beyond todays concepts of pattern recognition, merging the machinery of sensorimotor planning and action with vision in a continuous harmonious mechanism.

--

--