Why we should be Deeply Suspicious of BackPropagation
Geoffrey Hinton has finally expressed what many have been uneasy about. In a recent AI conference, Hinton remarked that he was “deeply suspicious” of back-propagation, and said “My view is throw it all away and start again.”
Backpropagation has become the bread and butter mechanism for Deep Learning. Researchers had discovered that one can employ any computation layer in a solution with the only requirement being that the layer must be differentiable. Said differently, that one is able to calculate the gradient of layer. In more plain speak, that in the game of ‘hotter’ and ‘colder’, that the verbal hints that are made accurately reflect the distance between the blindfolded player and his objective.
There are several questions about backpropagation. The first is whether the gradient that is calculated is always the correct direction towards learning. This intuitively is questionable. One can always find problems wherein the moving towards the most obvious direction does not always lead to a solution. So it should not be unexpected that ignoring a gradient may also lead to a solution. (I don’t think though you can ignore the gradient forever) I have written previously about the difference between an adaptation perspective versus an optimization perspective.
However, let’s step back a little bit and try to understand historically where this back-propagation idea comes from. Historically, machine learning originates from the general idea of curve fitting. In the specific case of linear regression (i.e. fitting a prediction to a line), calculating the gradient is a solving the least squares problem. In the field of optimization, there are many alternative ways other than using gradient to find an optimal solution. As a matter of fact, stochastic gradient descent is likely one of the most rudimentary approaches towards optimization. So it is just outstanding that one of the most simplest algorithms one can think of, actually works outstandingly well.
Most optimization experts had long believed that the high dimensional space that deep learn occupied would demand a non-convex solution and therefore be extremely difficult to optimize. However, for some unexplained reason, Deep Learning has worked extremely well using Stochastic Gradient Descent (SGD). Many researchers have later come up with different explanations as to why deep learning optimization is surprisingly easy with SGD. One of the more compelling arguments it that in a high-dimensional space, one is more likely to find a saddle point rather than a local valley. There will always be sufficient dimensions with gradients that point to an escape route.
Synthetic Gradients, an approach that decouples layers so that back-propagation is not always need or that the gradient calculation can be delayed, has also been shown to be equally effective. This finding may be a hint that something else more general is going on. It is as if that any update that tends to be incremental regardless of direction (random in the case of synthetic gradients) works equally effectively. I wrote about “Biological Plausible Backprogation” that examines a variety of alternative techniques.
There is also the question regarding the typically objective function that is employed. Backpropagation is calculated with respect to the object function. Typically, the objective function is a measure of the difference between the predicted distribution and the actual distribution. Usually, something derived off the Kullback-Liebler divergence or some other similarity distribution measure like Wassertsein. However, it is in these similarity calculations that the “label” in a supervised training exists. In the same interview Hinton said with regards to unsupervised learning: “I suspect that means getting rid of back-propagation.” He said further “We clearly don’t need all the labeled data.”
In short, you can’t do back-propagation if you don’t have an objective function. You can’t have an objective function if you don’t have a measure between a predicted value and a labeled (actual or training data) value. So to achieve “unsupervised learning”, you may have ditch the ability to calculate a gradient.
However, before we throw out the baby with the bath water, let’s examine the purpose of the objective function from a more general perspective. The objective function is a measure of how accurate an automation’s internal model is in predicting its environment. The purpose of any intelligent automation is to formulate an accurate internal model. However, there is nothing that demands that a measurement between a model and its environment be made at all times or continuously. That is, an automation does not have to be performing back-propagation to be learning. An automation could be doing something else that improves its internal model.
That something else, call it imagination or call it dreaming, does not require validation with immediate reality. The closest incarnation we have today is the generative adversarial network (GAN). A GAN consists of two networks, a generator and a discriminator. One can consider a discriminator as a neural network that acts in concert with the objective function. That is, it validates an internal generator network with reality. The generator is an automation that recreates an approximation of reality. A GAN works using back-propagation and it does perform unsupervised learning. So perhaps unsupervised learn doesn’t require an objective function, however it may still need back-propagation.
Another way to look at unsupervised learning is that it is some kind of meta-learning. One possibility why a system may not require supervised training data is that the learning algorithm already has developed its own internal model of how best to proceed. In other words, there is still some kind of supervision, it just happens to be implicit in the learning algorithm. How that learning algorithm was endowed with this capability is a big unknown.
In summary, it is still too early to tell if we can get rid of back-propagation. We could certainly use a less stringent version of it (i.e. synthetic gradient or some other heuristic). However, a gradual learning (or hill climbing) requirement still appears to be a requirement. I would of course be very interested to find any research that invalidates gradual learning or hill climbing. This has in fact an analogy of how the universe behaves, more specifically that of the second law of thermodynamics. More specifically, that entropy always increases. Information engines will decrease its own entropy in exchange for an entropy increase in the environment. Therefore, there is no way of avoiding the gradient entirely. To do so will require some “perpetual motion information machine”.
Update: A recent paper from Google, reports the discovery of two new kinds of optimization methods (named PowerSign and AddSign). Surprisingly, an programmatic search found these methods.