The paper submissions for ICLR 2017 in Toulon France deadline has arrived and instead of a trickle of new knowledge about Deep Learning we get a massive deluge. This is a gold mine of research that’s hot off the presses. Many papers are incremental improvements of algorithms of the state of the art. I had hoped to find more fundamental theoretical and experimental results of the nature of Deep Learning, unfortunately there were just a few. There was however 2 developments that were mind boggling and one paper that is something I’ve been suspecting for a while now and has finally been confirm to shocking results. It really is a good news, bad news story.
First let’s talk about the good news. The first is the mind boggling discovery that you can train a neural network to learn to learn (i.e. meta-learning). More specifically, several research groups have trained neural networks to perform stochastic gradient descent (SGD). Not only have they been able to demonstrate neural networks that have learned SGD, the networks have performed better than any hand tuned human method! The two papers that were submitted were”Deep Reinforcement Learning for Accelerating the Convergence Rate” and “Optimization as a Model for Few-Shot Learning” . Unfortunately though, these two groups have been previously scooped by Deep Mind, who showed that you could do this in this paper “Learning to Learn by gradient descent by gradient descent“. The two latter papers trained an LSTM, while the first one trained via RL. I had thought that it would take a bit longer to implement meta-learning, but it has arrived much sooner than I had expected!
Not to be out-done, two other groups created machines that could design new Deep Learning networks and do it in such a way as to improve on the state-of-the-art! This is learning to design neural networks. The two papers that were submitted are “Designing Neural Network Architectures using Reinforcement Learning” and “Neural Architecture Search with Reinforcement Learning”. The former paper describes the use of Reinforcment Q-Learning to discover CNN architectures. You can find some of their generated CNNs in Caffe here: https://bowenbaker.github.io/metaqnn/ . The latter paper is truly astounding (you can’t do this without Google’s compute resources). Not only did they show state-of-the-art CNN networks, the machine actually learned a few more variants of the LSTM node! Here are the LSTM nodes the machine created (left and bottom):
So not only are researcher who hand optimize gradient descent solutions out of business, so are folks who make a living designing neural architectures! This is actually just the beginning of Deep Learning systems just bootstrapping themselves. So I must now share Schmidhuber’s cartoon that aptly describes what is happening:
This is absolutely shocking and there’s really no end in sight as to how quickly Deep Learning algorithms are going to improve. This meta capability allows you to apply it on itself, recursively creating better and better systems.
Permit me now to deal you the bad news. Here is the paper that is the bearer of that news: “Understanding Deep Learning required Rethinking Generalization“. I’ve thought about Generalization a lot, and I’ve posted out some queries in Quora about Generalization and also about Randomness in the hope that someone could give some good insight. Unfortunately, nobody had enough of an answer or understood the significance of the question until the folks who wrote the above paper performed some interesting experiments. Here is a snippet of what they had found:
1. The effective capacity of neural networks is large enough for a brute-force memorization of the entire data set.
2. Even optimization on random labels remains easy. In fact, training time increases only by a small constant factor compared with training on the true labels.
3. Randomizing labels is solely a data transformation, leaving all other properties of the learning problem unchanged.
The shocking truth revealed. Deep Learning networks are just massive associative memory stores! Deep Learning networks are capable of good generalization even when fitting random data. This is indeed strange in that many arguments for the validity of Deep Learning is on the conjecture that ‘natural’ data tends to exists in a very narrow manifold in multi-dimensional space. Random data however does not have that sort of tendency.
John Hopfield wrote a paper early this year examining the duality of Neural Networks and Associative Memory. Here’s a figure from his paper:
The authors write:
Our work will inspire more possibilities of using the generative power of CNNs with random weights, which do not need long training time on multi-GPUs. …
Our work using random weights shows a possible way to start developing a theory of deep learning since with well-trained weights, theorems might be impossible.
The “Rethinking Generalization” paper goes even further by examining our tried and true tool for achieving Generalization (i.e. Regularization) and finds that:
Explicit regularization may improve generalization performance, but is neither necessary nor by itself sufficient for controlling generalization error.
In other words, all our regularization tools may be less effective than what we believe! Furthermore, even more shocking, the unreasonable effectiveness of SGD turns out to be:
Appealing to linear models, we analyze how SGD acts as an implicit regularizer.
just a different kind of regularization that just happens to work!
In fact, a paper submitted for ICLR2017 by another group titled “An Empirical Analysis of Deep Network Loss Surfaces” confirms that the local minima of these networks are different:
Our experiments show that different optimization methods find different minima, even when switching from one method to another very late during training. In addition, we found that the minima found by different optimization methods have different shapes, but that these minima are similar in terms of the most important measure, generalization accuracy.
Which tells you that your choice of learning algorithm “rigs” how it arrives at a solution. Randomness is ubiquitous and it does not matter how you regularize your network or what the SGD variant that you employ, the network just seems to evolve (if you set the right random conditions) towards convergence! What are the properties of SGD that leads to machines that can learn? Are the properties tied to differentiation or is it something more general? If we can teach a network to perform SGD, can we teach it to perform this unknown generalized learning method?
The effectiveness of this randomness was in fact demonstrated earlier this year in a paper: “A Powerful Generative Model Using Random Weights for the Deep Image Representation” also co-authored by John Hopcroft that showed that you could generate realistic imagery using randomly initialized networks without any training! How could this be possible? (Editor’s Note: Initialization with random weights is certainly better than non-random weights, unless those weights are from a pre-trained network)
Therefore to understand Deep Learning, we must embrace randomness. Randomness arises from maximum entropy, which interestingly enough is not without its own structure! The memory capacity of a neural network seems to be highest the closer to random the weights are. The strangeness here is that Randomness is ubiquitous in the universe. The arrow of time is reflected by the direction towards greater entropy. How then is it that this property is also the basis of learning machines?
If we were to assume that the reasoning (or the intuition) behind hierarchical layers in DL is that the bottom layers consist of the primitive recognition components that are built up, layer by layer, into more complex recognition components.
What this implies then is that the bottom components during training should be ‘searched’ more thoroughly than the top most components. But the way SGD works is that the search is driven from the top and not from the bottom. So the top is searched more thoroughly that the bottom layers.
Which tells you the bottom layers (the ones closest to inputs) are not optimal in their representation. In fact, they are the kind of a representation that likely will be of the most generalized form. The kind that will have recognizers that will have equal probability of matching anything, in short, completely random!
As you move up the layers, the specialization happens because it is actually driven from the top which is designed to fit the data. Fine tuning happens at the top.
Let’s make the analogy of this process with languages. The bottom components of a language are letters and the top parts are sentences. In between you have syllables, words, parts of speech etc. However from a Deep Learning perspective, it is as if there are no letters! But rather fuzzy forms of letters. Which builds up into other fuzzy forms of words and so forth. The final layers is like some projection (some wave collapse) into interpretation.
Now let’s throw in the Swap Out learning procedure which tells us that if you sample any subnetwork of the entire network the resulting prediction will be the similar to any other subnetwork you look sample. Just like holographic memory where you can slice of pieces and still recreate the whole. The procedure seems to be that the more random we try to make it to be, the better our learning. That is definitely counter-intuitive!
The following paper published last week “Learning in the Machine: Random Backpropagation and the Learning Channel” explores the robustness of using random matrices instead of gradients.
Please see Design Patterns for Deep Learning: Canonical Patterns for additional insight on this intriguing subject.