Deep Learning Machines are Holographic Memories


My favorite the paper in the 500 plus papers submitted to ICLR 2017 is this one done by a group at Google: “Understanding Deep Learning required Rethinking Generalization“. I’ve thought about Generalization a lot, and I’ve posted out some queries in Quora about Generalization and also about Randomness in the hope that someone could give some good insight. Unfortunately, nobody had enough of an answer or understood the significance of the question until the folks who wrote the above paper performed some interesting experiments. Here is a snippet of what they had found:

1. The effective capacity of neural networks is large enough for a brute-force memorization of the entire data set.
2. Even optimization on random labels remains easy. In fact, training time increases only by a small constant factor compared with training on the true labels.
3. Randomizing labels is solely a data transformation, leaving all other properties of the learning problem unchanged.

Deep Learning networks are just massive associative memory stores! Deep Learning networks are capable of good generalization even when fitting random data. This is indeed strange in that many arguments for the validity of Deep Learning is on the conjecture that ‘natural’ data tends to exists in a very narrow manifold in multi-dimensional space. Random data however does not have that sort of tendency.

John Hopcroft on a paper the examines the effects of random weights:

Our work will inspire more possibilities of using the generative power of CNNs with random weights, which do not need long training time on multi-GPUs. Furthermore, it is very hard to prove why trained deep neural networks work so well. Based on the networks with random weights, we might be able to prove some properties of the deep networks. Our work using random weights shows a possible way to start developing a theory of deep learning since with well-trained weights, theorems might be impossible.

To understand Deep Learning, we must embrace randomness. Randomness arises from maximum entropy, which interestingly enough is not without its own structure! The memory capacity of a neural network seems to be highest the closer to random the weights are. The strangeness here is that Randomness is ubiquitous in the universe. The arrow of time is reflected by the direction towards greater entropy. How then is it that this property is also the basis of learning machines?

If we were to assume that the reasoning (or the intuition) behind hierarchical layers in DL is that the bottom layers consist of the primitive recognition components that are built up, layer by layer, into more complex recognition components.

What this implies then is that the bottom components during training should be ‘searched’ more thoroughly than the top most components. But the way SGD works is that the search is driven from the top and not from the bottom. So the top is searched more thoroughly that the bottom layers.

Which tells you the bottom layers (the ones closest to inputs) are not optimal in their representation. In fact, they are the kind of a representation that likely will be of the most generalized form. The kind that will have recognizers that will have equal probability of matching anything, in short, completely random!

As you move up the layers, the specialization happens because it is actually driven from the top which is designed to fit the data. Fine tuning happens at the top.

Let’s make the analogy of this process with languages. The bottom components of a language are letters and the top parts are sentences. In between you have syllables, words, parts of speech etc. However from a Deep Learning perspective, it is as if there are no letters! But rather fuzzy forms of letters. Which builds up into other fuzzy forms of words and so forth. The final layers is like some projection (some wave collapse) into interpretation.

This notion of fuzzy letters and fuzzy words can intuitively be explained to you by having you read the following:

For emaxlpe, it deson’t mttaer in waht oredr the ltteers in a wrod aepapr, the olny iprmoatnt tihng is taht the frist and lsat ltteer are in the rghit pcale. The rset can be a toatl mses and you can sitll raed it wouthit pobelrm.
S1M1L4RLY, Y0UR M1ND 15 R34D1NG 7H15 4U70M471C4LLY W17H0U7 3V3N 7H1NK1NG 4B0U7 17.

Our brains work similarly to Deep Learning in that we work with primarily fuzzy concepts.

There is a recent paper that discusses a method Swapout that is a generalized form of Dropout that works across many layers:

Swapout improves results as compared with ResNet
Stochastic inference beats deterministic, even with only a few averaged results
Increasing the width of a network greatly improves performance

The Swapout learning procedure which tells us that if you sample any subnetwork of the entire network the resulting prediction will be the similar to any other subnetwork you look sample. Just like holographic memory where you can slice of pieces and still recreate the whole. The procedure seems to be that the more random we try to make it to be, the better our learning. That is definitely counter-intuitive!

Please see Design Patterns for Deep Learning: Canonical Patterns for additional insight on this intriguing subject.