Revisiting Deep Learning as a Non-Equilibrium Process
Last year, the best paper award for ICLR 2017 went to “Re-thinking Generalization” by Chiyuan Zhang et al. For a good review, here is a video of his talk:
The key take away of his teams discovery is that the nature of Deep Learning systems is remarkably very different from other classical machine learning systems. One of the biggest misunderstanding about Deep Learning is that it is just a higher dimensional form of curve fitting and thus solved from the perspective of optimization techniques. This is incorrect notion can be due to the fact that the way Artificial Neural Networks (ANN) is taught to many is that it is just a larger form of logistic regression. Alternatively, for the more experienced machine learning expert, everything can be framed from the viewpoint of an optimization problem.
The last view point in fact has been detrimental to the field for so long. If you take the optimization viewpoint, then Deep Learning is just too high dimensional and non-convex that it should be theoretically impossible to achieve convergence to a global minima. Unfortunately for the optimization gurus, this experimentally isn’t even true. The simplest of methods, stochastic gradient decent, works surprisingly too well. Something else is going on that has eluded the orthodox explanation of how optimization is supposed to work. Zhang’s discovery provided experimental evidence that we have to rethink our current (obviously incomplete) theories.
Despite the thousands of papers that are submitted to the various Deep Learning conferences this year, there’s very few papers that attempts to explore explain the true nature of Deep Learning. Deep Learning research is really just pure alchemy and piss poor explanations are backed with lots of hand waving that’s disguised as mathematics. Everyone in the academic community are so vested in pleasing everyone else that nobody wants to call out the BS. Fortunately, we have some brave souls that work on the real theoretical issues. Papers of this kind are unfortunately the kind that usually get rejected. It’s just a fact of reality that when you need to understand a system that you have to work with a simpler system. Yet, showing results using MNIST is considered not state-of-the-art and therefore should be ignored. The only folks that get a pass are celebrities like Geoffrey Hinton. It’s a sad reality where celebrity and alchemy is favored over real science.
Okay, I’m done with my rant. Let’s look at some interesting papers that has just recently published.
Here’s a new paper: “A Bayesian Perspective on Generalization and Stochastic Gradient Descent”. Which begs the question, why are smart people invoking spells like “Bayesian intuition” to obfuscate that they are actually just doing alchemy? I suspect that the use of the term Bayesian or Gaussian in papers is more to play to the sentiments of the orthodoxy and is at the expense of more precise language but equivalent language. Here are some quotes from the paper:
The contribution is often called the “Occam factor”, because it enforces Occam’s razor; when two models describe the data equally well, the simpler model is usually better.
We conclude that Bayesian model comparison is quantitatively consistent with the results of Zhang et al. (2016) in linear models where we can compute the evidence, and qualitatively consistent with their results in deep networks where we cannot.
I am not impressed at this paper’s attempt to justify ‘Bayesian intuition’. Fortunately, there a much better paper on exactly the same subject:(“Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks”). This also explores SGD as an implicit generalization method and remarks:
We prove that such “out-of-equilibrium” behavior is a consequence of the fact that the gradient noise in SGD is highly non-isotropic; the covariance matrix of mini-batch gradients has a rank as small as 1% of its dimension.
the paper has a gem of an observation:
There is however a disconnect between these two directions due to the fact that while adding external gradient noise helps in theory, it works poorly in practice. Instead, “noise tied to the architecture” works better, e.g., dropout, or small mini-batches.
In short, noise in Deep Learning arises because of the diversity of the training and architecture. It’s not something that you artificially add so you can justify Bayesian intuition.
This notion of the importance of architecture is further analyzed in a recent paper “Intriguing Properties of Adversarial Examples” where the authors from Google Brain use their “neural architecture search” infrastructure to discover new architectures that are less susceptible to adversarial features. They conclude that the size of the model network does not correlate with adversarial robustness. Rather adversarial robustness is strongly correlated with “clean accuracy”. The principles behind building a high “clear accuracy” architecture appears to be an open question. Their brute force search found the following network:
Which just happens to be longer and thinner than the baseline best NAS architecture. The longer the network perhaps alluding to a larger effect of transient chaos (discussed later). (I don’t have an explanation for the narrowness other than its lower complexity)
These two papers study the same subject but the approach is starkly different. In the first paper, the authors attempt to explain that Bayesian inference still holds with Deep Learning. In the second paper, the authors explain that this is a non-equilibrium phenomena and we can’t know enough because Deep Learning training is truncated with insufficient epochs.
You can see the problem here, satisfying Bayesian inference or Occam’s razor does not signify truth. All it signifies is that one’s own beliefs are validated and that the behavior of the inspected system validates those beliefs.
The second paper in contrast explores the aspects that are different about Deep Learning and attempts to make the analogy with other physical theories. In short, it’s not attempting to fix a round peg into a square hole. Reality is what it is and it is our business as scientists to explore a rich variety of models to explain our reality. The real problem is that many researchers aren’t skilled in the mathematics of Statistical Mechanics. They use whatever tool is at their disposal, unfortunately it is some antiquated 18th century math in the form of Bayes Theorem.
In this paper, the authors argue that SGD settles not at a local minima but rather in a limit-cycle:
The paper proposes the use of a ‘local entropy’ to discover these limits cycles. They cite a paper: “Unreasonable Effectiveness of Learning Neural Networks: From Accessible States and Robust Ensembles to Basic Algorithmic Schemes” that makes the claim about the smoothness of the local entropy as compared to the original objective function:
In a previous blog post, I pointed to recent papers that describe the two phases of gradient descent. The signal to noise ratio is extremely low at the minima, it is chaotic down there and any significantly increate in learning rate can violently kick one out of that minima. In addition, there are many of these minima out there and finding out which one of them leads to generalization is a wide open question. The current consensus is a wide basin is the preferred choice. I don’t know if this notion should give a researcher a warm fuzzy feeling that its the right choice!
There are several papers that also come from those trained in a field other than statistics, that will likely not see the light of day (or rather accepted in a conference). The incomprehensibility to the reviewer trained only in statistics is grounds for rejection. Here is one where Charles Martin and Michael Mahoney apply a statistical mechanics approach to further understanding the ‘rethinking generalization’ paper (“Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior”). The authors argue that:
In particular, methods that implement regularization by, e.g., adding a capacity control function to the objective and approximating the modified objective, performing dropout, adding noise to the input, and so on, do not substantially improve the situation. Indeed, the only control parameter that has a substantial regularization effect is early stopping.
There’ is indeed mass confusion today about what kind of regularization leads to generalization. In fact, there is experimental evidence that performing SGD without regularization also leads to good generalization. There are even papers that show that certain kinds of regularization are detrimental to generalization. Here’s a recent survey: “Regularization for Deep Learning: A Taxonomy”. There is this notion that the learning process in Deep Learning is ‘transient chaos’, that is convergence is in a chaotic regime and that given enough epochs that true chaotic phenomena will be revealed. Compare the output at different depths as a function of the input:
However, one has got to at least ask, why is it “chaotic” down there where generalization can be found? Could it be perhaps that we have encountered a many body problem? That is, an intelligent system should have multiple perspectives of reality and thus the transition to each perspective is of a fluid nature?
There’s no magic measure to achieve generalization without actually looking at the validation set, this is what most researcher seem to be completely blind of. A system that generalizes well is one that works well with the validation set. It does not have some kind of mystical precognition skills that tells it that one minima is better than another because of some bayesian belief that wider or simpler is better. This is why meta-learning methods are effective because it has seen enough validation sets to basically learn to be adaptive.
The paper by Martin et. al. proposes to simplify regularization by focusing on just two knobs for controling deep learning:
We propose that the two parameters used by Zhang et al. (and many others), which are control parameters used to control the learning process, are directly analogous to load-like and temperature-like parameters in the traditional SM approach to generalization.
They explored the design space using a simple model of deep learning and propose the following phase diagram:
This indeed is a refreshing idea that needs to be explored further using more complex deep learning architectures.
BTW, if you are lost in this discussion, meaning words like regularization, implicit regularization, generalization etc are new, the here’s a screenshot of the topics in a new course at Stanford that’ll give you some bearing:
Explore more in this new book: