How to Explain Deep Learning using Chaos and Complexity
I want to talk to you today about the concerns of Non-Equilibrium Information Dynamics and how an understanding of its features lead us to a better intuition about Deep Learning systems or learning systems in general.
Allow me to recap my observation from a previous post on “Deep Learning in Non-Equilibrium Dynamics.” In our study of Deep Learning, practitioners derive their intuition from the mathematics of physical systems. However, since these are not a physical system that we study but rather information systems, we apply information-theoretic principles. Now, information theory has its origins also in mathematics that describe physics (i.e., Thermodynamics). Both theories are essentially bulk observations of nature. What I mean by bulk, is that they are an aggregate measure of systems with a large number of interacting particles or entities.
Kieran D. Kelly, [KELLY]whose writing I recently stumbled upon, has one of the better intuitions out there about non-equilibrium dynamics. His blog is a pleasure to read, and I recommend it highly for anyone interested in this kind of esoteric thing.
Wired has posted an article titled “Move Over Coders — Physicists will soon Rule Silicon Valley” [WIRED]. Now, we might observe that Physicists, in general, have to have a decent IQ to do what they do and thus be able to handle computer science. We can also argue that the mathematics found in Deep Learning isn’t that advanced compared to what’s found in a typical undergraduate physics curriculum (emphasis on undergraduate). However, there is something else that most people do not understand, but it is generally understood by someone studying physics.
What people can’t seem to comprehend, and this is even among folks with a technical background such as computer science and mathematics, is the relationship between math and reality. They don’t recognize that the math that we use are just approximations of reality; that math has limitations beyond certain dimensions. People doing physics know this because despite using analytic forms, we are constantly performing hand waving approximations (i.e., Use Taylor series to expand any function and throw out any term beyond the quadratic). So when I write about the limits of Math with respect to AI, I get a ton of outrage from math inclined folk! The ignorance in this world, even among the learned, is really surprising.
Going back to Kelly, he echoes the same sentiment about math and reality:
Physics is, in a sense, a science of linear dynamics, a science of “dynamics without feedback”; such dynamics are indeed easily compressible, but the real world is a world that abounds with feedback, a “nonlinear” world full of “incompressible dynamics” [KEL].
For many, this statement may seem to be a shock. But it really is not; this is just basic reality that there are limits to analytic forms. Another thing that seems to confuse people is the use of the word “linear” and “non-linear” by Physicists. Most people think of “linear” is that of a linear equation, and I suppose non-linear to mean something that’s not. So a quadratic equation qualifies as non-linear. What the Physicist, however, defines as linear and non-linear is from the point of view of differential equations. Linear differential equation has a chance of being solvable in a closed form solution.
In contrast, with non-linear differential equations, almost all bets are off. The most classic example is the Navier-Stokes equation for fluids. Solvable analytically only up to 2 dimensions. Yes, two dimensions, that is an unrealistic flatland world.
Though, think of non-linear as systems that have feedback. In other words, most of our reality. So to understand a bit about our reality, we have to understand a bit about the nature of non-linearity. It turns out over the years; there have been two features about feedback systems that have been studied. This is chaos and complexity. Kelly has a whole set of articles about these two subjects, and I’ll re-direct you there to get an introduction.
Now what I want to focus on is information systems (not physical systems), so what we are looking for is chaos and complexity in the context of information systems. (side note: Deep Learning systems are information systems despite the poor association with the term Neural Networks). So here’s the very nice table from Kelly:
What drives evolution’s spontaneous and progressive complexity is the interplay of insufficient negative feedback and strong positive feedback; or in other words what drives evolution is The Interplay of Random Innovation and Natural Reinforcement.
Negative feedback here is the natural tendency that exists in the Second Law of Thermodynamics (which really is the law of large numbers). That is, systems tend towards maximum entropy. The positive feedback, however, is a mechanism that can lead to chaos. But at the upper right quadrant, we discover emergent complexity. In other words, one has to embrace the existence of mutual feedback as well as randomness. Unfortunately, our mathematical legacy, that of assuming nice independent Gaussian distributions and favoring sparsity (or parsimony) over randomness is demanding an unnatural constraint on the system.
An assumption of IID (i.e., Independent Identical Distributed) features and an assumption that sparsity is the favored solution is walking every researcher towards an entirely wrong direction! These assumptions are the equivalent of physicists making their equations linear. It is all so that our mathematics become convenient. Unfortunately, God did not mandate that reality is conveniently expressed in mathematics. We are pushing our researchers to buy into religion and not reality.
Now, before I completely forget, let me explain how chaos and complexity relate to explaining Deep Learning. Let’s start with randomness or entropy, I wrote about this in “The Unreasonable Effectiveness of Randomness”. When we study Deep Learning, we simply can’t ignore the presence of randomness. It just seems to be an intrinsic feature of these systems. The most simple intuition I can think of here is that diversity leads to survivability. Monocultures tend to less adaptability and possible extinction. The most counter-intuitive notion, randomness leads to information preservation. As an example of this in computer science, this is used in “Information Dispersal Algorithms.” That is, you take information and scatter it among different storage nodes and on a massive scale you do it randomly. You build storage that is highly redundant. This is the same mechanism as you find in holographic memories. So here, we establish the value of high entropy.
Let’s examine the other axis, that of high mutual information that can lead to unstable feedback and thus chaos. Mutual Information is the antithesis of many probabilistic methods. That’s because the math simply can’t handle it. But should we shoehorn reality to fit the math? I think not. One of the better characterization of how Deep Learning is able to work well in domains of higher mutual information is this paper “Critical Behavior from Deep Dynamics: A Hidden Dimension in Natural Language”:
Source: http://arxiv.org/abs/1606.06737v2 [LIN/TEG]
How can we know when machines are bad or good? The old answer is to compute the loss function. The new answer is also to compute the mutual information as a function of separation, which can immediately show how well the model is doing at capturing correlations on different scales.
Deep Learning must be able to learn correlations at multiple scales to be of any use. To phrase it in a different way that does make sense is, Deep Learning must be able to understand the composition of language, from letters to words, to sentences and eventually to complete texts. Deep learning works because it captures language.
And the learning mechanism for this is what exactly? Jeremy England actually has a very compelling argument as to how life self organizes. You can read it at Quanta: “A New Physics Theory of Life” [ENG]. We can take this idea and use it to explain how learning works in Deep Learning. I’ve written early about the 3 Ilities. Explanations of “Trainability” is significant. A layered DL system builds a representation of language from the lower layers up to the more abstract higher layers. Each layer has its mutual entanglement that is discovered through training. Over time, the entanglement gets reinforced such that the breaking of the entanglement becomes less likely. So, for example, if the network only sees Latin characters, then it never develops the ability to understand Arabic characters. Layers are also interconnected, so there is a constraint at the bottom ( more fundamental concepts ) and at the top ( minimizing relative entropy ). So eventually, a language hierarchy is built.
The objection here though is that it should take an infinite amount of time to arrive at a proper representation. That’s where the interplay of entropy comes into the picture. The basic theory is not unlike that of the holographic principle. Randomness begets robustness while mutual information begets self-organization and compression. What begets generalization? Not sure, but something seems to emerge at the upper right-hand quadrant!
To understand more, either keep reading this blog or head over and talk to us at “Intuition Machine.” Also, make sure you don’t miss any Deep Learning developments. Subscribe to our newsletter: https://www.getrevue.co/profile/intuitionmachine.
Explore more in this new book: