Dissipative Adaptation: The Origins of Life and Deep Learning
In a previous post, I wrote about Deep Learning not being Probabilistic Induction and actually being something else entirely. However, I didn’t explain how the mechanism of that something else works. In this piece, I will explain how the explanation of the emergence of life could also explain the mechanism of Deep Learning.
The work of Jeremy England first captured my attention in this Quanta article “A New Physics Theory of Life”. The mechanism that England proposes is called “Dissipative Adaptation”. It is a theory based on non-equilibrium statistical mechanics where the assumption here is that life resides in the regime that is far from equilibrium. This is the same thinking as non-equilibrium information dynamics where Deep Learning is also a non-equilibrium process. In this post, I will explain how this theory can be used as an explanation for the self-organizing behavior of Deep Learning.
Deep Learning is modeled as a continuous dynamical system. It’s driver is the Stochastic Gradient Descent (SGD) that is influenced by an objective function (typically in the form of the discrepancy between prediction and reality). One can look at SGD as an equation of motion or alternatively an equation of evolution of the parameters (i.e. weights) of an neural network. So for example, Tomaso Poggio in his framework to characterize Deep Learning uses the dissipative Langevin equation:
The Bellman equation in Reinforcement Learning is a derivation of Hamilton-Jacobi equation that we find in physics which describes the evolution of a dynamical system:
So it’s established that there is a relationship between the equations of evolution in physical systems and that of Deep Learning networks. However, the methods of Deep Learning has its origins from optimization methods. Optimization methods arrive at convergence when a global extremum is discovered as a solution to the objective function. DL systems differ from classical optimization in that it is overly parameterized and the objective is not optimization but rather another objective known as generalization. Generalization itself is a complicated subject, however the ‘theory’ here is that SGD will arrive at a stable minima and as a consequence generalization will be achieved. However, the open question is, why does stochastic gradient descent (SGD) even converge?
Classic optimization will tell you that the high-dimensional spaces found in Deep Learning is problematic. Yet for Deep Learning practitioners, stochastic gradient descent works surprisingly well. This is unintuitive for many experts in the optimization field. High dimensional problems are supposed to be non-convex and therefore extremely hard to optimize. An extremely simplistic method like SGD is not expected to be effective in the high complexity and high dimensionality space that deep learning networks find themselves in.
Experimental evidence has shown that in high dimensional spaces, the space neighboring the minimal point have a much higher probability of being a saddle point. A saddle point gifts the optimization process with many more opportunities to escape the minima and move forwards. This argument explains why large networks don’t appear to often get stuck in a non-optimal state. I therefore propose that rather than think of Deep Learning from the more conventional viewpoint of being optimization, one should think of Deep Learning instead as a physical system and residing in a non-equilibrium regime. This approach aligns much better with the experimental evidence. Furthermore, it aligns with another theme that an approach to understanding complexity should be based on physical motivations and not abstract mathematical ones. I have discussed this earlier in “Chaos and Entanglement in Disguise”.
England’s phenomena of Dissipative Adaptation is a mechanism found in dynamical systems that may explain how and why deep learning systems converge into stable attractor basins.
Dissipative Adaptation provides an explanation as to why self-replicating structures arise in physical systems. Dissipative Adaption describes the dynamics of a system in contact with a thermal reservoir and with an external energy source acting also on the system. In said system, different configurations of the system are not equally able to absorb energy from that external source. The absorption of energy from an external source allows the system configuration to traverse activation barriers too high to jump rapidly by thermal fluctuations alone. If energy is dissipated after a jump, then this energy is not available for the system to reversibly jump back from where it came. Even though any given change in configuration of the system is random, the most likely configuration (as a consequence of irreversibility) happens to be the configuration that aligns more efficiently with the absorption and dissipation of external energy.
Artificial neural network are not physical systems in that there is no notion of of energy. In contrast, the relevant measure is the relative entropy or alternatively the fitness function. It is a measure of similarity between the observation and prediction. The self-similarity of a neural network implies that at all components, down to the most basic neuron, there is a function that computes a similarity between observation and prediction.
The analogy to external energy source in the neural network context are the external observations of the system. Through training, the network is subjected with perturbations that drives the system towards minimizing entropy. This propagates down to every neuron such that those neurons that are aligned to the perturbations are those likely to remain aligned. (Note: The mechanism for alignment is the similarity operator) A neuron’s activation function is equivalent to that energy barrier. (Actually in reverse, if sufficiently not aligned, it gets removed)
The activation function acts like an irreversible operation once the entropy moves to a lower state. With the passage of training, the memory of these less erasable changes accumulates, and the system increasingly adopts a model that is best adapted to the training data. So from a initial random model, the neural network evolves into a model that is adapted to the stochastic observations (i.e. SGD) that it is trained under. (I personally am deeply suspicious of the activation function and have a hunch that it is not only not necessary but perhaps even detrimental.)
Have you ever wondered what the activation function is for? It is there not because of a bad excuse for the need of “non-linearity”, rather it is there to serve as an irreversible selection operator. Deep Learning networks are from the perspective of physics, a linear system. Non-linearity exists because of feedback and there is only truncated feedback in these networks. The word non-linear used in Deep Learning means something other than a straight line.
Perhaps the continuity requirements of back propagation, independent of stochasticity, ensures that there are no big local transitions in model changes. Only significant cumulative observations are required to achieve a persistent change in structure. Backpropagation ensures that the random changes due to training are not purely random, but rather constrained to changes that preserve continuity.
Each individual neuron adjusts its weights in the direction of minimum entropy. That is, adjust in the direction of the gradient and thus in the maximum direction to reduce entropy. Each neuron evolves to its local minima and is unable to extricate itself unless a sufficient accumulation of observation signal exists in the training data. As more and more neurons arrive at minima, larger collective cliques of neurons are formed. These cliques become more difficult to breakup. Only coordinated signals are able to break up cliques, and as there cliques become larger, the more observations will need to be in synchronization.
The essence of England’s model is that it explains the persistence of structures that are in tune with the environment. The equivalent of this from the DL perspective is that neurons that are able to match repeating observations of a training set (i.e. the environment) are more likely to persist through the epochs.
There are however differences between England’s model and DL architecture. England’s system makes the adjustment only with sufficient alignment. DL systems make an adjustment with insufficient alignment, that is the activation function goes below a threshold. So DL systems explicitly favor the replication of aligned neurons. England’s systems favor the persistence of components that accumulate because they become more irreversible over time. DL systems also have built-in mechanisms for memory and remember by default and forgetting is irreversible. England’s systems do not have memory, but achieve the equivalent through irreversibility.
The main commonality between England’s system and DL is the alignment of components to the direction of maximum energy dissipation or information gradient. Both systems however are learning systems in that they both adapt to the environment. This hints at a reality that (1) one can create learning systems with components that are even simpler than that found in DL systems and (2) the universe is made up of deep learning like systems.
A DL system that mirrors the architecture of England’s Dissipative Adaptation would be different from today’s DL system. It will only remember what matches its neurons, it will be less volatile with what it matches often and it will simply ignore non-matching information. It does not require an activation function since the only neurons that are active are the one’s that it remembers. Indeed radical and interesting at the same time.
This is how Dissipative Adaptation explains trainability. However, Dissipative Adaptation does not explain expressivity or even generalization.
Editor’s Note: Above article is still a bit of a mess, that I will work on improving over time.