Steve here… I just finished reading a fascinating summary of the tie between the power of neural networks / deep learning and the peculiar physics of our universe. The mystery of why they work so well may be resolved by seeing the resonant homology across the information-accumulating substrate of our universe, from the base simplicity of our physics to the constrained nature of the evolved and grown artifacts all around us. The data in our natural world is the product of a hierarchy of iterative algorithms, and the computational simplification embedded within a deep learning network is also a hierarchy of iteration. Since neural networks are symbolic abstractions of how the human cortex works, perhaps it should not be a surprise that the brain has evolved structures that are computationally tuned to tease apart the complexity of our world.
Here is a collection of interesting plain text points I extracted from the math in Lin & Tegmark’s article: http://arxiv.org/pdf/1608.08225v1.pdf
“The exceptional simplicity of physics-based functions hinges on properties such as symmetry, locality, compositionality and polynomial log-probability, and we explore how these properties translate into exceptionally simple neural networks approximating both natural phenomena such as images and abstract representations thereof such as drawings. We further argue that when the statistical process generating the data is of a certain hierarchical form prevalent in physics and machine-learning, a deep neural network can be more efficient than a shallow one. Various “no-flattening theorems” show when these efficient deep networks cannot be accurately approximated by shallow ones without efficiency loss.”
This last point reminds me of something I wrote in 2006: “Stephen Wolfram’s theory of computational equivalence suggests that simple, formulaic shortcuts for understanding evolution (and neural networks) may never be discovered. We can only run the iterative algorithm forward to see the results, and the various computational steps cannot be skipped. Thus, if we evolve a complex system, it is a black box defined by its interfaces. We cannot easily apply our design intuition to the improvement of its inner workings. We can’t even partition its subsystems without a serious effort at reverse-engineering.” — 2006 MIT Tech Review: https://www.technologyreview.com/s/406033/technology-design-or-evolution/
Back to quotes from the paper:
Neural networks perform a combinatorial swindle, replacing exponentiation by multiplication: if there are say n = 106 inputs taking v = 256 values each, this swindle cuts the number of parameters from v^n to v×n times some constant factor. We will show that this success of this swindle depends fundamentally on physics: although neural networks only work well for an exponentially tiny fraction of all possible inputs, the laws of physics are such that the data sets we care about for machine learning (natural images, sounds, drawings, text, etc.) are also drawn from an exponentially tiny fraction of all imaginable data sets. Moreover, we will see that these two tiny subsets are remarkably similar, enabling deep learning to work well in practice.
Increasing the depth of a neural network can provide polynomial or exponential efficiency gains even though it adds nothing in terms of expressivity.
Both physics and machine learning tend to favor Hamiltonians that are polynomials — indeed, often ones that are sparse, symmetric and low-order.
1. Low polynomial order
For reasons that are still not fully understood, our universe can be accurately described by polynomial Hamiltonians of low order d. At a fundamental level, the Hamiltonian of the standard model of particle physics has d = 4. There are many approximations of this quartic Hamiltonian that are accurate in specific regimes, for example the Maxwell equations governing electromagnetism, the Navier-Stokes equations governing fluid dynamics, the Alv ́en equations governing magnetohydrodynamics and various Ising models governing magnetization — all of these approximations have Hamiltonians that are polynomials in the field variables, of degree d ranging from 2 to 4.
One of the deepest principles of physics is locality: that things directly affect only what is in their immediate vicinity. When physical systems are simulated on a computer by discretizing space onto a rectangular lattice, locality manifests itself by allowing only nearest-neighbor interaction.
Whenever the Hamiltonian obeys some symmetry (is invariant under some transformation), the number of independent parameters required to describe it is further reduced. For instance, many probability distributions in both physics and machine learning are invariant under translation and rotation.
What properties of real-world probability distributions cause efficiency to further improve when networks are made deeper? This question has been extensively studied from a mathematical point of view, but mathematics alone cannot fully answer it, because part of the answer involves physics. We will argue that the answer involves the hierarchical/compositional structure of generative processes together with inability to efficiently “flatten” neural networks reflecting this structure.
A. Hierarchical processes
One of the most striking features of the physical world is its hierarchical structure. Spatially, it is an object hierarchy: elementary particles form atoms which in turn form molecules, cells, organisms, planets, solar systems, galaxies, etc. Causally, complex structures are frequently created through a distinct sequence of simpler steps.
We can write the combined effect of the entire generative process as a matrix product.
If a given data set is generated by a (classical) statistical physics process, it must be described by an equation in the form of [a matrix product], since dynamics in classical physics is fundamentally Markovian: classical equations of motion are always first order differential equations in the Hamiltonian formalism. This technically covers essentially all data of interest in the machine learning community, although the fundamental Markovian nature of the generative process of the data may be an in-efficient description.
The success of shallow neural networks hinges on symmetry, locality, and polynomial log-probability in data from or inspired by the natural world, which favors sparse low-order polynomial Hamiltonians that can be efficiently approximated. Whereas previous universality theorems guarantee that there exists a neural network that approximates any smooth function to within an error ε, they cannot guarantee that the size of the neural network does not grow to infinity with shrinking ε or that the activation function σ does not become pathological. We show constructively that given a multivariate polynomial and any generic non-linearity, a neural network with a fixed size and a generic smooth activation function can indeed approximate the polynomial highly efficiently.
The success of deep learning depends on the ubiquity of hierarchical and compositional generative processes in physics and other machine-learning applications.
And thanks to Tech Review for the pointer to this article: https://www.technologyreview.com/s/602344/the-extraordinary-link-between-deep-neural-networks-and-the-nature-of-the-universe/