Entropy, Chaos, and Intelligence

The most interesting phenomena that happens in living organisms results from networks of interactions, whether among neurons in the brain, genes or in a single cell, or amino acids in a single protein molecule. Scientist have long hoped that these collective biological phenomena could be described within the framework of statistical mechanics.

“Especially in the context of neural networks, there is a long tradition of using ideas from statistical physics to think about the emergence of collective behavior from the microscopic interactions, with the hope that this functional collective behavior will be robust to our ignorance of many details in these systems.”

In my last article, I briefly introduced the concept ‘intelligence’ qualitatively. Here, I want to present the attempt made by scientists in different domains to try to quantify “intelligence”. There are different explanations for the definition of intelligence across different domains such as in cosmology; there have been a variety of different threads of evidence that our universe appears to be finely tuned for the development of intelligence and in particular, for the development of universal states that maximize the diversity of possible futures.

In game plan, for example, at 1997 when IBM’s Deep Blue beat Garry Kasparov at chess, fewer people are aware in the past 10 years the game of Go arguably becomes much more challenging game because it has a much higher branching factor and started to succumb to computer game player. It seems like intelligent action made by computer. We may find that the best techniques for computer to play games like Go are techniques that try to maximize future options during game play. In robotic motion planning, there has been a wide variety of recent techniques that have tried to take advantage of abilities of robots to maximize future freedom of action in order to accomplish complex tasks and to behave intelligently.

Imagine here an underlying mechanism for intelligence that we can forge around — as long as we can find the right equation for intelligence we may be able to find a way to build an intelligent machine sooner rather than later. Recent advances in fields ranging from cosmology to computer science have hinted at a possible deep connection between intelligence and entropy maximization. In geoscience, entropy production maximization has been proposed as a unifying principle for non-equilibrium processes underlying planetary development and the emergence of life. In computer science, maximum entropy methods have been used for inference in situations with dynamically revealed information.

Even in deep learning field, we use maximum entropy(ME) learning algorithm for deep belief networks(dbn), designed specifically to handle limited training data. Maximizing only the entropy of parameters in the DBN allows more effective generalization capability, less bias towards data distributions, and robustness to over-fitting compared to regular maximum likelihood method. However, there is not a formal physical relationship between them yet.

So from here, we will examine a non-novel concept, the maximum entropy construction, and to discover the connection between it and ‘intelligence’. By digging deep, we may be able to push pass religion and philosophy to uncover the truth of intelligence.

Let’s first take a look at the application of entropic force in physical systems. A non-equilibrium physical system’s bias towards maximum instantaneous entropy production is reflected by its evolution towards higher entropy macroscopic states, a process characterized by the formalism of entropic force. Here, entropic force F associated with a macro-state partition, {x} is given by:

A casual macrostate X with horizon time tau, consisting of path microstates x(t) than share a common initial system state x(0), in an open thermodynamic system with initial environment state x*(0)

What you are seeing here is a statement of correspondence that intelligence is a force F that acts to maximize future freedom of action — to keep options open with some strength T with the diversity of possible accessible futures, S, up to some future time horizon, tau. In short, you can think intelligence does not like to be trapped so it tries to maximize future freedom of action. So here, we are not just trying to greedily maximize instantaneous entropy production, instead, we uniformly maximize entropy production between present and a future time — long-term entropy.

“In statistical mechanics, entropy (usual symbol S) is related to the number of microscopic configurations Ω that a thermodynamic system can have when in a state as specified by some macroscopic variables. Specifically, assuming for simplicity that each of the microscopic configurations is equally probable, the entropy of the system is the natural logarithm of that number of configurations, multiplied by the Boltzmann constant kB.”

For the Deep Learning domain, the term ‘entropy’ or ‘information entropy’, a more general measure that was originally proposed in the context of stationary states and the fluctuation theorem, is usually related to information theory. First, let’s take a look at information theory; it is a branch of applied mathematics that revolves around quantifying how much information is present in a signal. Initially, information theory was invented to investigate sending messages from discrete alphabets over a noisy channel like communication via radio transmission. But here, information theory tells how to design optimal codes and calculate the expected length of messages sampled from specific probability distributions using various encoding schemes.

“Information theory is based on probability theory and statistics. Information theory often concerns itself with measures of information of the distributions associated with random variables. Important quantities of information are entropy, a measure of information in a single random variable, and mutual information, a measure of information in common between two random variables.”

The basic intuition behind information theory is that learning that an unlikely event has occurred is more informative than learning that a likely event has occurred. We would like to quantify information in a way that formalizes this intuition:

  1. Likely events( very likely to happen to very unlikely to happen) should have low information content, and in the extreme case, events that are guaranteed to happen should have no information content whatsoever.
  2. Less likely events should have higher information gain.
  3. Independent events should have additive information.

To satisfy all 3 of these properties we define the self-information of an event x = x to be:

The choice of logarithmic base in the above formulae determines the unit of information entropy that is used. Our definition of I(x) is therefore written in unites of nats — One nat is the amount of information gained by observing an event of probability 1/e . Other texts use base-2 logarithms and units called bits or shannons — based on the binary logarithm. Information measured in bits is just a rescaling of information measured in nats.

In sum, entropy is a measure of unpredictability of the state, or equivalently, of its average information content. Entropy only takes into account the probability of observing a specific event, so the information it encapsulates is information about the underlying probability distribution, not the meaning of the events themselves. Self-information deals only with a single outcome. We can quantify the amount of uncertainty in an entire probability distribution using Shannon entropy:

“Consider the example of a coin toss: assuming the probability of heads is the same as the probability of tails, then the entropy of the coin toss is as high as it could be. This is because there is no way to predict the outcome of the coin toss ahead of time: the best we can do is predict that the coin will come up heads, and our prediction will be correct with probability 1/2. Such a coin toss has one shannon of entropy since there are two possible outcomes that occur with equal probability, and learning the actual outcome contains one shannon of information. Contrarily, a coin toss with a coin that has two heads and no tails has zero entropy since the coin will always come up heads, and the outcome can be predicted perfectly.”

Entropy Η(X) (i.e. the expectedsurprisal) of a coin flip, measured in shannons, graphed versus the bias of the coin Pr(X = 1), where X = 1represents a result of heads

In deep learning field, some scientists are trying to achieve higher ‘sampling temperature’ to make the network more ‘creative’ such as in these image recognition neural network and character-based recurrent neural network(RNN) examples. In other words, creativity is the opposite of coherence.

Can we link the concept of creativity to stochasticity?

Actually, the mechanism behind entropy maximization seems to align with this belief since we are trying to maximize the future possibilities which in a sense is similar to stochasticity. Ironically, this idea seems opposite to views on creativity espoused by deep learning pioneer Juergen Schmidhuber, who suggests that low entropy, in the form of short description length is a defining characteristic of art. Here let’s explore the relationship between stochasticity, the pattern, and the relationship between simplicity and complexity by looking back at the history.

We have been struggled to reveal the hidden face of nature for a long time, the most basic and simplest law, is the power to be unpredictable. It is also about the strange relationship between ‘order’ and ‘chaos’. The idea can be traced back to 1912 by Alan Turing. Turing might be the first person who realized that there was the possibility some simple mathematical equations might describe aspects of the biological world — of all nature’s mysteries the one that fascinated Turing the most was the idea that there might be a mathematical basis for human intelligence. He tried to use atomic physics to describe a living process - it explained for the first time how a biological system could self-organize and something smooth and featureless can develop features. This is the start of a mathematical approach to biology.

It became clearer and clearer that the chaos and specific patterns are built into nature’s most basic rules. As fractal geometry describes all the shapes in the natural world are a mathematical principle known as self-similarity which the same shape is repeated over and over again at smaller and smaller scales. This finding has a broad and deep impact on financial industry as well as defining intelligence.

I recommend <The Misbehavior of Markets: A Fractal View of Financial Turbulence> from Benoit Mandelbrot if you are interested.

The same fractal as above, magnified 2000-fold, where the Mandelbrot set fine detail resembles the detail at low magnification.

Even for evolution, it is a process which is built on these patterns and taken them as the raw ingredients and combined them together in various ways then experimented to see what works and what doesn’t. Keep the things that do work and build on that. It is a completely unconscious process. In other words, evolution is using nature’s self-organizing patterns. It is based on the simple rules to replicate and the feedback from the environment to build up the ever-increasing complex system upon the continuous small mutations.

Let’s go back to the entropy maximization theory, we can find the evolution is actually presenting a similar behavior. If we look back, we can say that the things happen right now like ‘intelligence’ could just be an “accident” which emerges from a long-term unconscious drive of nature to increase future freedom of action or avoid constraints of its own future following from simple rules. From the science perspective the entropy maximization rule does not prevent evolution from happening but in fact, it demands that ordered self-maintaining structures appear to maximize entropy with maximum efficiency:

  1. Ordered systems are more efficient than unordered systems in maximizing entropy.
  2. Self-maintaining, ordered systems will appear spontaneously wherever there is sufficient potential.

In short, the basic concept we can extract from this process is that the unthinking simple rules have the power to create complex systems without any conscious thought. It also implies that design does not need any interfering creator/designer and it is an inherent part of the universe.

“One of the things that makes people so uncomfortable about this idea is if you will spontaneously pattern from it is that somehow you don’t need a creator but perhaps a really clever ‘designer’, what you will do is to treat the universe like a giant simulation where you set some initial condition and just let the whole things spontaneously happen in all its wonder and all of its beauty.”