An Information-Theoretic Approach to Understand Deep Learning — Part 2 — Deep Dive
In the first part of our research adventure into Information Bottleneck (IB) Theory we discussed the basics of the theory. We saw in what terms the theory delivers a better understanding of Deep Learning (DL) with just two variables, mutual information measures, instead of the layer weight parameters of a deep neuronal network (DNN). Lastly, we contemplated on an intriguing analogy between the mutual information measures to macroscopic variables in statistical thermodynamics.
In this second part, we yet will go a bit deeper into the IB theory. This blog part is a shorter, non-mathematical, version of a very thorough post, that I wrote some weeks ago. Firstly, I will examine the question what role the size of training data plays. We will then see to what extent the two phases of the stochastic gradient descent (SGD) optimization are akin to the drift and diffusion phases of the so-called Fokker-Planck equation. This is another compelling similarity to statistical mechanics.
II. The Role of Training Data — How Much Is Enough
One of the major results of the IB theory, which we discussed in the previous part, is the two phases of deep learning, a fitting and a compression phase. An exciting observation was that the different randomized networks follow the same trajectories on the information plane, through the same two phases. Averaging over these trajectories of different networks, one gets an image like the one shown in Figure 1.
The figure illustrates training dynamics for randomized DNNs on 5%, 45% and 85% of the training data, respectively. Evidently, the fitting phase is equal in all the three cases. Opposed to that, inclusion of more data in training changes the compression phase characteristic. Hence, small training sets lead to larger compression of layer information about the label. This is a sign of over-fitting. Therefore, compression aids in simplifying the layer-wise representations, but it can also promote over-simplification. Note that the balance between simplification and not loosing too much relevant information is still a matter of investigation by Tishby and colleagues.
III. A Thought Experiment — An Excursion to Statistical Mechanics
Let us take a step back and talk about a completely different stuff. Or is it maybe not completely different? Let us do a thought experiment.
III.1. Drift and Diffusion — Or putting Color into Water
Our thought experiment goes as follows: We take a glass and fill it with water. Then, we dissolve a drop of red dye in the central upper part of the water (Figure 2 A). Due to the acting gravity force, the dye droplet drifts on average to the lower part in time. At the same time, by a process called diffusion, the volume with dye molecules gets larger (Figure 2 B). As a final consequence of both processes, drift and diffusion, the dye molecules will be homogeneously distributed in the water. It has to be noted that there is already drift present without any external potential, due to the friction force (for more details see in the larger version of this post). The consideration of the gravity as external force is just for better illustration.
III.2. Individual Particle versus Ensemble of Particles — Two Sides of the same Coin
Yet, how shall we understand these two processes. Where might we even start with? Should we try to understand what the individual dye molecules do? Or shall we moreover ask how the dye droplet as a whole behaves in time? These questions lie at the core of statistical mechanics. In statistical mechanics, we consider dynamics of systems that might consist of a huge amount of particles. Applying statistical mechanics on these systems shows in essence that both descriptions, microscopic versus macroscopic, are equivalent.
Contemplation on the problem date back even to roman times. In his first-century BC scientific poem, De rerum natura, the roman poet and philosopher Lucrecius described a phenomenon that we now know as Brownian motion. Essentially, Lucrecius depicted the apparently random motions of dust particles as consequences of collisions with the surrounding air molecules. He used this as a proof for existence of atoms.
To be more precise, his illustration entails not solely the random part of the motion, but also the deterministic part of the dynamics. The latter part of the motion, caused by air currents, is comparable to the drift dynamics in our thought experiment. The random part of the motion is attributed as Brownian motion, in honor of the Scottish botanist Robert Brown. A simulation of the Brownian motion is illustrated in Figure 3.
III.3. Single Particle Perspective — Stochastic Differential Equations
The single particle dynamics within the Brownian motion is mathematically described through a stochastic differential equation (SDE). This is, in short, a differential equation where at least one term in the equation is a stochastic process. With the aid of such an SDE we can simulate thousands to millions of suspension particles moving within a glass containing another million to billions of smaller water molecules. The initial starting configuration might be the larger particles concentrated in a small area of the water (as in Figure 1). Henceforth, an individual SDE determines temporal dynamics for each suspension particle. In essence, that is what the Monte Carlo method is achieving as one major successful computational algorithm in statistical mechanics.
III.4. Ensemble Description — Fokker-Planck Equation
But what if we just zoom out from the microscopic view up to a much more coarse-grained picture. From this high-ground view, the millions of particles will be smeared out to a distribution. Yet, the mathematical description also changes. Instead of millions of SDEs, we will have one deterministic partial differential equation to describe the macroscopic distribution of all particles in time and space. This macroscopic evolution equation is widely known as the Fokker-Planck equation. The crucial point is that, though the time evolution of each single particle is completely random, the particle population as a whole evolves deterministically.
The Fokker-Planck equation was first derived to describe Brownian motion from an ensemble perspective by the Dutch physicist and musician Adriaan Fokker and the famous German physicist Max Planck. Moreover, it is also known under the name Kolmogorov forward equation, due to Russian mathematician Andrey Kolmogorov. He developed the concept independently in 1931.
III.5. Fokker Planck Equation as a Continuity Equation
There is a visually appealing reformulation of the Fokker-Planck equation as a continuity equation. Without writing down the continuity equation explicitly here, we can just visualize it by a schematic picture shown in Figure 4. In here we again have the diffusion situation shown previously in Figure 2 D. For the purpose of illustration, an inner circle is separated from an outer circle by the dashed black boundary. If we consider the density of red dye particles inside the inner circle as p(x,t), then a change in time for this density is generated through particle fluxes, j(x,t), through the dashed boundary. Hence, a more prosaic assertion of the continuity equation can be: What comes in, is flowing from the outside into the inside, and analogously the other way around.
IV. Back to IB Theory — Constructing the Analogy
So, after our tour de force through statistical mechanics of drift and diffusion processes the question is how this might help us in understanding Deep Learning.
Training of neural networks can often be achieved on small batches of training data by utilizing stochastic gradient descent (SGD) optimization. Tishby and colleagues have presented in their 2017 work a nice visual picture on the existence of such drift and diffusion phases during the SGD optimization. For that, they calculated the mean and standard deviations of the weights‘ stochastic gradients for each layer of the DNN and subsequently plotted these as functions of the training epoch (Figure 5).
Notably, the transition from the first phase (the fitting phase) to the second phase (the compression phase) is visible here (the vertical dotted line in Figure 5). In the beginning of the first phase (up to ~ 100 epochs), the gradient means are around two magnitudes larger than the standard deviations. Then, between ~ 100 and~350 epochs the fluctuations grow continuously, until at the transition point they match in magnitude the means.
In the longer blog post, I provided a derivation scheme of an N-layer Fokker-Planck equation for the layer weights and how the two phases of deep learning are matched with the drift and diffusion terms of the Fokker-Planck equation. It can be shown explicitly that the layer weight gradients enter the Fokker-Planck equation. While the drift term then depends on the gradient means, the diffusion term is driven by the standard deviations of the weight gradients, just as expected after Figure 5.
V. Summary and Outlook
V.1. The Role of Training Data Size
In this second part of our deep dive into the IB theory, we have first discussed the question what role the overall size of the training data plays for the final training result. According to the IB theory, the fitting phase is always the same, no matter how much data we consider in training. Contrary to that, size of training data plays a key role in success of the compression phase. Essentially, very low training data (e.g., only 5%) yields over-compression of the information about the label. This is a clear sign of over-fitting.
V.2. Analogy to Fokker Planck Equation
Afterwards, we elaborated on an analogy between the two deep learning phases with the drift and diffusion terms of the Fokker-Planck (FP) equation. The FP equation is widely utilized within statistical mechanics. This was particularly exemplified within the context of Brownian motion of larger particles in a fluid. A detailed discussion in what sense the layer weight gradients plug into an N-layer FP equation, and thereby match the two learning phases of the DNN is provided in the longer blog post.
V.3. What awaits us at the horizon?
Next up, in an upcoming third part, we shall confer about the role of the hidden layers. It is known already since longer time that one hidden layer is sufficient to map a function of arbitrary complexity to underlying training data. Hence, one major research question is then why considering more and more hidden layers. We will see, that according to IB theory the main advantage of the additional hidden layers is merely computational. That means, they only serve in reducing computation time during training. Additionally, in the next part we will investigate some shortcomings of the IB theory and then examine a newer approach that resolves some of these shortcomings. The latter approach is called Variational IB theory.