An Information-Theoretic Approach to Understand Deep Learning — Part 2 — Deep Dive

I. Introduction

In the of our research adventure into Information Bottleneck (IB) Theory we discussed the basics of the theory. We saw in what terms the theory delivers a better understanding of Deep Learning (DL) with just two variables, mutual information measures, instead of the layer weight parameters of a deep neuronal network (DNN). Lastly, we contemplated on an intriguing analogy between the mutual information measures to macroscopic variables in statistical thermodynamics.

In this second part, we yet will go a bit deeper into the IB theory. This blog part is a shorter, non-mathematical, version of a , that I wrote some weeks ago. Firstly, I will examine the question what role the size of training data plays. We will then see to what extent the two phases of the stochastic gradient descent (SGD) optimization are akin to the drift and diffusion phases of the so-called . This is another compelling similarity to statistical mechanics.

II. The Role of Training Data — How Much Is Enough

One of the major results of the IB theory, which we discussed in the previous part, is the two phases of deep learning, a fitting and a compression phase. An exciting observation was that the different randomized networks follow the same trajectories on the information plane, through the same two phases. Averaging over these trajectories of different networks, one gets an image like the one shown in Figure 1.

Figure 1: The evolution of the layers with the training epochs in the information plane, for different training samples. On the left — 5% of the data, middle — 45% of the data, and right — 85% of the data. The colors indicate the number of training epochs with Stochastic Gradient Descent from 0 to 10 000. The network architecture was fully connected layers, with widths: input=12–10–8–6–4–2–1=output. The examples were generated by the spherical symmetric rule described in the text. The green paths correspond to the SGD drift-diffusion phase transition. Figure and caption taken from .

The figure illustrates training dynamics for randomized DNNs on 5%, 45% and 85% of the training data, respectively. Evidently, the fitting phase is equal in all the three cases. Opposed to that, inclusion of more data in training changes the compression phase characteristic. Hence, small training sets lead to larger compression of layer information about the label. This is a sign of over-fitting. Therefore, compression aids in simplifying the layer-wise representations, but it can also promote over-simplification. Note that the balance between simplification and not loosing too much relevant information is still a matter of investigation by Tishby and colleagues.

III. A Thought Experiment — An Excursion to Statistical Mechanics

Let us take a step back and talk about a completely different stuff. Or is it maybe not completely different? Let us do a thought experiment.

III.1. Drift and Diffusion — Or putting Color into Water

Figure 2: Scheme of drift and diffusion process. (A) Red color dye dissolved in a water glass, drifting as a whole (on average) downwards due to force of gravity. (B) Dye molecules dispersed into the surrounding water, taking larger volume portion, due to a process called diffusion. (C) Diffusion and drift led to homogeneous distribution of red dye in water. (D) Magnification of a small circular area in water glass after the homogeneous state is reached. Process of diffusion illustrated by random collision events of single dye molecules with surrounding water molecules.

Our thought experiment goes as follows: We take a glass and fill it with water. Then, we dissolve a drop of red dye in the central upper part of the water (Figure 2 A). Due to the acting gravity force, the dye droplet drifts on average to the lower part in time. At the same time, by a process called diffusion, the volume with dye molecules gets larger (Figure 2 B). As a final consequence of both processes, drift and diffusion, the dye molecules will be homogeneously distributed in the water. It has to be noted that there is already drift present without any external potential, due to the friction force (for more details see in the of this post). The consideration of the gravity as external force is just for better illustration.

III.2. Individual Particle versus Ensemble of Particles — Two Sides of the same Coin

Yet, how shall we understand these two processes. Where might we even start with? Should we try to understand what the individual dye molecules do? Or shall we moreover ask how the dye droplet as a whole behaves in time? These questions lie at the core of statistical mechanics. In statistical mechanics, we consider dynamics of systems that might consist of a huge amount of particles. Applying statistical mechanics on these systems shows in essence that both descriptions, microscopic versus macroscopic, are equivalent.

Contemplation on the problem date back even to roman times. In his first-century BC scientific poem, , the roman poet and philosopher described a phenomenon that we now know as . Essentially, Lucrecius depicted the apparently random motions of dust particles as consequences of collisions with the surrounding air molecules. He used this as a proof for existence of atoms.

Figure 3: Simulation of the Brownian motion of 5 particles (yellow) that collide with a large set of 800 particles. The yellow particles leave 5 blue trails of random motion and one of them has a red velocity vector (Source: ).

To be more precise, his illustration entails not solely the random part of the motion, but also the deterministic part of the dynamics. The latter part of the motion, caused by air currents, is comparable to the drift dynamics in our thought experiment. The random part of the motion is attributed as , in honor of the Scottish botanist . A simulation of the Brownian motion is illustrated in Figure 3.

III.3. Single Particle Perspective — Stochastic Differential Equations

The single particle dynamics within the Brownian motion is mathematically described through a (SDE). This is, in short, a where at least one term in the equation is a . With the aid of such an SDE we can simulate thousands to millions of suspension particles moving within a glass containing another million to billions of smaller water molecules. The initial starting configuration might be the larger particles concentrated in a small area of the water (as in Figure 1). Henceforth, an individual SDE determines temporal dynamics for each suspension particle. In essence, that is what the is achieving as one major successful computational algorithm in .

III.4. Ensemble Description — Fokker-Planck Equation

But what if we just zoom out from the microscopic view up to a much more coarse-grained picture. From this high-ground view, the millions of particles will be smeared out to a distribution. Yet, the mathematical description also changes. Instead of millions of SDEs, we will have one deterministic partial differential equation to describe the macroscopic distribution of all particles in time and space. This macroscopic evolution equation is widely known as the . The crucial point is that, though the time evolution of each single particle is completely random, the particle population as a whole evolves deterministically.

The Fokker-Planck equation was first derived to describe Brownian motion from an ensemble perspective by the Dutch physicist and musician and the famous German physicist . Moreover, it is also known under the name Kolmogorov forward equation, due to Russian mathematician . He developed independently in 1931.

III.5. Fokker Planck Equation as a Continuity Equation

There is a visually appealing reformulation of the Fokker-Planck equation as a . Without writing down the continuity equation explicitly here, we can just visualize it by a schematic picture shown in Figure 4. In here we again have the diffusion situation shown previously in Figure 2 D. For the purpose of illustration, an inner circle is separated from an outer circle by the dashed black boundary. If we consider the density of red dye particles inside the inner circle as p(x,t), then a change in time for this density is generated through particle fluxes, j(x,t), through the dashed boundary. Hence, a more prosaic assertion of the continuity equation can be: What comes in, is flowing from the outside into the inside, and analogously the other way around.

Figure 4: Schematic illustration of the continuity equation.

IV. Back to IB Theory — Constructing the Analogy

So, after our tour de force through statistical mechanics of drift and diffusion processes the question is how this might help us in understanding Deep Learning.

Training of neural networks can often be achieved on small batches of training data by utilizing stochastic gradient descent (SGD) optimization. Tishby and colleagues have presented in their a nice visual picture on the existence of such drift and diffusion phases during the SGD optimization. For that, they calculated the mean and standard deviations of the weights‘ stochastic gradients for each layer of the DNN and subsequently plotted these as functions of the training epoch (Figure 5).

Figure 5: The layers’ Stochastic Gradients distributions during the optimization process. The norm of the means and standard deviations of the weights gradients for each layer, as function of the number of training epochs (in log-log scale). The values are normalized by the L2 norms of the weights for each layer, which significantly increases during the optimization. The grey line (∼ 350 epochs) marks the transition between the first phase, with large gradient means and small variance (drift, high gradient SNR: Signal-to-Noise-Ratio), and the second phase, with large fluctuations and small means (diffusion, low SNR). Note that the gradients log (SNR) (the log differences between the mean and the STD lines) approach a constant for all the layers, reflecting the convergence of the network to a configuration with constant flow of relevant information through the layers. Figure and caption from .

Notably, the transition from the first phase (the fitting phase) to the second phase (the compression phase) is visible here (the vertical dotted line in Figure 5). In the beginning of the first phase (up to ~ 100 epochs), the gradient means are around two magnitudes larger than the standard deviations. Then, between ~ 100 and~350 epochs the fluctuations grow continuously, until at the transition point they match in magnitude the means.

In the , I provided a derivation scheme of an N-layer Fokker-Planck equation for the layer weights and how the two phases of deep learning are matched with the drift and diffusion terms of the Fokker-Planck equation. It can be shown explicitly that the layer weight gradients enter the Fokker-Planck equation. While the drift term then depends on the gradient means, the diffusion term is driven by the standard deviations of the weight gradients, just as expected after Figure 5.

V. Summary and Outlook

V.1. The Role of Training Data Size

In this second part of our deep dive into the IB theory, we have first discussed the question what role the overall size of the training data plays for the final training result. According to the IB theory, the fitting phase is always the same, no matter how much data we consider in training. Contrary to that, size of training data plays a key role in success of the compression phase. Essentially, very low training data (e.g., only 5%) yields over-compression of the information about the label. This is a clear sign of over-fitting.

V.2. Analogy to Fokker Planck Equation

Afterwards, we elaborated on an analogy between the two deep learning phases with the drift and diffusion terms of the Fokker-Planck (FP) equation. The FP equation is widely utilized within statistical mechanics. This was particularly exemplified within the context of Brownian motion of larger particles in a fluid. A detailed discussion in what sense the layer weight gradients plug into an N-layer FP equation, and thereby match the two learning phases of the DNN is provided in the .

V.3. What awaits us at the horizon?

Next up, in an upcoming third part, we shall confer about the role of the hidden layers. It is known already since longer time that one hidden layer is sufficient to map a function of arbitrary complexity to underlying training data. Hence, one major research question is then why considering more and more hidden layers. We will see, that according to IB theory the main advantage of the additional hidden layers is merely computational. That means, they only serve in reducing computation time during training. Additionally, in the next part we will investigate some shortcomings of the IB theory and then examine a newer approach that resolves some of these shortcomings. The latter approach is called Variational IB theory.

--

--

Research Scientist with PhD in theoretical physics doing research and development of algorithms at Diconium GmbH in Berlin, Germany.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Arash Azhand

Research Scientist with PhD in theoretical physics doing research and development of algorithms at Diconium GmbH in Berlin, Germany.