An Information-Theoretic Approach to Understand Deep Learning — Part 1 — Basics

Arash Azhand
5 min readJun 4, 2020


The following is a short version of a much longer Blog post on the topic that I wrote before on our company blog.

I know that I know nothing”
— Socrates, according to Plato —

I think, therefore I am”
— René Descartes —

In the last weeks the allegory of the cave came often to my mind while investigating most recent approaches to better understand the learning mechanism within Deep Neural Networks (DNNs). In the seventh book of “The State”, Socrates answers the question of how people can be educated with a parable (see also Figure 1 for a visualization sketch): According to this, people live as prisoners in a cave, tied up in such a way that they can only look at a cave wall. The only source of light is behind them, they cannot see its origin or the exit. The only thing they see are the shadows on the cave wall. Here, Plato compares the sensually perceivable world with an underground cave. The aim must be to free oneself from it through education and to ascend into the purely intellectual world. More details about the cave allegory can be found here.

Figure 1: A Visualization of the cave allegory (source).

And indeed: As humans we observe the world through our natural senses and our brain then makes sense of that information. This information is then put together in our head to form an overall model of the surrounding world that we can understand. As a consequence, all theories and models that we build based on our observations are constructed under uncertainty — because what we see, hear, taste and feel of the world is interpreted information.

Information Theory and Information Bottleneck Theory

In order to measure and quantify uncertainty in a mathematically rigorous way, information theory was proposed by Claude Shannon in 1948. The key measure of information theory is the information entropy, also often called just entropy. Entropy is therefore nothing more than a measure of uncertainty. It should be emphasized that entropy as a measure of disorder was introduced before in statistical physics of thermodynamics (by the work of Ludwig Boltzmann and Josiah Willard Gibbs).

In the context of the cave allegory, the act of breaking free from the cave is an attempt to maximally minimize entropy — meaning the uncertainty of knowledge — by acquiring the true causes of the observed phenomena.

Information theory is also at the heart of a promising theoretical approach to understanding the deep learning algorithm in neural networks. The novel approach, which is called “Information Bottleneck (IB) Theory”, was developed by Naftali Tishby, Professor of Computer Science and Computational Neuroscience at the Hebrew University of Jerusalem and his colleagues. In summary, the theory describes deep learning as a two-phase process to compress a huge amount of data with all its features into a shorter representation.

A Trade-off Between Compression and Prediction

As a visual example let us consider a trained DNN as a representation of a large image data set containing object classes (humans, animals, cars and other objects). Later feeding another image as input of a such trained DNN shall give as output the decision if there are one or more of the object classes present in the image. Tishby and Zaslavsky formulate in their 2015 paper, Deep Learning and the Information Bottleneck Principle, “the goal of deep learning as an information theoretic trade-off between compression and prediction”.

Figure 2: Schematic Representation of a generic DNN, processing an input image and outputting an image together with the object bounding boxes.

A more abstract schematic of such a DNN is visualized in Figure 2. The first part of the DNN, the Encoder, extracts relevant information from the input data set. The second part, the Decoder, classifies the output (human, animal, other objects) from the extracted features represented within the encoder. Basically, the Encoder is the compressed distillate of essential information about the data set, mathematically encoded into the millions to billions of parameters of a DNN. Correspondingly, the Decoder is later able to interpret this information together with the new image input in such a way that the output is the decision what is in the image.

The Information Plane

To better understand how the IB theory captures the essence of deep learning, Tishby and co-workers utilized the concept of information plane as a two-dimensional Carthesian plane (see Figure 3). Here, the x-axis shows the information content encoded into the network about the input, while the y-axis quantifies the information content encoded about the output class within the network.

Figure 3: Snapshots of layers (different colors) of 50 randomized networks during the training process in the information plane (in bits). Left panel: initial random weights; central panel: after 400 epochs of training; right panel: after 9000 training epochs (full video). Figure and caption taken from R. Schwartz-Ziv and N. Tishby (2017).

The information plane lets us visualize the whole life cycle of tens to hundreds of neural networks simultaneously. All these individual networks can have different architectures and different initial weights while being trained with the same data set (input and output).

A Forgetting Procedure

Generally, after many learning epochs, information about the input within all layers is gradually decreasing. One can imagine this process as a kind of forgetting procedure. Let us consider the example of predicting if in any image there are humans or not. In the phase where the network training is just involved with fitting input images to the label human or not, the network might just consider any type of image feature as relevant for the classification.

For example, it may be that the first batches of the training data set contain mostly images of people at streets in cities. Hence, the network learns to consider the high-level feature street as relevant for classifying humans in images. Later it might see other samples with people inside houses, people in the nature and so on. It shall then forget the specific feature street as relevant for predicting human in the image. We can somehow say that the network now has learned to abstract away the specific surrounding environment features to an extent that it generalizes well over several environments where we expect to observe humans. Conversely, it is also logically evident that a network that has been trained on very one-sided data will also make very one-sided decisions. Examples would be negative decisions based on gender, race, age, and so on.

Thanks to the effort by Tishby and colleagues, we now have with the IB theory one promising candidate for a more rigorous study of deep learning at our disposal. The parameters of modern DNNs reach up to billions in number. The IB theory delivers a description picture with two parameters (the two axes of the information plane) instead of the millions to billions of connection parameters of a DNN.


In a next part more details of the theory will be provided. Firstly, we will examine the question what role the size of training data plays. We will then see to what extent the two phases of the deep learning are akin to the drift and diffusion phases of the Fokker-Planck equation. This is another intriguing similarity to statistical mechanics.



Arash Azhand

Research Scientist with PhD in theoretical physics doing research and development of algorithms at Diconium GmbH in Berlin, Germany.