A new look at Physics of Machine Learning

shervin
8 min readFeb 1, 2024

--

If “information is physical”, then the process of learning information must be a physical process.

Thermodynamics of machine learning by Della. E 2!

In this post, I delve into the core concept behind my recent paper. This work utilizes a thermodynamic framework to quantify the information learned and stored in neural networks during the training process.

Upfront, here’s the bottom line:

  1. The training process of a neural network inherently renders a thermodynamic process.
  2. The learning process is irreversible, with entropy production serving as the source of what has been learned.
  3. The model’s parameters play the role of a heat reservoir, absorbing this learned information.
  4. This acquired information emanates from dissipation in the model’s degree of freedom, which will be defined as model-generated samples.

Why is this thermodynamic framework is useful?

1. It provides us with a thermodynamic explanation for the benefits of over-parameterized models. An over-parameterized model implies a high heat capacity of the parameter subsystem, a condition necessary for acting as a heat reservoir.

2. It offers a method to measure learned information during the learning process, addressing a long existing challenge in the machine learning literature (for example see this reference on importance of this problem).

3. It makes ML a subfield of physics! (okay, I am not serious about this one)

To see how this approach works, I will present two background sections (Part I and Part II). Then, we will put these two together to form the final argument. To ensure accessibility for readers from diverse backgrounds, I will provide a non-technical and simplified overview of the topic, at the cost of not delving rigorously into some aspects.

Part I: Thermodynamics of Learning Information

Historically, the concept of (Shannon) information was discovered (prior to Shannon and by Szilard) while studying an old paradox in thermodynamics known as Maxwell’s demon. I won’t begin from there, but rather from a more recent discovery that connects the accumulation of mutual information between two subsystems (what I call it learning) to thermodynamic quantities. To see this, we need to revisit the second law.

The initial version of the second law of thermodynamics, formulated by Clausius in 1862, can be expressed in modern terms as follows:

In any thermodynamic process, the total change in the entropy of a closed system, which includes both a subsystem and a reservoir, is always greater than or equal to zero.

Let us represent a subsystem as X (e.g., a cup of hot tea) and its reservoir as Θ (e.g., the room surrounding the cup of tea). We can then formally write the second law as follows:

Σ := ΔS(X) + ΔS(Θ) ≥ 0

This equation denotes the sum of the change in the entropy of the tea (the subsystem) and the entropy of the room (the reservoir), always being greater than or equal to zero. The variable Σ, known as Entropy Production (EP), shows this non-negative value of change in entropy. EP is zero when the process is reversible and positive when the process is irreversible. For instance, when a cup of tea cools down to room temperature, the EP is positive, indicating an irreversible process.

Now, for modern readers familiar with information theory, an intriguing observation can be made in the above formulation. The correct way to compute the entropy change of the joint system (X,Θ) is as follows:

ΔS(X,Θ) = ΔS(X) + ΔS(Θ) — ΔI(X:Θ)

Moreover, for a closed system, one would expect the conservation of joint entropy (Liouville’s theorem), thus ΔS(X,Θ) = 0. Combining this observation with the second law, we arrive at:

Σ = ΔI(X:Θ) = I(X:Θ)[t] — I(X:Θ)[0] ≥ 0

Now, we can see the profound connection between the second law and the accumulation of mutual information between the subsystem and its reservoir. If the subsystem X is brought in contact with the reservoir Θ at t=0, one would expect I(X:Θ)[0] = 0. Then, the second law states that if the subsystem X learns “some bits” about the reservoir Θ, i.e., when I(X:Θ)[t] > 0, then the process is irreversible, and that “some bits” manifest as EP. Thus, our cup of tea learns something about the room surrounding it while it is cooling down!

Part II: Information Content in Learning a Generative Model

The primary application of the thermodynamic formulation of the learning process is to create a framework for measuring the information content of training a model. In this section, we define two information-theoretic quantities to gauge this information content.

Let B be the ground truth random variable generating the training dataset samples. We represent the action of the optimizer after n steps with a stochastic map, Λ(Θ|B)[n], that takes random variable B and outputs the parameter’s random variable Θ[n] at time t = n. The map Λ encompasses all hyper-parameters, loss function, regularization terms, and the choice of optimizer algorithm used in training the model. A sample drawn from this parameters’ random variable after n optimization steps defines the parametric model p(X|θ[n]). Finally, we can sample our model to obtain what is learned with this generative model: X[n] ∼ p(X|θ[n]).

This process of training and sampling can be represented by a Markov chain:
B → Θ[n] → X[n]

The Data Processing Inequality (DPI) associated with this Markov chain tells us:
I(B;Θ)[n] ≥ I(B; X)[n].

Note that before starting the training process, the initial values of I(B;Θ)[0] and I(B;X)[0] are zero. Thus, I(B;Θ)[n] and I(B;X)[n] measure the accumulation of mutual information during the training process.

We are now on the verge of defining two information-theoretic quantities that capture the information content of learning a generative model. The left side of the inequality, I(B;Θ), measures the accumulation of information between the model’s parameters and the source of the training dataset. We call this the Memorized-information (M-info).

The right side of this inequality, I(B;X), measures the performance of the model by quantifying the accumulation of information between the model-generated samples and the ground truth source of the training dataset. We refer to this as Learned-information (L-info).

It is well-known that the restriction of M-info (information contained in parameters) improves the generalization error and avoids overfitting. On the other side, the maximization of L-info is the objective of learning (log-likelihood maximization). Thus, the ideal situation is M-info = L-info, like a student who wants to learn what is necessary for an upcoming exam but avoids memorizing any extra information!

Okay, I want to keep this essay short and accessible. So, let me jump ahead and state that under some circumstances (when Λ(Θ|B) approaches a delta function shape, meaning that we have a robust learning process such that it won’t matter who runs the training algorithms) we can rewrite these two quantities as follows:

  1. M-info := I(B;Θ)[n] = S(Θ).
    This means M-info naturally appears as the entropy of parameters.
  2. L-info := I(B;X)[n] ≈ I(Θ;X)[n]
    This means L-info can be approximated as the accumulation of mutual information between model’s parameters and model-generated samples.

I think by now you know what I want to say:

In Part II, we showed how task-relevant information learned by the model can be measured by L-info I(X;Θ), representing the accumulation of information between the model’s parameters and the model-generated samples. In Part I, we explore how two subsystems, such as X and Θ, gain mutual information while evolving in an irreversible process in contact with each other under the rule of the second law. Integrating these two parts provides us with a thermodynamic framework for studying the information content of training parametric models.

Part I + Part II: Machine learning as a thermodynamic process?

Here, we are talking about the modern formulation of thermodynamics, not the one we used to study in school with the nonsense second law. Let me chase to this definition:

A distribution like p(X)[t] represent the thermal state (statistical state) of a system with X degrees of freedom at time t. The sample you draw from this distribution, represent a possible microstate of the system. Then a thermodynamics process is noting but, the time evolution of the thermal state, i.e., p(X)[t=1] , p(X)[t=2], …, p(X)[t=n].

This time evolution can be formally described by a master equation, whose transition rate capture the underlining physics of the process. The first, and second law can be driven from this master equation directly (see this reference to learn more about this formulation).

You may now start realizing the connection between a thermodynamic process and training a model in machine learning literature. When the neural network models a distribution (or more exactly, log of a distribution), the training process (e.g. the action of SGD optimizer) renders the time evolution of that distribution. We just saw that a thermodynamic process is noting but time evaluation of a distribution (i.e., the thermal state). If we can find the transition probabilities that governs this time evolution, then we can study the physics of this process, i.e., computing work, heat dissipation, etc. during the training process.

You may ask: what do you mean when you say heat dissipation? Do you mean the physical heat that we measure in joules, the one that we feel it moving between our finger when we put our hand above a heater? Or we are just using a mere analogy here? The answer is that I really mean the physical heat. Again, we need to look at the modern definition of heat flow: change in energy due to change in distribution of a system. Let say the system is air molecules in a room, by injecting heat, what you are doing is changing the distribution of air molecules, and consequently increasing the average energy of the room. Thus, training a model (evolving a parametric distribution) is associated with physical heat exchange.

But heat exchange between what? There are two subsystems that embodies the physical system undergoing through this thermodynamic process:

  1. The model subsystem: the microscopic states of this subsystem are model-generated samples X[t] ∼ p(x|θ[t]) at time t. Note that the model subsystem evolves from generating noise to generating some meaningful pattern (like human faces).
  2. The parameter subsystem: The microscopic state of the parameter subsystem are sample generated from parameters distribution p(θ[t]). This distribution also evolves during training in a way that parameters drawn from it define a model, i.e., p(x∣θ[t]), with a desirable statistical characteristics as determined by the loss function.

Now here is the technical part that am going to barely state here, but for a detailed discussion I refer you to the paper. we can argue in the case of an over-parameterized model, when the parameter subsystem is much larger than the model subsystem, the subsystem Θ acts as an heat reservoir for subsystem X. Because, due to its huge number of degrees of freedom, subsystem Θ has a much higher heat capacity compare to subsystem X. Moreover, the slow dynamics of over-parameterized model under SGD optimizer (known as lazy dynamics) can render a quasi-dynamics for this parameters subsystem, that is one condition of an ideal heat reservoir.

Putting all this together with what we discuss in Part I and Part II, we would see: The model learns the L-info, is an irreversible process, and we can measure this learned information by computing the Entropy Production of the training process!

Now, to provide a background on why measuring the information content of a neural network (a parametric distribution) is a relevant and challenging problem in machine learning, I would like to reference a few studies spanning from the old days to recent times in this field:

  1. G. E. Hinton and D. Van Camp, “Keeping the neural networks simple by minimizing the description length of the weights,” in Proc. 6th Annu. Conf. Computational Learning Theory. ACM, 1993, pp. 5–13.
  2. Shwartz-Ziv, R., & Tishby, N. (2017). Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810.
  3. Achille, A., Paolini, G., & Soatto, S. (2019). Where is the information in a deep neural network?. arXiv preprint arXiv:1905.12213.
  4. Shwartz-Ziv, R., & LeCun, Y. (2023). To Compress or Not to Compress — Self-Supervised Learning and Information Theory: A Review. arXiv preprint arXiv:2304.09355.

Thank you! hope enjoyed this view. Please let me what you think?

--

--