Physics

The Principle of Maximum Entropy

Implementing an algorithm that maximizes Shannon Entropy

Nicole Chantzi

Published in

Intuition

11 min readMay 2, 2022

https://en.wikipedia.org/wiki/Kolmogorov_complexity#/media/File:Mandelpart2_red.png

Problem Statement

When we toss a die the observed result can range between 1 and 6. Suppose that we are given a die and we are told that it was tossed multiple times yielding an average of 4.7 rather than 3.5 that someone would expect from a fair regular symmetrical die. Given this information how one should update their beliefs about each side occurring in a single toss?

In the case of a regular die with probability 1/6 of each side occurring, we would expect the count histogram to look something like this:

At first glance, it seems that if we were not given any information about the die, then the only reasonable choice is to assume that all sides are equiprobable with a probability of 1 out of 6. This reflects the Principle of Insufficient Reason: “Absent evidence to the contrary, all outcomes of a trial should be assumed to have equal probability”.

However, here we know something more. Our hypothesis is that the sample average is equal to 4.7. This indicates that the sides 4, 5 and 6 must have a larger probability weight than 1/6. What we want to find is the six probabilities

which fully characterize the probability mass function of the die tossing experiment, where pi is the probability of side i occurring. Hence, for our yet unknown probability distribution the following constraints apply:

Amongst all possible probability distributions we need one with an expected value of 4.7. But there are an infinite number of die distributions that can have expected value equal to 4.7. We need the one which is in some way “the least informative” and assumes no more than what our data suggests. For instance, we could come up with a distribution that completely eliminates the probability of side 4 occurring but still has expected value 4.7.

But somehow this seems further unlikely, since we have no reason to believe that such a die truly exists. We would in-cognito incorporate data we do not currently possess. It’s much more likely for the sides to share an increasing proportion of the total probability. We can clearly see that we need a formal way to compare the information across different probability distributions. We can imagine more “strange” dice which produce more unexpected histograms:

Introduction to Maximum Entropy Principle

This problem can be solved by a principle which generalizes Laplace’s Principle of Indifference.

Principle of Maximum Entropy: The beliefs which best represent the current state of knowledge about a system are the ones which maximize the entropy.

But what is entropy? I tend to think of entropy as a synonym to uncertainty. The higher the entropy of a system, the harder it is to express and form judgements about it. The principle of maximum entropy is at its heart a conservative principle. We try to be the least biased and not incorporate data into our beliefs that we do not currently possess.

The principle of maximum entropy is useful when our prior information about a system consists of averages of certain quantities. Like in our die toss problem. We are mostly presented with an aggregated result of an experiment, most usually a sample mean and from there we have to infer the probabilities of the underlying phenomenon.

But for the moment, I invite you to suspend judgement and let’s do a little detour to the realm of information theory.

Our goal is:
1. to find an epistemologically consistent way of deriving these probabilities, and
2. implement the algorithm in Python.

We will only scratch the surface of this very interesting topic.

Entropy as the measure of Surprise

The information theorist regards the informational value contained in a communicated message as proportional to the degree of which its content was surprising. If an event is entirely improbable it’s actualization will have a much greater impact in terms of its informational value that it carries rather than a message which is entirely anticipated. For instance, think of the following two sentences:

1) It rained in Sahara.
2) It rained in Brussels.

The first proposition is definitely much more surprising than the second one which is entirely expected and as such it carries more informational value. It seems that there is a consistent way to measure the surprise of probability distributions which can be achieved via the entropy function.

Claude Shannon in his famous 1948 paper “A Mathematical theory of Communication” defined Shannon Entropy of a discrete probability distribution with n possible outcomes as follows:

Notice that a different probability distribution on the die toss experiment would yield different entropies. That way Shannon Entropy allows us to categorize and order probability distribution as “high Entropic” or “less Entropic” which should allow us to choose the one which maximizes Shannon entropy subject to the experimental constraints; in our case, the expected value being equal to 4.7.

We can visualize Shannon entropy in ternary plot when n = 3. We can see that uncertainty/entropy increases when the probabilities are less extreme and hence resulting in more varying results.

We can see that Shannon entropy tends to be higher when probabilities are more evenly distributed across the sample space. Thus, a die with probability of side 4 occurring equal to 0, would decrease our entropy, making us more certain about the outcome rather than a distribution which would not rule that possibility out, but still would have an expected value equal to 4.7.

We have found a way to measure the “information” in a given discrete probability distribution! Since we need to be the “least biased” for our die toss problem, we should find the probability distribution which maximizes entropy H subject to the given constraints to extract the maximum entropic distribution. What we just did here is that we reduced our logical problem into a mathematical optimization problem.

Shannon Entropy of a Single Coin Toss

In this article, we won’t go into further detail why Shannon decided to formulate entropy in that particular way, but it suffices to say that if we express entropy in the base 2 logarithm, then it expresses the expected amount of minimum number of bits required to binary encode a message and transfer it across a communication channel. For our purposes H quantifies our expected surprise from a discrete probability distribution. In the case of the coin toss which with probability p it lands heads, Shannon entropy simplifies to:

Amongst all the possible values of probability p, the one which would yield the most surprising distribution would be the one which is produced by a fair coin, i.e. p = 0.5. Why is that? Think about it. When we are the least certain about the result? If the probability is lesser or greater than 0.5 the coin would be biased; hence in a way our judgements would be biased towards the one result over the other. The point where the epistemic uncertainty is the greatest, or equivalently our entropy is maximized is precisely achieved on 0.5 probability. When the coin is fair the scenarios are equiprobable.

Interestingly enough, on p=0.5 both variance and entropy are maximized which is intuitively obvious. On the maximum uncertainty generated results will also have greater variance and vice versa.

In the coin toss experiment the variance & entropy are the highest when the coin is fair.

Notice here that we use the natural logarithm. This is perfectly fine because log is a monotonous transformation and as such preserves the minima and the maxima.

Since now we have an algebraic expression which analytically quantifies the underlying uncertainty in a given distribution we may proceed one step further and formulate a systematic way of dealing with such problems.

Maximizing Shannon Entropy

In our die toss example, we know that the sample mean is 4.7. From the Law of Large Numbers we know that the sample mean converges to the expected value of the distribution. Yet, we do not know if the sample mean is still a good approximation of the actual mean, but still it’s all we currently have.

We will formulate our problem in more general terms. We will suppose that we have a discrete random variable X (in a sense a “generalized die random variable”) from which n different possible outcomes may arise. Analytically such random variable is represented as follows:

We want to estimate the probability density function of X. Suddenly we are presented with some information retrieved from the past experiments. In particular, we were observing the quantities

for which we are informed only about their sample means. This may be the case in scenarios where we do not have direct access to the true result of the random variable but we could only measure these side quantities.

Since we know that our sample mean is equal to a fixed value, we have the following constraints to our problem

How can we update our beliefs about the system on the basis of our prior information. According to Maximum Entropy Principle we will maximize the entropy function subject to these constraints. We will use Lagrange multipliers to analytically solve the optimization problem. We have the following equation:

Some algebraic manipulation will yield the following probabilities:

Now we need to find a way to evaluate the Lagrange multipliers. By using the first constraint, i.e. all probabilities must necessarily sum up to unit, we can specify the first Lagrange multiplier:

Hence, by replacing we eventually arrive at the Gibbs distribution:

where the denominator is known as the partition function:

Implementation in Python

What we want to achieve is to construct a MaxEnt model class which we will fit on a sample mean value -provided by the user-which will evaluate the Lagrange multipliers and yield the resulting Gibbs distribution.

In the following code we implement a MaxEnt class which accepts the number of a payoffs argument, a numpy array which maps each of the possible events to a monetary gain or loss (the corresponding xi’s). Inside the MaxEnt class we will implement the following methods:

trainer : implements the helper function g
gradient: implements the derivative of the helper function dg/dl
partition function: implements the partition function Z as a function of a Lagrange multiplier
Gibbs distribution: returns a numpy array with the corresponding probability for each event of our sample space
fit : implements the Newton-Raphson algorithm to evaluate the Lagrange multiplier given a sample mean value (retrieved from the past data)

We can see now that the Lagrange multiplier 0.4 corresponds to expected value 2.43; thus, the Gibbs distribution favours the smaller results rather than the large ones.

The graph above is generated by the following snippet:

To completely specify the Gibbs distribution we need to evaluate the remaining m Lagrange multipliers. We haven’t used the constraints on the expected values. By doing so we arrive at m non-linear equations:

After this huge detour we are almost ready to answer our initial die toss problem. To remind you, we are informed that the sample mean is equal to 4.7. In our problem we have only one constraint, hence only one Lagrange multiplier that we need to specify in order to derive the Gibbs distribution. Hence, the equation above simplifies to:

In order to solve this equation we can specify a new helper function:

We can solve this non linear equation by utilizing the Newton-Raphson recursive algorithm. But first, we can visualize g as the sample mean varies.

We can see that a solution always exists in the area around 0. By differentiating the helper function we arrive at the following formula:

We will now add to our MaxEnt class the helper function and its gradient so we can use them in the fit method where we will train our model.

The Newton-Raphson recursive can be expressed analytically as follows:

Now we are ready to implement the fit method! It will just take one float argument, the sample mean that was observed in the past experiments, and some more optional arguments.

Now we are ready to deploy & train our MaxEnt model! In our die problem the payoffs are simply the numbers on the sides of the die. Hence, we can represent it simply with a numpy array ranging from 1 to 6.

Now our problem has been solved! We have updated our beliefs in accordance to the maximum entropy principle. They should look like this:

One Step Further

Since we have implemented our model you may be wondering what happens to the Gibbs distribution as we alter the sample mean. To answer this question I created the following dataframe:

You can clearly see the pattern. As the sample mean varies from 1 to 6 the Gibbs distribution adjusts in such a way to reflect the minimum state of knowledge by maximizing our ignorance. Furthermore, we can visualize the way probabilities of the Gibbs distribution vary as we alter our prior information about the sample mean of the die.

Conclusion

We learned how to compute the Shannon entropy of given distributions and maximize our entropy given certain constraints suggested by our prior data. Moreover, we found out how to implement the MaxEnt model in Python and compute the Gibbs Distribution.

However, dear reader, keep in mind that The Principle of Maximum Entropy is a statistical principle. It cannot be proved. Since every theory has its proponents and its critiques, as proper data scientists we should explore both sides in order to be able to see the principle in its full potential alongside with its limitations.

You can find the code that was used to create this article in the following Github repo:
https://github.com/NicoleChant/MaximumEntropy