# Monday Notes 9–17-18

*I’m a little late. Considered calling this “Tuesday notes” instead, but I’ll stick with the theme.*

**Entropy in Statistics**

Entropy is a quantity introduced by Claude Shannon in 1948. It serves as a measure of the *information content* of a message sent over a channel which may introduce random noise.

For a random variable *x* in the discrete domain *D* with probability function *p* satisfying the standard axioms of probability (∀ *x *∈ *D*) (*p*(*x*)≥0* *and Σ*p*(*x*)=1), the *entropy* of *p* in *D *is defined as

For a continuous domain the entropy of *p* in *D* is defined as

The reason these formulas are used as measures of information is because they both satisfy four intuitive properties:

- Information is an increasing function of the number of messages [size of
*D*] - Information is a non-negative quantity
- Events that always occur do not convey information [if
*p*(*x*)=1 then log(*p*(*x*)) = 0] - The information content of independent events is additive [for non-overlapping domains
*D*and*F*,*H*(*p*|*D*∪*F*)=*H*(*p*|*D*)+*H*(*p*|*F*)]

In Shannon’s original paper, he proves that the formulas given above uniquely satisfy these conditions — apart from using a different base for the logarithm or scaling by a constant.

**Maximum Entropy Models**

In statistics, the principal of maximum entropy is a modern restatement of the principle of indifference. The latter the philosophy that if there are several possible outcomes of a random event and no other information available, one should assign equal probability to each. More concretely, one should assume that all coins are fair unless there is some evidence — such as a sample of several flips — to indicate otherwise.

If we consider experimental data to be the result of some (at least partially) random process, then each new data point can be interpreted as a *message* carrying information about that random process. We would like each new message to provide maximal information about whether we should accept or reject our current *model* of that random process. Thus, by selecting a model which maximizes entropy, we are setting ourselves up to make the most out of our experimental data. It has been said repeatedly in the literature that maximum entropy models are **maximally non-committal on unobserved data.**

The definition of a maximum entropy model is the probability distribution *p* on the domain *D *which maximizes *H*(*p*|*D*). Finding this probability distribution is most commonly done using the method of Lagrange multipliers within the language of the calculus of variations. I won’t give a detailed proof, but the basic outline is that you begin with the case of a discrete domain and by treating probability distributions as vectors, or ordered lists of probabilities for each possible outcome. The probability vector can then be *perturbed* by adding a small vector ε*v* where ε is a small scalar and *v* an arbitrary probability vector. The probability vector *p *where *H*(*p*|*D*) is constant when ε is equal to zero is then an extrema of *H*. It can be shown that this extrema is the unique maximum of *H*. This argument can be carefully extended to continuous domains by replacing probability vectors with their natural extension, probability density functions, and using the fundamental lemma of the calculus of variations.

**Softmax falls right out of Maximum Entropy Models**

The final remark I’d like to make on this topic (for now) is an interesting property of the softmax function. In the 1978 paper by Agmon and Alhassid, the following formula is derived using maximum entropy methods. It gives the probability of the *i*-th of *N *outcomes given *M* empirical moments.

Upon careful examination, one might notice that this is exact form of the softmax function used in logistic regression and deep learning.

The above formulas can also be interpreted as a description of the *i*-th component of the probability vector *p. *Agmon and Alhassid go on to show that there is scalar potential function whose minimum is attained when the the probability vector has maximal entropy and the *M* moment constraints are satisfied.

Going completely into the realm of speculation, I suspect that this algorithm is tightly related why deep learning models are able to generalize from a limited set of data. By using softmax carefully in a deep neural network, one is asymptotically guaranteed to arrive at a maximum entropy model, provided there is enough data.