## Diving into MINE | Towards AI

# MINE: Mutual Information Neural Estimation

## Estimating mutual information using arbitrary neural networks through MINE

# Introduction

Mutual information, also known as information gain, has been successfully used in the context of deep learning(which we will see soon) and deep reinforcement learning(e.g., VIME, EMI) to measure/enhance the coupling between two representations. In this article, we discuss in detail a neural estimator named MINE(Mutual Information Neural Estimation), published by Mohamed Ishmael Belghazi et al. in ICML 2018, that allows us to directly estimate the mutual information.

This article is comprised of three parts: We first introduce the concept of mutual information, building some intuition for a better understanding of what we are dealing with. Then we present the MINE algorithm and discuss some of its applications in deep learning. In the end, we provide all proofs you need to make sense of what we are talking about here.

Note, this is the first article of our series about mutual information. More will come in the coming weeks. Hope you will enjoy the journey!

# Mutual Information

Mathematically the mutual information between two variables *X* and *Z* is defined as

where the Shannon entropy *H(X)=-E_{p(x)}[log(p(x))]* quantifies the uncertainty in *X* and the conditional entropy *H(X|Z)=-E_{p(x, z)}[log(p(x|z))]* measures the uncertainty in *X* given *Z.*

There are three slightly different interpretations for mutual information:

- Formally, mutual information captures the statistical dependence between random variables. (In contrast, the correlation coefficient only captures the linear dependence).
- Intuitively, the mutual information between
*X*and*Z*describes the amount of information learned from knowledge of*Z*about*X*and vice versa. - The direct interpretation is that it’s the decrease of the uncertainty in
*X*given*Z* or the other way around.

Mutual information is also equivalent to the KL-divergence between the joint probability *P(X, Z)* and the product of the marginals *P(X)*⊗*P(Z)* :

The intuition behind this KL-divergence is that the more different the joint is from the product of the marginals, the stronger the dependence between *X* and *Z* is.

# MINE Algorithm

MINE estimator gives the greatest lower-bound of the mutual information *I(X; Z)* by computing

where *T* could be any function that takes as input *x* and *z* and outputs a real number. This greatest lower bound suggests that we are free to use any neural network to approximate the mutual information and, thanks to the expressive power of the neural network, such an approximation could gain arbitrary accuracy. The following algorithm does exactly what we just described, where *θ* represents the net parameters

The gradient of *V(θ)* is

which is biased because of the expectation in the denominator and minibatch gradient computation, the authors suggest replacing the expectation in the denominator with an exponential moving average. For small learning rates, this improved MINE gradient estimator can be made to have an arbitrarily small bias.

**Equitability**

Let *Y=f(X)+σ⊙ϵ*, where *f* is a deterministic non-linear transformation and ϵ is random noise. The property *equitability* of mutual information says that the mutual information between *X* and *Y,* *I(X; Y),* is invariant to and should only depend on the amount of noise, *σ⊙ϵ*. That is, no matter *f(x)=x, f(x)=x³* or something else, *I(X; Y)* stays approximately the same as long as σ⊙ϵ remains unchanged. The following snapshot demonstrates MINE captures the equitability

## Applications of MINE

**Maximizing Mutual Information to Improve GANs**

GANs commonly face mode collapse problem, where the generator fails to capture the diversity of training data. InfoGANs(Chen et al. [2]) mitigate this issue by defining the latent vector of variables *Z=[ϵ, c]*, the concatenation of noise *ϵ* and code variables *c *shown in the above figure. The noise vector *ϵ* is treated as a source of incompressible noise, while the latent code *c* captures the salient structured semantic features of the data distribution. In a vanilla GAN, the generator is free to ignore *c* since nothing enforces this dependency. To reinforce such dependency, we add an extra mutual-information term to the generator loss as follows

Notice that the mutual information is unbounded so that *Lₘ* can overwhelm *L_μ*, leading to a failure mode of the algorithm where the generator puts all its attention on maximizing the mutual information and ignores the adversarial game with discriminator. The authors propose adaptively clipping the gradient from the mutual information so that its Frobenius norm is at most that of the gradient from the discriminator:

Now we come back to *c.* *c* is sampled from an auxiliary distribution *Q*, a network that gets optimized by *I(G([ϵ, c]); c)* through* c*(as the red dotted curve shows). In most experiments, *Q* shares all convolutional layers with discriminator *D*, and have an additional final fully-connected layer to output parameters for conditional distribution *Q(c|x)*, e.g., the mean and variance for a normal distribution for continuous latent code. Furthermore, since *c* is sampled from the conditional distribution *Q(c|x*), we may also have to have recourse to the reparametrization trick in order to update *Q*.

**Maximizing Mutual Information to Improve BiGANs**

**A Brief Introduction to BiGANs**

Bidirectional Generative Adversarial Networks, BiGANs(Jeff Donahue et al.[3]), as the name suggests, is bidirectional in that the real data is encoded before being passed to the discriminator. The discriminator takes as input both the feature representations (*z* and *E(x)*) and the fully representative data (*G(z)* and *x*), distinguishing which from which. The generator and encoder collaborate to fool the discriminator by approaching *E(x)* to *z* and* G(z)* to *x*.

**BiGANs Cooperate with MINE**

To reduce the reconstruction error, the authors prove that the reconstruction error *R* is bounded by

where *p* denotes the joint probability of the generator part and *q* is the joint probability of the encoder part. From the above inequality, we can easily see that maximizing the mutual information between *x* and* E(x)* minimizes the reconstruction error. Thus, we add additional mutual information to the encoder loss and have

**Information Bottleneck**

The gist of the information bottleneck is to summarize a random variable *X* so as to achieve the best tradeoff between accuracy and complexity. That is, we want to extract the *most concise* representation *Z* that captures the factors of *X* *most relevant* to the prediction *Y*. The objective of the information bottleneck method is thus minimizing the Lagrangian

where *β* is a Lagrange multiplier. The first term minimizes the uncertainty of *Y* given *Z*, while the second term minimizes the dependency between *X* and *Z*.

# Supplementary Materials

**Proof for Eq.1**

**Proof for Eq.2 being the greatest lower bound of I(X; Z).**

To show that Eq.2 gives the greatest lower bound of *I(X; Z)*, we just need to prove the following statement

we first define the Gibbs distribution

then we represent the left side of Eq.4 using the Gibbs distribution *G* we just defined

now we compute the difference between the left and right side of Eq.4

The positivity of the KL-divergence ensures Δ≥0, thus Eq.4 always holds.

**Proof for Eq.3**

The reconstruction error is defined as

where *p* is the generator and *q* the encoder. This roughly measures the loss of fidelity that the generator recovers *x* from* z* encoded by the encoder. Now we have

## References

- Mohamed Ishmael Belghazi et al. Mutual Information Neural Estimation
- Xi Chen et al. InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets
- Jeff Donahue et al. Adversarial Feature Learning