Deep InfoMax | Explanation of Mutual Information Maximization

An Introduction of the Deep InfoMax Method for Mutual Information Maximization

This post intends to provide an explanation of the deep learning method for maximizing mutual information, known as Deep InfoMax or DIM. The information contained in this post is based on the February 2019 version of Dr. Devon Hjelm’s the 2019 International Conference on Learning Representations conference paper titled Learning Deep Representations by Mutual Information Estimation and Maximization. My aim is to introduce Dr. Hjelm’s method to an audience that may not have the time to analyze the entire conference paper, in a way that allows even readers with limited background in machine learning techniques to understand the concepts.

What is Deep InfoMax and why do we need it?

Deep InfoMax is a deep neural network architecture that improves upon some of the leading architectures of today. It resembles a traditional Convolutional Neural Network (CNN) with some additional features.

Convolutional Neural Network — This and all images below were created by the author

It represents an improvement in the performance and accuracy of learning of good representations without annotations. It is technically self-supervised, a form of unsupervised learning, making it another important tool for learning representations of unlabeled datasets.

What makes Deep InfoMax special?

I will explain each of these in detail in the coming sections, but these are the reasons why you should be learning about Deep InfoMax, and not some other method of machine learning.

▹ First, mutual information maximization. Definitely not a new concept Deep InfoMax follows examples like Mutual Information Neural Estimation known as MINE. Deep InfoMax differs for reasons explained in more detail below, including the eliminating the need for a generator and improvements over Kullback-Leibler regularization.

▹Second, matching representations to a prior distribution adversarially to implicitly train the system to match the noise.

▹ Third, structure matters. This is a theme consistent through Dr. Hjelm’s paper. The results demonstrate that including knowledge about locality in the input can significantly improve performance.

Supervised vs Unsupervised Learning

No this does not involve watching your neural network while it trains to ensure it doesn’t become self-aware and take over the world. Supervised learning pertains to the use of “labels” which are the meaningful or informative descriptions that the neural network is trying to predict.

If the data being trained on contains labels that the machine learning model tries to predict and match, this is what is thought of as supervised learning.

On the other hand, training on unlabeled data is one type of unsupervised learning. Unsupervised learning that involves training the neural network to group items it determines to be similar is considered clustering.

“Self-supervised” Learning

Now I’d like to take a moment to talk about something that might sound like semantics, but is actually a neat trick of Deep InfoMax. It is not supervised, but it’s not unsupervised either. One of the things that makes Deep InfoMax unique is that it generates its own labels. These labels are more simple than what we are used to, it simply labels the feature maps of all of the input samples as “real” after they have passed through the encoder. It then creates new feature map of another image and passes the feature vector of the first image and the feature map of the second image through the discriminator. Because these feature maps and feature vectors do not belong together it labels them as “fake” and then creates a new set of images by mixing patches of different images and it labels the members of this new set as “fake”.

This self supervised aspect, and the Deep InfoMax architecture as a whole, is a complicated way of answering the simple question of “do these things go together”

It knows which samples don’t go together because it put them there and by passing the “fake” image through a discriminator it maximize mutual information, which I will discuss in the next section. The end goal of this process is to train the encoder to fool the discriminator and the discriminator to know the difference.

Mutual Information

Self-supervision is a good lead in to our next topic, mutual information. What is mutual information? Mutual information is just like a divergence or a difference between two different distributions, one is the joint distribution between two variables and the other is the product of marginals. The joint distribution is the probability that these two things co-occur and the product of marginals is their probabilities that they occur independently of each other.

So basically mutual information is training a classifier for determining which samples go together and which samples don’t. We use mutual information to reduce uncertainty for one variable given a known value of the other variable.

By passing the samples from the joint distribution, known as the “real” samples, and samples from the product of marginal distribution, the “fake” samples, the discriminator is trained to maximize the estimation of the mutual information.

Deep InfoMax uses gradients from the discriminator to help train the encoder network. Unmitigated, this method of training would yield representations that contain noisy information, or information that would increase mutual information without providing any useful information for classification. Fortunately there is a solution for handling this noise that is discussed in the next section.

Matching Representations to a Prior Distribution

Pure mutual information has its limitations. Maximizing mutual information between the input image and output representation globally would result in learning features that are unrelated because their sum has more unique information than other redundant locations. Encoding separate but related pieces of information doesn’t increase mutual information as much as encoding details of the background. These irrelevant patches of information can be considered noise. In order to remove noise and lessen its effect, the discriminator is trained to estimate the divergence between a push-forward distribution, and a prior. The encoder is then implicitly trained to minimize this estimate. This technique of combining mutual information maximization with prior matching is similar to what is performed by adversarial autoencoders.

Structure Matters

Just like any other CNN, Deep InfoMax takes patches across the input image and encodes them. Incorporating the location of each patch and its corresponding features is important because it biases the model to learn things that are expressible structurally. The features of each patch can vary based on location and the network needs to understand why those specific patches are related. It is a trend of CNNs to classify images based on textures that correlate with the class label rather than with shape or size. Intuition tells us that depending on what else is being classified, features like texture alone would not be enough to meaningfully differentiate between classes. For example recognizing the texture of an animal’s fur might distinguish it from the texture of a food or plants, however it would be insufficient for classifying different animal species with similar fur.

Incorporating spatial locality is accomplished by the discriminator. It is the discriminator’s objective to maximize the mutual information estimation either globally or over a local subset.

How it works

As stated before, Deep InfoMax takes patches across the input image and encodes them using a convnet, building a feature map of all the feature vectors from each patch. This feature map is encoded further and result is a new feature map representation that reflects some structural aspect of the data, e.g. spatial locality. This feature map is summarized into a global feature vector. From here two different architectures are described, “concatenate-and-convolve” and “encode-and-dot-product”

Feature Map for local and global feature vector

Concatenate-and-convolve: From here the global feature vector is concatenated with the lower-level feature map at every location and passed through a 1 x 1 convolutional discriminator to produce the “real” feature map. The fake feature map is paired with the global feature vector and also passed through the discriminator to produce a “fake” score.

Concatenate-and-convolve

Encode-and-dot-product: This time the global feature vector is encoded using a fully connected network. The lower-level feature map is encoded using 1 x 1 convolutions. The dot product of the global feature vector and local feature map is taken at each location.

Encode-and-dot-product

Architecture

The Deep InfoMax architecture is difficult to define because there isn’t just one, but several versions described.

In addition to the two versions of the local classifier were outlined in the previous section. Four different mutual information architectures were tested. The first was global feature only estimation, and the other three were all local feature estimation using different mutual information estimators.

1. Global only

2. Donsker & Varadhan

3. Jensen-Shannon

4. Noise-Contrastive Estimation

Tests were conducted using a variety of datasets of varying sizes and classes.

Test datasets

Consistent across all tests were the hidden layers of the classifiers. They were comprised of two 512 unit layers. The ReLU activation function was used for both layers.

Results

Knowledge of the location of each patch and its corresponding feature vector was shown to greatly improve the representation’s quality.

Test results

As shown in the table above, Deep InfoMax outperformed the other networks with the exception of the global only architecture. The Bidirectional GAN network was the next highest performer actually outperforming one of the Deep InfoMax local architectures in one of the datasets.

Conclusions

Deep InfoMax is an improvement upon existing unsupervised classification architectures. Because it uses mutual information estimation maximization on its self-supervised samples it is able to train and encoder to improve the discriminator. By matching representations to a prior distribution the system is able to eliminate noise. And finally and crucially, because the joint distribution of the mutual information estimation is dependent on feature location we can say that structure in fact matters.

Other Resources

Here is a short video explaining the topic: https://youtu.be/jHt4nNC6ZcA

Dr. Devon Hjelm’s conference paper, Learning Deep Representations by Mutual Information Estimation and Maximization, which this post is based on can be found at: https://arxiv.org/abs/1808.06670

Additional resources on this topic, provided by Microsoft can be found at:

https://www.microsoft.com/en-us/research/blog/deep-infomax-learning-good-representations-through-mutual-information-maximization/

and

https://youtu.be/s-Us47zp-48

--

--