Learning Representations by Maximizing Mutual Information Across Views — A Summary by Kyle Dennis and Ren Hu

Kyle Dennis

Published in

Machine Intelligence and Deep Learning

16 min readApr 24, 2022

Presentation

A presentation regarding this paper can be seen at the following link: https://www.youtube.com/watch?v=XmiNGRtejMM

Introduction

In this paper, a model is proposed that acts as an “approach to self-supervised representation learning based on maximizing mutual information between features extracted from multiple views of a shared context.” Before continuing, that sentence should be broken down to explain each piece of this sentence.

The main example used to describe this model’s functionality throughout this paper is image classification. Representations indicate meaning that can be inferred from some input (e.g., an inputted image of a cat would represent a cat shown in Fig.1) in the case of a self-supervised model, such as AMDIM, the representation generated by evaluating an image of a cat would be a data output that translate to images containing cats. This model learns representations in a self-supervised way (creates and uses its own labels rather than being provided labels defined by a human ahead of time like supervised learning shown in Fig.2). This is similar as the unsupervised learning shown in Fig.3, but the difference between the two learning styles is highlighted in the next part of the sentence: “maximizing mutual information between features”. Mutual information can be thought of as a quantity that expresses how related two different variables are, or how related two different sets of variables are. Maximizing mutual information between features can be described as aiming to attain the highest level of efficiency and success in having the model determine whether a pair of things (variables, sets of variables, etc.) belongs together or not. Features represent the pair of things (variables or sets of variables) that the model is maximizing mutual information between.

The shared context consists of a single image. Multiple views consist of various versions of the original shared context (e.g., augmented versions of the original images, portions of the image, etc.). Features extracted from multiple views of a shared context translates to variables and sets of variables taken from different versions of a single inputted image. Maximizing the mutual information between these pieces of different augmented versions of a single image forces the model to highlight the high-level factors (patterns, objects, etc.) from the inputted image that will allow the model to identify images of that same representation more efficiently and successfully in the future (e.g., successfully highlighting the shape of a cats ears, the color of a cats fur, and the shape and color of a cats eyes from an inputted image of a cat will allow the model to better identify other images of cats in the future). This effectively solves the generalization problem that many models experience when transitioning from analyzing training data to analyzing testing data.

Fig.1 Example of Representation Learning using Image Data. For instance, the cat image as the input x is imported into the representation learning box and output a formulation to represent the cat.

Fig.2 Example of Supervised Learning. For instance, cat is the label of this image and the task is to predict the probability of belonging to be cat given the image input x.

Fig.3 Example of Unsupervised Learning. For instance, no cat label is provided and compute the distance between the input and the cluster center to determine which cluster this input belongs to.

Local DIM vs AMDIM

This paper proposes a model called Augmented Multiscale Deep InfoMax (AMDIM), an improved extension of the local Deep InfoMax (DIM) model. AMDIM has the same purpose of local DIM (a model that learns representations, with unlabeled data, by maximizing mutual information between the inputted data and outputted results from a deep neural network encoder), but extends the model in four ways:

Rather than maximizing mutual information between features extracted from a single, unaugmented copy of each image, AMDIM differs from local DIM in that mutual information is maximized between features extracted from independently augmented copies of each image. This is to say that rather than analyzing the features from a single image of a cat, features of the cat are analyzed from many different copies of the image of the cat, all uniquely modified.
Rather than maximizing mutual information between a single global and local scale, AMDIM differs from local DIM in that mutual information is maximized between multiple feature scales simultaneously. Essentially, outputted results are back propagated through the network across a wider variety of layer dimensions than that of Local DIM.
The encoder utilized in AMDIM consists of a more powerful architecture than that of local DIM.
Lastly, AMDIM introduces mixture-based representations.

While DIM was introduced in 2019 as a method that “outperforms a number of popular unsupervised learning methods and compares favorably with fully-supervised learning on several classification tasks in with some standard architectures” [Hjelm 2019], AMDIM is shown to extend and improve the initial DIM implementation and introduce a new “self-supervised” form of learning. It should be noted that the main difference between the two models is that while DIM aims to learn representations by maximizing mutual information between a single inputted image and its output from the deep neural network encoder, AMDIM aims to maximize mutual information over many different augmented and scaled views of a single image (hence the Augmented and Multiscale (AM) addition to the name of the model).

Local DIM

The local DIM model is proposed by Hjelm et al. [2019]. In this paper, the authors adopt this model but use a pair of augmented images as the input, instead of a single, unaugmented copy of each image. The main architecture is simplified in Fig.4. The objective loss (cost) function is called noise-contrastive estimation (NCE) loss, represented below in (1).

where (f₁(x¹), f₇(x²)ᵢⱼ) is called is called positive sample pair from their joint probability distribution and N₇ is the set of negative samples from the marginal probability distribution of f₇(x²)ᵢⱼ. The goal is to maximize the similarity of (f₁(x¹), f₇(x²)ᵢⱼ), i.e., we want different views of the same image to have the most similar representations. This maximization can be implemented by minimizing the contrastive loss, i.e., NCE loss in (1). Φ denotes the matching score of (f₁(x¹), f₇(x²)ᵢⱼ). For the ease of understanding, we can treat Φ as the cosine similarity metric function, measuring the similarity of two feature vectors (f₁(x¹), f₇(x²)ᵢⱼ). Φ’ is the revised matching score function by adding the regularization term λΦ², the soft-clipping operation to round off the peak values, and the non-linear transformation by tanh function where λ=4e^-2 , c=20. To quantify the expectation function in (1), the Monte Carlo simulation can be leveraged to get the deterministic equivalent NCE loss formulations based on the random samples of (f₁(x¹), f₇(x²)ᵢⱼ), N₇. Since f₁(x¹) and f₇(x²)ᵢⱼ are the representations of the input x¹, x², and weight, bias parameters, the formulation (1) will be eventually converted to a function of the input x¹, x², and weight, bias parameters. To solve this optimization problem, we can use gradient descent-based method with backpropagation to get the optimal solution of the weight and bias parameters of encoder.

Fig.4 Example of Local Deep InfoMax (DIM) Model. The inputs are a pair of augmented cat images (x¹, x²) from the cat image x, fed into the encoder. In the different layers of the encoder, there are different levels of feature maps where f₁, f₇ denotes the global feature and local feature, respectively. In this paradigm of contrastive learning, the objective of local DIM is to maximize the mutual information between f₁ and f₇, i.e., I(f₁(x¹), f₇(x²)). In other words, the encoder will yield a set of weight and bias parameters when finding the pair of (f₁(x¹), f₇(x²)) with the most similarity.

AMDIM

As stated in the introduction, AMDIM extends the local DIM by maximizing the mutual information (similarity) between multiple layers of high-level features simultaneously using the augmented images of each image. The main structure of AMDIM is shown in Fig. 5 below. The objective function of AMDIM is represented by the NCE loss function below in (5), which is like the loss function of the local DIM with augmented image inputs in (1).

where the subscript n, m denote the top-most nxn, mxm layers in the encoder f. Hence, (n, m) can be (1,5), (1,7) and (5,5), shown in Fig.5. For converting (5) to the deterministic formulation, in the same fashion, the Monte Carlo simulation can be employed to approximate the expectations in (5). Eventually, the NCE loss formulation in (5) will be written as a function of the input (x¹, x²), the weight and bias parameters. The gradient descent-based method with backpropagation can be used to solve this optimization problem in (5) and yield the optimal weight and bias parameters when finding the most similarity between f₁(x¹) and f₅(x²)ₖₗ, f₁(x¹) and f₇(x²)ₖₗ, f₅(x¹)ᵢⱼ and f₅(x²)ₖₗ.

Fig.5 Example of Contrastive Learning for AMDIM. The multi-scale means multi-layer’s contrastive pairs exist in this encoder. A pair of augmented cat images (x¹, x²) from the cat image x are the input fed into the encoder where there are different layers of feature maps f₁, f₅, f₇. Note that f₁ denotes the global feature and f₅ and f₇ are the local features, respectively. In the paradigm of contrastive learning, the objective of local DIM is to maximize the mutual information between the contrastive pairs f₁(x¹) and f₅(x²), f₁(x¹) and f₇(x²), f₅(x¹) and f₅(x²). In other words, the encoder will yield a set of weight and bias parameters when finding the contrastive pair with the most similarity.

Data Augmentation

The first of the four listed ways that AMDIM extends local DIM is by maximizing mutual information between features extracted from augmented views of the input, as opposed to the local DIM method of maximizing mutual information between features extracted from a single unaugmented view of the input. This forces the AMDIM model to evaluate an inputted image many times over, in a variety of distorted views, rather than only evaluating a single, non-distorted image. Extending this evaluation process grants the model more opportunity to recognize the key factors of an inputted image (e.g., cat’s ears, eyes, nose, etc.). Intuitively, this requires more resources than the method of local DIM, but also generates much greater accuracy in the end-goal of image classification through maximizing mutual information.

When an image is inputted into the AMDIM model, the image (referred to as x) has a random horizontal flip applied to it. After horizontally flipping the image, randomized distortions and changes are made to the image (a process referred to as stochastic data augmentation), and the newly augmented copy of x is appended to a collection of augmented views of x, denoted A(x). The example of image augmentation is shown in Fig.6 below. The methods of data augmentation listed in this paper include:

Random resized crop (taking a small subsection of the image, such as the top left corner or small portion in the middle of the image)
Random jitter in color space (slightly altering the color values of each pixel in the augmented image)
Random conversion to grayscale (random chance that an augmented image’s pixels will have its color values reduced from RGB to grayscale)

It is from this newly generated collection, A(x), containing augmented views of the original image, x, that two randomly chosen images (denoted x¹ and x²) are chosen and evaluated by the model for mutual information. This sampling process is repeated until the inputted image has been fully evaluated by the model, per the specifications of inputted hyper-parameters.

Multiscale Mutual Information

When the local DIM model evaluates high-level features of an image (cat’s eyes, ears, nose, etc.), the features are outlined in a global result vector that is outputted by the encoder. This global result vector can also be referred to as the output vector. In the intermediate layers of the encoder, there are numerous local prediction vectors that contain prediction values of high-level features of the image currently being evaluated. One method of training the DIM model is by maximizing the mutual information between the global feature vector, f₁(x), outputted by the encoder f, and the local feature vector, f₇(x)ᵢⱼ, produced by an intermediate layer in f. The subscript 7 represents the layer’s spatial dimension (in this case, the layer has spatial dimension 7x7), and the subscripts ij can be thought of as coordinates that specify the local feature vector’s position in the intermediate layer plane. Maximizing the mutual information between the outputted global feature vector and the intermediate local feature vectors trains the model to tune its inner parameters accordingly with the goal of increasing the similarity between the high-level features predicted by the intermediate layer vectors and the desired high-level features that are represented by the global output vector.

In the local DIM model, one single view of the image is being evaluated, while in the AMDIM model shown in Fig.5 above, two augmented views of the image are being evaluated for mutual information. That is to say: in the AMDIM model, two samples, x¹ and x², are being evaluated to have any of the same features be recognized by the model (e.g., if x¹ and x² both contain pieces of a cat’s eye, it is desired that the mutual information estimation is maximized between the two samples, indicating that the model was successful in recognizing this shared feature, even though both samples are of different augmentations). Not only does AMDIM extend local DIM by maximizing mutual information between two augmented views of an image, as opposed to one single unaugmented view of an image, but AMDIM goes a step further to make the global-local vector comparisons described above across multiple feature scales within the network.

The global-local vector comparisons explained for local DIM can be simplified to maximizing mutual information between a constant global vector in the 1x1 output layer, to a local vector in a 7x7 intermediate layer, variably located at position (i,j) within the layer. In AMDIM, rather than only maximizing mutual information from feature scale 1-to-7 (as described above), mutual information is maximized across multiple feature scales, including 1-to-7, 1-to-5, and 5-to-5. Being that AMDIM is working to maximize the mutual information between two samples, x¹ and x², this can be translated to: the AMDIM model is generating an estimation for how similar the two samples are, if they are similar at all. To do this, the encoder evaluates the high-level features contained in x¹, and its feature outlines are compared to the feature predictions of sample x² as it passes through the encoder (i.e., after x¹ goes through the encoder, its outputted feature vector is captured (layer 1x1 vector). Mutual information is then maximized between this vector and vectors within the intermediate layers (7x7 and 5x5) of the encoder whilst sample x² is being evaluated. Additionally, mutual information is maximized between prediction vectors of the intermediate 5x5 layer in the encoder of sample x¹ evaluation, variably located at position (i,j) within the layer plane, and the prediction vectors of sample x² evaluation, variably located at position (k,l) within its separate respective layer plane). This effectively maximizes the mutual information not only between the output of sample x¹ and the various intermediate values of sample x², but also maximizes the mutual information between the various intermediate values of sample x¹ and sample x². With this extension to the local DIM model, this introduces a great improvement to the generalization problem that model’s face with training data and testing data (e.g., if both samples x¹ and x² contain a cat’s eye, but x¹ is grayscale, x² is resized, and both x¹ and x² show the cat’s eye in different locations within the image plane, the model will learn to recognize the pattern of the cat’s eye regardless of augmentation status (position of eye, color scale, image size, etc.), further reinforcing the model’s capability to recognize the representation of a cat’s eye in other images of cats).

Encoder

The first type of encoder adopted in this paper is revised from the standard ResNet [He et al., 2016a,b] shown in Fig.7, with some changes for DIM. The first change is to use the mean pooling (shown in Fig.8) in the first layer of each Residual block. Another change is to use the 1x1 convolution layer (shown in Fig.9) before the residual layers.

Fig.7 Example of a Residual Layer. The input x can be the image itself or the output from the previous layer. *F(x)* is the residual mapping being learned. This structure can be implemented by feedforward neural network with a connect skipping one or more layers. The connection in Fig.7 is called identity shortcut connection. The ResNet is stacked by many residual layers and can include convolution layers. Then entire network can be trained by gradient descent-based method with backpropagation.

Fig.8 Example of Mean Pooling. Mean pooling is a type of down-sampling method for feature maps, widely used after the convolution layer.

Fig.9 Visualization of 1x1 convolution layer. 1x1 layers can be used to reduce or increase the dimension of next layer’s output. Overall, in this paper 1x1 layers can maintain the shape of the receptive field, i.e., the local region of feature maps.

Regarding this encoder, what does “1x1 layers” mean, and how does it solve the concern of controlling receptive field growth?

In fact, 1x1 layers refer to 1x1 convolutions. The convolutions have 1x1 filters. For example, suppose the input shape of the convolution is 64x64x192, i.e., height = 64, width = 64, channel = 192, then the shape of the receptive field is 64x64. With the 1x1 filter, the output of the convolution is 64x64x1. Therefore, the 1x1 convolution maintains the height and width of the receptive field by 64x64. This example can be visualized in the Fig.9 below.

Regarding this encoder, what does “keeping feature distributions stationary” mean ?

The paper mentions the previous encoder is ‘keeping the feature distributions stationary by avoiding padding’. For the ease of explanation, it is necessary to explain the effect of padding. In general padding is to add zeros to the border of input image (matrix) symmetrically. After padding, there is more space for the filter to cover the image and the dimension of output can be maintained same as the input dimension shown in Fig.10 below. However, padding seriously changes the spatial distribution of the input image at its border shown in Fig.11 (a)-(d), explained by Nguyen et al [2019]. Inversely, avoiding padding will not change the feature distributions of the input image, i.e., avoiding padding will keep feature distributions stationary.

Fig.10 Example of Padding with stride = 1 and 3x3 filter. The left figure (a) is the padded feature map. At the right figure (b) the feature map (pink matrix) through the convolution is formulated with the same dimension as the unpadded input feature map (blue matrix).

Fig.11 Feature Distribution Comparison before and after padding. Figure (a) is the original image, and Figure (b) is the image with the padded frame. Figure (c) and (d) are the corresponding distributions (histograms) of the top region in the original and padded images, respectively.

The second type of encoder used for working with ImageNet and Places205 datasets is shown in the Fig.12 below.

Fig.12 Structure of ImageNet Encoder. This encoder has 2 convolution layers with ReLU activations and 6 Residual blocks.

Mixture-based Representation

Assume the original image x has a random pair of augmented images (x¹, x²). After importing (x¹, x²) into the encoder, the top-layer features f₁, f₇ are provided. Here ‘mixture’ refers to the concatenate the top-level features from the augmented images. For example, to generate k mixture features for each feature f₁, like {f₁¹(x¹), f₁²(x¹), … , f₁^k(x¹)}, this paper uses a simple network structure that comprises a fully-connected layer with a single ReLU layer and a residual connection between f₁ and the mixture features. The structure is shown in Fig.13 (a) below. The objective function of mixture-based encoder is also represented by the NCE loss function in local DIM and AMDIM, represented in (6) below.

where s_nce is the NCE score of (f₁^j(x¹), f₇^i(x²)) based on the formulation in (2). q is the posterior of f₁^j(x¹) conditioned on f₇^i(x²). αH(q) is the entropy maximization term like a regularization term. τ and α are hyperparameters that can be tuned. Similarly, the formulation (6) will be eventually transformed into a function of the input (x¹, x²), the weight and bias parameters. The gradient descent-based method with the backpropagation can be used to compute the optimization problem in (6) and yield the optimal weight and bias parameters for this mixture-based encoder.

Fig.13 Structures of the network for Mixture features and Mixture-based Encoder. The left figure (a) denotes the process of generating the mixture features. The right figure (b) illustrates the structure of mixture-based encoder under the contrastive learning paradigm.

Experiments and Analysis

Before starting the discussion on the experiment results, it’s necessary to clarify in this paper how the self-supervised learning works and how to evaluate their performance, which is shown in Fig.14 below.

Fig.14 Example of Self-supervised Learning and its supervised Evaluation Fashion in this paper. The self-supervised learning yields a high-level feature vector as the representation of the image. To evaluate the goodness of this representation, the common method is to use the high-level feature vector as the input fed into the supervised learning models, e.g., the linear logistic regression and multi-layer perceptron (MLP) network, i.e., the linear and MLP evaluation.

The classification accuracy is used to measure the performance of AMDIM and other competing self-supervised learning methods in this paper. The performance comparison between AMDIM and other competing methods is indicated in Fig.15 (a) and (b) tables below. The effect of different data augmentation methods, the regularization term in NCE loss function, the multiscale structure on the performance of AMDIM is shown in Fig.15 (c) table below.

Fig. 15 Experiment Results of AMDIM and other Competing Methods.

In Fig.15(a) table, ‘sup’ means both methods are trained in fully-supervised way, without self-supervised NCE cost (loss). Hence, the labels are used to facilitate the encoders to get the weight and bias parameters. Except for the results in supervised training way at the first two rows, all other results are based on self-supervised training way to yield the high-level feature vector as the inputs fed into the linear logistic classifiers (i.e., linear evaluation). Note that the difference between ‘small’ and ‘large’ for AMDIM aims at the number of epochs during the training. ‘Place205’ is used to test the accuracy of transferring the learning models trained on ImageNet. This table indicates that AMDIM outperforms the existing seven competing self-supervised learning methods.

In Fig.15(b) table, all results are based on the self-supervised learning methods. After training, these methods yield the high-level feature vector as the input of the linear logistic classifier and the multi-layer perceptron (MLP) classifier, corresponding to the linear and MLP evaluation. Based on the linear and MLP classifier, the performance (accuracy) of these self-supervised learning methods is evaluated in a supervised way. This table implies that AMDIM has comparable performance compared with those three self-supervised learning baselines based on datasets CIFAR10 and CIFAR100.

In Fig.15(c) table, all results are based on AMDIM using data augmentations, the regularization term in NCE cost function, and the multiscale feature structure. The first row of result by AMDIM means the techniques of data augmentations including color jitter, random gray and random crop, the multiscale features, and the NCE cost regularization are all considered in AMDIM. ‘+strong aug’ means the Fast AutoAugmentation policy proposed by Lim et al. [2019] is also considered in AMDIM. ‘-color jitter’ means this augmentation way is not considered in AMDIM. Similarly, ‘-random gray’, ‘-random crop’, ‘-multiscale’, and ‘-stabilize’ denote any one of these techniques is not considered in AMDIM. Note that ‘stabilize’ corresponds to the stability regularization, i.e., the NCE cost regularization in formulation (4). Among these results, we can observe that AMDIM considering ‘strong aug’ has the best performance evaluated by the linear logistic and MLP classifiers. Among these distinct data augmentations, without using ‘random crop’ will make AMDIM have the worst performance. Overall, data augmentations have the most effect on the performance of AMDIM, followed by the NCE cost regularization and the multiscale feature structure.

Group Member Contributions

Reference

[1] Bachman, Philip, R. Devon Hjelm, and William Buchwalter. “Learning representations by maximizing mutual information across views.” Advances in neural information processing systems 32, 2019.

[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. Conference on Computer Vision and Pattern Recognition (CVPR), 2016a.

[3] Sungbim Lim, Ildoo Kim, Taesup Kim, Chiheon Kim, and Sungwoong Kim. Fast autoaugment. arXiv:1905.00397 [cs.LG], 2019.

[4] Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. Revisiting self-supervised learning. arXiv:1901.09005, 2019.

[5] Nguyen, Anh-Duc, Seonghwa Choi, Woojae Kim, Sewoong Ahn, Jinwoo Kim, and Sanghoon Lee. “Distribution padding in convolutional neural networks.” In 2019 IEEE International Conference on Image Processing (ICIP), pp. 4275–4279. IEEE, 2019.

[6] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Philip Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. International Conference on Learning Representations (ICLR), 2019.