Learning Representations by Maximizing Mutual Information Across Views — A Summary by Kyle Dennis and Ren Hu
Presentation
A presentation regarding this paper can be seen at the following link: https://www.youtube.com/watch?v=XmiNGRtejMM
Introduction
In this paper, a model is proposed that acts as an “approach to self-supervised representation learning based on maximizing mutual information between features extracted from multiple views of a shared context.” Before continuing, that sentence should be broken down to explain each piece of this sentence.
The main example used to describe this model’s functionality throughout this paper is image classification. Representations indicate meaning that can be inferred from some input (e.g., an inputted image of a cat would represent a cat shown in Fig.1) in the case of a self-supervised model, such as AMDIM, the representation generated by evaluating an image of a cat would be a data output that translate to images containing cats. This model learns representations in a self-supervised way (creates and uses its own labels rather than being provided labels defined by a human ahead of time like supervised learning shown in Fig.2). This is similar as the unsupervised learning shown in Fig.3, but the difference between the two learning styles is highlighted in the next part of the sentence: “maximizing mutual information between features”. Mutual information can be thought of as a quantity that expresses how related two different variables are, or how related two different sets of variables are. Maximizing mutual information between features can be described as aiming to attain the highest level of efficiency and success in having the model determine whether a pair of things (variables, sets of variables, etc.) belongs together or not. Features represent the pair of things (variables or sets of variables) that the model is maximizing mutual information between.
The shared context consists of a single image. Multiple views consist of various versions of the original shared context (e.g., augmented versions of the original images, portions of the image, etc.). Features extracted from multiple views of a shared context translates to variables and sets of variables taken from different versions of a single inputted image. Maximizing the mutual information between these pieces of different augmented versions of a single image forces the model to highlight the high-level factors (patterns, objects, etc.) from the inputted image that will allow the model to identify images of that same representation more efficiently and successfully in the future (e.g., successfully highlighting the shape of a cats ears, the color of a cats fur, and the shape and color of a cats eyes from an inputted image of a cat will allow the model to better identify other images of cats in the future). This effectively solves the generalization problem that many models experience when transitioning from analyzing training data to analyzing testing data.
Local DIM vs AMDIM
This paper proposes a model called Augmented Multiscale Deep InfoMax (AMDIM), an improved extension of the local Deep InfoMax (DIM) model. AMDIM has the same purpose of local DIM (a model that learns representations, with unlabeled data, by maximizing mutual information between the inputted data and outputted results from a deep neural network encoder), but extends the model in four ways:
- Rather than maximizing mutual information between features extracted from a single, unaugmented copy of each image, AMDIM differs from local DIM in that mutual information is maximized between features extracted from independently augmented copies of each image. This is to say that rather than analyzing the features from a single image of a cat, features of the cat are analyzed from many different copies of the image of the cat, all uniquely modified.
- Rather than maximizing mutual information between a single global and local scale, AMDIM differs from local DIM in that mutual information is maximized between multiple feature scales simultaneously. Essentially, outputted results are back propagated through the network across a wider variety of layer dimensions than that of Local DIM.
- The encoder utilized in AMDIM consists of a more powerful architecture than that of local DIM.
- Lastly, AMDIM introduces mixture-based representations.
While DIM was introduced in 2019 as a method that “outperforms a number of popular unsupervised learning methods and compares favorably with fully-supervised learning on several classification tasks in with some standard architectures” [Hjelm 2019], AMDIM is shown to extend and improve the initial DIM implementation and introduce a new “self-supervised” form of learning. It should be noted that the main difference between the two models is that while DIM aims to learn representations by maximizing mutual information between a single inputted image and its output from the deep neural network encoder, AMDIM aims to maximize mutual information over many different augmented and scaled views of a single image (hence the Augmented and Multiscale (AM) addition to the name of the model).
Local DIM
The local DIM model is proposed by Hjelm et al. [2019]. In this paper, the authors adopt this model but use a pair of augmented images as the input, instead of a single, unaugmented copy of each image. The main architecture is simplified in Fig.4. The objective loss (cost) function is called noise-contrastive estimation (NCE) loss, represented below in (1).
where (f₁(x¹), f₇(x²)ᵢⱼ) is called is called positive sample pair from their joint probability distribution and N₇ is the set of negative samples from the marginal probability distribution of f₇(x²)ᵢⱼ. The goal is to maximize the similarity of (f₁(x¹), f₇(x²)ᵢⱼ), i.e., we want different views of the same image to have the most similar representations. This maximization can be implemented by minimizing the contrastive loss, i.e., NCE loss in (1). Φ denotes the matching score of (f₁(x¹), f₇(x²)ᵢⱼ). For the ease of understanding, we can treat Φ as the cosine similarity metric function, measuring the similarity of two feature vectors (f₁(x¹), f₇(x²)ᵢⱼ). Φ’ is the revised matching score function by adding the regularization term λΦ², the soft-clipping operation to round off the peak values, and the non-linear transformation by tanh function where λ=4e^-2 , c=20. To quantify the expectation function in (1), the Monte Carlo simulation can be leveraged to get the deterministic equivalent NCE loss formulations based on the random samples of (f₁(x¹), f₇(x²)ᵢⱼ), N₇. Since f₁(x¹) and f₇(x²)ᵢⱼ are the representations of the input x¹, x², and weight, bias parameters, the formulation (1) will be eventually converted to a function of the input x¹, x², and weight, bias parameters. To solve this optimization problem, we can use gradient descent-based method with backpropagation to get the optimal solution of the weight and bias parameters of encoder.
AMDIM
As stated in the introduction, AMDIM extends the local DIM by maximizing the mutual information (similarity) between multiple layers of high-level features simultaneously using the augmented images of each image. The main structure of AMDIM is shown in Fig. 5 below. The objective function of AMDIM is represented by the NCE loss function below in (5), which is like the loss function of the local DIM with augmented image inputs in (1).
where the subscript n, m denote the top-most nxn, mxm layers in the encoder f. Hence, (n, m) can be (1,5), (1,7) and (5,5), shown in Fig.5. For converting (5) to the deterministic formulation, in the same fashion, the Monte Carlo simulation can be employed to approximate the expectations in (5). Eventually, the NCE loss formulation in (5) will be written as a function of the input (x¹, x²), the weight and bias parameters. The gradient descent-based method with backpropagation can be used to solve this optimization problem in (5) and yield the optimal weight and bias parameters when finding the most similarity between f₁(x¹) and f₅(x²)ₖₗ, f₁(x¹) and f₇(x²)ₖₗ, f₅(x¹)ᵢⱼ and f₅(x²)ₖₗ.
Data Augmentation
The first of the four listed ways that AMDIM extends local DIM is by maximizing mutual information between features extracted from augmented views of the input, as opposed to the local DIM method of maximizing mutual information between features extracted from a single unaugmented view of the input. This forces the AMDIM model to evaluate an inputted image many times over, in a variety of distorted views, rather than only evaluating a single, non-distorted image. Extending this evaluation process grants the model more opportunity to recognize the key factors of an inputted image (e.g., cat’s ears, eyes, nose, etc.). Intuitively, this requires more resources than the method of local DIM, but also generates much greater accuracy in the end-goal of image classification through maximizing mutual information.
When an image is inputted into the AMDIM model, the image (referred to as x) has a random horizontal flip applied to it. After horizontally flipping the image, randomized distortions and changes are made to the image (a process referred to as stochastic data augmentation), and the newly augmented copy of x is appended to a collection of augmented views of x, denoted A(x). The example of image augmentation is shown in Fig.6 below. The methods of data augmentation listed in this paper include:
- Random resized crop (taking a small subsection of the image, such as the top left corner or small portion in the middle of the image)
- Random jitter in color space (slightly altering the color values of each pixel in the augmented image)
- Random conversion to grayscale (random chance that an augmented image’s pixels will have its color values reduced from RGB to grayscale)
It is from this newly generated collection, A(x), containing augmented views of the original image, x, that two randomly chosen images (denoted x¹ and x²) are chosen and evaluated by the model for mutual information. This sampling process is repeated until the inputted image has been fully evaluated by the model, per the specifications of inputted hyper-parameters.
Multiscale Mutual Information
When the local DIM model evaluates high-level features of an image (cat’s eyes, ears, nose, etc.), the features are outlined in a global result vector that is outputted by the encoder. This global result vector can also be referred to as the output vector. In the intermediate layers of the encoder, there are numerous local prediction vectors that contain prediction values of high-level features of the image currently being evaluated. One method of training the DIM model is by maximizing the mutual information between the global feature vector, f₁(x), outputted by the encoder f, and the local feature vector, f₇(x)ᵢⱼ, produced by an intermediate layer in f. The subscript 7 represents the layer’s spatial dimension (in this case, the layer has spatial dimension 7x7), and the subscripts ij can be thought of as coordinates that specify the local feature vector’s position in the intermediate layer plane. Maximizing the mutual information between the outputted global feature vector and the intermediate local feature vectors trains the model to tune its inner parameters accordingly with the goal of increasing the similarity between the high-level features predicted by the intermediate layer vectors and the desired high-level features that are represented by the global output vector.
In the local DIM model, one single view of the image is being evaluated, while in the AMDIM model shown in Fig.5 above, two augmented views of the image are being evaluated for mutual information. That is to say: in the AMDIM model, two samples, x¹ and x², are being evaluated to have any of the same features be recognized by the model (e.g., if x¹ and x² both contain pieces of a cat’s eye, it is desired that the mutual information estimation is maximized between the two samples, indicating that the model was successful in recognizing this shared feature, even though both samples are of different augmentations). Not only does AMDIM extend local DIM by maximizing mutual information between two augmented views of an image, as opposed to one single unaugmented view of an image, but AMDIM goes a step further to make the global-local vector comparisons described above across multiple feature scales within the network.
The global-local vector comparisons explained for local DIM can be simplified to maximizing mutual information between a constant global vector in the 1x1 output layer, to a local vector in a 7x7 intermediate layer, variably located at position (i,j) within the layer. In AMDIM, rather than only maximizing mutual information from feature scale 1-to-7 (as described above), mutual information is maximized across multiple feature scales, including 1-to-7, 1-to-5, and 5-to-5. Being that AMDIM is working to maximize the mutual information between two samples, x¹ and x², this can be translated to: the AMDIM model is generating an estimation for how similar the two samples are, if they are similar at all. To do this, the encoder evaluates the high-level features contained in x¹, and its feature outlines are compared to the feature predictions of sample x² as it passes through the encoder (i.e., after x¹ goes through the encoder, its outputted feature vector is captured (layer 1x1 vector). Mutual information is then maximized between this vector and vectors within the intermediate layers (7x7 and 5x5) of the encoder whilst sample x² is being evaluated. Additionally, mutual information is maximized between prediction vectors of the intermediate 5x5 layer in the encoder of sample x¹ evaluation, variably located at position (i,j) within the layer plane, and the prediction vectors of sample x² evaluation, variably located at position (k,l) within its separate respective layer plane). This effectively maximizes the mutual information not only between the output of sample x¹ and the various intermediate values of sample x², but also maximizes the mutual information between the various intermediate values of sample x¹ and sample x². With this extension to the local DIM model, this introduces a great improvement to the generalization problem that model’s face with training data and testing data (e.g., if both samples x¹ and x² contain a cat’s eye, but x¹ is grayscale, x² is resized, and both x¹ and x² show the cat’s eye in different locations within the image plane, the model will learn to recognize the pattern of the cat’s eye regardless of augmentation status (position of eye, color scale, image size, etc.), further reinforcing the model’s capability to recognize the representation of a cat’s eye in other images of cats).
Encoder
The first type of encoder adopted in this paper is revised from the standard ResNet [He et al., 2016a,b] shown in Fig.7, with some changes for DIM. The first change is to use the mean pooling (shown in Fig.8) in the first layer of each Residual block. Another change is to use the 1x1 convolution layer (shown in Fig.9) before the residual layers.
Regarding this encoder, what does “1x1 layers” mean, and how does it solve the concern of controlling receptive field growth?
In fact, 1x1 layers refer to 1x1 convolutions. The convolutions have 1x1 filters. For example, suppose the input shape of the convolution is 64x64x192, i.e., height = 64, width = 64, channel = 192, then the shape of the receptive field is 64x64. With the 1x1 filter, the output of the convolution is 64x64x1. Therefore, the 1x1 convolution maintains the height and width of the receptive field by 64x64. This example can be visualized in the Fig.9 below.
Regarding this encoder, what does “keeping feature distributions stationary” mean ?
The paper mentions the previous encoder is ‘keeping the feature distributions stationary by avoiding padding’. For the ease of explanation, it is necessary to explain the effect of padding. In general padding is to add zeros to the border of input image (matrix) symmetrically. After padding, there is more space for the filter to cover the image and the dimension of output can be maintained same as the input dimension shown in Fig.10 below. However, padding seriously changes the spatial distribution of the input image at its border shown in Fig.11 (a)-(d), explained by Nguyen et al [2019]. Inversely, avoiding padding will not change the feature distributions of the input image, i.e., avoiding padding will keep feature distributions stationary.
The second type of encoder used for working with ImageNet and Places205 datasets is shown in the Fig.12 below.
Mixture-based Representation
Assume the original image x has a random pair of augmented images (x¹, x²). After importing (x¹, x²) into the encoder, the top-layer features f₁, f₇ are provided. Here ‘mixture’ refers to the concatenate the top-level features from the augmented images. For example, to generate k mixture features for each feature f₁, like {f₁¹(x¹), f₁²(x¹), … , f₁^k(x¹)}, this paper uses a simple network structure that comprises a fully-connected layer with a single ReLU layer and a residual connection between f₁ and the mixture features. The structure is shown in Fig.13 (a) below. The objective function of mixture-based encoder is also represented by the NCE loss function in local DIM and AMDIM, represented in (6) below.
where s_nce is the NCE score of (f₁^j(x¹), f₇^i(x²)) based on the formulation in (2). q is the posterior of f₁^j(x¹) conditioned on f₇^i(x²). αH(q) is the entropy maximization term like a regularization term. τ and α are hyperparameters that can be tuned. Similarly, the formulation (6) will be eventually transformed into a function of the input (x¹, x²), the weight and bias parameters. The gradient descent-based method with the backpropagation can be used to compute the optimization problem in (6) and yield the optimal weight and bias parameters for this mixture-based encoder.
Experiments and Analysis
Before starting the discussion on the experiment results, it’s necessary to clarify in this paper how the self-supervised learning works and how to evaluate their performance, which is shown in Fig.14 below.
The classification accuracy is used to measure the performance of AMDIM and other competing self-supervised learning methods in this paper. The performance comparison between AMDIM and other competing methods is indicated in Fig.15 (a) and (b) tables below. The effect of different data augmentation methods, the regularization term in NCE loss function, the multiscale structure on the performance of AMDIM is shown in Fig.15 (c) table below.
In Fig.15(a) table, ‘sup’ means both methods are trained in fully-supervised way, without self-supervised NCE cost (loss). Hence, the labels are used to facilitate the encoders to get the weight and bias parameters. Except for the results in supervised training way at the first two rows, all other results are based on self-supervised training way to yield the high-level feature vector as the inputs fed into the linear logistic classifiers (i.e., linear evaluation). Note that the difference between ‘small’ and ‘large’ for AMDIM aims at the number of epochs during the training. ‘Place205’ is used to test the accuracy of transferring the learning models trained on ImageNet. This table indicates that AMDIM outperforms the existing seven competing self-supervised learning methods.
In Fig.15(b) table, all results are based on the self-supervised learning methods. After training, these methods yield the high-level feature vector as the input of the linear logistic classifier and the multi-layer perceptron (MLP) classifier, corresponding to the linear and MLP evaluation. Based on the linear and MLP classifier, the performance (accuracy) of these self-supervised learning methods is evaluated in a supervised way. This table implies that AMDIM has comparable performance compared with those three self-supervised learning baselines based on datasets CIFAR10 and CIFAR100.
In Fig.15(c) table, all results are based on AMDIM using data augmentations, the regularization term in NCE cost function, and the multiscale feature structure. The first row of result by AMDIM means the techniques of data augmentations including color jitter, random gray and random crop, the multiscale features, and the NCE cost regularization are all considered in AMDIM. ‘+strong aug’ means the Fast AutoAugmentation policy proposed by Lim et al. [2019] is also considered in AMDIM. ‘-color jitter’ means this augmentation way is not considered in AMDIM. Similarly, ‘-random gray’, ‘-random crop’, ‘-multiscale’, and ‘-stabilize’ denote any one of these techniques is not considered in AMDIM. Note that ‘stabilize’ corresponds to the stability regularization, i.e., the NCE cost regularization in formulation (4). Among these results, we can observe that AMDIM considering ‘strong aug’ has the best performance evaluated by the linear logistic and MLP classifiers. Among these distinct data augmentations, without using ‘random crop’ will make AMDIM have the worst performance. Overall, data augmentations have the most effect on the performance of AMDIM, followed by the NCE cost regularization and the multiscale feature structure.
Group Member Contributions
Reference
[1] Bachman, Philip, R. Devon Hjelm, and William Buchwalter. “Learning representations by maximizing mutual information across views.” Advances in neural information processing systems 32, 2019.
[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. Conference on Computer Vision and Pattern Recognition (CVPR), 2016a.
[3] Sungbim Lim, Ildoo Kim, Taesup Kim, Chiheon Kim, and Sungwoong Kim. Fast autoaugment. arXiv:1905.00397 [cs.LG], 2019.
[4] Alexander Kolesnikov, Xiaohua Zhai, and Lucas Beyer. Revisiting self-supervised learning. arXiv:1901.09005, 2019.
[5] Nguyen, Anh-Duc, Seonghwa Choi, Woojae Kim, Sewoong Ahn, Jinwoo Kim, and Sanghoon Lee. “Distribution padding in convolutional neural networks.” In 2019 IEEE International Conference on Image Processing (ICIP), pp. 4275–4279. IEEE, 2019.
[6] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Philip Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. International Conference on Learning Representations (ICLR), 2019.