Understanding the Gradient-Isolated Learning of Representations and intuition to the Greedy InfoMax

Published in

Analytics Vidhya

7 min readJan 25, 2020

Since my association with Data Science and Machine Learning, the one question that has always fascinated me is the humongous amount of data humans generate and our inability to effectively utilize it in the sophisticated algorithms available at our disposal. The only shortcoming in all our ideas to apply deep learning and to save the world(a bit too late for that) is the effort and time involved in data preprocessing and not the data itself, which we have in abundance. Just to get a sense of this, we generate close to 3 quintillion bytes (10¹⁸) of data every day! That is the amount of energy that hits the earth’s surface from the sun every day (in joules). This includes a range of the information from the data from just a click on google search to all the data tapped in cameras, weather monitors, and social media. So, if we could learn and make sense of this data, without relying much upon the labeling or annotation, it could be the next disruptive technology. This paper I happened to stumble upon in the new paper radar attempts to learn the representations of large data in an unsupervised manner. I have attempted to capture the main concepts of this research work(link) by Sindy Löwe, Peter O’Connor, Bastiaan S. Veeling in their paper titled “Putting An End to End-to-End: Gradient-Isolated Learning of Representations” without diving too much into the mathematical aspect of it.

Intuition to the approach

The paper introduces their approach by stating the disadvantages associated with traditional end to end training and its biological implausibility as it is proven that our brain doesn’t process and learn information the way traditional backpropagation does. Despite some evidence for top-down connections in the brain, there does not appear to be a global objective that is optimized by error signals in the brain. This paper introduces a new method to effectively train unlabelled datasets and capture the representations without having the need for a supervised learning method and perform the downstream tasks.

The basic element that enables this approach is the existing slow features in the data used by typical downstream tasks.

slow-features refers to the inherent similarity the units of a data feature map exhibit with its neighbouring units. The features have similar properties to those in its vicinity compared to the ones an any random part of the feature representation.

To put this into perspective, all the pixels related to one particular object in an image exhibit local similarity in terms of texture, color, lighting, and gradients. Similarly, in audio data, if there are multiple speakers the data associated with one person on a single continuous patch of the audio will be similar in terms of pitch, frequency, tone, etc., The method leverages the sequential ordering present in the data to encode and represent it in a compact way without the model being subject to end-to-end loss optimization. The data is represented in the form of multiple stacked modules. Each module independently learns and represents the temporal similarity present in the previous module. The model also self optimizes within a module without needing feedback from the modules that follow it and hence also eliminating the vanishing gradient problem.

The algorithm

This self-supervised end-to-end learning approach extracts useful representations from sequential inputs by maximizing the mutual information between the extracted representations of temporally nearby patches.

The main principle of the algorithm is to learn the similarity in the immediate next sequence in the data patch provided the representation of the current patch. So the algorithm works as follows. Let’s imagine that sequential data (for example, a speech sample)sample is divided into ‘n’ uniform time units.
The first step is that the data sample(x) up to time ‘t’ is encoded using a deep encoding model(E). Additionally, another representation C(t) that aggregates the information of all patches up to time-step ‘t’ is created using an autoregressive model as G(ar)[0:t]=C(t). An Autoregressive Model, like a Recurrent neural Network, passes the information from the previous state to the next. But it differs from an RNN by not using hidden states, and instead directly provides input to the next state.

WaveNet animation. Source: Google DeepMind.

Now we have the initial representation, the goal is to maximize the mutual information between the data till timestamp ‘t’ and the next neighboring patch, let's say (t+k). This is done by extracting the encoding of the input representations E(t+k) till t+k timestamps(where k is the next unit in the data which is k timestamps away from t)and training to maximize the information sharing between C(t) and E(t+k) of temporally nearby patches by employing a specifically designed global probabilistic loss.

The training and loss

The loss mentioned above is derived principles of Noise Contrastive Estimation (NCE) [Gutmann and Hyvärinen, 2010]. The idea here is similar to what we understood in the last paragraph, we take the current representation E(t) and C(t) and try to optimize the similarity between C(t) and E(t+k) for all possible values of k and pick the value with the highest similarity. This is done by taking a bag of inputs as

X = [E(t+k), E(n1), E(n2), E(n3)…] where E(t+k) is the positive encoding which is k timesteps ahead of t and all other encodings are randomly sampled from the data which has no correlation to E(t).

The loss function used here takes pairwise inputs for C(t) and E(ni). Each pair of encodings (ni, Ct) is scored using a function to predict how likely it is that the given encoding ‘ni’ is the positive sample E(t+k) using a log-bilinear loss(link). This loss is used to optimize both the encoding model ‘E’ and the auto-regressive model G(ar)to extract the features that are consistent over neighboring patches but which diverge between random pairs of patches. At the same time, the scoring model learns to use those features to correctly classify the matching pair.

Greedy InfoMax

Intuition

The theory is that the brain learns to process its perceptions by maximally preserving the information of the input activities in each layer. On top of this, neuroscience suggests that the brain predicts its future inputs and learns by minimizing this prediction error, i.e. its “surprise” [Friston, 2010]. Empirical evidence indicates, that retinal cells carry significant mutual information between the current and the future state of their own activity and this process may happen at each layer within the brain. This technique draws motivation from these theories, resulting in a method that learns to preserve the information between the input and the output of each layer by learning representations that are predictive of future inputs.

Let’s look at another implementation aspect of the approach, the task of effectively optimizing the mutual information between representations at each layer of a model in isolation, enjoying the many practical benefits that greedy training (decoupled, isolated training of parts of a model) provides.TO do the same, a conventional deep learning architecture is taken and divided into a stack of M modules. This decoupling can happen at the individual layer level or, for example, at the level of blocks found in residual networks [He et al., 2016b]. Rather than training this model end-to-end, this prevents gradients from flowing between modules and employ a local self-supervised loss instead, additionally reducing the issue of vanishing gradients

The algorithm in a nutshell(source: the original paper https://arxiv.org/abs/1905.11786)

The right half of the above image depicts all the encodings from time stamps ‘t’ to ‘t+k’ till ‘j’. All the encodings are compared with a scoring function ‘f’ and finally passed to the loss function L(n).
The left half shows how the data is divided into modules, each encoding module G(enc) within the architecture maps the output from the previous module to an encoding Z(mt) calculated using the previous modules output. No gradients are flowing between modules, which is enforced using a gradient blocking operator. Therefore, each module G(enc) is trained using the loss function explained in the previous section and the scoring function ‘f’ that compares the pairwise encodings.

Results and Summary

The algorithm was applied both on audio and image datasets. In both settings, a feature-extraction model is divided by depth into modules and trained without labels using the approach. The representations created by the final module are then used as the input for a linear classifier. The results in both sections were close to their state-of-the-art supervised counterparts and hence proves the applicability of this algorithm.

The key applications of the approach as listed down by the paper is as follows:

Applying GIM to high-dimensional inputs, each module can be optimizes in sequence to decrease the memory costs during training. In the most memory constrained scenario, individual modules can be trained, frozen, and their outputs stored as a dataset for the next module, which effectively removes the depth of the network as a factor of the memory complexity.
Additionally, GIM allows for training models on larger-than-memory input data with architectures that would otherwise exceed memory limitations.
Last but not least, GIM provides a highly flexible framework for the training of neural networks. It enables the training of individual parts of an architecture at varying update frequencies. When a higher level of abstraction is needed, GIM allows for adding new modules on top at any moment of the optimization process without having to fine-tune previous results.

With this, I would like to conclude this article by thanking the authors of the original paper Sindy Löwe, Peter O’Connor, Bastiaan S. Veeling for this brilliant work! All the credit to the information present in this article belong to the authors of the paper titled

Putting An End to End-to-End: Gradient-Isolated Learning of Representations

link to the original paper