Learning Discrete Compressed Representation for Noise-Robust Exploration
By Seyeon An
A heuristic, or a heuristic technique, is any approach to problem solving that uses a practical method or various shortcuts in order to produce solutions that may not be optimal but are sufficient given a limited timeframe or deadline.
Humans normally rely on heuristics to solve cognitive tasks. For instance, to identify the image of a kangaroo, we perform a set of checks — does it have round ears? does it have a pouch? does it have four legs? — without even noticing. As a friendly reminder, the goal of deep learning is to train Deep Neural Networks (DNNs) to perform tasks like humans. In this case, the key capacity for performance would be the ability to yield numerical representations of data suitable for solving a set of tasks. Representation learning aims to provide such capacities to DNNs.
Then, what can we do about unsuitable information in data, which hugely manipulates our DNNs? Our work starts from this question, and provides an answer for the treatment of such irrelevant information. Our method jointly learns features and feature-wise drop probability for discretely discarding irrelevant information via the Information Bottleneck (IB) framework, called Drop-Bottleneck.
Information Bottleneck (IB) Framework
We have introduced a new IB method through our work. In this section, we will explain the basic concepts of the IB framework, and show what prior IB methods have lacked.
Information Bottleneck (IB) Framework
The information bottleneck framework formalizes a problem of obtaining X’s compressed representation Z, still preserving information about Y by deriving prediction and compression terms.
- The prediction term encourages preserving task-relevant information.
- The compression term penalizes Z for containing information of X.
Motivation
Our motivation is to develop an IB method that can be used for inference tasks, (i) without stochasticity and (ii) with improved efficiency as the result of compression, which are the properties that prior IB methods, such as VIB, lack.
Drop-Bottleneck (DB) and Its Objective
Drop-Bottleneck (DB)
We approach the problem by proposing Drop-Bottleneck (DB), an IB method which discretely drops irrelevant features with joint feature learning.
We define the ith feature of Z, Z_i as
to preserve or discard input feature from X using the Bernoulli distribution and the learnable feature-wise drop probability p ∈[0,1]^d.
Objective and Training of DB
For the training, we derive and minimize the upper bound of the compression term as
However, Z is not differentiable with respect to p, which prevents updating p using the prediction term.
We solve the issue by employing the Concrete relaxation of Bernoulli distribution. Intuitively, p is trained to assign high-drop probability to irrelevant features.
Using this method, each stochastic node p is refactored into a differentiable function of its parameters and a random variable with fixed distribution. After refactoring, the gradients of the loss propagated by the chain rule through the graph are low variance unbiased estimators of the gradients of the expected loss.
Note that it also allows joint training with the feature extractor that outputs X, via the prediction term, as using Deep InfoMax — which is trained to output a large value for joint (or likely) input and a small value for marginal (or arbitrary) input.
DB’s Deterministic Compressed Representations
Drop-Bottleneck provides a deterministic compressed representation that still maintains majority of the learned distinctions, as compression, but can be a reasonable replacement for its original representation in terms of the learned indistinguishability, with Occluded CIFAR dataset.
We also provide a deterministic version of the compressed representation as
which is useful for inference tasks that require consistent representations and feature dimensionality reduction at inference time.
Robust Exploration with DB
We suggest an exploration method equipped with Drop-Bottleneck, which learns state representations using the information bottleneck framework.
Our exploration maintains an episodic memory and generates intrinsic rewards based on the predictability of new observations from the compressed representations of the ones in the memory.
Starting from an empty episodic memory M, we add the learned feature of the observation at each step.
For transitions (S, A, S’), where S, A, S’ are current states, actions and next states, respectively, and feature extractor f_ϕ, we set
where
for i =1, …, d.
- The compression term of the DB I(Z; Y) = I(C_p(f_ϕ(S’)); f_ϕ(S’)) essentially encourages f_ϕ to drop unnecessary (noisy) features.
- The prediction term of the DB I(Z; Y) = I(C_p(f_ϕ(S’)); C_p(f_ϕ(S’))) makes the compressed representations of adjacent states (S and S’) predictable about each other, i.e. noise-robust predictability.
Employing an episodic memory M, at t of these state representations,
we quantify the novelty of s_t using Deep InfoMax’s discriminator. The quantified results contribute in a solid exploration method capable of handling noisy environments with very sparse rewards, since for s_t that is close to a region covered by the earlier observations in the state space, the results get smaller, and vice versa.
Experiments
Exploration in Noisy Environments
We evaluate our exploration method in environments with multiple noisy TV settings. As the picture above, the DB helps focus on the navigation task without being distracted by irrelevant images or noise (examples shown in the pictures above).
- Three noisy-TV settings: ImageAction, Noise, NoiseAction
- Environments: VizDoom, DMLab
Results : DB shows SOTA performance on all the noisy VizDoom and DMLab environments.
Comparison with Variational IB
We compare Drop-Bottleneck with Variational Information Bottleneck on the ImageNet dataset.
- Classification accuracy and feature dimensionality reduction (Left) : While showing similar development of the accuracy as the β goes, it suggests that Drop-Bottleneck’s deterministic approach is very effective for feature dimensional reduction.
- Adversarial Robustness (Right) : On the ImageNet dataset, we also test the robustness to adversarial attack — as noise added to distort the picture which only the neural network can recognize. The results show that Drop-Bottleneck outperforms VIB (Variational Information Bottleneck) in terms of the adversarial robustness. DB (+ deterministic) shows superior adversarial robustness to VIB.
Removal of Task-Irrelevant Information
Additionally, we perform experiments to demonstrate discarding task-irrelevant information.
The training process consists of two phases: (1) training using the primary (CIFAR) labels and (2) training using the nuisance, or deterministic (MNIST) labels
Hence we use the Occluded CIFAR dataset. Each image of the Occluded CIFAR consists of a CIFAR Image, along with a MNIST Image to play the role of occlusion. Hence, there are two labels (CIFAR, MNIST) for each image.
- Phase 1: We train the feature extractor with Bottleneck and Classifier, using the primary (CIFAR) labels.
- Phase 2: We fix the feature extractor and learn a new classifier, using the nuisance (MNIST) labels.
Results : Increasing β (the Lagrangian multiplier) effectively discards the nuisance information in DB. While the VIB’s deterministic representations shows poor performance compared to the original representations, the DB shows great performance in both deterministic and original representations.
Conclusion
The Drop-Bottleneck method jointly learns features and drops task-irrelevant ones, providing deterministic representations as well for inference. Our method can be utilized for inference tasks, (i) without stochasticity and (ii) with improved efficiency as the result of compression, which are the properties that prior IB methods, such as VIB, lack.
Experiments show that the Drop-Bottleneck method is effective for
- adversarial robustness
- feature dimensionality reduction
- distilling relevant information
and achieves SOTA performance in noisy exploration tasks.
The Drop-Bottleneck provides a big step forward in the field of representation learning, to see and to infer the way humans do.
Acknowledgements
We thank Jaekyeom Kim and the co-authors of the paper “Drop-Bottleneck: Learning Discrete Compressed Representation for Noise-Robust Exploration” for their contributions and discussions in preparing this blog. The views and opinions expressed in this blog are solely of the authors.
This post is based on the following paper:
- Drop-Bottleneck: Learning Discrete Compressed Representation for Noise-Robust Exploration, Jaekyeon Kim, Minjung Kim, Dongyeon Woo, Gunhee Kim, International Conference on Learning Representations (ICLR) 2021, arXiv, GitHub.
This post was originally posted on our Notion blog, at June 19, 2021.