Learning Discrete Compressed Representation for Noise-Robust Exploration

Published in

SNU AIIS Blog

7 min readMar 26, 2022

By Seyeon An

A heuristic, or a heuristic technique, is any approach to problem solving that uses a practical method or various shortcuts in order to produce solutions that may not be optimal but are sufficient given a limited timeframe or deadline.

Humans normally rely on heuristics to solve cognitive tasks. For instance, to identify the image of a kangaroo, we perform a set of checks — does it have round ears? does it have a pouch? does it have four legs? — without even noticing. As a friendly reminder, the goal of deep learning is to train Deep Neural Networks (DNNs) to perform tasks like humans. In this case, the key capacity for performance would be the ability to yield numerical representations of data suitable for solving a set of tasks. Representation learning aims to provide such capacities to DNNs.

Then, what can we do about unsuitable information in data, which hugely manipulates our DNNs? Our work starts from this question, and provides an answer for the treatment of such irrelevant information. Our method jointly learns features and feature-wise drop probability for discretely discarding irrelevant information via the Information Bottleneck (IB) framework, called Drop-Bottleneck.

A Quick Overview of the Drop-Bottleneck : We propose a novel information bottleneck (IB) method named Drop-Bottleneck, which discretely drops features that are irrelevant to the target variable. Drop-Bottleneck jointly trains a feature extractor and performs feature selection, dropping out irrelevant information and keeping the essential ones.

Information Bottleneck (IB) Framework

We have introduced a new IB method through our work. In this section, we will explain the basic concepts of the IB framework, and show what prior IB methods have lacked.

Information Bottleneck (IB) Framework

The information bottleneck framework formalizes a problem of obtaining X’s compressed representation Z, still preserving information about Y by deriving prediction and compression terms.

The prediction term encourages preserving task-relevant information.
The compression term penalizes Z for containing information of X.

Motivation

Our motivation is to develop an IB method that can be used for inference tasks, (i) without stochasticity and (ii) with improved efficiency as the result of compression, which are the properties that prior IB methods, such as VIB, lack.

Drop-Bottleneck (DB) and Its Objective

Drop-Bottleneck (DB)

We approach the problem by proposing Drop-Bottleneck (DB), an IB method which discretely drops irrelevant features with joint feature learning.

We define the ith feature of Z, Z_i as

to preserve or discard input feature from X using the Bernoulli distribution and the learnable feature-wise drop probability p ∈[0,1]^d.

Objective and Training of DB

For the training, we derive and minimize the upper bound of the compression term as

However, Z is not differentiable with respect to p, which prevents updating p using the prediction term.

We solve the issue by employing the Concrete relaxation of Bernoulli distribution. Intuitively, p is trained to assign high-drop probability to irrelevant features.

Using this method, each stochastic node p is refactored into a differentiable function of its parameters and a random variable with fixed distribution. After refactoring, the gradients of the loss propagated by the chain rule through the graph are low variance unbiased estimators of the gradients of the expected loss.

Note that it also allows joint training with the feature extractor that outputs X, via the prediction term, as using Deep InfoMax — which is trained to output a large value for joint (or likely) input and a small value for marginal (or arbitrary) input.

DB’s Deterministic Compressed Representations

Drop-Bottleneck provides a deterministic compressed representation that still maintains majority of the learned distinctions, as compression, but can be a reasonable replacement for its original representation in terms of the learned indistinguishability, with Occluded CIFAR dataset.

We also provide a deterministic version of the compressed representation as

which is useful for inference tasks that require consistent representations and feature dimensionality reduction at inference time.

Robust Exploration with DB

We suggest an exploration method equipped with Drop-Bottleneck, which learns state representations using the information bottleneck framework.

Our exploration maintains an episodic memory and generates intrinsic rewards based on the predictability of new observations from the compressed representations of the ones in the memory.

Starting from an empty episodic memory M, we add the learned feature of the observation at each step.

For transitions (S, A, S’), where S, A, S’ are current states, actions and next states, respectively, and feature extractor f_ϕ, we set

where

for i =1, …, d.

The compression term of the DB I(Z; Y) = I(C_p(f_ϕ(S’)); f_ϕ(S’)) essentially encourages f_ϕ to drop unnecessary (noisy) features.
The prediction term of the DB I(Z; Y) = I(C_p(f_ϕ(S’)); C_p(f_ϕ(S’))) makes the compressed representations of adjacent states (S and S’) predictable about each other, i.e. noise-robust predictability.

Employing an episodic memory M, at t of these state representations,

we quantify the novelty of s_t using Deep InfoMax’s discriminator. The quantified results contribute in a solid exploration method capable of handling noisy environments with very sparse rewards, since for s_t that is close to a region covered by the earlier observations in the state space, the results get smaller, and vice versa.

Experiments

Exploration in Noisy Environments

We evaluate our exploration method in environments with multiple noisy TV settings. As the picture above, the DB helps focus on the navigation task without being distracted by irrelevant images or noise (examples shown in the pictures above).

Three noisy-TV settings: ImageAction, Noise, NoiseAction

Environments: VizDoom, DMLab

Number represents average episodic sum of rewards; higher is better.

Results : DB shows SOTA performance on all the noisy VizDoom and DMLab environments.

Comparison with Variational IB

We compare Drop-Bottleneck with Variational Information Bottleneck on the ImageNet dataset.

Classification accuracy and feature dimensionality reduction of VIB and DB (+ deterministic) on ImageNet

Classification accuracy and feature dimensionality reduction (Left) : While showing similar development of the accuracy as the β goes, it suggests that Drop-Bottleneck’s deterministic approach is very effective for feature dimensional reduction.
Adversarial Robustness (Right) : On the ImageNet dataset, we also test the robustness to adversarial attack — as noise added to distort the picture which only the neural network can recognize. The results show that Drop-Bottleneck outperforms VIB (Variational Information Bottleneck) in terms of the adversarial robustness. DB (+ deterministic) shows superior adversarial robustness to VIB.

Removal of Task-Irrelevant Information

Additionally, we perform experiments to demonstrate discarding task-irrelevant information.

The training process consists of two phases: (1) training using the primary (CIFAR) labels and (2) training using the nuisance, or deterministic (MNIST) labels

Hence we use the Occluded CIFAR dataset. Each image of the Occluded CIFAR consists of a CIFAR Image, along with a MNIST Image to play the role of occlusion. Hence, there are two labels (CIFAR, MNIST) for each image.

Phase 1: We train the feature extractor with Bottleneck and Classifier, using the primary (CIFAR) labels.
Phase 2: We fix the feature extractor and learn a new classifier, using the nuisance (MNIST) labels.

Test error plots on the primary task (i.e. the classification of occluded CIFAR images) and on the
nuisance tasks (i.e. classification of the MNIST digits). For all the three types of tasks, we use the
same feature extractor trained for the primary task, where its deterministic representation is used
only for training and test of the nuisance (deterministic) task. — Test error plots on the primary task (i.e. the classification of occluded CIFAR images) and on the nuisance tasks (i.e. classification of the MNIST digits). For all the three types of tasks, we use the same feature extractor trained for the primary task, where its deterministic representation is used only for training and test of the nuisance (deterministic) task.

Results : Increasing β (the Lagrangian multiplier) effectively discards the nuisance information in DB. While the VIB’s deterministic representations shows poor performance compared to the original representations, the DB shows great performance in both deterministic and original representations.

Conclusion

The Drop-Bottleneck method jointly learns features and drops task-irrelevant ones, providing deterministic representations as well for inference. Our method can be utilized for inference tasks, (i) without stochasticity and (ii) with improved efficiency as the result of compression, which are the properties that prior IB methods, such as VIB, lack.

Experiments show that the Drop-Bottleneck method is effective for

adversarial robustness
feature dimensionality reduction
distilling relevant information

and achieves SOTA performance in noisy exploration tasks.

The Drop-Bottleneck provides a big step forward in the field of representation learning, to see and to infer the way humans do.

Acknowledgements

We thank Jaekyeom Kim and the co-authors of the paper “Drop-Bottleneck: Learning Discrete Compressed Representation for Noise-Robust Exploration” for their contributions and discussions in preparing this blog. The views and opinions expressed in this blog are solely of the authors.

This post is based on the following paper:

Drop-Bottleneck: Learning Discrete Compressed Representation for Noise-Robust Exploration, Jaekyeon Kim, Minjung Kim, Dongyeon Woo, Gunhee Kim, International Conference on Learning Representations (ICLR) 2021, arXiv, GitHub.

This post was originally posted on our Notion blog, at June 19, 2021.

Learning Discrete Compressed Representation for Noise-Robust Exploration

Information Bottleneck (IB) Framework

Information Bottleneck (IB) Framework

Motivation

Drop-Bottleneck (DB) and Its Objective

Drop-Bottleneck (DB)

Objective and Training of DB

DB’s Deterministic Compressed Representations

Robust Exploration with DB

Experiments

Exploration in Noisy Environments

Comparison with Variational IB

Removal of Task-Irrelevant Information

Conclusion

Acknowledgements

Written by SNU AI