VIME — Variational Information Maximizing Exploration

Bansi Gajera
Clique Community
Published in
6 min readFeb 17, 2019

Reinforcement Learning experienced a major breakthrough with DeepMind’s significant success in demonstrating how Artificial Intelligence can expedite new learnings. RL is much considered to be a dead field in terms of it’s slow learning, huge training and high computing power required. However, RL is poised to be influential and impactful in human assistance for it is exactly how the human brain learns to make decisions. Much into controversy, Deep RL has been a subject of an active research and the exploration exploitation problem has captured many curious minds. By far many exploration approaches have come forward like — Bayesian RL and PAC-MDP (for discrete state and action spaces), acting randomly, Boltzmann method and utilizing Gaussian noise on the controls in policy gradient methods (for state and action spaces where discretization is not feasible) and many more. The curiosity-driven exploration strategy of VIME — Variational Information Maximizing Exploration [1] which makes use of the information gain about the agent’s internal belief of the dynamics model as a driving force is seen to have outperformed the heuristic exploration methods across a variety of continuous control tasks and algorithms, including tasks with very sparse rewards.

The motivation behind using VIME is, while doing exploration we prefer that the agent takes action that result in states they deem surprising — i.e., states that cause larger updates to the dynamics model distribution. So, the aim here is to maximize the reduction in uncertainty about the dynamics. It is often the case that, taking actions that maximize the reduction in entropy (i.e. unexplored states) leads to states that are maximally informative.

Before understanding VIME, heres a brief recap of the concepts used in the VIME method.

Exploration and Exploitation:

Exploration is doing things that aren’t done before, in the hope of getting even higher reward. The agent experiments with novel strategies that may improve returns in the long run. In exploitation, the agent maximizes rewards through behavior that is known to be successful.

Bayesian Neural Network (BNN):

A Bayesian neural network is a neural network with a prior distribution on its weights (Neal, 2012). Long story short, it is a neural network which acts as a conditional model p that is parameterised by the parameters or weights θ of the network and output y when some input x is given [2]. The likelyhood of the occurence of a particular data point D is determined by the parameter θ. In other words, we can say that we are priotizing few data samples by the means of our parameter θ. This is the primary motive behind VIME using BNN rather than just NN.

Variational Bayes:

Variational Bayesian method is used for approximating intractable integrals arising in Bayesian inference and machine learning. It is typically used in complex statistical models consisting of data as well as unknown parameters and latent variables. Variational Bayesian methods are primarily used for two purposes:

  1. To provide an analytical approximation to the posterior probability of the unobserved variables, in order to do statistical inference over these variables.
  2. To derive a lower bound for the marginal likelihood of the observed data.

KL Divergence:

Kullback-Leibler Divergence is a measure of how a probability distribution differs from another probability distribution. Classically, in Bayesian theory, there is some true distribution P(X); we’d like to estimate with an approximate distribution Q(X).

As the figure shows, we have an actual distribution P (displayed in blue color) which we would like to approximate with a Gaussian distribution Q (displayed in pink color)[4]. In this context, the KL divergence measures the distance from the approximate distribution Q to the true distribution P. It is computed as follows:

In RL, reverse KL divergence is used. In reverse KL, we sample points from Q(X) and try to maximize the probability of these points under P(X), i.e. wherever Q(⋅) has high probability, P(⋅) must also have high probability.

Now getting back to VIME,

As stated earlier, the aim is to maximize the information gain. The information gained after performing an action is obtained by computing the difference in the Entropy of the states before and after performing an action. Maximizing the Information Gain can be formalized as maximizing the sum of reductions in entropy.

Considering the history of the agent up to time instance t to be ξt ={s0,a0,s1,a1,…,st}, the mutual information of the dynamics model before and after taking action can be derived using KL Divergence as follows:

Here, we are comparing the posterior probability before and after performing an action by means of KL divergence to compute the Information gained. The agent models the environment dynamics via a model

parametrized by the random variable Θ with values θ ∈ Θ. Here, S denotes the state and a denotes the action performed at an instance t. This posterior probability of the parameters of the environment dynamics is computed using the Variational Inference. This posterior probability is intractable and so is approximated using Variational distribution q.

Here, D denotes the data samples. The approximated distribution q is represented as a factorized distribution and uses a Bayesian Neural Network parameterized by a fully factorized Gaussian distrubtion. Using the variational distribution q, we can approximate our posterior distribution by minimizing the KL Divergence between the two distribution:

By minimizing KL divergence we are approximating our variatinal distribution q to the posterior distribution p as closely as possible. This is done by maximizing the variational lower bound L[q]:

To incentivise the agent to perform the action which fetches more information, an additional intrinsic reward is given to the agent. Rather than computing information gain explicitly, an approximation is used, leading to the following total reward:

The hyperparameter η controls the amount of incentive to explore (the curiosity). Since we assume that the variational approximation is a fully factorized Gaussian, the KL divergence from posterior to prior has a particularly simple form:

Thus, using information gain in this learned dynamics model as intrinsic rewards allows the agent to optimize for both external reward and intrinsic surprise simultaneously. Empirical results show that VIME performs significantly better than heuristic exploration methods across various continuous control tasks and algorithms. Experiments show that when augmenting VIME with an RL algorithm, there are significant improvements in the face of sparse reward signals.

--

--

Bansi Gajera
Clique Community

an ICT Engineer | a proud Indian | an Epistemophile