# HOMER: Provable Exploration in Reinforcement Learning

This week at ICML 2020, Mikael Henaff, Akshay Krishnamurthy, John Langford and I have a paper presenting a new reinforcement learning (RL) algorithm called HOMER that addresses three main problems in real-world RL problem: (i) exploration, (ii) decoding latent dynamics, and (iii) optimizing a given reward function. ArXiv version of the paper can be found here, and the ICML version would be released soon.

The paper is a bit mathematically heavy in nature and this post is an attempt to distil the key findings. We will also be following up soon with a new codebase release (more on it later).

# Rich-observation RL landscape

The environment makes it difficult to reach the big treasure chest in three ways. Firstly, the environmental dynamics are such that if you are in good states, then only 1 out of 10 possible actions will let you reach the two good states at the next time step with equal probability (the good action changes from state to state). Every other action in good states and all actions in bad states will put you into bad states at the next time step, from which it is impossible to recover. Secondly, it will try to mislead you by giving you a small myopic bonus for transitioning from a good state to a bad state (small treasure chest). This means that a locally optimal policy is to transition to one of the bad states as quickly as possible. Third, the agent never directly observes which state it is in. Instead, it receives a high-dimensional, noisy observation from which it must decode the true underlying state.

It is easy to see that if we take actions uniformly at random, then the probability of reaching the big treasure chest at the end is 1/10^{100}. The number 10^{100} is called Googol and is larger than the current estimate of the number of elementary particles in the universe. Furthermore, since transitions are stochastic one can show that no fixed sequence of actions would perform well either.

A key aspect of the rich-observation setting is that the agent receives observation instead of the latent state. The observations are stochastically sampled from an infinitely large space conditioned on the state. However, observations are rich-enough to enable decoding the latent state which generates them.

# What does provable RL mean?

Thus, a provable RL algorithm is capable of learning a close to an optimal policy with high probability (where the word high and close can be made arbitrarily more refined), provided the assumptions it makes are satisfied. There are other ways to define a provable RL algorithm and other considerations but we won’t cover them here.

# Why should I care if my algorithm is provable?

1. We can only test an algorithm on a finite number of environments (in practice somewhere between 1 and 20). Without guarantees, we don’t know how they will behave in a new environment. This matters especially if failure in a new environment can result in high real-world costs (e.g., in health or financial domains).
2. If a provable algorithm fails to give the desired result, this can be attributed with high probability to failure of at least one of its assumptions. A developer can then look at the assumptions and try to determine which ones are violated, and either intervene to fix them or determine that the algorithm is not appropriate for the problem.

# HOMER

For each time step, HOMER learns a state decoder function and a set of exploration policies. The state decoder maps high-dimensional observations to a small set of possible latent states, while the exploration policies map observations to actions which will lead the agent to each of the latent states. We describe HOMER below.

• For a given time step, we first learn a decoder for mapping observations to a small set of values using contrastive learning. This procedure works as follows. Firstly, collect a transition by following a randomly sampled exploration policy from the previous time step until that time step, and then taking a single random action. We collect two transitions this way as shown below.
• We then flip a coin; if we get heads then we store the transition (x1, a1, x’1), and otherwise, we store the imposter transition (x1, a1, x’2). We train a supervised classifier to predict if a given transition (x, a, x’) is real or not. This classifier has a special structure which allows us to recover a decoder for time step h.
• Once we have learned the state decoder, we will learn an exploration policy for every possible value of the decoder (which we call abstract state as they are related to the latent state space). This step is standard can be done using many different approaches such as model-based planning, model-free methods, etc. In the paper, we use an existing model-free algorithm called policy search by dynamic programming (PSDP) by Bagnell et al. 2004.
• We have now recovered a decoder and a set of exploration policies for this time step. We repeat this procedure for each of the following time steps and learn a decoder and a set of exploration policies for the whole latent state space. Finally, we can easily optimize any given reward function using any provable planner like PSDP or a model-based algorithm. (The algorithm actually recovers the latent state space up to an inherent ambiguity by combining two different decoders, but I’ll leave that to avoid overloading this post).

# Key findings

1. The contrastive learning procedure gives us the right state decoding (we recover up to some inherent ambiguity but I won’t cover it here).
2. HOMER can learn a set of exploration policies to reach every latent state
3. HOMER can learn a nearly-optimal policy for any given reward function with high probability. Further, this can be done after the exploration part has been performed.

# Failure cases of prior RL algorithms

Below we show how PPO+RND fails to solve the above problem while HOMER succeeds. We simplify the problem by using a grid pattern where rows represent the state (the top two represents “good” states and the bottom row represents “bad” states), and column represents timestep.

We present counter-examples for other algorithms in the paper (see Section 6 here). These counterexamples allow us to find limits of prior work without expensive empirical computation on many domains.

# How can I use with HOMER?

Machine learning and NLP Researcher at Microsoft Research, New York. https://dipendramisra.com/.