The Representation-driven Option Discovery Cycle

Marlos C. Machado
6 min readJun 30, 2023

--

I really believe that our reinforcement learning agents should be able to do temporal abstraction, and I particularly like the options framework. However, for the last 25 years we had this open question on where options should come from. To put it differently, how do we learn temporal abstractions?!

So, for several years now (see my PhD thesis, for example), my pitch has been that the option discovery process should be seen as a cycle that instantiates a constructivist approach in which options discovered from previous iterations act as a scaffold for more complex behaviors discovered in subsequent iterations. In a more operational way, this framework is defined such that the agent’s representation is used to identify useful options, which are then used to further improve its representation. This results in a virtuous, never-ending, cycle in which both the representation and the options are constantly refined based on each other. This is depicted in the figure below.

The representation-driven option discovery cycle (ROD cycle)

The question though is how to instantiate each one of the boxes above. I recently wrote this very long paper at JMLR summarizing many ideas around this topic. This blog post is about a tiny portion of this JMLR paper, it is about the general idea behind this framework, how to instantiate it, and what we can do with it.

Instantiating the ROD-cycle

So, when talking about instantiating the ROD cycle, this JMLR paper I mentioned makes somewhat of a strong claim: “the successor representation, which encodes states based on the pattern of state visitation that follows them, can be seen as a natural substrate for the discovery and use of temporal abstractions”. There are reasons for that, and I won’t write about them here, but it matters when instantiating this framework (see the aforementioned JMLR paper, for example). The proposed instantiation is that you interact with the environment following somewhat of a random policy, you learn the successor representation (SR) induced by your experience, and then you learn options that maximize a function of the eigenvectors of the SR — we use the eigenvectors of the SR because they capture large-scale temporal properties of the environment dynamics, capturing diffusion properties, at different time scales, that are quite useful to “diffuse” in the environment, that is, explore.

If you want to learn options that are useful for planning, you can use the eigenvectors of the SR to identify bottleneck states, which are thought to be pretty good for planning (see, for example, the paper by Solway et al.). If you want to use options to do exploration, I claim you should use the eigenvectors of the SR to maximize diffusion in the environment. This is where my idea of eigenoptions came from.

I find the idea of using options for exploration quite interesting actually; and by now we have used them (handcrafted) even to fly balloons in the stratosphere. Nevertheless, for the longest time it was not clear to me how to discover options in a robust and general way, mainly when doing it over multiple iterations of a cycle. The intuition was always clear to me though: at first the agent doesn’t know much about the environment, so it chooses actions uniformly at random. This will lead to some coverage, but not much. We then learn options that take the agent to the boundary of where it has been, such that next time this happens the agent can start from a different frontier, and keep pushing. The figure below, adapted from another paper of mine, captures this intuition quite well:

Intuition behind temporally-extended exploration with options, adapted from Erraqabi et al. (2022).

But this is too vague. How do we instantiate this framework concretely? Well, here’s the proposal: (1) we interact with the world according to a uniform random policy over primitive actions and the options available to you. We learn the successor representation and compute its eigenvectors, that’s the representation. We then learn something akin to eigenoptions, we call them covering eigenoptions, that maximize a reward function that incentivizes the agent to reach the state in which the eigenvector of the SR assigns the larger value to. This state is a state the agent hasn’t visited much and that it is in a region that hasn’t been visited much either. Notice that the SR we are learning here is based on the SR induced by the agent’s trajectory, the empirical SR if you will, not the true one one would converge to if they waited long enough. The agent then incorporates this learned option into its option set and does it again.

This works! We can learn options (in a very general and domain-agnostic way) that consistently take the agent to the “boundary of its knowledge” by learning options that do things the agent knows it can do but that at first are hard to do. In our JMLR paper we have shown that in the four-room domain, for example, when compared to the random policy, we can reduce by a whole order of magnitude the expected number of timesteps the agent needs in order to visit every state in the environment. The figure below shows the first four iterations of an individual run to help you see how this actually happens.

Illustration of multiple iterations of the ROD cycle when instantiated with covering eigenoptions. In iteration 1, the agent collects samples with a random walk and it learns a representation such that it is incentivized to visit states it has not visited often. This leads to an option that takes the agent closer to the hallway between the top and bottom rooms on the right. In iteration 2, the agent ends up closer to the referred hallway as it eventually samples the learned option; a random walk then leads the agent to the bottom right room. The learned representation still puts a lot of mass in the states often visited in the first iteration, but the reward function is now defined in more states. Aside from the option that takes the agent to the aforementioned hallway, the agent now also discovers an option that, when in the top right room, takes the agent to the hallway between the top right and left rooms or, when in the bottom right room, takes the agent close to the hallway between the bottom right and left room. At the third iteration the agent then has access to two options. By chance, the agent eventually samples an option that takes it to the state close to the hallway between the top and left room, with the random walk then, by chance, taking the agent to the top left room. The induced SR at the end of these three iterations leads to the discovery of a third option that, from most states, takes the agent to the state close to the hallway between the bottom right and left room. By chance, the agent ends up sampling that option in the fourth iteration, visiting the room that had not been visited yet.

Wrapping up

Hopefully I was able to convey why I’m excited about this representation-driven option discovery framework. I also understand if you are a little skeptical about it; the results above, for example, are in the tabular case, and you might be wondering if we can actually scale this up. Also, I’m talking about the eigenfunctions of the SR, and you might be wondering about the computational cost of such a process; maybe you don’t like the staged way I set up the algorithm, with very separate phases; or maybe you are put off by the fact that we are using a random policy and the SR is conditioned on the policy. I get it. As I said, one way or another, I have been working on these ideas for quite some time and, consequently, I have heard different versions of these comments. I mean, the first time I put something out talking about ideas related to this was in 2016, in a paper showing the benefits of multiple iterations of what I now call the ROD cycle. Over the years we (and others) have been making steady progress on this vision, but one of the questions above would always remain.

This year, 2023, has been great for this line of work because we wrote two papers I’m really proud of. The first one is the one I just talked about (pieces of it), and it outlines this research vision and goes back to the basics to make meaningful conceptual contributions. The second one is a paper that was accepted at ICML (see tweet) and it is the paper that I wanna say provides an answer to all of the questions above! I’ll write about that paper in my next post, in a week or so (hopefully). In fact, when I first started writing this, the goal was to write about the ICML paper, but then I realized I needed to first write about the representation-driven option discovery framework and it seemed wiser to split the post in two 😅.

Acknowledgments

I want to thank Martin Klissarov for his feedback on an earlier draft of this post.

--

--

Marlos C. Machado

Assistant Professor at the University of Alberta. Amii Fellow, Canada CIFAR AI chair. Website: https://webdocs.cs.ualberta.ca/~machado/