Deep Laplacian-based Options for Temporally-Extended Exploration

10 min readJul 14, 2023

At the end of the month, Martin Klissarov is going to be presenting our paper, Deep Laplacian-based Options for Temporally-Extended Exploration, at ICML in Hawaii. I really wish I could go to ICML this year but apparently I need to wait 500 days to get an interview at the US consulate to renew my visa ¯\_(ツ)_/¯

Anyways, today I want to write about this paper I’m extremely proud of. As I said in my tweet when it came out, it fulfills a vision I had about this line of research for 7+ years! Also, to say it one more time: this work was led by Martin, who is amazing! One more thing before I actually start: this blog post probably has my previous blog post as a requirement. When I wrote the previous post, I actually had started writing this one. However, I then realized that the blog post about our ICML paper was going to be too long with all the preamble, so I made the preamble a blog post by itself.

Earlier this year I had this very long paper published at JMLR. When I first started writing it, I was planning it to be the last paper I was going to write on the topic, leaving it to someone else to pick up where I left off. However, the paper ended up with lots of really cool new results and I saw myself very excited again about this line of work, with several promising research opportunities. One of them was extending the ROD cycle beyond the toy task I used in the JMLR paper. This is what our ICML paper (and this blog post) is about.

Let’s go back to the ROD cycle I talk about in my JMLR paper and in the previous blog post:

The Representation-driven Option Discovery cycle (Machado et al., 2023).

This cycle can potentially be quite impactful (as demonstrated in the simple tabular domain in the previous post). However, as in every cycle where one component receives as input the output of another component, you need to get a lot of things right. Each component or arrow in the cycle can be a whole paper by itself, and integration is a major challenge in AI. Thinking about implementing this cycle in somewhat challenging environments requires us to deal with many things, such as learning a representation, estimating the eigenfunctions of the successor features (or of the graph Laplacian, which is equivalent), learning a policy with deep reinforcement learning algorithms, defining the termination function of options in the face of approximation, integrating reward maximization, and so on. What this means is that every time I would pitch this idea to other people they would either bring up one of these issues or something else. In fact, I have been compiling a laundry list of common complaints (sometimes rightfully) that I have heard over the years, to name a few:

This is only tabular, what about general function approximation?
You are not learning features, that is, you are not “deep”.
An eigendecomposition is expensive, it is O(n³), this is not scalable.
You are not maximizing rewards, RL is about reward maximization.
The uniform random policy is not a good baseline, you should use X.
You have special phases, you can’t just wait for a long time for a phase to finish.
Does this work beyond gridworlds / navigational tasks?
What if the environment is not symmetric? You can’t assume symmetry.
The SR is policy dependent, are you dependent on the random policy?

The reason I am very excited about this paper we are presenting at ICML is because it addresses every single one of these complaints. It is not that we can write a paper this way, but I like to present it in light of these common complaints because it highlights how much Martin did in a single project.

I’ll refrain from talking about the algorithm itself because that’s described in the paper. Moreover, I want to use this post to present the paper from a different perspective, which is how the paper addresses the common complaints about this line of work. I’ll do that mostly by discussing some of our empirical results. To just hint at the algorithm itself, here’s a figure from the paper that provides a general overview of our approach, which we call deep covering eigenoptions (DCEO).

Deep Covering Eigenoptions (DCEO) algorithm. The agent receives rewards and observations and leverages them to learn the Laplacian representation which encodes the environment’s topology at different timescales (Mahadevan, 2005; Mahadevan & Maggioni, 2007). Using this representation the agent derives a set of intrinsic rewards to learn exploratory options. It then selects actions either by being greedy w.p. 1 − ϵ, or by exploring w.p. ϵ. When taking exploratory actions these may come either from a random policy w.p. 1 − µ, or from the set of options w.p. µ. Once selected, the agent acts according to the option’s policy until termination.

This is only tabular / You are not learning a representation / Eigendecomposion is expensive / You need better baselines

These complaints I can totally agree with. Potentially they are deal breakers. Obviously, mainly when talking about ideas that are quite novel, sometimes you need to develop things first and eventually scale them up, so we shouldn’t put the cart before the horse. However, the concern is totally valid, we don’t want to come up with ideas that don’t scale. This is particularly relevant when talking about eigendecompositions, which have a cubic cost.

All that being said, I was never really concerned about it because back in 2018 I had already started working on this, showing how we could use neural networks to learn successor features [ICLR paper]. The solution for getting the singular vectors was not great at all, but that paper back then already hinted that a solution was possible. Importantly, that paper proved the equivalence between the eigenvectors of the successor representation and those of the graph Laplacian (proto-value functions). This is a big deal because it unlocked other areas of research for us. Lo and behold, other people were also thinking about similar problems; Wu et al. (ICLR 2019) and Wang et al. (ICML 2021), for example, came up with objective functions we could use to train neural networks, incrementally, to approximate the eigenfunctions of the graph Laplacian. We used Wang et al.’s (2021) paper to scale things up, while implicitly learning a representation of the problem we were tackling. We then also used standard deep reinforcement learning algorithms (e.g., double DQN, Rainbow) to learn option policies.

So there you have it, the first three concerns are addressed. To have results for each section, here’s a plot from the Appendix of our paper; it shows the state coverage under deep function approximation.

Finally, as you can see in the plot, we do have much better baselines than the random walk, baselines that are often said to achieve state-of-the-art performance in different problems.

You are not maximizing rewards

Good, so the first step is checked — we do more in the paper even before the result I put above, but moving on… Now, the real concern is that reinforcement learning is about maximizing rewards, and I keep evaluating state coverage. Well, the whole paper is actually about the performance in terms of reward maximization, so that’s addressed as well. See the results below, using non-linear function approximation throughout, while learning to maximize rewards. Notice, as we do throughout the whole paper, that we compare DCEO against other very competitive baseline methods.

[A random rant before moving on: Notice I’m refraining from saying “we are better”, “we outperform”, “we achieve state-of-the-art”. I find these claims almost meaningless and most of the time they are written only for reviewers. I’m excited about these results because they show we are competitive, in different environments, to algorithms that have shown to perform well across different environments as well. You should see this as an alternative approach for exploration that seems to work really well. Radically different exploration approaches don’t appear every day.]

You have special phases

If you look at the previous plot you’ll notice that it has a shaded region denoting the option discovery phase. It is there because we obviously need to learn options before exploring the world, and we didn’t want to assume we start with the options. The shaded region is to emphasize that our approach’s starting point is offset to not favour it. So although we are being fair it is still kind of wonky. How do we choose that period? What if the problem changes? Is the problem formulation different to allow us to do that? I really like talking about the different phases of the ROD cycle for conceptual clarity, but clearly we need them to happen simultaneously, at different time scales. And when we do it, it works!

I have to give props to Martin here. When he told me he was going to run this experiment, as much as I wanted it to work, I told him he was crazy, that it would never work. Guess who was right? One interesting reason this works is that, apparently, we can learn the eigenfunctions of the Laplacian much quicker than we can learn a policy with our current deep reinforcement learning algorithms; effectively the representation learning phase happens at a much shorter timescale.

Does this work beyond gridworlds? / What if the environment is not symmetric?

At this point I was already very excited because we had already done so much in terms of progress from where we were before. By the way, below is a picture of the environments we were using:

Importantly, in the first two, which are gridworlds, the agent’s observation was pixels. We also experimented with x,y coordinates and it also worked, the same as pixels, but I think we ended up not putting those results in the paper. Anyways, at this point I had another common complaint in the back of my mind, that these were still gridworlds.

It was time to try some more difficult tasks, with some additional complexity such as handling objects, or even harder tasks such as Atari 2600 games, or 3D environments with partial observability (MiniWorld-FourRooms-v0 and MiniWorld-MyWayHomeSparse-v0). And again, it works and it is quite competitive! Below is DCEO’s performance in the 3D environments and in the Atari 2600 game Montezuma’s Revenge.

Notice that some of these results are in asymmetric environments. In Montezuma’s Revenge, for example, the agent cannot “un-jump”, the agent dies, once the skull is killed the agent cannot put the skull back again, etc. But let me show you one more result that I think makes this point even clearer:

In the first task the agent can’t control the red objects, and they go up and down. There’s no reversible action there from the agent’s perspective. An even more prominent example is the Obstructed Key environment. Here the agent needs to get to the blue wall, break it, so it can collect the key and open the door. The agent cannot put the wall back in. The agent needs to change the dynamics/topology of the environment in order to succeed. Also, notice the importance of having all phases happening at the same time, otherwise the agent would have to wait for a whole iteration of the cycle to have its representation capturing that the topology of the environment had changed.

The SR is policy dependent, are you dependent on the random policy?

To conclude, I also want to say that aside from the very first results, we did not rely on the agent’s random policy when learning the eigenfunctions of the Laplacian (successor representation). We learned them using the state distribution induced by the agent’s current policy. Clearly we do not need to condition whatever we are doing on the random policy.

Conclusion

There you have it! One of the reasons I’m excited about this paper we are presenting at ICML is that it fulfills a vision I had more than 7 years ago. It addresses pretty much all of the common complaints I have been hearing for the last several years. I’m sure there will be more, but this is a huge progress in this line of research.

I’m also very excited about this because we have an option discovery method that is general, that works, that demonstrates a cycle where each iteration serves as a scaffold to the next one, that is fully experiential, is scalable, is amenable to function approximation, that works for different data streams, and that doesn’t make any assumptions about the topology of the environment (no domain knowledge). As someone who has worked on option discovery for a long time, I can’t avoid thinking this is a big deal.

Now, to wrap up, maybe you are still not convinced you need to bother with temporal abstractions. What’s their point? Certainly they are not supposed to be seen as an alternative exploration method. They have several selling points, I want to emphasize one here: their reusability. If you know me, you know I’ve been excited about continual reinforcement learning. In such a setting, what does exploration look like? Well, having skills that allow you to quickly navigate throughout the environment seems like kind of a big deal. Importantly, they are reusable, while things such as counts or surprise are useless if the world changes. Another result we have in this paper is about using options for continual exploration, and this also seems like a promising research path.

Hopefully I managed to make you at least a little excited about all of this. If you happen to be going to ICML at the end of the month, feel free to pass by our poster and say hi to Martin!

Acknowledgments

I want to thank Martin Klissarov and Craig Sherstan for their feedback on an earlier draft of this post.