Yet Another Hindsight Experience Replay: Refining the Plan

Francisco Ramos
7 min readAug 18, 2020

--

Image from Ingredients for
Robotics Research

In the previous post I pointed out one of the issues in RL, which concerns the Reward function, and laid out the main ideas behind Hindsight Experience Replay: Sparse Rewards, Learning from Mistakes and Multi-Goal Learning, which can address the problem.

In this post I’d like to sort of refine the goal of my research.

This paper is three years old. That, in the world of Reinforcement Learning, is actually quite old, and there is already a lot of literature about it — hence the title of this story “Yet Another Hindsight Experience Replay”. But the idea fascinates me a lot, and got me hooked on the moment I read the paper………. twice………. or more 😅.

I’ve spent some months researching, working on different implementations, trying various ideas, banging my head against lots of walls. I wanted to go a bit further and not just reproduce the paper to the letter. What’s the fun though if you simply stick to the paper, right?. As I was searching on Internet, I realized that this algorithm has been mostly reproduced on the same environments used by the authors. These are: BitFlip and Robotics environments provided by OpenAI Gym. I think I also saw some attempt on a maze environment, but nothing else. I found this a bit fishy, like, maybe this only works on these environments?. I was very curious to know if this could be used, well, anywhere, even environments that are not goal-conditioned. The mentioned environments are all goal-conditioned, that means, they provide you with the tools to train an agent in this way, such as multiple and random goals that the agent needs to achieve… but, how about my beloved LunarLander? 🥰 I like this environment for a few reasons:

  1. It’s not the easiest one, if the algorithm is not correctly implemented and tuned, it won’t be able to solve it.
  2. It’s very simple to use, and most importantly for this project, it’s easy to modify and/or extend.
  3. It has a continuous state space and both discrete and continuous actions, so you can try a wide range of algorithms on it.
  4. It’s fun, I mean, how cool it is to safely land a spaceship, right?

Sparse, Goal-conditioned Environment

Let’s start off with the environment: LunarLander

I’m using the discrete version, so it’s got these three engines, left, right and down. Firing them will push the spaceship in the opposite direction. So three actions plus “do nothing” action. The state consists of position, angle and velocities of the spaceship, plus whether or not it has touched ground. The task is simple: land smoothly between those two flags.

Now, I need to tell the agent to reach a specific goal, that is: go to the landing pad, straight up, with zero or almost zero velocity. How do I do that?, let’s have a closer look at the state:

LunarLander source code

Alright, the centre of the landing pad is located in coordinates (0, 0). Should be reached with 0 horizontal and vertical speed — it wouldn’t be a smooth landing otherwise — , and of course 0 angle and angular speed — landing on the side or upside-down? nope — . So, this is our target state: [0, 0, 0, 0, 0, 0]. How about those last two state variables? legs contact?. Well, if the spaceship has 0 angle and reaches coordinates (x, 0), I guarantee you the two legs have touched ground, so they’re kind of redundant.

But I found this only goal quite boring, I mean, if we can condition the goal, tell the agent which goal to reach, why not to experiment with this and see the power of Multi-Goal learning to the fullest?. This environment is an episodic task that ends when the spaceship either lands or crashes. Ok then, let’s make it reach any location we want and hover there forever 😜. The environment should provide us with a random goal to reach during training. Since we want the spaceship to hover, it needs to have zero velocity and zero angle in a random location. Then the goal would look like this: [x, y, 0, 0, 0, 0], being x and y randomly selected within a valid range of values. During evaluation we can test any goal we want… even [0, 0, 0, 0, 0, 0] to actually solve the environment.

Another aspect of the environment we want to change here is the infamous reward function. Remember, we’re trying to avoid dense, highly informative but complicated and prone to unintended behaviours, reward function. Now, have a look at the one from this environment:

What if I told you that with HER we just need something along these lines:

And that’s enough for the agent to learn. Cool, huh?. Alright, I’m gonna then extend the environment implementing a wrapper that’s gonna provide the agent with multiple goals to achieve…

…and modify the reward function to something simpler:

The idea behind this sparse reward is that we check the distance between current state and goal and see if it’s acceptable enough to declare a success. We use this eps parameter as tolerance.

That’s it, we’re done with the wrapper for this environment. You can find the full code here: https://github.com/jscriptcoder/Hindsight-Experience-Replay/blob/master/common/lunar_lander_wrapper.py

It’s all about HER

So, I have an environment with continuous state and discrete action space. Which algorithm can we use here? Let’s see, continuous state space? we can tile code… Naaaa, we don’t want to do that , although you would be surprised to see how well tile coding works and helps an agent to succeed in a continuous state space. Instead, let’s use function approximation. And what are the best function approximators? You guessed it, Neural Networks!!. Alright, how about dealing with discrete actions? let’s not go too fancy and use a very old but powerful algorithm that marked a major turning point in the history of RL: Deep Q-Network (a.k.a DQN). I’m not gonna go into details about this algorithm since it’s not the purpose of the article, but I will mention, and leave some links for those who are interested, that I’m also using some of the improvements made after DQN was introduced back in 2013. These improvements are Double DQN and Dueling Network Architecture.

So, let’s go back to HER. Another thing I really like about this idea is how easy it is to combine it with any off-policy algorithm with very little changes. Plug and play like. Although the training loop proposed by the paper doesn’t really suggest that:

HER paper

I have to say, at the beginning I spent a lot of time trying to reproduce the very same loop and adapt DQN algorithm to it. Let’s see what DQN looks like and compare:

DQN paper

If you look closely the mayor difference here is that DQN performs one optimization step over a random batch of transitions after executing an action and storing the transition. This means there is an optimization per time step, whereas in HER algorithm the optimizations happen after each episode, N amount of times, being N a hyperparameter. Well, I’d say it’s not something that should really affect the performance of the training, given the off-policy nature of this algorithm, but I didn’t like the idea of having to tune that hyperparam and I wasn’t happy with changing DQN algorithm… just because of my OCD 😅.

I also got a bit confused after reading this part:

HER paper

So it turns out that there are actually two other loops that are not shown in the pseudo-code. One running over epochs, and another one, inner loop, running over cycles, which in turn runs a certain amount of episodes, and then perform the optimizations. If you go back to the pseudo-code, these optimizations were happening at the end of the episode… soooo yes, a bit confusing. This is a few more hyperparameters to tune. Another important detail is when the target networks are updated, which according to the paper happen after every cycle. But it’s not clear whether it happens after running all the optimizations, or on every optimization. By looking at the decay coefficient, 0.95, quite high I’d say, it looks like it should happen after running all the optimizations. But maybe I got this wrong.

Anyway, other experiment details such as network architecture, state-goal distribution or simulation were not relevant to me, since the paper is using a slightly different algorithm, DDPG, and so is the environment.

In the next article I’ll talk about my struggles and how I decided to distance myself a bit from the paper’s initial setup… and got away with it.

See you next time.

--

--

Francisco Ramos

Machine and Deep Learning obsessive compulsive. Functional Programming passionate. Frontend for a living