Yet Another Hindsight Experience Replay: Target Reached

9 min readAug 18, 2020

In the previous post I developed an action plan for this project, where the idea is to use LunarLander environment and adapt it in order to use HER on it. The goal: to command the spaceship to hover in different locations.

In this last post I’ll talk about problems I encountered on the way, and how I finally reached the goal.

I spent over a month — yes, I have a busy life — trying to implement this paper to the letter and make it work the way I imagined in my head this “should” work with LunarLander environment. But I kept on failing. I switched to DDPG and use the continuous version of LunarLander — DDPG is the DQN for continuous action space — . I spent another month — super busy life — tuning hyperparameters, trying different hyperparams optimizations techniques such as Bayesian Optimization, changing network architectures, etc… But my agent refused to learn. Then I tried to use dense reward, using Euclidean distance, and training improved, but I wasn’t happy with the results, above all because the main goal of this research was to be convinced by my own experiments, that this brilliant idea on paper actually works outside the paper and could be used in environments other than the ones used by the authors.

Training became more stable when I decided to ditch this part:

This should actually help the training as suggested, and so does the intuition behind it. Input scaling is a very important part of the data preparation before training a model on it. The reason for this is beyond the scope of this article, but there is a lot of literature around the topic one can simple google. The thing is that the normalization of the input wasn’t helping much, and also noticed some instability in the training. Maybe the reason is because the scale of the state components of this environment don’t differ much from each other, having similar distributions, and they range mostly between -1 and 1 or close to that range. Maybe that, plus the fact that we’re using a running mean and std since we calculate those values as the agent encounters more and more states… But I’m just speculating here. I got rid of the logic, but left the scaler if you’re interested: https://github.com/jscriptcoder/Hindsight-Experience-Replay/blob/master/common/scaler.py

So things didn’t look too good after a few months of experimentation. I decided to get out of my comfort zone — don’t stick to the paper — , take what’s really important, the basic idea behind HER, and use it in a different algorithm than the one suggested. That means, forget most of the setup and use pure DQN algorithm where I would simply plug in the important piece of HER. This is the result:

I’ll explain a bit what’s going on here: we run an episode, collecting the transitions, these are tuples of (state, action, next state, reward, terminal, goal), and running an optimization step as the agent explores/exploits the environment using epsilon-greedy strategy. Normal DQN stuff, except for a new value, the goal, which will be used as part of the input of the Neural Network, and to condition the agent. We can visualize the episode as a simple line, representing the trajectory, or list of transitions, as followed:

I’m showing in green what a good trajectory might look like, reaching the target, along with the actual trajectory, in red, with the wrong outcome. The agent failed. Now it’s when the fun begins. After running this episode, we’re gonna do the following: we’re gonna go through all the collected transitions, and for each one — remember, this is a tuple (state, action, next state, reward, terminal goal) —we’re gonna pick a state , or multiple ones, ahead of it, in the future along that trajectory, turning them into goals, virtual goals that have actually been reached by the agent as followed:

The amount of goals to pick is controlled by another hyperparameter, which I called future_k, as seen in the code above. Now we loop over these new goals, recalculate the reward based on the next state, achieved during the current transition being replayed, and the new goal, and finally store that “simulated” transition in the replay buffer as: (state, action, next state, new reward, False, new goal). This all means that we’re storing approximately T*future_k new transitions in our buffer, being T the steps taken during the whole episode, and I said “approximately” because as we get closer to the final outcome we might not be able to pick future_k goals if this value is higher than the amount of steps left to reach the end. We can see something happening here, as we get closer and closer to the final outcome, we’ll also be adding more and more positive signal as the virtual goals, randomly sampled, will be closer to the current transition being replayed. This can be seen here:

This might click on you if I remind you about this threshold value, within which two states are considered similar. This means that in the image above, some virtual goals, if not all of them, depending on the tolerance, might yield +1 reward, filling up the buffer with lots of signal to learn from. Another thing to mention is the terminal value which I decided to always set to False in the simulated transition. Reaching these goals doesn’t mean we’re done. Remember, the idea is for the spaceship to hover as long as possible.

My failed attempt to change a bit more the paper

At the beginning, this didn’t make much sense to me, and decided to switch loops around. I instead tried to simulate a whole trajectory from time step 0, initial transition, up to the new goal. That means, the outer loop goes over the virtual goals, and the inner loop goes over the transitions up to the current virtual goal, contrary to the paper. In my head this made more sense. But turns out that simulating trajectories only works well if you use dense rewards with Euclidean distance, as all the transitions will give a signal to learn from. But when using sparse rewards, by simulating trajectories you’re just limiting the amount of good signals, since these will only be given at the end. And you might be wondering: ok, then simulate a lot of trajectories, maybe even as many as time steps, then you’ll have enough signal. Sure, but then you’re also kind of polluting the buffer with things the agent cannot make good use of, and it’s all those transitions in between. I tested this myself and I can confirm these results… Anyway, not sure this whole story was understood, but I did go through some questions like this one and experimented with all of them. This is what research is about, no?

Goal finally reached

I’m gonna spare you all the learning plots such as loss, success rate and rewards over time, which you can perfectly reproduce by cloning the repo and running the training. I’m logging everything to tensorboard. I think it’s more interesting, and fun, to see the agent flying this spaceship with such multi-goal training, and sparse rewards. These are some of the results, with their goals:

Goal: [0., 1., 0., 0., 0., 0.]

The red circle delimits the target location with the radius being approximately the tolerance during training. I found this tolerance an important parameter for the success of the agent to reach the targets. Too small and it’ll have a hard time to learn something. Too big and it’ll just learn garbage.

Goal: [-0.8, 0.5, 0., 0., 0., 0.]

Goal: [0.8, 0.5, 0., 0., 0., 0.]

Goal: [-0.5, 0.2, 0., 0., 0., 0.]

Goal: [0.5, 0.2, 0., 0., 0., 0.]

This one above is somehow interesting, because of the correction after a wrong manoeuvre.

Goal: [0., 0., 0., 0., 0., 0.]

Here I was trying to actually solve the environment by sending the spaceship to the landing pad, but most of the time it was just hovering there almost touching the ground or even touching it and going non stop left and right, but never really resting. So I decided to push the y coordinate a bit more and give a negative value, below the ground. I thought this could make the lander go idle. And it did, not always, but sometimes these two goals [0, -0.05, …] and [0, -0.1, …] where making the lander to solve the environment.

Goal: [0., -0.05, 0., 0., 0., 0.]

The cool thing about these last two goals, with negative y coordinate, is that the agent has never been trained with this target, only 0 ≤ y < 1.4, and yet it managed to generalized enough to understand that it needs to go lower. The only problem is that for this agent solving the environment is not ideal. It gets more reward if the episode doesn’t finish — remember, +1 for each time step while hovering— , so most of the time it will simply keep on firing the left and right engine so as not to go idle while keeping the location as close as possible to the target.

A potential solution I just thought about while writing these last words is to add a new component to the goal indicating whether or not the lander is firing an engine, 1 — left, right or down — , 0 otherwise. So, in order to solve the environment the goal state would look like this: [0, 0, 0, 0, 0, 0, 0], where the last value is telling the agent to stop firing, and during the training the virtual goals will all have 1 or 0 depending on the action given in the current transition… or… hmmm… I’m thinking… maybe even simpler: change the sparse reward to 0 => success, -1 => fail. Would be worth to try 😉.

Conclusion

The results are not amazing, but I’m quite happy with them, because the main goal was to understand the idea, build it and make it work in an environment of my choice… besides, I only trained the agent for a couple of hours in a single and very old GPU, contrary to the authors of this paper, who trained an agent for 6 hours, using 8 CPU cores, and probably very powerful ones.

Even though I bumped into some walls in this project, it’s been one of the most fun I’ve worked on. It reinforces my passion for Reinforcement Learning. Hindsight Experience Replay is just one of those awesome ideas the world of RL is full of. There are tons of other mind-blowing ideas out there waiting to be explored. One of those ideas I recently started working on, and which also resembles our own human nature, is Curiosity-driven Exploration and Intrinsic Motivation… but I’ll leave that for another article.

If you’re curious about this project, and the code, I invite you to git clone, or fork, and experiment with it. You can find the repo here: https://github.com/jscriptcoder/Hindsight-Experience-Replay.

Thanks for reading 😊

Yet Another Hindsight Experience Replay: Target Reached

My failed attempt to change a bit more the paper

Goal finally reached

Conclusion

Written by Francisco Ramos