Yet Another Hindsight Experience Replay: Backstory

Francisco Ramos
6 min readAug 18, 2020

--

Image from https://bit.ly/3a351l6

A couple of months ago I decided to embark on a journey when I stumbled upon a paper, which ideas I found fascinating. This article is about Reinforcement Learning and is the first of a series of three articles:

  1. Backstory
  2. Refining the plan
  3. Target Reached

I remember the time I finally got to learn about Reinforcement Learning, after a couple of years completely devoted to Supervised/Unsupervised Learning. It never really drew my attention, mainly because most of the material, articles, courses, etc… you find about Machine Learning, barely touch the subject, or even skip it altogether, so they don’t pique much your curiosity. Then, while studying the Machine Learning Engineer from Udacity, and I need to point out here, the good and long old Nanodegree — new one leaves much to be desired — , there was a whole section with tons of lessons covering in detail (Deep) Reinforcement Learning, from the basics, Dynamic Programming, Monte Carlo methods and Temporal Difference learning to Value-based, Policy-based and Actor Critic methods. Since then I haven’t stopped digging into the subject, becoming one my favorite hobbies. I find RL way more satisfying and fulfilling, in many ways, than the other two branches.

The Problem

I remember the lectures about the Reward Hypothesis, Goals and Rewards, where it was explained how Google DeepMind managed to make a humanoid learn how to walk. The image below represents the basic idea and elements involved in a reinforcement learning system where an agent, with different sensors, tries to learn how to walk by applying different forces to its joints:

Image from ML Engineer ND at Udacity, RL Framework: The Problem, Goals and Rewards, Part 2

So, what DeepMind did was to carefully engineer a reward function so the agent could successfully learn how to walk. What does this reward function look like?, like this:

Image from ML Engineer ND at Udacity, RL Framework: The Problem, Goals and Rewards, Part 2

Not super complicated, and kind of intuitive once each component is explained, but from this moment on I realized there was something really wrong with RL: the need for this handcrafted, and prone to all kind of issues, reward functions in order for an agent to achieve a specific goal. After watching the lecture I tried to think about tasks where engineering such reward would be almost impossible due to its complexity… Who likes ironing?, I personally hate it. How cool would be to have a robot at home doing the ironing, right?. Question is, what would a reward function look like so a robot could learn how to iron things? think about it. It’s not only about grasping, positions and velocities of the joints to use the iron machine, but also lots of state components associated to the piece of clothes it’s trying to iron, wrinkle-free!! I mean, that could end up in a crazy reward function.

In an ideal AI world, my ideal world 😅, an agent wouldn’t need such shaped and highly informative reward function, which at the end is generally problem-specific. It’s much easier, for the developer of the agent, to have what’s called a Sparse Reward:

… and nothing in between, because how do you tell the robot that’s doing well or wrong during the process of ironing a shirt?. Bear with me, I’ll be connecting things soon.

Human nature: Learning from Mistakes

This is one of the aspects of RL that made me fall in love with it, and that is how we can apply concepts found in human nature to an AI agent so it learns how to perform a task… or multiple ones as we’ll see soon.

We have this amazing ability to learn from our mistakes. We try to reach a goal and we fail miserably. But we can still learn from those decisions we took that led us to the wrong outcome. We now know what actions to take to reach that “wrong” outcome. Yes, sure!, it wasn’t what we wanted, but we definitely learnt something: how to reach this “other” outcome. Let me explain this with some illustrations. Imagine we have this robotic arm that’s trying to push this puck to reach that goal:

Image from Ingredients for
Robotics Research

It pushes it and the puck ends up in the wrong place…

Video from Ingredients for
Robotics Research

It failed to reach its initial goal, but it did reach some other — call it imaginary or virtual — goal.

Image from Ingredients for
Robotics Research

Well, the arm still doesn’t know how to push the puck to the original goal, but it does definitely know how to push it to reach this other one, and it can learn from it. And now you’re probably wondering: still, how can it use those mistakes to learn to reach the initial target?, the answer is: thanks to the magic of function approximation. I’ll elaborate on this in the next section.

Multi-Goal RL and the power of Neural Networks

Neural Networks are universal function approximators with a powerful property responsible for the success of Deep Learning: Generalization. This is the ability to apply what’s learnt on data that’s never seen before during training. That means, in our context, not only has the robotic arm learnt how to reach that virtual goal, but now it has a pretty good idea how to also reach the surroundings. How cool is that?, you can know imagine how it might be able to learn how to reach the real target, and this is by getting closer and closer to it. There is still a missing piece here though, and that is the multi-goal learning. We need to tell the agent to reach a state, let’s call it G. The agent reaches another one by mistake, the virtual one, let’s call it V. We ask then the agent to learn what happened when trying to reach G:

sorry, but you failed here ⇒ -1

But we also ask the agent to learn what happened when trying to reach V:

great job! you reached the target ⇒ +1

We keep on doing this during the whole training, so the agent will always have a signal to learn something useful. If we didn’t ask the agent to learn from this mistake, then it would always receive -1 reward and the agent would be super confused. This sparse reward says nothing about the value of an action, all of them yielding -1… and believe me chances the agent reaches the goal “by accident” are extremely low considering a continuous state and action space. I mean, what are the chances that a robot, randomly moving all its joins, manages to iron a shirt?, actually what are the chances that it even picks up the iron machine?, close to zero.

The learning is conditioned by a goal, and this goal is part of the input of the Neural Network along with the current state. With this we’re creating an agent that’s learning not only to reach a goal, but multiple ones and all their surroundings. After training our expert ironing robot, we can ask him/her/it to iron just the sleeves of the shirt and leave the rest all wrinkled up — yeah, that’s fashionable today… in my ideal world 😂.

So, now we have the different components that this extraordinary idea comprises: Sparse Rewards, Learning from Mistakes and Multi-Goal Learning. This idea is called Hindsight Experience Replay (a.k.a HER), and it’s one of my favorite papers.

In the next article I’ll talk about HER and how I adapted one of my most loved environments to try this out.

Hope to see you there.

--

--

Francisco Ramos

Machine and Deep Learning obsessive compulsive. Functional Programming passionate. Frontend for a living