Demystifying Upside-Down Reinforcement Learning (a.k.a ⅂ꓤ)

12 min readFeb 27, 2020

Image from https://neilpatel.com/blog/robots-txt/

Reinforcement Learning (RL) is a fascinating branch of Machine Learning, which, in my humble opinion, is the closest we can get to what we could imagine as being Artificial General Intelligence. It’s definitely still quite far from that, but when you think about what a futuristic robot would be like, how they would learn things and make smart decisions, well, that idea resembles a lot how we humans learn, and this is what Reinforcement Learning is about.

RL is actually quite old. It could be traced back to around 1950 when Richard Bellman and others tried to solve the problem of Optimal Control with techniques such as Dynamic Programming. Optimal Control problems are closely related to Reinforcement Learning, particularly those formulated as Markov decision process or MDPs. If you’re interested in the history, there is an interesting chapter in this book from Richard S. Sutton and Andrew G. Barto.

What is Reinforcement Learning?

Before we continue, I’m gonna give a rather brief introduction to RL, since it’s not the purpose of this article, but if you’d like a more extended one, I highly recommend to read this excellent series from Jonathan Hui. Believe me, this guy writes great stuff.

Image from https://nervanasystems.github.io/coach/index.html

The main idea is quite simple: there is this agent (an algorithm) interacting with an environment through actions. These actions make the environment respond with a new state and a signal, the reward, that the agent will use as a feedback. The learning happens through trial and error. If the signal is negative the agent learns it shouldn’t be doing that. If the signal is positive, then it learns that it might be a good idea to actually do that next time. This reward is key for the agent to learn which sequence of decisions are the best.

Don’t predict rewards. Just map them to actions

On December 5th, famous AI researcher Jürgen Schmidhuber and his team, came up with what I consider a revolutionary idea to solve Reinforcement Learning problems: Upside-Down Reinforcement Learning. They published two papers about this idea:

The first one is more theoretical, and outlines many of its main principles. The second one is more practical, giving an overview of the setup, algorithms and experiments. My work, and this article, is based on the second one.

Let’s turn Reinforcement Learning on its head

Traditional RL either predict rewards with value functions (e.g. Q-Learning) or maximize them using policy search (e.g. Policy Gradient methods). Schmidhuber and his team propose something quite unique and unexpected: instead of predicting the reward, use it as an input along with the state to map to actions, literally turning RL learning upside-down.

Image from the paper https://arxiv.org/abs/1912.02877

In ⅂ꓤ (short for Upside-Down RL) instead of having a function approximator that maps state-action pair to value (expected return), we have this Behavior function (B) that maps state-command to action. And what’s a command? you might be wondering. Here is where things get interesting. A command consists of a desired return and a desired horizon. I’ll try to explain this idea. Let’s image we have this toy environment with only 4 states:

We have two terminal states, s2 and s3. We could be in either s0 or s1 to reach those terminal states. Now, according to this Markov Decision Process (a.k.a. MDP) if we were to explore the environment, we could end up building the table below, and I’ll explain what it means:

If we were in s0 and wanted to get a reward of 2 in 1 step, which action should the agent take?, have a look at the MDP again. It should be action a1. That one would give a 2 as a return in 1 step, landing in s1. How about getting 1 in 1 step?, remember, we’re still in state s0. Definitely not action a1, as we saw that action gives us a return of 2 in 1 step. We can take action a2, landing on state s2. Now, still on state s0, I’m gonna “command” the agent to get a reward of 1, but this time in 2 steps. Which action would get the agent there?. Let’s see, if we take a1 we get a reward of 2, landing on s1. But we’re not done yet. From there, if we take a3, we get a reward of -1 landing on s3(terminal state). Total reward 2-1=1, in 2 steps. This is exactly what I commanded my agent to do. Let’s go back then to that last command: My desired return was 1 and my desired horizon was 2 steps. This maps to that initial action the agent must take to comply with the command: a1.

I hope all that wasn’t very confusing. In any case, if you pay a closer look at that simple table, can you see something familiar?, does it not remind you of another Machine Learning branch?. Features (state, desired_return, desired_horizon) and labels (actions)… Exactly, Supervised Learning!!. We have a clear Supervised Learning problem here. And SL has proven to be quite stable and successful. ⅂ꓤ turns Reinforcement Learning into a Supervised Learning problem, training an agent on all past experiences in a supervised way. I mean, how cool is that?

Using SL in RL problems isn’t new at all. It’s been tried before such as in Imitation Learning, but with little success. The paper also explains in its introduction:

While there is a rich history of techniques that incorporate supervised learning (SL) into reinforcement learning (RL) algorithms, it is believed that fully solving RL problems using SL is not possible, because feedback from the environment provides error signals in SL but evaluation signals in RL. Put simply, an agent gets feedback about how useful its actions are, but not about which actions are the best to take in any situation. On the possibility of turning an RL problem into an SL problem, Barto and Dietterich surmised: “In general, there is no way to do this.”

What’s proposed in this paper bridges this gap between SL and RL, avoiding the issues arising from the combination of function approximation, bootstrapping and off-policy training.

Enough talking. Show me the code!!

What we’re gonna do is to build this table and train our Behavior on it in a supervised way. In theory we could learn from any policy (e.g. random policy) by generating all possible trajectories and selecting the action that leads to any desired return in any desired horizon, but this wouldn’t really be practical. Our goal is not to get any return, but high ones. Also imagine how long that could take even in a relatively small MDP when the exploration is not directed and completely random. The paper proposes a few additions so that the exploration is instead guided towards achieving always higher returns.

Let’s begin with the hyperparameters we’re gonna be using for this algorithm. There is a comment explaining each of them, but I’ll be introducing them as we progress with the algorithm:

Our entry point and main training loop:

We’re gonna make use of a replay buffer to store previous experiences. The idea is to have this buffer always sorted by total return seen so far and train our agent always on the trajectories that got the highest returns. We initialize this buffer with warm-up trajectories (regulated by n_warm_up_episodes) from a random policy, something to get us started. We instantiate our agent’s Behavior function, which is, as you can guess, a Neural Network that’s gonna map state-command to action. You can get as fancy as you want with the architecture. If you’d like to see the details of the components involved in this algorithm, you can take a look at the Jupyter notebook (html version).

The main loop is controlled by n_main_iter. Inside the loop we train and improve our agent based on what we have in our buffer. We’re gonna backpropagate and update our Neural Network n_updates_per_iter times:

What’s happening here?. We’re sampling random episodes in batches of size batch_size. From each episode, we draw a random training example. Now, what is it and what does a training example look like?. Let’s see. Imagine we have an episode with this trajectory:

A trajectory is nothing more than a sequence of state, action, reward like this: [(s0,a0,r0), (s1,a1,r1), ..., (st,at,rt), ...], which we are storing in our buffer. From this trajectory we sort of cut a random section such that it follows this rule: 0≤t1<t2≤T, where t1is the initial time step, t2 is the final time step and T is the length of the episode. We have the initial state and action at step t1. Desired return is the sum of all rewards from t1 to t2, and desired horizon is t2-t1. These two form our command. With this information we have the inputs for the Behavior function: B[state,command], and we have the action, our target label, to build the loss function for the supervised learning part:

B*τ is the optimal Behavior function for a set of trajectories τ

Following the training we sample a new command and generate an episode based on that initial command:

The way we sample this command is quite clever. The idea is to try to generate new, previously infeasible behaviour, potentially achieving higher returns. How can we even do that?. These are the steps:

First we select from the buffer last_few number of episodes with the highest return. Remember our buffer is sorted in ascending order by total return, so we have at the end of the buffer the highest returns.
We work out the desired horizon by calculating the mean of the lengths of those last_few episodes.
The desired return is sampled from the uniform distribution, U[M,M+S], where M is the mean and S is the standard deviation of those returns from the last_few episodes. This will try to set the desired return always higher than average, kind of pushing the agent to figure out the action that could attain such return

Following we have the function performing these steps and taking care of the exploratory part of ⅂ꓤ:

The last important piece of the loop is the generation of episodes using our Behavior function. We generate n_episodes_per_iter each time and add them to the buffer. This is how we go about this last algorithm:

So, what’s happening here?. Using an initial exploratory command for the very first action (remember the previous function sample_command?), we run a whole episode until reaching a terminal state, done=True. For each step we need to calculate a new command in order to find the best action to take. The way we calculate the desired return is by subtracting the obtained reward in that step from the previous desired return, clipping the result such that it’s upper-bounded by the maximum return achievable in the environment. This clipping is important otherwise we might end up passing a command with a desired return that’s not realistic, not achievable from any state. As for the desired horizon, we simply decrease by one the previous horizon. Makes sense, right? one step per step. We also need to clip this one. In this case lower-bound to one, otherwise we could end up passing a negative horizon: “Hey, agent, I want the job done by yesterday”. Not very realistic either, is it?

And that is. Except from some small details, this is mostly the whole story. Well, we should add an evaluation step at the end of the training loop to check how well the agent is doing. This last piece looks quite similar to our function generate_episode, the only difference is that we switch our NN to evaluation mode, in case we use dropouts and/or batchnorm, and do some logging of the total return at the end of the episode. We could also perform the evaluation multiple times, controlled by n_evals, so we make sure that the agent hasn’t solved the environment by chance.

Landing Spaceships

In the paper they ran some experiments in different environments. One of them is the famous OpenAI LunarLander. I always found this environment better benchmark than the even more famous CartPole. The problem with this last one is that even broken algorithms can solve it, I’ve experienced this myself, so I’d say it’s not very reliable one, at least the discrete version of it (discrete actions). Now, with the LunarLander, if the algorithm is able to solve it, you know you’ve done quite a decent job. It’s definitely not a difficult one, but it’s not the easiest either and some algos might struggle to solve it. So not only did I decide to use it, but also try to outperform the results presented in the paper. The gif below shows an agent that hasn’t been properly trained.

While training I noticed something interesting: the agent was gathering points by landing the spaceship and going in and out of the landing area (between the two flags) over and over again, moving horizontally and never switching off the engines. Later on in the training it realized that it can actually get more points by turning off the engines, but was taking way more epochs to get to that conclusion. This was causing that the exploratory horizon was increasing up to 1000 steps, which seems to be the maximum steps allowed by the environment before reaching a terminal state. I decided then to modify a bit the reward function, punishing the agent by giving max_steps_reward (this is a big negative reward) if it didn’t land the spaceship within a max_steps. I also noticed after running some experiments for very long time that the algorithm was decreasing the desired horizon from 1000 down to 250, so I decided to start punishing the agent after 300 steps. This sped up the learning quite a lot, to the point that it started to solve the environment after 500 epochs, compared to the initial 6000 epochs or so.

The following plots are in my opinion quite insightful, because you can see a few interesting things going on during the training:

Desired return rockets around epoch 300, which is where the desired horizon reaches the punishing ceiling max_steps_rewards. From that point on the agent probably realizes that it should start turning off the engines, which is more rewarding that getting this big punishment.

Desired horizon seems to be converging to something around 220 steps or so. Maybe those are the minimum steps to land the spaceship smoothly, without smashing it on the ground?, I didn’t leave the training long enough to see how fast the agent could land. Would be worth to try.

Training loss is consistently going down, with a fast drop also around 300 epochs and then slowing down.

Let’s compare the desired return against the the actual return:

Seems to be sort of consistent, which means the training is actually working and the agent is learning to obey the commands. It’s a bit biased though, kind of shifted, since the actual return is always a bit less than what’s desired. Again, I didn’t let it train long enough to see if these two eventually converge, but it’s clear that this algorithm is doing a good job.

These two gifs show a trained agent who’s given two commands with the same desired horizon of 230 steps. The synchronization is excellent:

Conclusion

This is what I find really amazing about this algorithm. You can decide how well and fast the agent should do the job. I don’t know any other Deep Reinforcement Learning algorithm that’s able to do that.

It was really fun to read, understand and work on this paper. The algorithm is definitely quite original and full of possibilities. I only scratched the surface of what’s possible with it. Didn’t play long enough with hyperparameters nor NN architecture, dropout, batchnorm, different layers and units, or types of cells such as LSTMs, etc. I haven’t yet tried continuous environments. The paper shows good performance in sparse rewards environments. Also something worth to try.

I believe this paper is not getting the attention it deserves, but I’m sure there will be more and new work based on it, and I really can’t wait for that.

This project can be found here: https://github.com/jscriptcoder/Upside-Down-Reinforcement-Learning