Curious Agents II: Solving MountainCar without Rewards

Dries Smit
InstaDeep
Published in
6 min readJul 1, 2023

Welcome back to our series where we investigate some promising self-supervised learning methods that might help alleviate some of the issues current reinforcement learning algorithms face. If you missed it, the first post in this series can be found here.

In this post, we will implement our first curiosity-based learning agent to solve Gymnax’s MountainCar-v0 environment. All the code used in this series is available here for you to play around with. The code for this post is found here.

As mentioned previously, self-supervised learning seems to be a promising area of research that could help agents autonomously learn in large open-world environments. In these environments, it is typically difficult and/or time-consuming to design rewards by hand, e.g. it might be quite challenging to specify a reward for a robot cleaner or an agent browsing the internet. Therefore many researchers have proposed methods with which agents can generate intrinsic rewards for themselves.

The goal of self-supervised learning is to find ways to pre-train agents without using explicit rewards from the environment. This might seem like a daunting task. How can our agents possibly learn anything useful if there is no reward signal to guide it?

To answer this, we first need to go back to the typical agent environment interaction diagram as seen below.

Source.

If we remove the reward signal, we are left with only observations and actions. The only data that is provided to the agent is the sequence of observations given by the environment and the knowledge of which actions it took. Therefore a natural starting point is to try and predict future observations from previous observations and actions. This dynamics model, that the agent learns, is referred to as a world model. Its objective is to try and decrease its prediction error on future observations. In its simplest form, the world model ( f )’s loss function can be defined as

f_{loss} = ||o_{t+1} — f(o_t, a_t)||²,

where o_t and a_t are the observations and actions at timestep t. Here ||x||² represents the squared L2 distance between the true next observation and the predicted next observation.

Learning a world model in itself is interesting, but it is still not clear how this would align with our goal of learning a policy that can easily be fine-tuned to downstream tasks. We also want to train a policy that explores useful parts of the state space. This idea has resurfaced many times in the past. Some of the most noteworthy work done in this space was done by Jürgen Schmidhuber. His basic idea is to train a policy that attempts to maximise the loss function defined above. By setting the policy’s reward value to be equal to f_{loss} we incentives it to go out and find observation sequences that the world model can’t perfectly model yet. This is a pretty neat trick. Initially, the policy gets a lot of rewards for just varying its actions which vary the observations. But as the world model gets better at predicting, the policy must explore further into the environment to find novel experiences. The policy, therefore, learns to efficiently navigate and manipulate the environment, which can make it valuable for downstream tasks.

To illustrate this point we will now create our own agent to solve Gymnax’s MountainCar environment. The code used in all the posts in this series can be found here. Throughout this series, we will be using Python and environments written in JAX . JAX is an XLA library that, among other things, allows for massive acceleration of machine learning code. Reinforcement learning and self-supervised learning typically require a lot of simulations to learn effective policies, therefore JAX allows us to train our agents in a reasonable time.

Let us now take a look at the MountainCar example.

Random policy on Gymnax’s MountainCar-v0.

In the above image, we can see a random policy acting in the environment. The agent can move forwards, backwards or do nothing and its goal is to reach the flag at the top of the rightmost mountain. In this environment, an agent receives a step reward of -1 for a maximum of 200 timesteps. When the agent reaches the flag the episode ends. Therefore if the agent can solve the environment in less than 200 steps its total reward should be less than -200. Typical RL algorithms such as PPO struggle with this environment, as the agent must first reach the top to achieve an episode reward that is less than -200. It is highly unlikely to reach that flag with random actions, so PPO usually never learns to solve it without introducing additional rewards.

Luckily, we will not be using external rewards. So let us dive in. To save us some time we will not be coding an RL algorithm from scratch, but instead by using the source code written by Chris Lu and Andrei Lupu in this repo. They implemented PPO among other algorithms inside the JAX framework. Please give their code a try!

We start by modifying it to include a world model. Let us set up the world model as follow:

class WorldModel(nn.Module):
action_dim: Sequence[int]
activation: str = "tanh"

@nn.compact
def __call__(self, x, action):
if self.activation == "relu":
activation = nn.relu
else:
activation = nn.tanh

# One-hot encode the action
one_hot_action = jax.nn.one_hot(action, self.action_dim)

inp = jnp.concatenate([x, one_hot_action], axis=-1)

layer_out = nn.Dense(
64, kernel_init=orthogonal(np.sqrt(2)), bias_init=constant(0.0)
)(inp)
layer_out = activation(layer_out)
layer_out = nn.Dense(
64, kernel_init=orthogonal(np.sqrt(2)), bias_init=constant(0.0)
)(layer_out)
layer_out = activation(layer_out)
layer_out = nn.Dense(x.shape[-1], kernel_init=orthogonal(1.0),
bias_init=constant(0.0))(
layer_out
)
return layer_out

The model takes in an observation (x) and action and produces an output of the same shape as the input observation.

Now we simply set the policy’s reward to be equal to the loss as specified above:

def l2_norm_squared(arr, axis=-1):
return jnp.sum(jnp.square(arr), axis=axis)

# Calcuate the distance between the predicted and the actual observation
pred_o_t = self._world_model.apply(wm_train_state.params, o_tm1, action)
reward = l2_norm_squared(o_t - pred_o_t).mean(axis=-1)

First, the world model predicts the next observation (pred_o_t) using the last observation (o_tm1) and action. The policy’s reward is set to the squared distance between the true next observation (o_t) and the predicted next reward (pred_o_t), which is the predicted next observation. We also set the world model’s loss to be equal to the L2 norm squared.

If we do this we now get the below training plots. We measure the agent’s rewards received on the environment even though we are not training to directly maximise these rewards.

It seems that our agent learns to reach the top of the right hill after about 800k training steps. Below we visualise some environment runs after training.

As can be seen, our agent successfully learns to solve the environment without ever receiving external rewards. This is a nice result given that the sparse external rewards typically stop RL agents from learning to solve the environment.

And there you go! We have created our first self-supervised agent that can solve MountainCar. Feel free to try it out for yourself.

There are a few questions that you might still have. How did our agent learn to solve this environment? What in this internal reward setup actually aligns it with the external rewards? To answer this we need to take a look at what values are that are encoded in each observation. The observation is a vector of size two with the car’s position and velocity in it. As it moves faster it becomes slightly harder to predict the next position and velocity of the car. Therefore the policy specifically attempts to accelerate the car as fast as possible.

There are a few open questions remaining. What about extremely large observation spaces and observations that have noise in them? Surely this method will not scale to those environments? Furthermore, it is not to say that maximising this intrinsic reward would lead to maximising the environment’s external reward. These are all issues that we will try to address in the next posts.

In the next post, we will investigate a recently proposed algorithm from DeepMind called BYOL-Explore and apply it on Jumanji’s Maze environment. Specifically, we will be taking a look at how we can perform predictions directly in latent space instead of the noisy observation space. Stay tuned for more!

--

--