Curious Agents V: Large Sparse-Reward Environments

David de Villiers
InstaDeep
Published in
6 min readNov 7, 2023

Welcome back to our series where we investigate promising self-supervised reinforcement learning (RL) methods. In previous posts, we investigated the application of curiosity-driven RL algorithms to several environments, which can be found in the links below:

Curious Agents: An Introduction
Curious Agents II: Solving MountainCar without Rewards
Curious Agents III: BYOL-Explore
Curious Agents IV: BYOL-Hindsight

In the last post, we applied the BYOL-Hindsight algorithm to a custom, simplified JAX-based Minecraft environment, and showed that the agent is robust against stochastic transitions. Now, we investigate whether curiosity-driven RL techniques are feasible in a large, open-world environment.

To accomplish this, we modified the existing Minecraft2D environment to mimic the terrain generation rules of Crafter, an open-world survival game for evaluating a wide range of agent abilities within a single environment. This results in randomly generated 64×64 terrain like below. The collectable resources and level progression remain unchanged, however, cosmetic blocks (stone, dirt, grass and sand) are introduced, along with lava and water. Water cannot be traversed, while lava immediately ends a run.

An example of the randomly-generated Minecraft2D environment
An example of the randomly generated Minecraft2D environment

The wooden log is the most common resource, while cobblestone and iron ore are rare, and every environment includes only one or two diamond ore blocks. Overall, this creates an exciting challenge for the agent, requiring careful and deep exploration to solve the environment’s challenges. As a reminder, the table below displays the tasks the agent needs to complete in order to solve the environment.

A table showing the agent’s various tasks in the Minecraft2D environment, and their corresponding levels.
For the agent to progress through the environment and obtain the diamond, it must first complete several tasks, such as collecting resources or creating equipment. In order to reach any level, all previous levels’ tasks must have been completed.

To determine whether curiosity-driven RL agents can solve this environment, we first show that an extrinsically rewarded recurrent proximal policy optimization (PPO-RNN) agent breaks down as rewards are made increasingly sparse. This experiment was conducted by first running PPO-RNN in the Minecraft2D environment in a reward-rich configuration, where it receives a reward every time it progresses a level. Following this run, the reward was made increasingly sparse by only providing the agent with a reward once it has progressed every 2nd, 3rd or 4th level, continuing until the agent receives only the 10th level’s reward, and finally receives no reward at all.

Although providing every second, third and fourth reward did not affect PPO-RNN’s performance, it degraded significantly as the environment’s rewards were made increasingly sparse. The agent was unable to learn any useful behaviours when only rewarded when it reached the 10th level, or when placed in a reward-free environment. This shows that a different approach is needed in sparse reward environments.

Graphs showing the maximum and average returns per episode of the PPO-RNN agent in increasingly sparse-reward environments
The PPO-RNN agent’s performance declined as the reward was made increasingly sparse

The existing BYOL-Hindsight code is adapted from the previous post to include the architecture from the original paper. This included adding closed and open-loop RNNs to update the agent’s current belief about the environment and simulate future actions and beliefs. For the complete implementation, visit the project’s GitHub page.

As a brief overview, BYOL-Hindsight’s architecture is displayed in the figure below. This architecture consists of a closed-loop RNN that takes in previously sampled actions and online observation embeddings to form a belief of the current timestep b_t. This belief, along with the future timestep’s target embedding and current action, is passed into a generator p_θ (not pictured here), which produces a hindsight vector Z. After computing the open loop belief b_{t, i} from the current action and previous timestep’s belief, this belief and the hindsight vector are used to predict the agent’s next state ŵ_{t, i}. This prediction is compared to the target embedding, and the agent’s prediction loss is obtained, which is used as a source of intrinsic motivation, allowing agents to learn with no extrinsic rewards obtained from the environment.

BYOL-Hindsight architecture
BYOL-Hindsight architecture. Figure adapted from the original paper.

After upgrading the BYOL-Hindsight implementation, we run this agent, alongside other curious agents — specifically Random Network Distillation (RND) and BYOL-Explore — that act as a baseline to compare against. These agents are implemented in the reward-free configuration of Minecraft2D. These agents therefore do not receive any extrinsic reward and explore the environment purely through intrinsic motivation. The agents’ performance is displayed in the figure below, where PPO-RNN’s reward-rich run is also displayed to show the performance achievable within the environment. While no curious agent can match PPO-RNN’s performance, the maximum episodic return of BYOL-Hindsight and RND is promising, although the average episodic return shows that the agent still lags behind in most cases.

A graph showing the performance of curious agents in a reward-free Minecraft2D environment
Over 50M timesteps, BYOL-Hindsight outperforms other curiosity-based alternatives, although it fails to match PPO-RNN’s performance in a reward-rich environment

The most important part of this result, however, is the fact that BYOL-Hindsight is able to reach the maximum level of the environment, which is obtained when mining a diamond and crafting a diamond pickaxe! This shows that curiosity-driven RL is a viable option in sparse-reward or reward-free settings such as this environment.

Being a stochastic environment, Minecraft2D includes aspects that can be learned — referred to as epistemic knowledge — as well as aspects that cannot be learned — referred to as aleatoric noise. In theory, BYOL-Hindsight should be able to distinguish between these sources of intrinsic rewards. Curious Agents IV explored this concept with the previous version of Minecraft2D. We want to determine if BYOL-Hindsight is able to extract all epistemic knowledge from the environment, and if so, at which point during training this happens. This event would mean that the agent would stop trying to obtain the maximum reward from the environment and instead search for novel experiences to satisfy their curiosity. The figures below show BYOL-Hindsight’s maximum and average episodic returns and average step length per episode over 450M timesteps, where it can be seen that the agent’s episodic return starts to decline after around 100M timesteps. Towards the end of training, the episodic return consistently remains close to zero, while the agent’s average step length remains around 128, indicating that the agent still searches for novel experiences to learn from while avoiding immediately ending the current run. This is an important result, as it shows that BYOL-Hindsight does not prioritise random transitions as in Curious Agents III.

Graphs showing the maximum and average episodic returns of BYOL-Hindsight over 450M training steps
BYOL-Hindsight achieves consistently high episodic return levels for around 100M timesteps, after which these returns decline, perhaps indicating that the agent has started looking for novel experiences.
BYOL Hindsight’s average step length per episode across all timesteps

Next, we inspect the loss value that the world model assigns to different levels’ transitions, which happens when the agent progresses from one level to the next by mining a resource or crafting a tool. BYOL-Hindsight might allocate higher rewards to later levels’ transitions, which correspond to a higher loss value. However, as seen in the figure below, the world model loss converges to a similar value for every level transition. It initially struggles to model later transitions from level seven onwards, and level 10 is not reached enough times to draw meaningful conclusions.

Although the agent initially struggles to model certain level transitions, such as an environment reset, it eventually allocates an approximately equal world model loss value to all levels. Loss value spikes in later runs might be due to the agent’s underlying policy architecture

After conducting a thorough analysis of the agent’s performance in the Minecraft2D environment, we can now visualise the behaviour of the agent in the gif below. Training the agent and obtaining the visualisation can be done by modifying the main.yaml file to run the BYOL-Hindsight agent:

# === PPO BYOL HINDSIGHT RNN (MINECRAFT)
defaults:
- agent: ppo_byol_hindsight_rnn_mc
name: CuriousAgents
env_name: Minecraft2D
train: True
load_state: False
visualise: True
visualisation_steps: 500
log_dir: "./logs"
training_steps: 10e6
seed: 1234

The agent does struggle to make progress at times, moving back and forth between tiles, but eventually manages to mine the diamond and complete the final level. This behaviour could be explained by the fact that the agent’s policy is not recurrent, and the current implementation only includes a feed-forward policy.

A visualisation of the BYOL-Hindsight agent obtaining a diamond and completing the last challenge of the Minecraft2D environment.

This concludes this post on curiosity-driven RL in a large open-world environment! Feel free to try the code out for yourself. Happy coding!

--

--

David de Villiers
InstaDeep

Cybersecurity Consultant | MSc MLAI | BEng Computer Engineering