Reinforcement Learning: Dealing with Sparse Reward Environments

Karam Daaboul
12 min readAug 27, 2020


Reinforcement Learning (RL) is a method of machine learning in which an agent learns a strategy through interactions with its environment that maximizes the rewards it receives from the environment. The agent is not given a policy but is guided only by positive and negative rewards and optimizes his behaviour based on them. Deep Reinforcement Learning (DRL) uses deep neural networks so that the agent can also process an ample action space as well as states or observations from which the states are derived. DRL methods have proven effective on a wide range of tasks, from playing Atari [1] to learning complex robotic manipulation tasks [2].

In many real-world scenarios, an agent faces the challenge of sparse extrinsic reward, leading to a problematic and challenging objective to solve. A sparse reward task is typically characterized by a meagre amount of states in the state space that return a feedback signal. A typical situation is a situation where an agent has to reach a goal and only receives a positive reward signal when he is close enough to the target.

Several methods have been proposed to deal with sparse reward environments. In this article, we will divide these methods into three classes and give a short description of them.

  1. Curiosity-Driven Methods:
    a. Intrinsic Curiosity Model [3,4]
    b. Curiosity in model-based RL [5]
  2. Curriculum Learning
    a. Automatic generation of easy goals [6]
    b. Learning to select easy tasks [7]
  3. Auxiliary Tasks [8]

The idea of curiosity is that an agent is intrinsically motivated to explore its environment out of curiosity. Curriculum learning describes the concept of building a curriculum of tasks that are simpler or easier to achieve for the agent. Auxiliary tasks are tasks solved by the agent that vary from the initial sparse reward task but improve the agent’s performance on the latter.

Sparse Reward Task

A simple formalization of the sparse reward task is to define the target g as solved if the distance of the agent state s to the target state s_g is less than a threshold Ԑ. For binary reward signal binary, we could describe this reward distance as

To receive the reward, the agent has to explore the environment, starting from its initial state s_0. On the other hand, the agent should not lose the focus while exploring the environment and must exploit the rewards it has already collected to update its policy. The trade-off between exploration and exploitation is relatively old in the field of reinforcement learning. Mnih et al. [1] use a Ԑ-greedy approach to switch between the greedy policy of choosing the action with the highest value and a random act of exploring the environment.

Reward Shaping

The most intuitive solution to sparse reward problems is reward shaping. Mataric formulated the idea back in 1994 [12], and it has been used widely ever since, even in more recent works like the OpenAI agent playing the advanced game Dota 2. Reward Shaping means that we enhance the primary reward of the environment with some additional reward features. By using additional reward features, we shape the primary reward to appropriately reward or punish interactions, filling the gap of the original sparse reward feature.

While the approach has a few upsides, it also comes with various problems. The additional reward functions are usually hand-crafted and require human experts to be successful. In addition to the expertise needed, human-crafted reward functions also have the problem of introducing a human bias to the possible policies the agent will find to solve the problem. Thinking of a complex game like Go, it is not easy to find experts that can design reward functions for the agent that may vary with the state of the game. Moreover, by using hand-crafted reward functions, the agent might fail to discover new policies that humans have not found yet.

Curiosity-Driven Methods

The idea behind curiosity-driven methods is that the agent is encouraged to explore the environment, visiting unseen states that may eventually help solve the meagre reward task. The analogy in the real world would be a baby who, through curiosity, learns the task by exploring the environment. The exploration of the world is not random but controlled. In the beginning, a baby is very interested in his body parts and starts playing with his hands and feet. As the baby grows older and more experienced, the body parts become less exciting, and the baby focuses more on objects in its environment. We also want an agent to be curious, so he chooses the actions that lead him to states where he is most unusual.

Intrinsic curiosity-driven exploration by self-supervised prediction

Pathak et al. [3] tried to solve the exploration problem by encouraging the agent to explore the environment in such a way that it searches for new states and by prompting the agent to choose actions that reduce the errors of its predictions of the results of its actions. The authors proposed an Intrinsic Curiosity Module (ICM) to implement curiosity about an agent. The ICM consists of two neural networks with combined early hidden layers for embedding (pixel) observation.

  1. The dynamics model predicts the next state s_t+1, considering the current state s_t and the selected action a_t. The dynamics model will not be able to predict the correct future state if the current state and (or) the taken action are (is) unknown. The deviation between the model’s prediction and the actual state is used as a measure of novelty. If the agent continually optimizes his prediction, but at the same time searches for the states where his prediction is incorrect, the agent will continuously take action to visit new states.
  2. The Inverse Model predicts the action a_t, which is the cause of the transition from state s_t to s_t+1, where the states s_t and s_t+1 are the inputs of the network. The idea of the inverse model is that ICM is encouraged to embed only those features of the observations that are relevant to the prediction of the corresponding action. In this way, the agent does not focus on information in the input space that does not influence its choice of action.
    Note: If you have direct access to the full state space, the inverse model is no longer required.

The agent will try to optimize an objective function consists of many parts at once.

  1. L_I: measures the discrepancy between the predicted action obtained by using the inverse dynamic model and the real action a_t, we try to minimize this term of the objective function.
  2. L_F: decrease this term of the objective function improves the prediction of the dynamic model.
  3. R: the expected cumulative extrinsic rewards

where 0≤β≤1 weighs the inverse model loss against the forward model loss and λ >0 weighs the importance of the extrinsic against the intrinsic reward signal.

Planning to Explore via self-supervised World Models

While the previous approach used a model-free agent to solve the environments, it is also possible to use curiosity for model-based agents. Sekar et al. [5] used the idea of model-based RL and combined it with curiosity to create an agent that explores and successfully solves sparse reward tasks.
The agent is first interacting self-supervised without any extrinsic reward with its environment in the exploration phase to build a global world model. In the second phase, the agent receives reward functions for various specific tasks to adapt to the environment in a zero-shot way.
Despite using a model-based agent, Planning to Explore seems to have a few other advantages as well.

  1. The authors claim that method, like the intrinsic curiosity model, need a model-free exploration policy that requires a large amount of data when adapting it to a specific task.
  2. Another down-side of previous curiosity methods is that the curiosity of a state that was recently visited is calculated. The agent will then seek states it has already seen instead of new states.
  3. Instead of seeking actions where the agent’s prediction of the next state has a high error compared to the real future state, Sekar et al. decided to use an ensemble of dynamic models and calculate the disagreement of the predictions of the next states.

The (high-dimensional) observations of the environment at each time step o_t are first encoded into features h_t. The features have then used an input for a recurrent latent state s_t. The exploration policy is then returning the action to reach a new state, where the agent is currently the most uncertain.
In the first phase, the exploration phase, the agent is continuously collecting data and learning a global world model, which is then used to choose the agent’s action for further exploration of the environment. As already mentioned, the exploration policy inside the world model is estimating the novelty of a state by using the disagreement of multiple dynamic models, called Latent Disagreement.
To be precise, they use an ensemble of one-step predictive models. The ensemble uncertainty is then quantified as the variance over predicted means of the one-step models. The one-step predictive models predict the next feature state h_t+1. The variance or disagreement over those future feature states is then used as the intrinsic reward for the exploration policy.
To decide on an optimal action, Plan2Explore uses the latent dynamics model of PlaNet [10] and Dreamer [11] to efficiently learn a parametric policy inside the world model. The learned world model is then used to predict future latent states starting from the latent sates obtained by encoding images from the replay buffer.

Curriculum learning

A different approach to solving sparse reward tasks is curriculum learning. The idea of curriculum learning in RL is to present an agent numerous tasks in a meaningful sequence, so the tasks get more complex over time until the agent can solve the initially given task. While several algorithms use curriculum learning in RL, we will focus on two approaches here.

Automatic Goal Generation for Reinforcement Learning

For curriculum learning, it is not only necessary to provide an agent with a range of tasks it should solve but to provide the tasks in a meaningful order. The agent may start with a straightforward task at first and later solve increasingly harder tasks over the training period until it can solve the initial task. One technique to create a meaningful order is called GoalGAN [6]. As the name already indicates, the framework uses a Generative Adversarial Network (GAN) to generate Goals that are solvable for an agent.

Auxiliary Tasks

Jaderberg et al. [8] proposed an extension of the reward tasks with auxiliary tasks that the agent has to solve during training. While this sounds similar to the approach of Riedmiller et al. [7], the auxiliary tasks selected here are not based on the main task in the form of a curriculum. Instead, the tasks can be differentiated into auxiliary control and auxiliary reward prediction tasks. Auxiliary control tasks can be of two types:

  1. Pixel Changes: The idea is that rapidly changing pixels are an indicator of significant events. The agent will try to control the change of pixels by choosing the right actions.
  2. Network Features: Here, the agent will try to control the activation of the hidden layers of its value and policy network. Because those networks typically can extract high-level features, it could be useful if the agent can control their activation.

The auxiliary control and reward prediction tasks are combined with an agent that uses the A3C algorithm [9] to solve a shared objective function. Since the layers of the neural network used to solve the main and auxiliary tasks are shared, the agent improves for all tasks.

The environment the authors are trying to solve is a labyrinth where an agent is attempting to navigate through while only gaining rewards when reaching the goal. For this specific task, the authors propose three specific auxiliary tasks that will help the agent to navigate through the sparse environment successfully:

  1. Pixel Control: An auxiliary policy is trained to maximize the change of pixels intensity in different regions of the input image.
  2. Reward Prediction: Given three frames from the replay buffer, the network tries to predict the reward for the next, unseen frame. Because the reward is sparse, skewed sampling is applied, leading to more frames where a reward was given. The purpose of the reward predictor is solely to shape the feature layers of the agent that convert the high-dimensional input space into a low-dimensional latent space.
  3. Value Function Replay {It is unclear whether Jaderberg et al. sees the Value Function Replay as an additional task or rather just an optimization strategy for the training.}: The agent’s value function is trained additionally on samples from the replay buffer, besides the on-policy value function training in A3C. The value iteration is performed on a varying frame length and exploits the newly discovered features, shaped by the reward predictor.

Though sharing the same layers, the tasks are not solved at the same time on the same data. Instead, the authors propose a replay buffer that stores the observations the base A3C agent has already visited. The UNREAL agent combines two separate deep RL techniques. The primary policy is trained with A3C, which means it is updated online using policy gradient methods. The network is a recurrent neural network which enables to encode a history of states. The auxiliary tasks, on the other hand, are trained on new sequences of experience that are stored in the replay buffer and sampled explicitly for each task. The tasks are trained off-policy by Q-learning and are trained with simple feed-forward architectures to ensure maximum efficiency.


This article has examined five approaches to solve sparse reward environments effectively. Almost all the methods have the benefit that there is no or only little human interaction needed for improving the agent’s performance on sparse reward environments, avoiding the main problem of reward shaping introduced in the first chapter.
Curiosity-driven methods don’t need any prior designing of tasks, reward functions, or something else. They work very well in environments where the main goal is to explore the environment continuously or to stay alive for a long time while always having state changes. In such environments, an agent is also able to train entirely on the intrinsic reward signal with no need for an external reward function. However, curiosity-based methods show weaknesses in environments that are not based on these principles. Strictly following the intrinsic reward signal can lead to worse performance than using a random policy.
Curriculum learning, on the other side, does either need a human to design specific auxiliary tasks or has restrictive requirements for the task to make human interaction unnecessary.
For auxiliary tasks, there are no restrictions on the environments for which they can be used.
Nevertheless, the tasks used for a specific environment have to be selected by hand before training the agent because they could either be non-beneficial or even harmful for the performance of the agent.
While there are several approaches in existence, it seems that no approach applies to every setting. However, Yuri et al. prove that curiosity applies to a wide range of environments. So when we evaluate the methods on overall performance and user-friendliness, curiosity-based methods seem to have a slight advantage over others, as they need no prior human knowledge, despite knowing the main goal and design of the environment. Besides that, all the approaches have a specific field where they prove to perform very well, so before deciding on a method the user always has to consider multiple ways to reach the best performance possible.


[1] V. Mnih, K. Kavukcuoglu, D. Silver, Al. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning.”
[2] A. Singh, L. Yang, K. Hartikainen, C. Finn, S. Levine, “ End-to-End Robotic Reinforcement Learning without Reward Engineering.”
[3] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-driven exploration by self-supervised prediction. ”
[4] Y. Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell, and A. A.Efros, “Large-scale study of curiosity-driven learning.”
[5] R. Sekar, O. Rybkin, K. Daniilidis, P. Abbeel, D. Hafner, and D. Pathak, “Planning to explore via self-supervised world models.”
[6] C. Florensa, D. Held, X. Geng, and P. Abbeel, “Automatic goal generation for reinforcement learning agents.”
[7] M. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. van de Wiele, V. Mnih, N. Heess, and J. T. Springenberg, “Learning by playing solving sparse reward tasks from scratch.”
[8] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Sil-ver, and K. Kavukcuoglu, “Reinforcement learning with unsupervised auxiliary tasks.”
[9] Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning.”
[10] D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson, “Learning latent dynamics for planning from pixels.”
[11] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi, “Dream to control: Learning behaviours by latent imagination.”
[12] M. J. Mataric, “Reward functions for accelerated learning.”