Curiosity-Driven Exploration in Reinforcement Learning

Solving the sparse reward problem in reinforcement learning environments

Published in

Data Science in your pocket

5 min readJun 14, 2023

Photo by Cláudio Luiz Castro on Unsplash

We have covered many concepts in Reinforcement Learning by now that include the basics methodologies alongside some algorithms from the Deep Reinforcement Learning side which you can read below in the ‘Reinforcement Learning’ section

Moving on, we will try to solve the Sparse reward problem in this post which I assume we might face in many Reinforcement Learning environments.

What is Sparse Reward?

So, in some environments, we won’t be getting rewarded for every action we take. Assume Mario, where if we give rewards once after we complete a particular level, training the agent becomes harder. Hence, for most of the actions you take, there is no reward (i.e. no guiding force) and the agent has no clue whether he has taken the right step or not.

So, assume when the agent is naive i.e. has no training at all, in that case it has to take a sequence of random steps to reach to the end of the level and than finally it will get some reward for guidance. Gathering instances with any rewards would be challenging in such scenarios

You really need to pray hard to God for such an agent to learn as it is highly unlikely it, on its own, the agent will take a sequence of random steps initially that will lead to the end of the level. In short,

The probability of such an event is very low

So, what to do in such cases where the environment is designed in such a way that rewards are rare? Some different sources of reward?

Curiosity-driven learning

The whole idea of curiosity-driven learning is to generate a secondary reward for the agent in environments where rewards aren’t frequent based on curiosity.

What is Curiosity reward in this context?

So, in curiosity learning, we will divide the reward into two parts

Primary rewards (Extrinsic reward): Reward the agent gets on completing the episode or reaching some checkpoint (hence sparse). Hence, for every action taken, you’re not getting rewarded. This is coming from the environment hence called extrinsic
Curiosity reward (Intrinsic reward): Reward the agent will get on exploring newer areas hence increasing curiosity. The more different the state than expected, the higher the reward. This reward will be awarded for every action and the magnitude depends on the newness of the next state. This is coming from the self-motivation of the agent to explore new stuff, hence intrinsic.

How to calculate this curiosity reward?

For this, we will have a model that will intake current state S and Action taken A as input and output predicted Next State. Now, the difference between the predicted Next State and the Actual Next State (that we would get from the environment after feeding S and A) would become a Curiosity reward. We will call this Neural Network a Forward Prediction model which would intake some encoded version of the State which would be generated using the Encoder model discussed later in this post. Why encoded version is required? will find out soon

This looks easy

There are two major issues that are to be considered

Noisy TV problem
Trivial randomness

Noisy TV problem

As you saw curiosity reward is calculated and eventually, the whole idea of any reinforcement learning environment is to Maximize the reward. Right? Hence, the Neural Network used to calculate the curiosity reward i.e. prediction error between predicted and actual next_state, may learn to produce random/noisy next_states as prediction as this would lead to bigger prediction errors and maximizes intrinsic rewards which is similar to a Noisy TV where the whole screen is filled with random pixels. In such cases, the model might produce garbage forever !!

Trivial randomness

Another issue with using curiosity rewards is in some complex environment like Mario, we would be having multiple elements in the environment which has nothing to do with the agent like in Mario, there are birds, and clouds present in the environment which are just for aesthetic purpose. Now, a high prediction error on these elements is something we are not interested in which can be distracting the agent from focussing on major elements of the environment. Hence, we somehow want to account for elements that are important in the context of the environment like the villain, the hero, etc.

These two issues are bigger than they appear and require some work around.

Inverse Dynamics

So, here we would be introducing two more Neural Networks into the Curiosity-driven Reinforcement Learning apart from the Forward Prediction model.

Inverse dynamics prediction model. This model will intake State S and Next State S` and predict action taken. Though, we won’t be feeding the states straightaway but using some encoded version of these states.
Encoder model to encode states to a lower dimensional embedding so as to compress trivial information.

Note 1:We won’t be training the encoder model separately but with Inverse dynamics models only. The error in inverse dynamics model will be backpropagated to Encoder model also.
Note 2: We would using the same Encoder model for Forward Prediction model as well for encoding States but no training for Encoder would be done there.

It’s time we wrap up all the segments to form our final Intrinsic Curiosity Module (ICM) which is something like this

Encoder model generates embeddings for current State S and next State S`
Feed these embeddings to both 1)Inverse dynamics model 2) Forward prediction model.
Train Encoder model using error from Inverse Dynamics module only. Let Forward Predicton model and Inverse Dynamics module learn from their respective losses.

So, if you notice, the Inverse Dynamics Network is used just to train the Encoder. It’s actually the Forward Prediction network and Encoder that we need in the curiosity module that helps us in overcoming the two issues.

This module called ICM(Intrinsic Curiosity Module) can be integrated with any of the existing deep reinforcement learning algorithms like DQN, A2C, DDPG, REINFORCE, etc that we have already discussed in my previous posts which can be of great help to resolve the Sparse Reward problem. Assuming you are integrating it with DQNs or REINFORCE or A2C to train Mario using ICM, the architecture would look something like this

Here you can observe

ICM can be integrated with any existing deep RL algorithm
ICM is affecting the final reward only. Everything else remains the same
The output from Inverse Dynamics model isn’t used anywhere to train the actual Deep RL model.

With this, I would be wrapping up this post. See you soon. By that time, you can explore the below youtube playlists