Curiosity-Driven Learning with OpenAI and Keras

A guide for building an ICM agent using Keras and OpenAI Gym

Gili Karni
The Startup
8 min readApr 29, 2020

--

Curiosity lands on Mars (picture from here)

This tutorial will:

  • Provide a theoretical review of key features of the Curiosity-Driven Learning algorithm;
  • Describe, in steps, how to implement an ICM agent.

You can find the full version of the code here.

Who this is for: This tutorial is for anyone who is interested in advanced RL algorithms. This post assumes previous knowledge in RL but no knowledge of ICM. To follow the implementation, you should be comfortable with Python and Keras.

Curiosity-Driven Learning

The benefits of Reinforcement Learning (RL) go without saying these days. The reward, i.e. — the feedback given to different actions, is a crucial property of RL. Recall the reward hypothesis, which states that maximizing the reward produces a better action-taking policy. Thus, you can imagine that an RL agent would struggle in an environment where external feedback is sparse or nonexistent. Briefly, agents use rewards to alter their behavior and would be stuck without them. Recent models offer to overcome this problem using an intrinsic reward mechanism. The idea of intrinsically-driven learning, mostly known as curiosity, enables the agent to self-learn and being rewarded for discovering new states.

Why is it important? Well, firstly, there are some environments that, by definition, generate a little reward. E.g., the snake game. But second (and probably more important) is the ability of RL to scale with big and complex environments without the dependency on human-made reward function.

The Intrinsic Curiosity Module (ICM)

There are a few approaches proposed recently to implement curiosity in RL. Common to them all is the integration of an intrinsic reward function that guides the agent into choosing a better action. This intrinsic reward (IR) function reflects the prediction error of the next state, s_{t+1}, given the current state, s_{t}. Thus, maximizing this reward function means that it is harder to predict the next state — i.e., exploring unknown trajectories.

According to the Intrinsic Curiosity Module (ICM), the intrinsic reward at time t equals to the L2 normed difference between the predicted feature vector of the next state and the feature vector of the next state. Thus, the delta between the actual next step to the agent’s prediction.

The intrinsic reward function from Pathak et al. (2018), annotated by Simonini (2019)

To predict these two values, the predicted next state, and the next state, the ICM implements two sub-models; the inverse model and the forward model.

(1) The forward model (f) aims to predict the feature representation of the next state, s_{t+1}, given the feature vector of the current state, s_t, and the action, a_t.

The forward model from Pathak et al. (2018), annotated by Simonini (2019)

(2) The inverse model (g) tries to predict the next action, \hat{a_t}, by learning the current state, s_t, and the next state, s_{t+1}.

The inverse model from Pathak et al. (2018), annotated by Simonini (2019)

Using a mathematical phrasing, curiosity is the difference between the predicted feature vector of the next state and the real feature vector of the next state.

The intrinsic reward from Pathak et al. (2018), annotated by Simonini (2019)

Next, let us review the overall optimization problem of the ICM agent. It encompasses four elements.

  • The inverse loss, which is the cross-entropy between the predicted action and the true action;
  • The forward loss, which measures the L2 normed difference between the predicted next state and the true next state;
  • A policy gradient loss;
  • and the intrinsic reward.

Notice the two parameters:

  • β (beta) - weighs the inverse loss against the reward that is generated from the forward model.
  • λ (lambda)- weights the importance of the policy gradient loss against the intrinsic reward.
The loss function from Pathak et al. (2018), annotated by Simonini (2019)

Bringing It Together

The agent in the current state, s_t, interacts with the environment by executing an action, a_t, which is sampled from the agent’s current policy, π. This action leads to the next state, s_{t+1}. The policy optimizes the total reward, given extrinsically by the environment, r_t^e, and intrinsically by the ICM model, r_t^i.

A schematic overview of the ICM model. Figure from Pathak et al. (2018)

Using curiosity, the agent will favor actions with high prediction error (i.e, less-visited transition, or more complex ones) and so it will explore the environment better.

Implementing the ICM Algorithm

I use Keras to build the model and an environment from the OpenAI gym.

Setting the Environment Configuration

First, setting the configuration parameters. Importantly, this part ensures the reproducibility of the code below by using a random seed.

At initiation, the ICM object sets a few parameters. First, is the environment in which the model learns and its properties. Second, are both the parameters of the ICM algorithm — Lambda (λ) and Beta (β). Lambda, as explained above governs the ratio of external loss against the intrinsic reward and beta weights the inverse loss against the forward reward. Third, it sets parameters for the learning (total games, steps per game, and the training batch’s size). Forth, it generates and compiles the ICM model. And lastly, positions and rewards arrays are space saved for storing the learning processes indices for later plotting.

The ICM agent object includes an one_hot_encode_actionutility function that encodes the actions into a one-hot-encoder format (read more about what is one-hot-encoding and why is it a good idea, here).

Building the ICM Neural Network Model

The ICM model includes three neural networks, as explained above.

  • The inverse_model provides a sense of self-supervision by predicting the agent’s action given the current and next states.
  • The forward_modelwhich predicts the next state given the current state and the action.
  • The ICM model brings both these models together. The ICM function uses the create_feature_vector function that supports the vectorization of the states. The create_feature_vector function simplifies a representation issue. It ensures the model only uses elements that can be controlled by or affect the agent.

Action Selection Mechanism

Ideally, I would employ a training network and a policy network (in addition to the ICM model), which will co-train to update the policy (see the make_train_policy_net_model function for an example of how would that look like). However, for simplicity, I present the act function. This function chooses an action via a direct sampling from the ICM model. First, it samples from the action space and performs an offline interaction with the environment to extract the next state and the reward. Using these variables, it predicts the loss and chooses the action which minimizes it. This solution will not scale, but since the mountain car problem has a small action space, it could work.

Training the Agent

Here I use batch training to fit the ICM model. The learn function fits the model given the provided observations.

The batch_train function manages the training by running a training sequence. Additionally, it uses the get_intrinsic_reward function to extract the internal loss calculated by the ICM model. When done running, it plots the training processes using the show_training_data function.

The MountainCar Problem

An illustration of the Mountain Car environment. From TANAKALA (2018)

Briefly, it describes a scenario in which an under-powered car must drive up a steep hill (from Wikipedia). The car starts at a valley with the end goal of reaching the top of the rightmost hill. See the video at the top of the post.

In this problem, the agent receives a negative reward at every time step when the goal is not reached — thus, it has no information about the goal until an initial success.

This environment has three actions (push left(0), no push(1) and push right(2)), and two-dimensional state space (velocity and position).

Using a DQN agent in the MountainCar Problem highlights the spareness of the rewards, thus, the challenge in solving it using these. To demonstrate that I borrowed the code put forward by Patel (2017).

The top figure highlights the best position of the car over time and the low figure demonstrates the dearth of reward.

How Does the ICM Perform At the MountainCar Problem?

Using the ICM produces these results. See that the agent reaches the maximum position much faster integrating intrinsic reward rather than solely using the external one (given by the environment). This is reasonable given the little reward available by the environment (As seen in the bottom graph).

The top figure highlights the best position of the car over time and the low figure demonstrates the dearth of reward.

Notice that the results above reflect the outcome of the model while setting the beta parameter to 0.2. That means that the ratio between the inverse model to the forward model is 0.8:0.2. Experimenting with sampled lower or higher values resulted in a worse maximum position, however, I did not run a full grid search (you can see the full experimentation in the GitHub notebook).

Final Thoughts

Learning is a complex task. Oftentimes, as discussed above, relying solely on the environment for feedback is not sufficient to make progress. In cases of sparse or lack of reward, it can be almost impossible. This post presents the intrinsic curiosity module as a possible resolution to this problem. In the curiosity approach, the agent is rewarded by exploring new parts of the environment. Practically speaking, by finding areas of high surprise (i.e, little predictive ability).

Notice that the implementation brought here is a simple version of the ICM agent. I will delve into the full implementation (using an independent policy network) in the next post.

References

Mountain car problem. (2020). Retrieved 10 March 2020, from https://en.wikipedia.org/wiki/Mountain_car_problem#Reward

Patel, Y. (2017). Reinforcement Learning w/ Keras + OpenAI: DQNs. Retrieved 10 March 2020, from https://towardsdatascience.com/reinforcement-learning-w-keras-openai-dqns-1eed3a5338c

Pathak, D., Agrawal, P., Efros, A. A., & Darrell, T. (2017). Curiosity-driven exploration by self-supervised prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 16–17).

Simonini, T. (2019). Curiosity-Driven Learning through Next State Prediction. From https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-next-state-prediction-f7f4e2f592fa

TANKALA (2018) Solving Curious case of MountainCar reward problem using OpenAI Gym, Keras, TensorFlow in Python — A Software Engineer’s Journal. Retrieved 10 March 2020, from https://blog.tanka.la/2018/10/19/solving-curious-case-of-mountaincar-reward-problem-using-openai-gym-keras-tensorflow-in-python/

--

--