Policy Network With Tensorlayer & Curiosity Learning

Published in

The Startup

14 min readMar 6, 2020

Popular Game Environments In Reinforcement Learning (source: OpenAI, ICLR 2019)

[I] RL: A Brief Introduction

Reinforcement learning is one of the main learning paradigms in machine learning, alongside supervised and unsupervised learning. Unlike supervised learning, where we have a set of fixed labels of the true, or nearly true, values we want to approximate, reinforcement learning interacts with the environment to incrementally to learn more about what the learner should or shouldn't do through a signal commonly referred to as the reward. Unlike unsupervised learning, where we want to learn the representation of the input and extract underlying patterns, reinforcement learning aims to retrieve a strategy to effectively navigate the environment it is trained in, known as the policy. These distinctions differentiate reinforcement learning with the other paradigms and also inform us about scenarios where reinforcement learning is the most useful. It is incremental, so when supervised learning fails to have readily available labels we can use reinforcement learning instead. It learns strategies, so where representations in unsupervised learning are less than useful, we can employ reinforcement learning.

It is also good to separate this learning paradigm from genetic algorithms. Reinforcement learning is different from genetic algorithms in the sense that it depends heavily on the environment, or in other words, it utilizes the full feedback information from the environment at every step rather than randomly mutate each generation. If the available time is ample and the search space is sufficiently small then evolutional approaches like genetic algorithms or similar optimization techniques like simulated annealing work well, especially when the state of the environment cannot be fully sensed, but in massive state space, such as that of complex games, reinforcement learning can be efficient in its exploration because of the guidance of reward signals.

Reinforcement learning is inspired by conditioning learning in the brain, which is familiar with us humans. We touch hot objects and we got burned, so we don't do it as much in the future. Because reinforcement learning learns “strategies”, it can be used in various decision-making processes or problems that can be phrased as decision-making processes. Since the recent revitalization of neural networks, reinforcement learning has enjoyed many successes across various domains, from robotics to game-playing, with famous examples that shook the world like AlphaGo becoming super-human in the notoriously hard game of Go, or OpenAI cracked cooperative AI in a strategy game Dota.

In this article, you will:

1) learn about the basics of reinforcement learning (from hereon referred to as RL) including its main approaches;
2) build a neural network to play the game of Pong using RL;
4) understand a recent advancement in the field named curiosity learning with a walkthrough of the original paper code.

[II] A Crashcourse In RL

In (very) short, RL uses a reward signal to modify its behaviors and adapt the strategy to the environment. Let's break this down.

Figure from: https://www.kdnuggets.com/2018/03/5-things-reinforcement-learning.html

There are three main components of any RL settings:

State: to let our RL learner interact with the environment, it first needs to see the environment. We need to provide the learner with a state: some sort of representation of the environment at the current time. This state can be imperfect, noisy or even corrupted, but this will pose different challenges to our learners.
Action: the learner needs to interact with the environment. Every RL problem requires a specification of the valid actions that can be undertaken at the current state. RL is a series of learning experiences between the learner and the environment, as the figure above shows, so a list of valid interactions has to be available.
Reward: the learner also needs to know if it is modifying its environment the way it wants or not. A reward is a numerical value we assign to let the learner know if it is successful in obtaining whatever goal we give it. To reinforce our learner’s good behaviors (and thus discourage bad behaviors) the environment will let the learner know, once in a while, that it is doing well by rewarding it, or not well by taking the reward away.

Formally, an RL algorithm wants to retrieve the optimal policy: the strategy of actions in each of the states which will maximize the reward. We can do this by either learning the policy directly or indirectly, which constitutes the two main approaches to RL:

Policy learning: These approaches directly find the policy, the most suitable action in a given state, in an environment. This can be done by parameterizing the policy, like using a neural network, to optimize over the reward and retrieve the optimal parameterized policy. This is usually referred to as the control problem of RL: we want to learn to control an agent in the environment, without much care in learning about the environment itself. Notable algorithms include REINFORCE, neural policy gradient, etc.
Value-function learning: These approaches aim to learn the value of the states or state-action pairs: what is the expected reward in each of the state if we were to move into this state. This is still implicitly learning a policy but in a subtle way: if we want to maximize reward, we move to states with higher expected values. Two different value functions are often the objectives to retrieve: V-function (state value function) and Q-function (state-action pair value function). This is usually referred to as the prediction problem of RL: we want to learn to predict the expected reward of the environment and using this information we will decide how to move about here. Notable examples of these approaches include SARSA, Q-learning, etc.

Since we are using policy learning in the next section, let's dive a bit deeper into this. The simplest form of policy learning is the REINFORCE algorithm. Let say we parameterize our policy using a vector of theta, with an initial policy pi and action A in state S at time t. The update rule for our policy is then:

Intuitively, the update rule points to the direction and value of the maximum reward, by differentiating over the reward signal with respect to our current policy. This makes sense since if we have a large positive reward we would like to increase the chance of taking the action A in state S, represented by the gradient of the probability of taking action A at S which is the direction that either increase (if the reward is positive) or decrease (if the reward is negative) the chance we will take that action in the future. The denominator signifies the degree in which this gradient has an effect: if the action already has a high probability, we will update it less. This is to prevent bias: we cannot just choose an action too many times because it has some rewards since there might be other actions that have higher rewards. For example, choosing the familiar pizza while you are traveling abroad is certainly great, but if you always go with pizza then you might miss a local cuisine option that can blow your mind. This is often referred to as the dilemma between exploration (transversing the state-action space more or less randomly to find a better policy) and exploitation (going with the best policy at the moment).

REINFORCE idea carries to other, more recent policy learning algorithms: we parameterize the policy somehow, then taking the gradient of the reward to move in the direction that maximizes it. As with everything these days, this parameterization can be done with neural networks. Neural networks, being universal function approximators, can take the state as the input and spit out the optimal action as the output. In fact, we can learn very complex policies with this method since neural networks are very powerful. In the next section, we will go through an example of using neural networks to learn a good policy for Pong, an Atari game.

[III] Let's Play Pong!

Pong is a simple game in Atari, very much resembles table tennis. The objective of the game is to bounce the ball off a paddle whenever it is coming toward us. Let's go through the core component of the RL problem in Gym, a popular simulated environment we use for this example:

State: The state of the environment is given to us in the form of an 80x80 image of the current game state. This image contains everything from where the paddle is to the current position of the ball on the screen. In short, this is what a human will see playing the game.
Action: There are three possible actions like a human would have: UP, which moves the paddle upward; DOWN, which is the opposite; and STAY, which retains the current position of the paddle. With these actions, we can fully control the paddle.
Reward: There is either a positive +1 reward for holding the ball on the screen until the end of an episode or a negative -1 reward for letting the ball pass our paddle defense and go off-screen. With this, we can determine whether the learner is successful at playing Pong or not.

In this example, we will use Tensorlayer, which is a popular framework for deep learning. Tensorlayer abstracts many of the popular architecture in deep learning, from simple ones like convolutional layers in CNN to complex ones like the entire seq2seq model in RNN. With this level of abstraction, we can play around with complicated architecture quickly for prototyping. Better yet, Tensorlayer also has an extensive model zoo containing popular, pre-trained models in CV, NLP and recently RL. With these models, we can reproduce results, do transfer learning, and have access to research-grade trained models.

First thing first, we need to install the relevant packages:

%%bashpip install tensorlayer, gym
pip install — upgrade tensorflow-gpu==2.0.0
pip install — upgrade tensorflow==2.0.0

The last two lines are because currently, Tensorlayer (which takes Tensorflow as the backend) will only work with Tensorflow below 2.0. Let's import the relevant packages and define the environment in Gym:

import time, gym
import numpy as np
import tensorflow as tf
import tensorlayer as tl
import matplotlib.pyplot as pltenv = gym.make("Pong-v0")
observation = env.reset()

Now, we can go through the parameters for the learner:

image_size = 80 #The state given to us by the Gym environment
D = image_size * image_size #The dimension of our input layer
H = 200 #The dimension of the latent layer(s)
batch_size = 10 
learning_rate = 10**(-3)
gamma = 0.95 #Reward discount hyperparameter
decay_rate = 0.95 #Optimizer learning rate hyperparameter
model_file_name = “pong_PNN”

To start off with a simple example, let's pretend like we don't know there is a type of neural networks that can process images directly and try to parameterize the policy by a vanilla feed-forward network:

input_shape = (None, D)
ni = tl.layers.Input(inputs_shape)
nn = tl.layers.Dense(n_units=H, act=tf.nn.relu, name='hidden_1')(ni)
nn = tl.layers.Dense(n_units=H, act=tf.nn.relu, name='hidden_2')(nn)
nn = tl.layers.Dense(n_units=H, act=tf.nn.relu, name='hidden_3')(nn)
nn = tl.layers.Dense(n_units=3, name='output')(nn)
M = tl.models.Model(inputs=ni, outputs=nn, name="pong_PNN")

The network takes in the state representation, not as an image, but as a flatten out NumPy array of the image, then feeds this information through four latent layers and output the action that the network thinks works best for the input state. This is a very simple policy network, and a very shallow one indeed, so we don't expect good performance from the model. The training procedure involves interacting with the environment:

#Keeping some useful stats
prev_x = None
running_reward = 0
reward_sum = 0
episode_number = 0
rs = []
game_number = 0
rewards = []
end_training = True#Define our optimizer
train_weights = model.trainable_weights
optimizer = tf.optimizers.RMSprop(lr=learning_rate, decay=decay_rate)
model.train()#Main training loop
while end_training:
   cur_x = prepro(observation)
   x = cur_x - prev_x if prev_x is not None else np.zeros(D, dtype=np.float32)
   x = x.reshape(1, D)
   prev_x = cur_x
   _prob = model(x)
   prob = tf.nn.softmax(_prob)
   action = tl.rein.choice_action_by_probs(prob[0].numpy(), [1, 2, 3])
   observation, reward, done, _ = env.step(action)
   reward_sum += reward
   rs.append(reward)
   
   if done:
      episode_number += 1
      game_number = 0      if episode_number % batch_size == 0:
         epr = np.asarray(rs)
         disR = tl.rein.discount_episode_rewards(epr, gamma)
         disR -= np.mean(disR)
         disR /= np.std(disR)
         with tf.GradientTape() as tape:
            _prob = model(epx)
            _loss = tl.rein.cross_entropy_reward_loss(_prob, epy, disR)
         grad = tape.gradient(_loss, train_weights)
         optimizer.apply_gradients(zip(grad, train_weights))
      if running_reward <= -10:
         end_training = False
      reward_sum = 0
      observation = env.reset()
      prev_x = None

In the above code, first, we have defined the optimizer with its hyperparameters. The advantage of using a neural network policy is that we can optimize it using well-known optimization techniques, like autodiff, very easily. Usually, we need to take the gradient over the parameter of the policy, but in this case, we can simply offload this work to the optimizer we are using (and to Tensorflow). In the main training loop, we sample the state from the environment, feed to the network and receive the policy network best guess of what is the optimal next action, try that action on the environment and observe the reward signal. We then use the reward signal as the gradient to optimize the parameters of the policy network using backpropagation (with gradient clipping, although not very necessary for such a shallow network). After every episode, we reset the environment to get a fresh start and continue the update process until the loss converge.

If you have followed the code above, you will notice that the policy network is too simple to approximate a good policy for a complex game like Pong (and yes, Pong is considered complex, at least for a computer). We can increase the capacity of the policy network by using CNNs, so that the network is deeper and can see the screen directly:

num_f = 32
f_size = (3,3)
stride = (2,2)input_shape = [None, image_size, image_size, 1]
ni = tl.layers.Input(inputs_shape)
nn = tl.layers.Conv2d(n_filter=num_f, filter_size=f_size, strides=stride)(ni)
nn = tl.layers.Conv2d(n_filter=num_f, filter_size=f_size, strides=stride)(nn)
nn = tl.layers.Dense(n_units=H, act=tf.nn.relu, name=’hidden_1')(nn)
nn = tl.layers.Dense(n_units=H, act=tf.nn.relu, name=’hidden_2')(nn)
nn = tl.layers.Dense(n_units=H, act=tf.nn.relu, name=’hidden_3')(nn)
nn = tl.layers.Dense(n_units=3, name=’output’)(nn)
M = tl.models.Model(inputs=ni, outputs=nn, name=”pong_PNN_CNN”)

With proper hyperparameters and enough training, this policy network will converge to local optima in the policy space and it can play Pong with decent chances of winning. The question is, can we do better? For games specifically, the action space is large in terms of possible combinations in any given state. For example, in the classic Mario Bros. game, you can go forward and jump at the same time, or even shoot a fireball. How do we effectively explore our options in such scenarios? The answer: with curiosity, we take the leap of faith.

[IV] Curiosity Kills The Cat, But Helps The Robot

Curiosity is one of the most important characteristics of the human learning experience, especially at the early stages of the development process (Smith, L. & Gasser, M., 2005). Unlike traditional reward-oriented behaviors, curiosity is motivated by something different: sometimes we choose to satisfy our curiosity just for the sake of itself, not for potential rewards in the future, although the latter can further the motivation for the former. Examples of curiosity in humans are prevalent, from traveling to an unknown country to trying different cuisines. In game-playing, some games encourage exploration behaviors such as a hidden boss in RPG games such as Diablo, while some games are entirely built on the premise of exploration and curiosity, like Minecraft or No Man's Sky. Surely, like many other great implementations inspired by biological learning systems, curiosity can be of use in RL.

To understand how curiosity can be used RL algorithms, it is important to first understand two polarizations in RL. The first one concerns exploration/exploitation, which was mentioned briefly above: exploration involves taking actions that have no obvious rewards, at least in the short term, to discover better states; and exploitation involves repeatedly use the current best knowledge to take actions that maximize rewards. Any good RL algorithm will have to balance between these two since too much of any of them results in either near-random behaviors (too much exploration) or convergence to sub-optimal policy (too much exploitation). Curiosity more deeply concerns exploration. The second polarization is that of extrinsic/intrinsic reward. Extrinsic reward is the reward given to the learner by the environment, i.e. it signifies that we have done something advantageous in the environment, whereas intrinsic reward is the reward the learner gives to itself, i.e. when the learner believes that it has done something of value. Curiosity, as you might have guessed, is an intrinsic reward signal.

Usually, it is often enough to be guided by reward signals from the environment, but there are scenarios where this extrinsic reward is very sparse, or non-existent even. For example, a chess game goal is to capture the opponent’s king, but if we use this as a reward then we must wait for the entire game to get the reward signal from the environment. Sure, we can handcraft intermediate reward signals, like capturing the opponent’s queen, but this gives rise to the problem of deciding the reward function for the environment, and this injects biases into our modeling attempt. One solution is using entirely intrinsic reward signals: curiosity, to guide the policy search. This approach operates by the simple principle: always explore the region where you are most uncertain about. Imagine you are on a ship mapping the coastal shape of a continent, and you are given a map made by someone before you on some regions of the coast. Would you rather go to regions where the old map already has detailed descriptions of, or would you attempt regions of no substantial mapping before? Uncertainty is certainly scary, but a leap of faith will always reduce uncertainty, and this usually leads to more understanding of the world around us.

Formally, pure curiosity-driven learning foregoes extrinsic reward signal altogether but uses prediction error, which signifies the degree of surprise/uncertainty of the learner for a particular state/action, to guide the policy search. If, say, like in our last section, we use a policy network to predict the next best action and we move to a completely unexpected state, under curiosity we will be more inclined to take this action again just to see what might happen. Burda, W. et al (2018) try this approach for multiple RL environments, ranging from classic Atari games to complex physical dynamic environments, and discovered that even without extrinsic reward the learner can still be taught to perform well. When used as a pre-training technique, curiosity-driven training can increase performance for traditional RL algorithms down the line. Let's take a look at how this learning approach is different from our previous learning loop:

We can see that the main difference is in the way we choose the next action to try, in other words, the way we do exploration. With the code in the previous section, we choose the best action from the current policy, with some probability of choosing others to explore (known as e-greedy exploration). In the above code, we can see that curiosity-driven learning uses the amount of uncertainty, i.e. counting how many times have we chose these actions before and how much error we accumulated for these actions. We then choose the one that 1) was not chosen many times before and 2) has a lot of prediction error. You can check out the game-play performance of the learners trained on curiosity only here. Quite impressive, indeed!

[Conclusion]

In this article, we have learned the basics of RL, and how to create a policy neural network from Tensorlayer. We also briefly dabble on curiosity-driven learning, which is one of the most recent developments in the field.

References:

Burda, Y. et al. (2018). Large-Scale Study of Curiosity-Driven Learning. Retrieved from: https://pathak22.github.io/large-scale-curiosity/

Smith, L. & Gasser, M. (2005). The development of embodied cognition: six lessons from babies. Retrieved from: https://www.ncbi.nlm.nih.gov/pubmed/15811218

Tensorlayer. (n.d.). Reinforcement Learning Tutorial with Tensorlayer. Retrieved from: https://github.com/tensorlayer/tensorlayer/tree/master/examples/reinforcement_learning