Curious Actor Critic Network

D.W.

7 min readJun 17, 2017

Learning always works best when you get your hand dirty & tinker your way through.

Introduction

To jump start my deep learning journey, i start from scratch to implement DQN [1] and the ActorCritic structure in A3C paper [2], then evolve it with intrinsic curiosity reward [3] in order to learn the concept on how the RL foundation works. Squeezing time on weekends is difficult, but at least the project is fun, and of course, worth a short documentation as record.

TLDR version: the AC losses plateau after just a few iteration, and seems no more coverage afterwards… (oops) maybe some day later i will fix this. :P if you’re interested, you can still find the code here on github: https://github.com/skelneko/-CuriousActorCritic

What to be expected from this post

As stated in the Introduction, this is more of a self-exploratory project, the code might not be ideal (it isn’t) nor the network is fine-tuned (it isn’t neither). Plus, should you want to learn RL in more structured way, there’re already plenty of well done tutorials online, including here, here, and here. Nevertheless, this post highlights more on random tips and attentions that a beginner (like me) might need, in order to get into the field. So, should you find anything on the post or the code contradictory to anything, do feel free to rise and comment. Critics are great for learning, and of course, so does rewards and appreciation. :)

The Beginning

Starting from learning all the basic concept through Goodfellow’s deep learning book. It’s free online here, and do buy one on Amazon if you’re a retro book reader like me.

Once get into the CNN part, you may want to start the tinkering. The key to implement RL, or QLearning to be specific, is to understand how the entire framework works. And to be frank, the Network part (be it DQN or A3C) is in fact only 1/3 of the story, which will be covered in next section.

To begin with, we can start by understanding the Markov Decision Process, and try building a Cartpole using OpenAi with the following pseudo code:

class Agent:
  Network DQN  get_action(current state):
    if random < epsilon_threshold:
      return random_action
    else:
      return DQN.prediction(current state)  play(tf_session):  # mainloop
    while True:
      action = self.get_action()
      apply action to Environment
      save_memory(state, action, reward, next_state)
    if memory full:
      train DQN__main__:
  a = Agent()
  a.play(new tf_session)

The interactions between Network, Agent, and Environment might find difficult for people without programming background, especially if we want to build an A3C, it’s indeed better to take it slowly and try solidifying our learning at this stage.

Confident to move on? Good. Now, we’re ready to level-up.

The Mid-Game

Cart-pole demonstrates the basic, but its Observation from its game Environment is far from ideal comparing to DQN, which takes Atari game frames. We thus pick a game, and my choice was Ms-Pacman due to its spatial game environment.

A big leap from a CartPole toy — Deep Q-Network

We need to implement a CNN, so better finish the chapter from Good Fellow if we haven’t done so.

Should we process the python gym at raw state, the first bottleneck will likely be very slow processing if we get the entire 210x160x3 (H, W, RGB Channel) screen image and process it through our CNN layers, so we grayscale it and also scale down its size to 64x64 (In DQN, it uses 84x84). I thus created an Environment class, as a warpper of python gym, to handle this requirement by adding functions that i find across the internet (similar to DQN coding in fact):

### Utilities ###    
def rgb2gray(self, rgb):        
  r, g, b = rgb[:,:,0], rgb[:,:,1], rgb[:,:,2]        
  gray = 0.2989 * r + 0.5870 * g + 0.1140 * b        
  return gray     def resizeScreen(self, state, shape):        
  img = PIL.Image.fromarray(state, mode=None)        
  img = img.resize(shape, PIL.Image.LANCZOS)        
  arr = list(img.getdata())        
  return np.reshape(arr, shape)

Then adding a LSTM layer at the end of CNN should be an easy task (nope, it took me two weeks from understanding its structure to implementing the code…). This should help the Network to learn based on the memory of state that it has (recently) encountered.

# adding lstm layer            
state_size= self.num_cnn3_out            
batch_size = 1 # further study required            
with tf.variable_scope("LSTM") as scope:                
  self.lstm_in = [self.fc_out]                
  self.lstm_in = tf.transpose(self.lstm_in, [0, 1, 2])    # h_in                 
  self.cell = tf.contrib.rnn.BasicLSTMCell(num_units=state_size)                
  states = self.cell.zero_state(batch_size, tf.float32) 
  h_out, states = tf.nn.dynamic_rnn(cell=self.cell, inputs=self.lstm_in, initial_state = states)
self.h_out_unpacked = tf.unstack(h_out, axis=0)                self.lstm_out = self.h_out_unpacked[0]

Last but not least, implement the Experience Replay memory as a standalone class and built-in to handle discounting rewards and frame buffering. And i shall leave this part to you for further investigation.

Watching pacman to run through the grid is getting fun at this stage.

Enable a better mind — Actor Critic Network

Personally i think it’s important to learn any new concept by comparing it with existing concepts (which we have learnt). Therefore, the implementation of ACNetwork rides on QNetwork that developed in last phase. I used inheritance to achieve this:

class ActorCriticCNN(QValueCNN):    
  def __init__(self, input_tensor_shape, output_tensor_shape):
    QValueCNN.__init__(self, input_tensor_shape, output_tensor_shape)

The beauty of this approach is to enable reusing the QNetwork foundation and extending its structure further to fit into Actor-Critic concept. Hence we ride on the output of last layer and branch out into a value subnet (Critic) and a policy subnet (Actor).

At the end we implement the loss function that we would like to optimize. For details, it’s highly recommend to have a look on Arthur’s A3Ctutorial.

Implementing Asynchronized threads for A3C would be a hassle due to the time investment that i need. Instead, i consider to stop the A3C but continue it in a different direction. May be i can blend a new model into our Pacman and play around with it?

Mix & Blend ideas — Curiosity-Driven Network

Ms. Pacman is rewarded when she eat a pellet or a ghost, but could we also reward the action on exploration? Concept of apply intrinsic motivation of Human to a Deep Learning network thus an interesting area for exploring. While the paper from Deepak Pathak et al. 2017 [3] is a very good read, they haven’t released their own code (yet). So, why not i try it out myself?

Their ICM model is found not to be difficult to implement, but its elegance lies in its simple design: by looping forward losses of a prediction network as intrinsic reward back to the A3C network. For my implementation, you can take a look with my code on github. And of course, i inherit my code to quickly build the ICM part.

The End-Game

Arriving this stage is not easy (oh my weekends), but at least this enable me to have some grounded knowledge to think how to squeeze more joy from my new toy. My conclusion is tuning in the meta way.

Config class has thus configured to support multiple profile setting, and the __main__ script allows the program to loop through the profiles and even simulate the act of A3C by running profiles with short episodes.

Now it’s time to analyze the models by the rewards trend.

The overall graph does shows that the Intrinsic reward played an essential role for the learning, Hybrid Reward (Intrinsic + Extrinsic) does help to deliver a higher game score (i.e. extrinsic reward) after all, while the Intrinsic-Reward-only Agent shows the least scoring, which properly resulted from minimal association between Intrinsic & Extrinsic reward within their model.

Now the final question is, why the reward curve seems not improving at all??

By plotting the losses, we can see that while Agents with Intrinsic motivation does converge to certain level of losses, they plateau at a similar level as the Extrinsic-only Agents. Furthermore, the Extrinsic-only Agents are found to have their losses continue without any proper convergence….

So the million-dollars answer? The AC Network itself wasn’t working somehow… (or was it?)

oh my… :’(

Final Note

Despite seems having a miserable failure and a tragic ending, the project has been a lot of fun. Given my time in reality, i am afraid i have to deem the project to a closure for the moment. Who knows some days later i will come back and fix it ? :)

It’s also my pleasure to share my journey with you, my dear reader.

Farewell, and see you around soon.

Cheers,

D.W.

Reference:

[1] References: Playing Atari with Deep Reinforcement Learning https://arxiv.org/abs/1312.5602

[2] Asynchronous Methods for Deep Reinforcement Learning https://arxiv.org/abs/1602.01783

[3] Curiosity-driven Exploration by Self-supervised Prediction https://arxiv.org/abs/1705.05363