Days 9–10 of the OpenAI Retro Contest

Getting about 8 lines into the Rainbow DQN baseline implementation.

Tristan Sokol
3 min readApr 16, 2018

Team Bobcats’s previous work for days 1, 3, 4 & 5, and 6 of the OpenAI Retro Contest had been mostly focused on getting some of the different baseline implementations up and running. The jerk agent wasn’t too tricky to understand, but the rainbow was definitely outside of immediate grokability so it was time to dig deep and see what was going on. I am going to try to go line by line in the code to see what is happening where, so hopefully by the end of the post I will have a decent understanding.

Part one, the imports:

This will probably the easiest section.

  • #!/usr/bin/env python honestly I thought that this was just something that you put on python scripts, and this helpful StackOver answer told me that I was not alone but also that it does a little more than that, but doesn’t seem to be anything special for this use case.
  • import tensorflow as tf TensorFlow is very cool right now. What does it do ? I am not totally sure, but I look forward to finding out.
  • from anyrl.algos import DQN this is the first in the set of imports bring in a variety of things from a library that apparently “APIs for Reinforcement Learning” but have pretty little in terms in terms of documentation. The first line adds an implementation of Deep Q-Learning. DQN seems to be just on my level of understanding based on this handy post I read from Intel and more generally seem to be the basic structure of the type of learning we are trying to do.
  • from anyrl.envs import BatchedGymEnv This functionality looks to be focused on just an optimization for running your code in batches.
  • from anyrl.envs.wrappers import BatchedFrameStack Here we get some code to handle the stacks of images in a batched manner, the batched version FrameStackEnv
  • from anyrl.models import rainbow_models The Rainbow refers to this paper where a group combined the efforts of a few Deep Q-Learning techniques.

Reading that paper reminded me of the PPO2 baseline agent, so I took a little trip into reading about in the OpenAI site.

We’re releasing a new class of reinforcement learning algorithms, Proximal Policy Optimization (PPO), which perform comparably or better than state-of-the-art approaches while being much simpler to implement and tune.

Wow! sounds like the PPO is much easier to understand and work through, so with my limited internet access this week I might suspend my efforts and focus and trying to get the PPO2 implementation to work.

--

--

Tristan Sokol

Software Lead at NorthPoint Development. When I’m not helping automate a real estate company, I’m growing succulents in my back yard. https://tristansokol.com/