Off-Policy Q-learning in OpenAI Universe: Part 1 — Setting up OpenAI’s Baseline DQN

Introduction:

8 min readJul 2, 2017

OpenAI’s Gym and Universe toolkits allow users to run video games and other tasks from within a Python program. Both toolkits are designed to make it easy to apply reinforcement learning algorithms to those tasks. Basically, OpenAI’s toolkits provide you with information about what’s happening in the game — for instance, by giving you an array of RGB values for the pixels on the screen, together with a reward signal that tells you how many points were scored. You feed this information into a learning algorithm of your choice — probably some sort of neural network — so that it can decide which action to play next and learn how to maximize rewards in this situation. Once the algorithm has chosen an action, you can use OpenAI’s toolkit again to input the action back into the game and receive information about the game’s new state. Typically, you’ll have this cycle repeat until your learning algorithm is making sufficiently decent choices in the given game.

While OpenAI Gym comes with a collection of games that work really well with reinforcement learning (for instance, it gives you access to a variety of classic Atari 2600 games), the more recently published OpenAI Universe really opens up great new opportunities to enlarge the collection of available tasks. In Universe, each game is running in a Docker container — viz. a sort of minimalistic virtual machine that exists solely for the purpose of hosting the game (or other task) via VNC. What’s neat about this is that one could theoretically run any game (or any program whatsoever, really) within this framework. For instance, support for games like Minecraft and Portal is currently planned (even though we’ll probably have to wait and see if OpenAI will actually manage to make this happen, after all, support for GTA V was announced and suddenly removed without a trace — my guess being that this might have had something to do with publisher Take 2 Interactive’s latest lawsuits against modders). What’s also nice about Universe is that each game is rendered within a fixed size 1024 x 768 panel and takes actual key and mouse events as inputs. This means that one doesn’t have to adjust the architecture of one’s algorithm for each new task, only to cope with different frame sizes or other format choices specific to that task. To give you an impression, this is what a typical frame from an OpenAI Universe game looks like:

A screenshot from the game NeonRacer that is available in OpenAI Universe’s library of games. Notice that the frame not only contains the game screen, but also the complete user interface of a functioning web browser that is running within OpenAI’s Docker container (theoretically, the AI could use this browser just like a person would).

For this blog series, I decided to play with OpenAI Universe — or rather have a suitable deep Q-learning network (DQN) play with it — and document the process. A DQN essentially consists of a function approximator for the so-called action value function, Q, to which it applies an argmax operation to determine which action it should take in a given state. The Q-function takes the state, s, of a game along with an action, a, as inputs and outputs, intuitively speaking, how many points one will score in the rest of the game, if one plays a in s and then continues to play optimally from there onwards. In our case, the available actions are (a subset of) the possible button and mouse events that OpenAI Universe can input to the games. The states are, basically, determined by what is visible on the screen — viz. by the frames. This isn’t entirely true, though, as one can easily grasp by looking at the screenshot above: One frame isn’t enough to assess everything about the game’s current state. For instance, the screenshot above doesn’t tell you (or the DQN) how fast the car is going. However, if one inputs a sequence of frames to the DQN, it may be able to learn to create at least a descent approximation of the actual Q-function.

In the remainder of this blog post, I’ll introduce the DQN that I ended up using, explain how I got it to work in OpenAI Universe, and provide a couple of code snippets that I implemented in order to get everything running. In subsequent blog posts of this series, I intend to dive deeper into Universe’s gaming library, experiment with potentially interesting modifications of the base DQN, and look into the process of creating new Docker containers for your AIs to interact with.

OpenAI’s Baseline DQN:

My initial idea was to create a Q-learning agent myself, ideally one that uses LSTM units to store information about past frames dynamically — thereby eliminating the need to manually stack a fixed number of frames in order to provide the network with information about what has happened in the past. While such deep recurrent Q-learning networks (DRQNs) have been successfully implemented in the past, I have to admit that I struggled quite a bit with getting them to run at all, let alone stably and with a real chance of beating non-trivial games. And frankly, even implementing a more conventional DQN is certainly not an easy task (especially if you are like me and think that you can get around implementing some of the more tedious building blocks that make state-of-the-art DQNs as powerful as they are — I’m looking at you, prioritized experience replay buffer).

For now, I’ve therefore decided to play it safe and use the DQN that OpenAI recently published as a part of their Baselines project — and I absolutely do not regret this choice, as their DQN seems to work really well, judged from what I’ve seen so far. The baseline DQN does come with a caveat, though: It doesn’t currently (officially) work with OpenAI Universe environments, but only with tasks from OpenAI Gym. Also, while the baseline DQN is training, one doesn’t really get to see very much of the action, apart from occasional information about the average rewards received printed to standard out at the end of an episode. Now, this may not be a big problem in the typical Gym game, in which the algorithm blazes through dozens of episodes in a matter of minutes and the whole training process is often over in a few hours at the most (assuming that you have a decent GPU to train the DQN on). In Universe games, however, it helps a lot if you can actually see what’s currently going on, in order to check, for instance, if your DQN is stuck somewhere in the often complex 3D environments. So, in an effort to remedy these issues, I came up with a few of lines of code that I’ll post below, so that you can easily copy and paste the snippets into a Jupyter Notebook (or a simple Python file). If you do, you can start training the baseline DQN within a Universe environment of your choice and see exactly what the DQN sees, rendered in an extra window.

Besides having the AI interact with a Universe environment and rendering what it sees, there was one more thing that I desperately wanted to implement — especially after I’ve watched Sentdex’s awesome blog on training a self-driving car in GTA V. What really intrigued me about the way Sentdex presented his AI was how he could seamlessly take control of the action if the algorithm got stuck, get it to a clear location and return control to the algorithm. Now, this is something that one can do in OpenAI Universe as well — even out of the box, simply by connecting a VNC viewer to the Docker container and starting to input commands via one’s mouse and keyboard. However, if one does this, it looks to the AI as if things are being controlled by an external force, so to speak, and it doesn’t learn anything from that. One great opportunity that Q-learning provides us with, is that the algorithm works off-policy as well as on-policy. Thus, intuitively speaking, it doesn’t matter to the algorithm whether it watches someone else play and has to learn off-policy, or whether it plays by itself and learns on-policy. So, I added a couple of key event listeners to the window that displays what the algorithm sees, which allow you to control the game at any time and then return control back to the algorithm by hitting “return”. That way, the algorithm actually sees what buttons you’re pressing, stores the information in a prioritized experience replay buffer (yes, Baseline’s DQN has it), and learns from that live, while the game is running. So, when it gets stuck, you can not only get it “unstuck”, but it can even learn how to do so itself when it faces a similar situations in the future.

Now without further ado, let’s take a look at the two classes that I implemented to get this running. The first class wraps a Universe environment in a way that makes it resemble a simple Gym environment, so that Baseline’s DQN is able to work with it. Also, it takes care of scaling down the size of the frame and converting it to grayscale (training a DQN — or any interesting neural network, really — is very resource intense, both regarding memory and computation, so this is definitely required):

Notice how the preprocessing step radically downsizes the individual frames, simply by passing only every 3rd row and column of pixels to the DQN. Thanks to preprocessing, this is what a typical frame really looks like for the DQN:

A frame from NeonRacer, after it has been preprocessed. It is rendered inside the extra window that I’ve mentioned in the text. This is its original size, so it’s really pretty small —however, it’s enough for the DQN to at least learn some basic maneuvers.

Apart from that, I pretty much just cherry-picked what I thought might be useful normalization steps by looking at how OpenAI handles Atari 2600 games. The result can be seen in the function wrap_openai_universe_game: After applying my own wrapper, two wrappers from OpenAI’s Atari module are used. The first one controls the frame rate, the other one controls how many frames are stacked so that the DQN can discover temporal dependencies (e.g., deduce how fast something is moving).

The only other class that required some creativity on my part, basically serves as a wrapper for a core component of OpenAI’s DQN, namely the part that takes a given state of the game and uses a Q-function approximation to choose an action. My wrapper class, called PygletController, intercepts this process. It takes the given game state, renders it in a tiny window, checks whether there is any human input, and only delegates the decision process to the DQN, if there is no such input from a human player. Here’s the code snipped for that:

In order to actually put this to use, I created two more Python files, each of which containing a slight modification of a function from OpenAI’s Baseline project: The first one is the main-function (viz. the function that starts everything else that needs to be started), the other one is a variant of baselines.deepq.simple.learn where I just added one line of code in order to apply my PygletController-wrapper. All relevant files can be found on my GitHub repository.

So, does it work? Well, it does at least look kind of promising, as you can see in the short clip below. At that point, the DQN had trained for around fourteen hours, I’d say, where I occasionally played a round myself or helped the network to get back on track, so that it could learn off-policy from that (in the clip, the net is, of course, playing on-policy — so it’s the DQN that steers the racing car):

Off-Policy Q-learning in OpenAI Universe: Part 1 — Setting up OpenAI’s Baseline DQN

Introduction:

OpenAI’s Baseline DQN:

Written by Florian Hoidn