Reinforcement Learning Adventures

8 min readFeb 13, 2020

My fascination with reinforcement learning started when I came across an article titled “Emergent Tool Use from Multi-Agent Interaction” on OpenAI’s blog. Long story short, it talked about agents learning to play hide-and-seek. It was amazing to see the agents come up with strategies to perform better.

For context, reinforcement learning is the same method that was used by DeepMind in AlphaGo and by OpenAI in their Dota2 bots. The idea is similar to the way we learn: observe, act, evaluate. The agent is in a continuous cycle of observing its environment, taking an action, and determining whether that action was good or bad.

Introduction to Deep Q-Networks

All the quests I talk about in this post are built on DQNs, so it makes sense to cover what exactly they are. Although I won’t go into too much detail, DQNs were created by DeepMind (arXiv:1312.5602 [cs.LG]) in 2013 to play old-school Atari games. They worked by trying to learn something called the Q-value (or quality value) of being in a certain state and taking a certain action. It’s built on top of the Bellman equation and follows this formula.

The quality of being in a certain state and taking a certain action is the sum of the immediate reward and all future rewards.

By learning to estimate the Q-value, the network would be able to pick the optimal action to make at every stage of a game.

Quest: Chrome Dinosaur Game

My introductory project was a bot to play the Chrome Dino game. I thought the game was perfect because of its simplicity and repeatability. I slapped a Deep Q-Network onto it to see what would happen. Initially, the was just randomly divebombing into cacti, but after 15 hours of training, it reached a record of 10,000!

A few seconds of gameplay from the bot.

After the success of my first experiment, I thought I’d try another game. This time, I wanted a game where I could see development of strategies.

Quest: Connect 4

I thought Connect 4 could be a good step-up, so I settled on it. The most substantial difference between Connect 4 and the dinosaur game is that Connect 4 is a two-player game. However, Connect 4 is what I’m calling a symmetric game: there’s no substantial difference between player 1 and player 2. So, I wanted to use the same network to play both sides of the game, which would allow it to effectively learn twice as fast (double the experiences) as two different networks.

When I set about making the Connect 4 environment, it was set up so that both players would think they’re playing as red. That way, the network would only need to learn how to play to make red win. When it was black’s turn, the environment would flip the colors behind the scenes.

In addition, since DQNs weren’t intended for two-player games, I made environment effectively conceal the agents from each other, letting the networks chalk down their opponents’ behavior as the environment’s behavior.

Network structure for the Connect 4 AI (*left*) and TensorBoard logs (right)

The network was once again a Deep Q-Network, but with a twist. DQNs suffer from being very shaky in their performance because what they are trying to learn keeps changing. You can think of this as a scenario of moving goalposts. Every time you get closer, the goalpost itself moves further.

To help alleviate the shakiness, van Hasselt et al. (arXiv:1509.06461 [cs.LG]) came up with a solution in 2015: have two networks, one online and one “target”. The idea was that the online network would train using a fixed target network as the goalpost. In intervals of n moves, the target network would get updated.

The network was two convolutional layers followed by three dense layers. The input was the current game state as a 6-by-7 array, with zero meaning empty, -1 meaning black, and 1 meaning red. It had 7 outputs, one for each column. At the end, an action mask is applied to prevent the network from playing illegal moves (like when a column is full).

When training, the network plays against itself repeatedly, getting a positive reward when it wins a game and a negative reward when it loses. Every 1000 moves, it also plays 100 games against a random player (one that picks moves randomly) to calculate its win-rate. (More about self-play here and here.)

A brief clip of the network learning.

Despite my best efforts, however, the network couldn’t exceed an 80-ish% win-rate. It also made some seemingly glaring mistakes, like leaving three coins in a row instead of blocking them. My best guesses are that the network wasn’t complicated enough to learn the game or that it just hadn’t trained for long enough.

Quest: Othello

Connect 4 was cool, but there still wasn’t enough room for the network to explore and innovate. Sticking with the board-games idea, my mind first went to chess. I discarded it, however, because I thought it was too complicated for my naïve bot to learn in a reasonable time period. The game I settled on was Othello. The game was easy enough rule-wise, but had a lot of room for mastery. Heck, the game’s tagline was “A minute to learn, a lifetime to master.”

Because Othello was still symmetric, I wanted the same network to play both sides of the game. So once again, the environment was setup to make both players think they were playing to make white win.

One of the tricky things to do with Othello was determining how to do rewards. In Chess and Connect 4, the person to play the last move is generally the winner, but in Othello, it doesn’t really matter who plays the finishing move. What matters is what majority of the board you control so I originally thought it could just be how many pieces you turn over with one move.

But that wouldn’t work, because it is possible that your move sets up the opponent for many more points. I decided that the best reward would dependent on the pieces that your move got as well as the pieces that your opponent got because of it.

This crudely-drawn image illustrates how the reward is calculated.

Initially, I applied the same Double DQN strategy to Othello, just with more (three) convolutional layers and more (five) dense layers. But even after 12 hours of training, there was no substantial improvement in win-rate. It was still idling around 50%, which is pretty disappointing against a player who literally decides randomly.

So I researched ways to improve this. One of the first things I came across was Dueling DQNs (arXiv:1511.06581 [cs.LG]). In these, the Q-value is split into two different parts: value and advantage. Value talks about the value of being in a certain state, like your current score in Super Mario, or the number of coins you have in Othello. Advantage focuses on the expected reward of performing a certain action, like jumping or playing in a specific spot. By splitting these up, the network should in theory be able to better model the Q function, making it perform better. I modified my network structure to include the dueling architecture in addition to the already-existing Double DQN, making it a (comical) DDDQN.

Left to right, initial structure, Double Dueling DQN structure, final TensorBoard logs

Another solution I implemented was what I called the “Champions System”. Earlier, the network would always train against itself, resulting in most matches being tied or near-tied. This time, I changed the training system a bit so the network would train against the best version of its past. That is, it would play against the past version of itself with the highest win-rate against the random player.

Finally, instead of just giving the current game-state to the network, I took a page out of AlphaGo’s playbook and gave the network the past three moves as well. This gives it a better sense of the moves being played over time and would hopefully allow it to predict strategies and stop them.

With these changes in place, I set it to train and went to sleep. The next morning, I woke up and the results were… disappointing. As was expected, the network was losing against the champion by 6 points on average, but the win-rate had gone down! The present champion had an 81% win-rate and was formulated at around 217k steps (~10 hours), but even at 412k steps (~20 hours), there still wasn’t a better one. I decided to stop the training and investigate what was going on before getting back to training.

To take a peek under the hood, I learned how to visualize CNN filters and created these maps.

The first row is for our reference, showing the current game state and the action taken. The next row contains the observation provided by the environment. Notice how the white agent sees it as it is, while the black agent sees the board inverted. That way, both agents try to make “white” win, but their definitions of white are opposite. Also, each observation contains the past 3 moves along with the current state.

The remaining rows contain the feature maps from each move. The first four rows are from the first convolutional layer, the next eight from the second and the last rows from the third one. Generally, as we progress into deeper layers of the network, we lose the ability to interpret what each map really represents.

Sub-Quest: What’s wrong?

The predominantly empty maps, especially in the deeper layers, leads me to believe that the network was either too complicated for this problem (unlikely) or simply hadn’t been trained for long enough (likely). However, because the network’s win-rate against the random agent was decreasing even though its performance against the champion remained stable, it’s also possible that it was overfitting to the champion, forgetting how to play against a random agent.

Future Quests

First, I want to continue training the Othello network as it is for maybe a week, just to see if its performance (or lack of) was because of insufficient training. Also, I’d like to see how it would perform if the training was against a random agent, rather than a champion.