Asymptotic Labs
Published in

Asymptotic Labs

Battlesnake Post Mortem

Using a desktop GPU to top the global arena in under a week.

Our brief time at the top using only a Reinforcement Learning Policy Network

Let me start by thanking the wonderful folks at Battlesnake for putting together the Stay Home and Code competition, which raised money for Food Banks Canada. You can find the event details and replays here: https://play.battlesnake.com/events/stay-home-and-code/

Battlesnake Primer

Battlesnake is an online multiplayer board game, played entirely through coded agents in groups of 2-8. To compete, users write web servers compatible with the Battlesnake API.

The rules are simple: eat food to grow, don't run out of life, and survive the longest.

Each snake starts with 100 life and loses 1 life per turn. Eating 1 food restores the snake's life back to 100 and increases the snake's size by one space (growing from the tail).

A snake will die if it leaves the board (ie. hits the wall), runs out of life, or hits another snake. Side case: if two snakes collide head-on, the larger snake survives.

Strategy Landscape Primer

Many strategies are viable in Battlesnake, and our agent needed to be able to compete against a wide array of strategies. These include rule-based solutions, tree search variants, and machine learning based snakes.

Rule Based Agents are the most common implementation strategy for newcomers to Battlesnake. These generally consist of a set of if-then-else logic branches, choosing their respective moves based on the rule set. They may include some search heuristics, but generally lack look-ahead behaviour.

Tree Search Agents are another popular choice as they provide insights into future game states. Common choices for tree searches including Minimax or its more efficient variant Alpha-Beta Pruning. These type of strategies have a long history in computer game playing as they can be relatively simple to implement (depends on the game) and have great performance with limited resources. A more recent strategy (2006) is Monte-Carlo Tree Search. MCTS is the backbone strategy for more recent advances in game-playing AIs including AlphaGo/AlphaZero.

Machine Learning Agents are a less popular method for competing in Battlesnake, due to their relatively high complexity, necessity for computing resources, and a large amount of training data required. Candidates here include supervised learning on preexisting game replays, or generating new games via reinforcement learning, among others.

Our Strategy

Our agent for the Stay Home and Code competition was a hybrid snake, which combined a Reinforcement Learning (RL) policy and an Alpha-Beta (AB) tree search.

The RL policy is used for general game playing, effectively choosing the move to take at each turn. The AB search is a fallback which will override the RL policy, only if it determines a move to be a definite loser or a definite winner. For example, if the AB search finds a winning move in the future, it will always override the RL policy. Similarly, if the AB search finds that the RL policy choice will lead to a definite loss in the future, it will override the RL policy, and choose the best move it finds.

The following pseudocode describes this logic:

policy_action = policy.compute_action(game_state)
tree_scores = tree.compute_scores(game_state)
# Check if the policy action is a loser in the future
if tree_scores[policy_action] == -Infinity:
return best_action(tree_scores)
# Check if tree search found a winning move
if any tree_scores is Infinity:
return best_action(tree_scores)
# Base case: use policy network move
return policy_action

In practice, we found that this helps slightly at the higher end of competition where tree search snakes dominate. Interestingly, it was not needed to briefly top the global leaderboard, as that was done without the tree search overrides.

Machine Learning Primer

Feel free to skip this part if you have a background in machine learning.

Machine Learning is found in applications people use every day. From Google Search to your bank, it is near impossible to avoid it’s presence in today’s technology heavy society. In a nutshell, it is the idea of creating models from data. A well known saying in machine learning is as follows: no model is perfect, but some models are useful.

Three branches of machine learning commonly described are Supervised Learning, Unsupervised Learning, and Reinforcement Learning.

Supervised Learning is about learning models to map previously unseen inputs to outputs. Well-known examples here would be creating a model to identify cats in images, detect stop-signs in a self-driving car, or a bank wanting to analyse how risky it is to give you a mortgage. The key here is that supervised models require existing datasets with labelled outputs to learn from.

Unsupervised Learning is about learning structure from data. A common application is a recommender system such as those found in streaming services or online shopping sites.

Reinforcement Learning is about learning to interact with an environment to maximise a reward. Reinforcement learning agents find uses in a wide variety of tasks including robotic control systems, financial portfolio management, games, and more.

Reinforcement Learning Primer

Feel free to skip this part if you have a background in reinforcement learning.

Reinforcement Learning is the branch of machine learning that really gets us excited. The core idea of reinforcement learning is create a decision making agent through trial and error. This is achieved by learning a policy (think of this like a strategy or rules), that chooses actions given the current environment’s observation (the state of the environment) that maximises cumulative future rewards (how good was the outcome of taking the chosen action in the prior state). This is modelled after Markov Decision Processes.

A toy reinforcement learning scenario that we encounter as children would be learning not to touch a hot frying pan. The environment would be the frying pan, with it’s possible observations as {Red Hot, Cold}. Possible actions would be {Touch, Avoid}. For each action taken in the environment, one could get a reward of either {Pain, No Pain}. Clearly, if one observed the frying pan as Red Hot and they Touch it, one would receive a reward of Pain. If the goal of the person is to avoid Pain, then the policy would optimally learn to not touch Red Hot frying pans. Learning policies through trial and error is what reinforcement learning is all about.

There are many amazing resources online to learn about RL, and one that we found very enlightening was the Sutton and Barto book on Reinforcement Learning.

Technical Details: Inputs and Policy Architecture

The following sections assume some background knowledge about neural networks. We’re avoiding a lot of implementation details since they’re out of scope for this article, and there are great existing learning materials via your favourite search engine.

In that past few years of Battlesnake, we had success creating agents using both the Minimax and Alpha-Beta pruning algorithms. However, given the 500ms per turn limit, we felt we had pushed our solution to the max, and after spending far too many hours tweaking our hand-coded heuristic, we decided this year to try something radically different.

At first, we were exploring the idea of creating a plain supervised model, trained on examples generated by the Alpha-Beta snake games. We quickly realised this was not practical, as it would take far too long to generate enough quality games, and also that since it was learning on examples of the Alpha-Beta snake, it would likely just converge to the same strategy. Since training on examples from our Alpha-Beta snake would just cause the model to learn the same strategies and biases, we scrapped that idea.

Recently, OpenAI released a paper about how they reached world champion performance in the computer game Dota using Proximal Policy Optimization (PPO). Inspired by this, we decided to try building our own agent for Battlesnake — we would train it from scratch, against nothing but itself via self-play.

In order to train a reinforcement learning agent, we needed to have the following components: a gym environment that we could use to simulate games (the observation, action, reward loop), and a PPO algorithm that we could tailor to our gym and policy network.

For our gym, we used https://github.com/cbinners/gym-battlesnake (this fork contains the up-to-date Battlesnake rule set and custom input layers. Thank you Arthur for the original work!)

For our PPO implementation, we used a modified version of the code found here: https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail.

The gym environment we used creates 17-channel 23x23 images as game state observations, with the player head centred at {12, 12}. This facilitates the current global arena setup of having 11x11 board sizes. The following is a comment from our gym implementation, describing what each channel is encoding:

/*        
layer0: snake health on heads {0,...,100}
layer1: snake bodies {0,1}
layer2: body segment numbers {0,...,255}
layer3: snake length >= player {0,1}
layer4: food {0,1}
layer5: gameboard area {0,1}
layer6: head_mask {0,1}
layer7: double_tail_mask {0,1}
layer8: snake bodies >= us {0,1}
layer9: snake bodies < us {0,1}
layer10-16: alive count mask
*/

If you feel inclined to dive into the game state generation, it’s here.

Our agents policy network is a non-recurrent deep CNN with 4 convolutional layers and a multi-head output. Below is a Pytorch snippet of our policy network setup.

# 4-stack CNN with leaky relu activations
self.base = nn.Sequential(
nn.Conv2d(17, 32, 3),
nn.LeakyReLU(),
nn.Conv2d(32, 64, 3),
nn.LeakyReLU(),
nn.Conv2d(64, 128, 3),
nn.LeakyReLU(),
nn.Conv2d(128, 256, 3),
nn.LeakyReLU(),
Flatten(),
nn.Linear(in_features=225*256, out_features=512),
nn.LeakyReLU(),
)

# Value head predicts how good the current board is
self.value_head = nn.Linear(in_features=512, out_features=1)

# Policy network gives action probabilities
self.policy_head = nn.Linear(in_features=512, out_features=4)

Reward Shaping

Since reinforcement learning algorithms are trying to maximise cumulative future rewards, we can see that choosing different reward shapes will dramatically change the behaviour of the learned policies.

For example, in Battlesnake, one might think that eating as much food as possible is a great strategy. A possible reward structure for this could be to reward the agent with +1 every time it eats food, and a reward of 0 if it does not eat. With this reward shape, then the reinforcement learning algorithm will learn a policy that tries to maximise the length of snake, with complete disregard for winning.

One could easily extend this reward shape to account for winning and losing, by rewarding the agent for +100 on a win, and -100 on a loss.

While this is a completely viable reward shape, we chose a much simpler one for a number of reasons. Most importantly, the above reward shape introduces human bias into the behaviour. How do we know that always growing as big as possible is the optimal strategy? We don’t.

Our rewards shape is -1 for a loss or tie, +1 for a win, 0 otherwise.

The intuition behind choosing this reward shape is simple: we don’t pretend to know the optimal strategy, but we know that winning is good, and losing is bad. Let the reinforcement learning algorithm find out a strategy that just wins.

In practice, this has worked very well thus far.

Reward Shaping for Specific Behaviours

Since reward shaping has such dramatic effects on the learned behaviour, we have some ideas that would be interesting to explore in the future. These may not be “winners”, but could be fun to watch.

Aggro Agent: provide an additional reward for head-on-head collision kills.

Short Agent: provide a slight negative reward for eating.

Training Parameters

Proximal Policy Optimization has many parameters to play with. For the most part, we chose the recommended values from the PPO paper.

agent = PPO(policy,
value_loss_coef=0.5,
entropy_coef=0.01,
max_grad_norm=0.5,
clip_param=0.2,
ppo_epoch=4,
num_mini_batch=16,
eps=1e-5,
lr=5e-5)

PPO gamma was set to 0.999, which gives a reward half life of 692 steps. This is one of the most important parameters to set properly, as it can have drastic effects on behaviour. This is the discount factor, which changes how much the agent values long term rewards vs short term rewards.

For our setup, we used 208 parallel environments, each simulating 600 steps per iteration. This gives each training batch 124,800 game turns, which we split into 16 mini-batches each of size of 7,800 turns. Each iteration, we do 4 epochs over the iteration batch.

For this competition, we trained our network for 4 days, with random weight initialization. We ran 4200 training iterations, which is ~524 million game turns. This equates to simulating and training on ~1,500 turns per second.

These parameters worked well for us, but they are unlikely to be optimal.

Updating our Agent

Reinforcement learning is notoriously difficult to get working properly. When doing supervised training, it’s easy to verify that you are improving your model — just ensure your validation error is decreasing. For reinforcement learning, specifically self-play training, this is not so simple.

Yes, during training you’ll observe your loss function decreasing, but that doesn’t necessarily mean your agent’s behaviour is improving. As your agent learns, it’s opponents get stronger as well (since the agent is the opponent)! This means that despite our agent improving, the observed training losses actually fluctuate up and down — they are non-stationary. Since we cannot directly use our loss function as an indicator of improvement, we use an auxiliary function to check for behaviour improvement.

Every 50 iterations, or roughly 6 million turns, we run a performance check pitting the current policy against 3 of the prior best policies, in a 4-way match. 3000 games are played to completion, and if the win rate of the current policy is > 30%, then we update the prior best policy to be the current policy. In a 4-way match with equally skilled players, the expected win rate is 25%, given a large enough sample size. This ensures that when we’re updating our agent, it is actually winning more often than the opponents, which is the metric we actually care about.

Preventing Strategy Collapse

A potential pitfall of training via self-play is something known as strategy collapse. Think of the modified kids game rock-paper-scissors-dynamite. If we assume that there are four strategies, Rock, Paper, Scissors, and Dynamite, with the classical rules, but Dynamite beats the other 3, then it’s clear we would like our strategy to learn Dynamite.

Here, we will present a scenario which clearly shows strategy collapse. Assume we have two policies: current (call it C), and prior (call it P). C and P are both initially running strategy Scissors.

As we start training, C eventually learns that playing strategy Rock wins against agent P. Now, we have a better policy (since clearly Rock beats Scissors), so we update the prior policy P to be the same as our current policy C. Then, we start the training procedure again, and C learns to play Paper (since Paper beats Rock), and we update P to Paper. Again, C learns to play Scissors (since Scissors beats Paper), and we update P to Scissors. Now, despite our policy updating multiple times, we’re back at the beginning and don’t actually have a better strategy!

As it turns out, there is a great way to combat this issue.

If, during the training between updates, we sometimes play against older policies, then the current policy will be forced to “remember” how to play against the older strategies, and will be less likely to learn cyclical strategies as presented above. Over time, as our agent learns to beat each of Rock, Paper, and Scissors, it should learn how to play a strategy like Dynamite.

In our training setup, to attempt to mitigate or damped strategy collapse, we use an opponent selection similar to, but not the same as OpenAI Dota 5: 80% of the time we train against the current best model, and 20% randomly select one of the prior best models as the opponent.

Comparison with Alpha Beta

Although our PPO snake has yet to dominate our Alpha-Beta strategy, it’s execution performance is vastly superior. On a single core compute instance, we can execute our PPO strategy on the CPU in < 15ms, and get very close to the same performance as our 500ms Alpha-Beta snake. This gives us hope that with more training time, and a slightly larger network, we’ll be able to finally have a policy that beats the tree search snakes consistently.

Google Colab Notebook Example

We’ve created an example notebook to play with some reinforcement learning at this Google Colab. This notebook is an example (it’s not configured as we have used to train our Battlesnake), and it will take a good amount of coding effort to take it from the notebook to a production instance. This could serve as a starting example for those wishing to dive in… we encourage it!

Hardware

All of our experiments and training were done on a single desktop machine with the following specs:

CPU: Ryzen 2700x (8 physical cores)

GPU: 24GB Titan RTX

RAM: 64GB DDR4

Conclusion

This years competition was an incredible learning experience, and very fun. Our final submission reached the top-8 in the Expert Division for Stay Home and Code. We will definitely be competing in future competitions, and will work hard to bring strong competition. We’re always happy to chat about machine learning and opportunities to collaborate, so please reach out to us at Asymptotic Labs if you’re so inclined!

Update

Since the first SHAC competition, we’ve won our first competition: https://play.battlesnake.com/events/communitech/communitech-veteran/brackets/

We’ll be doing a follow up post describing the changes we made to make this happen. Stay tuned!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store