An AI agent playing the game ‘Butterfly’, trained using SIMPLE

How To Build Your Own AI To Play Any Board Game

NEW reinforcement learning Python package SIMPLE — Self-play In MultiPlayer Environments

David Foster

Published in

Applied Data Science

10 min readJan 25, 2021

✏️ The Plan

In November, I set out to write a Python package that can train AI agents to play any board game…🤖 🎲

To be successful, the package had to meet the following objectives:

It will work with any custom board game logic
It will handle multiplayer games
It will start tabula-rasa and learn through self-play

🎯 The Outcome

The output from this project that meets these objectives is called SIMPLE — Self-play In MultiPlayer Environments.

You simply plug in a game file that handles the game logic, hit ‘train’ and wait for it to become superhuman! 🚀

🎲 What SIMPLE does in 148 words

If you’ve tried reinforcement learning (RL) before, you might have hit a massive barrier when expanding your horizons beyond single-player games (e.g. Cartpole, Atari).

That’s because with multiplayer games, you are trying to beat an opponent, rather than the environment itself.

And it really matters who you are playing against. It’s easy to beat a player playing randomly, but this does not mean we have achieved perfection in the environment.

A technique called self-play is the answer:

Train the current version of the network against previous versions of itself as opponents. The agent will keep finding novel strategies to overcome its latest opponent and gradually become a stronger overall player over time.

This technique has been applied with incredible success to agents such as AlphaZero (Go, Shogi, Chess) and OpenAI Five (Dota2).

SIMPLE is a package that makes implementing self-play RL on custom multiplayer games…simple!

SIMPLE allows you to implement self-play RL for your own custom multiplayer games

To get started right away, check out the package README.

The rest of this blog post explains how SIMPLE works — at the bottom of the post is a detailed code walkthrough and analysis of the training logs in Tensorboard.

🎨 A Diagram

Below is a diagram of the training loop, that trains the agent to play a multiplayer game through self-play PPO. The package uses the OpenAI Stable Baselines implementation of PPO.

A high-level diagram of the SIMPLE training loop.

There are four main components:

The Network Bank. This stores previous versions of the agents to pull into the environment as opponents.
The Proximal Policy Optimisation (PPO) engine. PPO, developed by OpenAI in 2017, is a state-of-the-art RL algorithm. It updates the current version of the network being trained and fires a callback that saves the network to the bank if the current version has outperformed previous versions.
The Environment. The base environment is any multiplayer game logic that you can dream up.
The Self-play Wrapper. This converts the multiplayer base environment into a 1-player environment that can be learnt by the PPO engine.

The self-play wrapper is the ‘secret sauce’ of SIMPLE! Let’s explore that in more detail now.

🍬 The Self-play Wrapper

The self-play wrapper performs two key functions:

🎲 Handling opponents

On each reset of the environment, it loads random previous versions of the network as the next opponents for the agent to play against. It takes the turns of the opponents, by sampling from their policy networks.

We can see this in action below — the base opponent is loaded initially as Player 1 and Player 3 and the agent plays as Player 2. Player 1’s turn is taken automatically by sampling from the policy network output before handing back to the agent.

Output from the SIMPLE training log for the game ‘Butterfly’

⏱ Delayed Rewards

It delays the return of the reward to the PPO agent, until the other agents have taken their turn.

Suppose you are playing a 3-player card game such as whist. If you are the first player to play in the trick, then your action (playing a card) doesn’t immediately create any reward. You need to wait to see what the other two players play in order to know whether you won the trick or not.

The self-play wrapper handles this delay, by following up the PPO agent action with all other required actions by opponents before returning any reward to the PPO agent being trained.

🤺 Tournament Results

We can evaluate agents in the network bank in a tournament, where every agent plays every other agent a set number of times. This creates an average reward matrix — for example as shown below, for a set of agents trained on the multiplayer card game Sushi Go.

Each cell show the average reward for the P1 agent for the game Sushi Go, when pitted against two copies of the P2 agent, averaged across 50 games. Rewards were 1st place= 1, 2nd place= 0, 3rd place= -1. All agents trained using self-play PPO.

We can clearly see the agent improving as training progresses — model_n can, on average, triumph over all model_i where i<n. This is demonstrated by the diagonal blue / red split in the matrix.

Having played the later models myself, I can attest that it has developed complex human-like strategies, such blocking other players, using the chopsticks to select the best two cards in the hands and not picking certain cards when it has little chance of winning the hand. It beats me over a series of games at Sushi Go, so it has definitely reached ‘superhuman’ territory — very exciting!

🎮 Training On Custom Environments

The fun really starts when you train SIMPLE on your own custom games.

You can use the existing environments in the repository as a template for building your own. The general structure follows the standard OpenAI Gym framework, with a few extra methods and attributes that are required to ensure it works with the self-play environment wrapper. So far, only Discrete action spaces are implemented (e.g. choosing one of n possible actions on your turn).

Follow the steps in the Github README, to start writing your own custom multiplayer environment!

🧠 Building Custom Networks

You’ll also need to define a custom structure for the policy network, that takes into account the length of the action space size of the environment and also the dimensionality of the observation input.

For example, for games played on a grid, like Tic Tac Toe, you can input the observation as a 3-D array (height, width, features) and write convolutional layers to operate on this input. For card games like Sushi Go (network shown below), dense layers can be used instead.

The network is completely customisable — for PPO, you just need to output a policy head (pi) and a value (vf) head. The policy head can also be masked to zero invalid actions, by passing the legal_actions vector in as additional input- see here for how this is done.

The Sushi Go custom policy network, with invalid action masking.

👏 Conclusion

In this post, we’ve explored how to use SIMPLE to train custom, multiplayer games through self-play reinforcement learning.

Feel free to contribute to the SIMPLE project and start using it to come up with RL agents for your own games. I’m excited to see what you produce!

APPENDIX

📈 The Training Log

Let’s now walkthrough the output of the training log.

We can inspect the Tensorboard charts that are created automatically by the Stable Baselines library, to understand the status of the training process.

Tensorboard output, creating during the training process.

Some of the most important charts to inspect are as follows:

episode_reward / discounted_rewards

These charts show the average reward achieved by the agent across the training process. In general, you should see this climb gradually until the threshold score is reached and then a sudden drop every time a new policy network is saved out, because the agent now has to play in an environment with more difficult opponents to overcome!

loss / policy_gradient_loss / value_function_loss / entropy_loss

These terms relate to the PPO network optimisation, which involves trying to maximise the objective function below:

In Tensorboard, you’re actually seeing the negation of each of these terms, because the neural network training is set up as a minimisation problem, so every term is negated and they are now indeed ‘losses’.

policy_gradient_loss is the negation of theCLIP term. The PPO algorithm tries to push theCLIP term higher by making the policy more likely to choose actions that are more favourable than predicted by the value function and less likely to choose actions than are less favourable than predicted by the value function. However, it is clipped, to prevent the policy from drifting too far away from the current policy.

value_function_loss is the VF term. The VF term relates to the accuracy of the value function — the PPO tries to make this as small as possible, hence the minus sign before the term.

entropy_loss is the negation of the S term. The S term relates to the entropy of the policy — the higher the entropy the more random the policy. The PPO algorithm tries to push this term higher to encourage exploration, but of course, the CLIP term will counterbalance this to ultimately prefer actions that produce beneficial outcomes.

Key things to look out for!

In general, you should see the value_loss gradually fall over time, though it may jump up every time a new network is saved out, because its previous evaluation is no longer valid against the new agent. The entropy_loss should drift gradually upwards over time (i.e. entropy falls) as the agent becomes more sure about its actions and less random.

Over time, a good strategy is to anneal the entropy coefficient towards zero, to allow the policy_gradient_loss to dominate. If there isn’t enough randomness in the actions initially, there is a risk the model will collapse because it has found what it thinks is a winning strategy, but it actually just isn’t exploring enough possible opponents moves to see that it has learnt a weak policy.

🐍 Detailed Code Walkthrough

Let’s now look at the three functions within the self-play environment wrapper that make the magic happen — these are code snippets from the full selfplay.py file that contains the SelfPlayEnv class. We assume here that we are playing multiplayer game where the ultimate reward is given at the end of each episode (e.g. 1 for winning the game, otherwise 0). Line numbers are in brackets for reference.

reset

First we reset the game as usual (5), then set up the opponents for the next game (6). By default, this chooses the current best network in the network bank with 80% probability and a random previous network with 20% probability. This reduces the chance that the agent will start to overfit to the latest best model and perform badly against earlier models.

Then we check if it’s the turn of the agent we are training (8) — after all, the agent may not start the game! If not, then we call the continue_game method (9)…

continue_game

This method runs the turns of the opponents, until it is once again the turn of the agent that we are training. First we ask the agent to choose an action (7). This is a stochastic choice from the policy distribution rather than choosing the action judged to be ‘best’, because we want our agent to face a range of scenarios and learn to exploit weak moves as much as defend against good moves.

We pass the chosen action to the multiplayer environment (8). This handles the change of game state — for example, changing the current_player_num, returning the done boolean and returning thereward for each player as a list.

If done is true (12), then we exit the while loop early (13) — an opponent has taken an action that has finished the game. Once it is the turn of the agent we are training or the game is over, we exit the function.

step

Lastly, we need to wrap the step function of the multiplayer environment.

We first pass the chosen action to the multiplayer environment (6) that handles the change of game state.

Then we check if the game is done (11). If not, then we hand over to the continue_game method (12) that, as we saw above, handles the turns of the opponents until it is once again the turn of the agent.

Finally, we pick out the reward from the list that is assigned to the agent (14). As the final reward value (1 = win, 0 = lose) may happen after an opponent’s turn, it’s important that thestep method waits for all opponents to take their turn before passing the reward value back to the agent (20).

Before you go…

👏 Please leave some claps!
✏️ Comment below! — I’d love to hear about your experiments!
🐦 Follow me on Twitter

Thanks for reading!

Applied Data Science Partners is a London based consultancy that implements end-to-end data science solutions for businesses, delivering measurable value. If you’re looking to do more with your data, please get in touch via our website. Follow us on LinkedIn for more AI and data science stories!