5 Ways NOT to Build a Catan AI

9 min readAug 25, 2021

During the pandemic, I started playing online Settlers of Catan (shoutout to https://colonist.io!). I quickly realized there is more skill involved than one may think and this made the game beautiful for me.

At the same time, I was amazed at the recent success of the AlphaGo team at making a superhuman player in the games of Chess, Shogi, and Go, with a seemingly simple algorithm.

These two interests prompted me to take a shot at making a superhuman artificial intelligence player for Catan.

The purpose of this post is to share these attempts so that others can take these findings further.

Basic Idea and Implementation

There were two main ideas I wanted to explore: Reinforcement Learning (like AlphaZero) and to use Supervised Learning to build a data-driven “value function” (a function that tells us how good the position of a given player is). For both ideas, I needed to implement the core game to simulate millions of games.

I choose to implement the core game logic in Python, because it’s my favorite language and its rich ecosystem of machine learning libraries (e.g. TensorFlow, Numpy, Pandas, Sklearn).

I also built a GUI (graphical user interface) to debug such game logic. It’s browser-based in React; this is what it looks like:

Benchmarks

Before diving into fancy approaches, it was important to have solid benchmark to measure strength against. I implemented three bots; outlined below.

1. Random

The most basic benchmark is the bot that takes actions completely at random (i.e. random.choice(playable_actions)), I call it RandomPlayer. It plays pretty poorly so losing to this bot would represent huge misfortune.

2. Weighted Random

The second (slightly better) benchmark bot, was what I call the WeightedRandomPlayer. This bot takes actions at random too, except when “build city”, “build settlement”, or “buy development card” are possible actions. In these cases it skews its random selection to heavily favor these (in that order). Basically strictly improving RandomPlayer slightly by saying “if you can build a city, do it”.

It was inspired by this paper, and it indeed played considerably better. You can see its implementation here and see how it wins ~61% of the games against RandomPlayer:

Result of 10,000 1v1 games between WeightedRandomPlayer and RandomPlayer.

3. Value Function

Lastly, another easy-to-implement yet decent benchmark would be a bot that calculates a hand-crafted “score” of the state of the game, and selects actions so as to greedily maximize this value.

Simply, we can make this score be the sum of quantities like the number of active knights, the total amount of production points, the number of victory points, the length of longest road, and more quantities that talk about and characterize how well-positioned a player might be. We also get to set in what proportion these are combined at (i.e. if we think “number of active knights” is 2.5x more valuable than “number of production points”, we could make it so).

This bot was much better than RandomPlayer and WeightedRandomPlayer. Although not superhuman, this was already an encouraging finding, since this approach can be continuously improved by either more innovative "features" (the quantities or metrics that characterize how good a player's position is) and/or further tuning of their respective "weights" (the proportions; the 2.5x in the example above).

Results of 3-player Catan between RandomPlayer, WeightedRandomPlayer and ValueFunctionPlayer.

You can find the “features” and “weights” I am using for this bot here.

Superhuman Attempts

With these benchmarks in place, we started the search for better bots with sophisticated machine learning approaches. Anything beating ValueFunctionPlayer would be considered promising.

For all Machine Learning approaches, we needed to encode the game state and actions as numerical vectors. Here's a preview of how these datasets looked:

samples.csv.gzip: A 1-D raw representation of game state. For more information see here.

board_tensors.csv.gzip: A 3-D tensor representation of the board state. Intended to be used for CNN-based approaches. Similar to the one described in this paper.

actions.csv.gzip: Hot-one-encoded actions taken by players in that state. Honestly, could have been a single-column vector of integer representations. You can see what each integer means here.

rewards.csv.gzip: Several labels we tried for each (state, action) pair. RETURN for example is 1 if player ultimately won, 0 otherwise. Some of the others are Reinforcement Learning inspired.

Attempt #1: Reinforcement Learning

I basically followed the tutorials here and here, to implement the Cross Entropy Method and an Online DQN, respectively.

These approaches failed, and I suspect it was because:

Bugs in the model training code.
I didn’t know how to filter to the top N-games without loading the complete dataset to memory. So I ran the Cross-Entropy Method with small datasets.
Class-imbalance made bots predict “END TURN” many times. This I suspect happened because these datasets are imbalanced with “END TURN” samples, since this is the most common action in a game. In particular, ~30% of the time bots had nothing to do but to end their turn.
The dataset generated is pretty much “garbage data” since they were randomly-played games.
The action space was too big. At the time, I had an action space of more than 5,000 actions. This was due to me flattening the state-action tree with actions like “ROAD BUILDING AT EDGE i AND j” for every i and j, and separating initial building action with normal building actions (one could argue building the first settlement is similar to building a settlement in general). This was later reduced to 289 as in the image; might be worth revisiting now.

Attempt #2: Supervised Learning

From the RL approaches, I was inspired to learn a Value Function using RL-inspired labels. In particular, I found some success when labeling states with what I called “VP_DISCOUNTED_RETURN”. This is basically a geometrically discounted sum of the total VPs eventually achieved in that game. The formula is: (0.9999 ^ num_turns) x vps_achieved.

So if a bot won with 10 points in 1,000 turns, then all those samples would get labeled with ((0.9999)^1000) * 10 = 9.048. A bot that won with 10 victory points in 300 turns however, will have ((0.9999)^300) * 10 = 9.704. The quicker you won, the higher the number.

I tried this with the “1-D raw representation” and the “board tensor” representations, and although they showed improvement, RandomPlayer would still take too many games to indicate any promising strength.

My suspicion here again was that the datasets are hard to predict since it’s random play. It might be interesting to integrate some sort of self-play iteration framework on top of this, where what we just did is considered one “iteration” or “generation”, and from here we generate a dataset of “better” games, learn from those, and iterate.

Attempt #3: Monte Carlo Tree Search

The MCTS algorithm is nicely explained here, and it is an important component of the AlphaZero algorithm.

Promisingly, that “playouts property” seems to hold in Catan. For example, in this board, who do you think is winning?

This “playouts property” says Orange’s position can be valued at 0.691 (and White’s position is a 0.309), since Orange wins 691 / 1000 games played at random from here. I’d say its an OK assessment (more roads, 1 knight up, and less blockable?).

Not promisingly, however, MCTS by itself doesn’t seem to be a way to “superhuman” play. In this paper they found even with 10,000 simulations per turn, the author would fairly easily beat it. We confirmed this, and in our Python implementation the situation is worse since running 100 simulations takes ~3 seconds per turn. Too slow to suggest it’s worth exploring more.

100 games of MCTS vs Weighted Random 1v1s. MCTS used 25 simulations per turn.

Better than WeightedRandomPlayer, but still worse than ValueFunctionPlayer.

Attempt #4: TensorForce Framework

It seemed if anything ML-related was going to work, it would have to have some sort of self-playing / self-improving component to it. Thus, I decided to pick up again the Reinforcement Learning approaches, but this time using an off-the-shelf implementation of the algorithms, to discard any possible bugs from my part. 😀

After considering Stable Baselines, Tensorflow Agents, and others, I landed in TensorForce, because it seemed approachable and had good documentation.

To my surprise, it showed some improvement! After around 8 hours of training, the bot was beating the RandomPlayer 80% of the time. But then it would face a dip in the learning graph, and in general, this was still pretty far from the ValueFunctionPlayer.

Tensorforce Results: The left is reward over time (1 for winning, -1 for losing, 0 otherwise). The right is episode length. Strong players finish games in around ~100 turns.

Maybe I am just inpatient and this just need to be run for a longer time.

Attempt #5: Alpha Beta Search

The last tool in the arsenal was to try and add “search” to the ValueFunctionPlayer. There was only one slight modification that I did to the standard Alpha-Beta Pruning algorithm, which was to take the expected value of the children when propagating values up the tree. This was to account the uncertainty in dice rolls, buying development cards, and stealing.

It showed some improvement. This AlphaBetaPlayer, with depth=2, increased the performance of ValueFunctionPlayer but only slightly. Here are the results:

AlphaBetaPlayer improves on ValueFunctionPlayer but only slightly.

Curiously, increasing the search to 3-layers deep made the bot play worse than when searching 2-layers deep. I think its either a bug in the implementation or maybe looking too much ahead makes the bot go frequently towards high-risk-high-reward plays that don’t pan out.

This bot plays pretty well, but is definitely not superhuman and it suffers a lack of long-term planning (e.g. it plays monopolies promptly).

Conclusion

Although we didn’t see superhuman play, this last approach showed strong play and it looks ripe to be incrementally improved. It’s also particularly promising because this is how at a high-level the classical version of the Stockfish Chess Engine played, and it was superhuman (after many open-source contributions).

That is why I’ve made all this code open-source here: https://github.com/bcollazo/catanatron. Because I am sure there are some brilliant minds out there that can take this further. And just like Stockfish is for Chess, we make Catanatron the strongest open-source engine for Catan.

You can clone the repository, install the dependencies, and test your own ideas for improving the best bot. If you find a change that beats the current best bot with statistical significance, please submit a Pull Request!

Future Work

I also still think Reinforcement Learning and Supervised Learning approaches could work with more refinement.

For Reinforcement Learning, I’ve been interested in getting either https://github.com/suragnair/alpha-zero-general or https://github.com/werner-duvaud/muzero-general working with Catanatron. I have implemented a Gym-like interface here, but it is not clear to me how to adapt those repositories.

On the other hand, theres the very interesting idea of doing the Supervised Learning approaches described here but on a dataset of human games.

Appendix

All results in this blog post were taken with the catanatron-play script provided in the repo. It can simulate any number of games between any set of players. For example, the following runs 10 4-player Catan games between two RandomPlayers, one WeightedRandomPlayer, and one ValueFunctionPlayer:

catanatron-play --players=R,R,W,F --num=10

Results are showed like this:

Output from catanatron-play shows basic statistics of the games.

The command-line tool also creates the datasets like so:

catanatron-play --players=R,R,R,R --num=5 --out=data/blog

This would execute five 4-player RandomPlayer games and output the 4 GZIP CSVs described above to the data/blog folder.