How I trained a self-supervised neural network to beat GnuGo on small (7x7) boards

Published in

Analytics Vidhya

9 min readJan 25, 2020

This article is a follow-up to an article I wrote about a year ago on my work implementing something closely resembling the AlphaGo Zero algorithm and running it on my modestly-powered desktop computer. That code produced networks that I would (and did) summarize as “not that great” in terms of their performance.

So, over the past year I’ve tried a few different approaches (like adding a few auxiliary loss functions) and, incidentally, found a problem, as described in more detail below, that was resulting in my networks being evaluated (but not trained) in a sub-optimal way. As a result, networks I was training (even without any auxiliary loss functions or other bells and whistles) were much better performing than I had realized.

So how good are the networks? Well, when I play against them (I’m an extreme novice, so take this as you will) and I have the network make the first move, I almost always lose on 7x7 boards (the one time I’ve beaten it is in the game play video I’ve posted). When I play the first move and the network plays the second, I can beat it occasionally. You can watch me play a few games against it in this Youtube video.

The neural network vs GNU Go

How well does it do against GNU Go? Here are some statistics when I run the network without using any tree searches (i.e., I define the network’s move as the move location where it has the highest probability output):

Network plays first (GNU Go plays second opening move):
The network wins 93% (119 wins out of 128)
Network plays second (GNU Go plays first opening move):
The network wins 9.4% (12 wins out of 128)

The average of these two conditions would be the network winning 51% of the time. Note that the above statistics can be replicated with the net_vs_gui.py script in the GitHub repository.

Training and model details

Training occurred over the course of approximately two months on two GPU cards (see sections below for more details on my setup) and is generally very similar to what I described previously. All self-play training were from games lasting 32 “turns” (where “turn” means each player moving once — so each player gets the opportunity to place a stone 32 times). Self-play games were rolled out 800 times. I did not implement any resign mechanism for the network and all games lasted 32 turns (if the network could not make a move, the game progressed to the next turn anyway, with no move being made).

After every 128*5 self-play games were generated from the “generator model”, I ran back propagation over the “current model” for 32*5*5 training steps over the pool of self-play batches (which consisted of 35*128 games each containing 32 “turns” [as defined above]). After the gradient updates, the “current” and “generator” model were evaluated against each other to see if the “current” model could be promoted to the next “generator” model (see the second figure below).

I’ve committed the model weights and definition into the repository (see models/). The model is 5 layers each of which contains 128 convolutional filters. The value and policy heads each have a separate fully connected layer with 128 filters and are followed by one last fully connected layer which matches the respective output dimension (1 and 7x7, respectively). Otherwise, model details (batch norm, etc.) are generally similar to the AlphaGo Zero and related papers.

The first set of plots below are training curves and the second plot are the number of times the network was “promoted” (the currently trained model was set as the new “generator” model for generating new self-play training examples). The vertical lines indicate when a promotion happened. All figures in this article can be replicated with the notebook I included in the repository (notebooks/training_visualizations.ipynb).

Training of the model. The **light red** lines are statistics when the network is evaluated against an opponent that moves randomly. The dark red lines are against the GNU Go model. All evaluations were using no tree search (the network’s move is chosen as the location which it rates with highest probability). Note that all network training data was from self-play — the games played against GNU Go and the random opponent were only to visualize training progress. Also note that the evaluations against GNU Go on these plots were with the evaluation bug described below.

Evaluating the “current” model (current version of weights being trained via back-prop) vs. the “generator” model (model used to generate self-play training data) throughout training. The vertical lines indicate when the generator model was promoted to the new “current” model.

Example games

You can also see some examples of me playing the game in this video. Below are examples of the network playing against GNU Go, a random opponent, and itself (self-play).

The neural network plays as black in each of the two games above. Note that the random opponent is simply making random moves.

The neural network playing against itself. Games such as these were used to train the neural network.

The bug

The main and, as far as I can tell, the only major problem in what I was doing before was that I was not setting the training flag in the Tensorflow batch norm functions (it defaulted to always be in training mode where mean and variance statistics were updated with each and every network evaluation). As a consequence, I believe the networks were always ending up in poorly-conditioned states when the network was evaluated at test time. I noticed this perceptually too when I played against the networks — they always seemed to have decent starting moves and then it all kind of devolved into it making a lot of careless mistakes as the game progressed.

A remaining mystery

While the networks I’m training perform well, as described in the sections above, one anomaly remains from finding this bug that I still haven’t figured out. Let me first summarize a few details in how I’m performing training. The code keeps three models in memory:

(The reason for using mixed-precision floating point numbers is that it speeds up (computational) training time.)

“main”: this model is used to generate new self-play training batches. “main” is stored as float16s.
“eval32”: this model is trained using the self-play training batches created by “main”. “eval32” is stored as float32s.
“eval”: the purpose of this model is to be evaluated in Go matches against the “main” model (this is a copy of the “eval32” model converted into float16s). Once it can win against “main” with a high enough probability, it is promoted (copied over / overwrites) the “main” model.

Backpropagation never directly occurs on the “main” and “eval” models — they are downstream from the “eval32” model. For this reason, it would seem that you would never want to run these models in training mode — the statistics should be set and held fixed by the “eval32” model as it trains. Therefore, I would think the following configuration of training flags would make most sense:

“main”: training= False
“eval32”: training= True
“eval”: training= False

However, I find training with the above configuration results in poorly performing models. It is only when I set all flags to True during training that I get decently performing models (if played with the training flag set to False; if I play them with the trainingflag set to True, the performance remains poor). If anyone has any ideas about why this is happening, please do let me know! I would think training in this configuration would result in models that are much worse, not much better.

The code

The code in which I’ve fixed the bug is available on GitHub — I again release it in the public domain. Aside from the bugfix, I’ve added the ability to train on two GPUs simultaneously which I was doing when training the current model I talk about in this article.

My setup

All code has been tested and written on Centos 8 using Python 2.7.16, Tensorflow v1.15.0 and compiled with NVCC VV10.2.89 (the Nvidia Cuda compiler). I’ve run and test all code on a dual GPU setup (with a Nvidia 2080 Ti and Nvidia Titan X card), a quad-core Intel i5–6600K CPU @ 3.50GHz, and 48 Gb of RAM (the code itself generally peaks using around 35 Gb). I have not tested it on alternative configurations (although for some time I was running Ubuntu 18.04 instead of Centos 8 and everything worked there too). If you were to run this setup using less RAM, I’d recommend running only one GPU (and cutting RAM needs in half) instead of cutting the tree search depth (which would be an alternative way to reduce RAM usage).

Going further

An obvious next step for this would simply be increasing the board size and seeing how far I can take this on my setup. However, others on the Computer Go mailing list have suggested that they expect I should be able to get even higher performing models with this type of training setup. Their ideas have included setting komi (I currently use a value of 0 — which might deprive the network of learning signals because black [who plays first] wins most of the self-play games), using techniques to ensure a sufficient amount of gameplay variety, and decreasing the initial buffer size of self-play games to get training started sooner. You can read more on the mailing list thread.

Framework complaints

While I would like to go forward with some of the ideas above, I unfortunately recently updated my system (general distro package update with “yum update” — no updates were made to Tensorflow) and now the multi-GPU part of my code crashes when it launches on the second GPU — despite this same code having worked perfectly fine for about a year — the issue seems to be in Tensorflow 1.15 crashing when I try to run the model.

I’m undecided if I will try to patch up what seems to be a sinking ship or change frameworks entirely. Tensorflow has been great in many ways to use and I do appreciate the great work all the developers have done. However, it has felt increasingly like a chore over the years to keep any code consistently running on it when names and semantics of functions seem to needlessly change and depreciate or things just stop working for no apparent reason like I’m experiencing now.

Before and during the initial release of Tensorflow in 2015 (and before I was aware of Tensorflow) I was working on my own neural network framework similar in purpose to Tensorflow (although my use case was narrower in scope than TF) where I was calling and using cuDNN directly in addition to some other custom CUDA kernels I had written. Anyway, the point of mentioning this is that this code still compiles and runs today despite me not having touched it in several years since I switched over to TF.

So, probably for me the next steps will be to move my code over to my old framework. On a longer-term basis, maybe I’ll eventually clean it up and release the framework. I know I’m not the only one out there that is tired of software frameworks changing out from under them for no reason. Fortunately for deep learning, it doesn’t seem to permeate all the way through — it appears that cuDNN has remained fairly stable (as evidenced by my code from 2015 just compiling and running as-is) so there’s no reason anything built on top of it should not approach the limit of the same level of stability.

Beyond Go

On a longer, longer term basis, I’d like to get a network to play a game, or subsets of it that I’m working on. Probably the approach there will need to take a more mixed approach of human supervised and self-supervised learning (similar to AlphaStar, perhaps). It may well be that I’ll never run network trainings at the scale labs like DeepMind do, but the progress I’ve seen with Go using minimal hardware is definitely encouraging to me. The networks might not be at Lee Sedol’s level, but they can still be good enough to be good adverseraies against many of the rest of us :)