Leela Chess — Test40, Test50, and beyond

11 min readFeb 13, 2019

If you don’t know what lc0 is, you should probably stop reading. For everyone else, I hope this is informative.

There is a lack of information in a well-structured form about recent changes, and plans for future prospects for those not part of development. This thread is my attempt, as an outsider, to produce a resource about what we know. I’ll warn that this is not going to be purely objective; I’ll give my opinions where and when I want to. Do not take what I’m saying as definitive or certain.

Test40

Test40 is meant to be a consolidation of all the experimentation that preceded it, to prove its value by running to completion with a single, stable set of parameters and, hopefully, the strongest net so far. Current results are promising.

1. Seeded with Test35 games

Test35 games were used to train Test40 at the start.

2. Squeeze-and-Excitation network

Test35 introduced SE networks, which is a new neural network architecture. The increase in strength is expected to be significant. The change only affects the network, not the search or the training strategy, so is most similar to an increase in network size without the severe performance penalty.

Paper: https://arxiv.org/abs/1709.01507

3. Matching the AlphaZero paper

DeepMind released an update to their paper during Test30, and Test40 takes some fixes from it. This is discussed on the blog: https://blog.lczero.org/2018/12/alphazero-paper-and-lc0-v0191.html. The most interesting changes made for Test40 are listed.

3.1 Endgame temperature reduced

Temperature is a tendency for Leela to make moves she thinks are suboptimal during training, which was added to increase variation during training and learn about moves Leela otherwise wouldn’t have tried. The issue is that she then learns to expect herself and her opponent to make errors during play, which taints the expected outcome of a game.

AlphaZero only uses temperature in the first 15 moves. Test40 only reduces temperature from 1.2 to 0.45.

3.2 cpuct formula changed

cpuct determines how much the Monte Carlo search should value exploration of new possible lines relative to analysis of the current best lines. The new formula says that early on Leela should mostly look at promising nodes, but as more time is spent on search, Leela will become more and more eager to explore new possible lines. This helps by allowing the search tree to change its mind faster.

Since this is purely an aspect of search, it applies to all networks.

3.3 First Play Urgency (FPU)

Another search-only change, FPU determines how the network evaluates unvisited nodes. The change means that unvisited nodes are just counted as a loss. The advantage was best described by the paper that introduced it, Exploration exploitation in Go: UCT for Monte-Carlo Go.

UCT works very well when the node is frequently visited as the trade-off between exploration and exploitation is well handled by UCB1 formula. However, for the nodes far from the root, which are visited rarely, UCT tends to be too much exploratory. This is due to the fact that all the possible moves in one position are supposed to be explored before using the UCB1 formula. This assumption is good when the number of options is low, but this is definitely not the case in Go. Thus, the values associated to moves in deep nodes are not meaningful, since the child-nodes of these nodes are not all explored yet and, sometimes even worse, the visited ones are selected in fixed order.

The new assumption used is just than unvisited nodes should be treated as losses.

4. Stochastic Weight Averaging

Test30 introduced SWA to network training, which is a technique that helps find global minima. This reduces overfitting, and increasing strength and training speed.

5. Tablebase rescoring

Endgame tablebases give perfect correct answers to endgame positions. Unfortunately, tablebases are so large that it is impractical to ask all contributors to download them themselves, so they cannot be used in self-play. The key idea of this approach, introduced in Test30, is to run self-play normally but then have the server rescore the game as if a tablebase was used: when a winning tablebase position is entered, the previous moves are all considered winning, and similar for lost and drawn positions. When Leela makes a mistake and makes a losing move inside a tablebase position, only that small sequence of losing moves is rescored. (There are some specifics here, but I’ll ignore them for now.)

Test40 increases the tablebase size from 5 to 6.

Test50, and beyond

Here we get to the ideas you are more likely not to have heard. As far as I know, Test50 isn’t fully decided yet, but it is likely we will get a smaller network for more rapid iteration of some ideas, with some potentially huge impacts. There is no order in this list, but I will try to list which ideas seem likely for inclusion in Test50, and which will probably wait. I will be providing more links to issues for the curious.

Seed with random play

AlphaZero starts by randomly playing games with a zero-knowledge fake network, that always thinks it’s drawing and has no move preferences. As far as I know, the only advantage here is that this method is faster than getting an untrained network to play, so it should slightly increase the learning speed early on. This is likely to be used for Test50 and beyond.

Implementation: https://github.com/LeelaChessZero/lc0/pull/725

Convolutional Policy Head

The policy head, the thing in the network that tells you what moves it likes, didn’t have any structure. Every move was just an unrelated different output. Alpha Zero instead lays out layers of 8x8 locations, which improves how fast the network is able to learn. The end result is not obviously better, but it learns faster so training time should fall. I would expect this to be used for Test50, since AFAICT there is no downside.

Discussion: https://github.com/LeelaChessZero/lc0/issues/637

Implementation: https://github.com/LeelaChessZero/lc0/pull/712

Contempt and trade penalties

Sometimes Leela plays against an opponent that she’s significantly better than. Aiming for risky wins is safer than it normally is in these situations, and this tendency will result in getting a higher score than it normally would, even at the potential cost of a rare loss. Contempt and trade penalties lower the value of drawing lines and value positions with more pieces active on the board, leading to situations with more opportunities to outplay the opponent. Since this is mostly about beating other engines, and only affects the search, it isn’t Test50 specific and won’t be used for training. The long wait to get contempt also makes me sceptical that it will be an accepted feature particularly soon.

Implementation: https://github.com/LeelaChessZero/lc0/pull/703

Implementation (old): #702, #577, #536, #466, #550, #436

Value head estimates draw probability

Here we get to the first change I’m excited by. The value head tells you what Leela thinks her win probability is, 1.0 for a certain win, 0.0 for a certain loss, and numbers in the middle for everything else. An issue with this representation is that 0.5 could represent both a forced draw and the evaluation of 50% win, 50% loss, 0% draw. This isn’t ideal for players, since it’s nice to know what outcome to expect, but even more significantly this hurts Leela since she has to guess the outcomes of matches. The ‘WDL’ (win, draw, loss) patch should have these impacts:

Leela’s training will tell her whether games were draws, which should help her recognize draws.
Self-play games will be able to offer draws, which should make training faster by cutting uninformative games short.
Contempt based on WDL should be more robust than other approaches; by changing the search’s relative value of draws, Leela should be able to enter more decisive games against weaker opponents and thus score much higher.

This addition has been all but confirmed for Test50.

Implementation: https://github.com/LeelaChessZero/lc0/pull/635

Implementation (draw offers in training): https://github.com/LeelaChessZero/lc0/pull/724

kldgain thresholding in training

This is a complex change, but man is it promising. Right now self-play games run 800 playouts per move precisely. There is a good argument for sticking to a constant rather than using traditional time controls: the network needs to be able to correct itself when it is going wrong. If some part of the search only gets low visits, then tactics outside of its horizon may never be seen, and the search may never end up fixing its blind spot.

On the other hand, the only way the network learns anything is by being told whether a position is winning. If those answers are wrong, the network will be trained incorrectly! It is thus vitally important that the training matches are as strong as possible. This is the same exploration-exploitation conflict we had with temperature in training matches, and kldgain thresholding is a clever approach to fixing this in a way that improves both at once!

Kullback–Leibler divergence is a measure of distance between probabilities. Monte Carlo search assigns probabilistic values to the moves in the network, and constantly refines this through playouts. The idea here is to measure the KLD from one point in the search to the next, to figure out how informative these playouts are. If the divergence is small, this implies the network largely understands the situation, and more playouts are likely to be not very cost-effective and not result in the exploration having much value. If the value is large, this implies that the network is confused about the current situation, and needs more work to figure out the right value. Thus instead of an 800 node limit, the threshold is whenever the KLD in the last 100 nodes is below a given value.

Testing a limit of 0.000005 against fixed 800 nodes resulted in +120 elo (!!) for training matches, with an average of only 766 nodes per playout! The minimum number of visits for unforced moves was 200, but for tricky positions this strategy played all the way up to 5300 nodes. This indicates that training using this technique will discover complex tactics faster than normal, and will have significantly stronger self-play games to work from. Initial Test50 sanity checks have been using this (with a limit of 0.000009, presumably to be decreased over time), and I would expect it to remain that way.

Proposal: https://github.com/LeelaChessZero/lc0/issues/684

Implementation: https://github.com/LeelaChessZero/lc0/pull/721

Certainty Propagation

When you visit a node, and that node is checkmate, you know more than just the move looking bad. In fact, you know that the opponent will always play that move, or a move at least as good. Therefore you can simplify the search by just not searching any other candidates. If a position has no moves that don’t lose, it is also lost. If a position has a move which leads to a win, that move is won.

Further, if a move repeats once, you can treat it as drawn: if there is a better, non-drawn move, you can just play it earlier, so you don’t remove your freedom to get to that position from the search tree. If your opponent doesn’t want to let you take the draw, the result could be worse than other non-drawn moves, but if that was the case then why didn’t you take those other non-drawn moves earlier?

The advantage of coding in this logic is that it reduces the size of the search tree, and propagates terminal states (wins, draws and losses) faster up the tree. This is exciting, with a +20 elo gain, faster mates, and mate-in-N displays as bonuses. However, it’s a complex patch and had performance bugs in initial revisions, so these things can take a while. It looks like there has been progress recently, so although Test50 is an unlikely candidate at the start, it may join in later and future networks could well use it.

Implementation: https://github.com/LeelaChessZero/lc0/pull/700

Implementation (old, more features): https://github.com/LeelaChessZero/lc0/pull/487

Value Pessimization

Here’s an interesting but idea with uncertain payoff, that’s mostly technical minutia. Perhaps the simplest way to explain this is by analogy to ‘burden of proof’. With FPU, unvisited nodes are considered to be losses, but as soon as a visit is made this assumption is thrown away and the result of that visit is considered definitive. With value pessimization, the calculation acts as if there exist some fractional number of other children (0.6 visits by default in the patch) which evaluated to losses, and you need enough visits in order to overcome this initial skepticism. This is in some sense the opposite side of UCT, in that UCT estimates the potential of a node from its uncertainty by increasing it for exploration, whereas value pessimization lowers the value of nodes that haven’t been explored sufficiently during analysis of existing lines.

This produces a small +20 elo gain, but details are unclear and it’s not an urgent candidate for a run, so don’t expect it to be in Test50.

Implementation: https://github.com/LeelaChessZero/lc0/pull/707

Q learning

Oracle published a blog series on AlphaZero where they discussed their approach to the task. Their fourth post discusses the value the head is trained on. One option is to train the value head on z, the final result of the game. This is great because it captures extremely long term effects, and we want wins to be recorded as wins. The issue is that this is only guaranteed to be the right value when the network did not make any mistakes in the game! This is ultimately really unlikely. One alternative is training on q, the value produced by the Monte Carlo search from that position. This is more likely to correctly resolve and strengthen uncertainties in the network on local problems like tactics. However, if the network is systematically wrong about a position, it might never resolve its confusion because it will never see the actual result; the change needs to slowly propagate up the different generations of networks.

The interesting solution to this is to average q and z, providing a mix of both. This makes the policy more learnable and less random-seeming while still allowing nonlocal effects to be visible. There has been some noise about using this for Test50, but it’s not clear whether this will be the case.

Implementation (partial): https://github.com/LeelaChessZero/lc0/pull/722

Simple Regret-based search (root cpuct, SR+CR MCTS, VOI, or other)

So this is confusing: we have solid theoretical arguments that MCTS is bad, there is a simple mitigation (increasing cpuct at the root), testing of the simple mitigation shows 100 elo improvements at 800 nodes, the technique is based around increasing exploration (which is appropriate for self-play)… and there’s almost silence about it. As far as I’m aware, information on this is only in Discord for now, so I will link to papers.

The idea is best summarised by the paper:

At the root node, the sampling in MCTS is usually aimed at finding the first move to perform. Search is re-started, either from scratch or using some previously collected information, after observing the actual outcome (in MDPs) or the opponent’s move (in adversarial games). Once one move is shown to be the best choice with high confidence, the value of information of additional samples of the best move (or, in fact, of any other samples) is low. Therefore, one should be able to do better than UCT by optimizing simple regret, rather than cumulative regret, at the root node.

I have no idea when, or even if, any of these ideas will be adopted, but they look extremely promising to me. The second paper shows results on par with doubling (!!) the number of searched nodes.

Papers: MCTS Based on Simple Regret, VOI-aware MCTS

Game splitting

Perhaps the most interesting idea of the lot, but certainly very experimental and uncertain, here comes a technique that rethinks the idea of temperature altogether. Consider: instead of randomizing moves within a game to encourage exploration, at the cost of clouding the expected value of the game, what if we used randomization to find new starting positions for games, and played from each starting position with a temperature of zero?

The exciting part of this is that the split completely removes the damaging aspect of temperature, whilst still allowing exploration and training on those randomly generated nodes. This technique also allows producing a training set with more middlegames and late games, which is great since middle-late and late games are underrepresented currently. There is a lot still to discuss about this, and it’s unclear when or even if this will get resolved, so it’s not a hugely likely candidate for Test50, but it’s worth keeping an eye on.

Implementation: https://github.com/LeelaChessZero/lc0/pull/720

Discussion: https://github.com/LeelaChessZero/lc0/issues/342, https://github.com/LeelaChessZero/lc0/issues/237