The Anatomy of a Chess AI
Chess-playing programs made their grand debut in the 50’s. They were unsurprisingly fairly weak; both technological and theoretical limitations kept engines from playing at the level of a master until around 1990. It was only in 1997 that the program Deep Blue was able to defeat a world champion, Gary Kasparov.
Since then they have surpassed humans completely. The top players have Elo ratings of around 2,800, the top engines are rated around 3,500. The 700-point difference roughly translates to the top human players having a 1.7% chance of winning against these engines.
The strongest modern chess engine is Lc0, an open-source project inspired by Deepmind’s AlphaZero algorithm. Unlike ordinary chess engines, Lc0 and AlphaZero are based on neural networks and a search algorithm known as Monte-Carlo tree search. This approach comes at a cost: building one of these algorithms from scratch requires a huge amount of resources. Deepmind trained AlphaZero on a server with four TPUs, and Lc0 was trained over months by distributed computing.
Despite the sheer intuitive power neural networks have over them, traditional networks are still not obsolete. Stockfish, Lc0’s main competitor, is an engine which relies on hard-coded heuristics and a significantly faster but rougher search algorithm known as minimax search. These engines still manage to be competitive thanks to the sheer power of brute-force computation enabled by clever optimizations and heuristic tricks.
The evaluation function
How do you get a computer to play chess? A simple way would be to program it to make random moves. A harder way, of course, requires finding out which moves are actually good. That is, which moves, if there are any, will lead to checkmate the quickest?
Theoretically we could take a given board and exhaustively search every move from that position until the end of the game. If we had all of this data available we could trace out which moves will guarantee a checkmate or, if winning is impossible, which moves will delay being checkmated the longest. Unfortunately this is not feasible as the number of different positions to keep track of grows exponentially, outpacing any machine’s memory capacity. Instead, this search is typically limited to a certain depth, i.e. looking 4 moves ahead. Since checkmates don’t often happen in only 4 moves, we need to introduce some intermediate measure that estimates how likely it is that the given side will win.
This measure is known as the evaluation function. It can be as simple as an indicator for whether a given side has been checkmated, but for limited-depth search this isn’t useful. A slightly more mature estimate is in counting the pieces on each side in a weighed way. If white has no queen but black does, then the position is unbalanced and white is in trouble. If white is three pawns down but also has an extra bishop, the position is likely to be balanced. This evaluation function scores pieces in terms of how many pawns they’re equivalent to.
An algorithm playing with this evaluation function will have general rules down. Taking pieces, especially rooks and queens, is typically good and sacrificing important pieces to take out pawns is typically bad. Sadly, its positional playing will be abhorrent.
A slightly move advanced method of scoring would not only take that into account, but also take into account the position of each piece on the board. This gives algorithms a surprising amount of positional intuition: pawns and knights should be in the center, rooks on the enemy’s ranks are powerful, kings should be in the corners, so on. Scoring can be made even more sophisticated by accounting more subtle tactical and positional patterns like pawn structure, connected rooks, and king mobility.
From here we won’t need to worry about what kind of evaluation function we use. For chess engines, evaluation and search are often independent of each other’s implementations. This fact has recently been used in developing Stockfish NNUE, a Stockfish clone that uses a sparsely updated neural network as an evaluation function.
Minimax search
A search algorithm boils down to the way a chess engine compares possible moves. Thanks to the use of evaluation functions, search algorithms only need to search to a fixed depth to make reasonable decisions. How do they actually decide which moves to play?
Search algorithms involve a search tree, or a way of representing a board and all the possible moves to be made from that position on the board. The root node of the tree is the original board, and all of the nodes branching out from it are reachable positions.
The minimax algorithm takes advantage of the fact that chess is a zero-sum game. Maximizing your chances of winning is the same as minimizing the opponent’s chances of winning. Each turn can be seen as a player making a move to maximize the evaluation function while the other tries to minimize it. In terms of a search tree, this means starting at a given node and choosing the children nodes with the best (or worst) scores.
Minimax is the “correct” way to do this, in the sense that if we were to let it go on indefinitely then it would play the game perfectly. That said, it is incredibly slow and impractical. This is thanks to the high branching factor of the search tree; any given position will have 10–20 possible moves, each of those new positions will have around the same amount, and so on. The number of nodes in a search tree increases exponentially with depth.
An early way to get around this was by using the “Shannon type B” variation of minimax. Instead of searching every possible move at a node (the type A approach), only a handful of the top moves are looked at. The candidate moves are decided either by the evaluation function, or some other heuristic like “checks-captures-threats”. These types of algorithms keep the branching factor low on the search tree, so they are able to search to a greater depth. However, since they don’t consider many moves they’re prone to miss a lot of important tactics and fall into traps.
By the 70’s, an optimization of minimax known as alpha-beta pruning was discovered that made the Type A approach viable. Selecting candidate moves for a Type B algorithm ended up being too computationally expensive in comparison and it fell out of favor.
Alpha-Beta pruning
If you’ve ever played chess, you will know that some moves are just bad. These are typically moves that allow the other player to get a clear upper hand on the next few turns; such as blundering a piece. To standard minimax, these moves are just as important to think about as the others. In return, the algorithm gets weighed down by analyzing bad positions.
Alpha-beta pruning speeds up minimax by skipping the “irrelevant” nodes of a search tree. This can be accomplished by adding extra data to each node, an “alpha” and a “beta” value, which represent the worst outcome for each player from that node. Since the maximizing player knows that the minimizing player will pick a response that minimizes the evaluation, they also know that they can avoid thinking about moves that allow the minimizing player to make things worse than they already are. These moves are “pruned” from the search tree and skipped.
Alpha-beta pruning makes brute-force search possible. Even better, it yields exactly the same results as naive minimax would, so it’s also the “correct” algorithm for tree search. However, it is still limited.
In the best case, alpha-beta pruning reduces the computational time by a “square root”. For example, if minimax takes 100 seconds to determine the best move, then alpha-beta pruning will take only 10. In most cases, however, alpha-beta pruning will not preform as well and will take some intermediate amount of time to run.
The reason this happens is due to how this algorithm visits nodes. So far there’s no system for deciding which nodes to look at first, so the algorithm just looks through them at random. This can lead to a situation where the program looks at the worst move when it does not yet know that it’s the worst move, as opposed to skipping it when it knows that there’s already a better move to pick.
Move ordering
The trick to mimicking the ideal case for alpha-beta search comes through move ordering. The idea is to look at the most promising moves first so that bad moves can be quickly eliminated. Since it’s impossible to tell what moves are best without actually performing a search on them, this method has to be guided by heuristics. Thanks to this, there are several possible ways to implement this in a chess engine.
This technique is similar to the Type B approach to minimax. Instead of choosing a subset of moves, they are instead just preferentially ordered. This can be something simple like ordering captures first and considering everything else after all the captures have been analyzed. Some finer-scale heuristics may be used, such as first looking at capturing the last moved piece.
Non-capturing moves may also be ordered. The most common heuristic for these is the “killer” heuristic. Within the alpha-beta search tree, two sibling nodes (descended from the same parent node) are going to have very similar positions. This means that a move which causes an alpha-beta cutoff in one node will likely be just as important for its sibling nodes, so it will be analyzed first once the algorithm gets to those nodes.
Another heuristic method for move ordering is known as relative history. It’s actually a combination of the history and butterfly heuristics, which are inadequate by themselves. The history heuristic orders moves by the number of times they’ve cause an alpha-beta cutoff in the search tree. The butterfly heursitic orders moves by how many times they are played overall. Relative history orders moves by the ratio of these two scores, essentially picking out the moves that were most effective when played.
If we allow the search algorithm to keep information about its search tree from previous moves, then we can also employ a form of move-ordering based on each node’s preferred moves from the previous search. This way, a chess AI searching a position to a depth of 8 doesn’t need to throw away all of that analysis and start anew on the next turn.
Since this relies on storing nodes and information about them, there needs to be a memory-efficient way to store this data. For picking out principal variations, we can get away with storing the board as-is since we don’t need to store very many. This isn’t always the best way.
Transposition tables
The speed of alpha-beta pruning is vastly improved by move ordering. Although it seems we’ve gone as far into optimizing this as possible, we can go even further through the use of transposition tables. A transposition table stores data about nodes throughout the search algorithm, allowing it to skip nodes it has already seen before in the search.
In a game like chess, transposed nodes show up all across the search tree. With the use of transposition tables, a given transposition will only need to be analyzed once before the algorithm remembers it and knows not to waste time re-analyzing it elsewhere in the tree.
Keeping a transposition table requires a large amount of memory. Storing the entire position of a board becomes impractical here, so programmers got creative. Since the program only needs to check if two positions are the same, we can take an idea from cryptography and check their hashes instead.
The idea behind hashing two objects is that most of the information of the objects is obscured, but their equality can still be checked. In our case, “obscuring information” means simplifying the object to save memory. The next special property of hashing is that two similar objects should have completely dissimilar hashes; which means that the hashing function should be fairly random and chaotic.
How do we hash a chess board? The answer comes from the Zobrist hashing method. This method takes the 8x8 chess board as being made up of 12 (for 6 pieces x 2 colors) 64-bit binary strings, along with a smaller binary string to keep track of castling and en passant rights. It then generates a hash key made up of a 8x8x12 matrix of random 32-bit strings (with a similar key for the substring). Finally, it takes an empty string and performs a bitwise XOR on it with the “keys” of each piece on the board, the random strings in the 8x8x12 key matrix which correspond to the 1’s in the 12 64-bit strings.
What makes this hash useful is that the XOR operation is reversible. This means we can effectively add and remove pieces from the board. So, instead of recalculating the whole hash from scratch, we can simply XOR that hash with the Zobrist key corresponding to that piece and square. This makes hashing efficient enough to use in a chess engine, since recalculating a board’s hash after a move only requires two XOR operations.
Information about a given node may be stored in a hash table. Encountering a new node prompts the algorithm to check whether that node is stored in the hash table. If so, it copies the data from the hash table over to that node without needing to perform a search.
The use of hashing has it’s downsides, however. Since hashing reduces high-information objects into low-information objects, there could be overlaps where two totally different positions somehow have the same hash value. This can be mitigated by having a large transposition table size. A size of around one to ten million entries is generally good enough.
Quiescence search
Until this point we’ve been dealing with fixed-depth searches. Since these programs are limited by how far they can look ahead, they can easily miss multi-move sequences that take place towards the terminal nodes. This is known as the horizon effect.
An example of the horizon effect in chess would be declining a favorable exchange. If the AI only notices that it is down a piece without knowing that it can recapture on the next turn with a good position, then it is unlikely to play that move and will miss it.
The way around this is to perform a second limited search on the “unstable” terminal nodes. This is a quiescence search, as it is meant to resolve a dynamic position into a stable one where the evaluation function will be more accurate. This mitigates the horizon effect for the nodes that are the most susceptible to it.
In chess, a position may be called unstable if any pieces can be captured. This limited search will only look at capturing moves, so its branching factor will be significantly smaller than the main search. Thanks to this, the search is also allowed to go on at a higher depth than the original search.
How computers play chess
Depth-limited searches with an imperfect evaluation function will always lead to errors in play. Computers can only manage to approximate a perfect strategy, so they all have weaknesses. The horizon effect was one of these weaknesses. A more unusual one is search pathology, in which the search gets worse as its depth increases. Luckily, this doesn’t happen for chess.
Chess computers are extremely good with direct attacks and sharp lines. When it comes to “quieter” positions based more on positional play, they tend to fail. However, since the past 20 years they’ve managed to rise above human players in skill. Better optimization and knowledge-based heuristics on the game have closed the gap.
Follow me:
- WordPress: https://austeretriceratops.wordpress.com/