AlphaZero and the Curse of Human Knowledge

DeepMind has shown us that statistical learners are better suited to playing Chess and Go than humans

In December 2017, Google DeepMind’s new chess engine, AlphaZero, soundly defeated Stockfish 8, one of the strongest computer chess engines in the world. Although AlphaZero wasn’t playing the strongest version of Stockfish and the system is still under peer review (this piece from Jose Camacho Collados does a great job of tempering some of the claims), it appears that AlphaZero will soon be the strongest chess playing entity in existence, if it isn’t already. DeepMind’s paper is on arXiv here.

AlphaZero’s ascendance is particularly remarkable because the core piece of software behind it wasn’t originally created to play chess. Instead, AlphaZero is a tweaked version of the AlphaGoZero, the program DeepMind recently used to beat the best Go player in the world. To get the new computer chess champion, the DeepMind researchers essentially took AlphaGoZero, removed all the Go-specific features from AlphaZero (for example, some aspects of AlphaGo depended on the rotational symmetry of a Go board), added the rules of chess, and trained the new system on 5,000 of Google’s custom machine learning chips for 4 hours.

Lee Sedol, a champion Go player, struggling to decipher one of AlphaGo’s moves. Photo by Google.

How Computers Play Chess

How did DeepMind become competitive with the world’s best chess programs so quickly? The same two major components, a value function and a tree search, power AlphaZero and Stockfish, as well as most other chess engines. The value function takes a game state as an input (for chess, this is just the position of each piece plus a bit indicating whether or not each player has castled) and outputs a single number that determines how “good” that state is for the computer player. If the value function works well, it will rate highly board positions that are favorable for the computer and rate lowly board positions that are likely to lead the computer to lose.

The tree search uses the value function to actually select a move. The simplest tree search involves simulating every possible sequence of future moves. For each possible move it can take, the computer looks at every move its opponent can respond with. It then looks at every possible move it can play in response to those moves. Ideally, the computer would simulate as many moves as conceivable until it reached the end of the game. But the exponential increase in possibilities makes this intractable. Instead, most algorithms just explore as much of the tree as they can within a predetermined time limit.

An illustrated tree search for a tic-tac-toe program (https://www.ics.uci.edu/~eppstein/180a/970417.html)

It’s important to focus the search on the most prom­­ising parts of the tree given what’s going on in the game. The best tree search algorithms use additional rules to avoid spending too much time exploring the consequences of moves that are obviously bad. For example, it might not bother to work out all the possible implications of sacrificing its queen, because a move like that is very likely to be a dud unless it leads to checkmate or a capture of the opponent’s queen within a few rounds of simulation. The value function makes these judgements — if a move leads exclusively to low value states after one round of simulation, it won’t receive much additional consideration.

While both AlphaZero and Stockfish work according to this general method, they use very different value functions. Stockfish’s value function is a formula based on the value of the pieces on the board and a series of rules such as the following:

Piece Placement
1. Encourage knights to occupy the center of the board.
2. Encourage queens and rooks to defend each other and attack.
3. Encourage 7th rank attacks for rooks.

Pawn Structure
1. Penalize doubled, backward and blocked pawns.
2. Encourage pawn advancement where adequately defended.
3. Encourage control of the center of the board.

Etc.

(these examples are taken from a blog post on how Stockfish by Catherine Ray)

Stockfish’s developers must write each of these rules in addition to deciding on their relative importances. Even though everything is done by hand, the rules are far from arbitrary. The Stockfish community has created a process to test changes to the value function and the tree search. Each new potential addition to Stockfish’s value function or tree search that a developer thinks of is run through an extensive test suite to determine its impact on the program’s performance. If the tests go well, the tweak is added to the project and Stockfish gets a tiny bit better. Stockfish is an open-source chess engine that has grown stronger over the years through the iterations that dozens of people have contributed. You can see the recent tests the team has run online here.

How AlphaZero works

AlphaZero’s developer took the opposite approach. It doesn’t have any heuristics built into its value function. Instead, the value function is learned entirely from scratch using a neural network model. The network starts with a value function that simply assigns random values to each state. Then two copies of the whole engine (the neural network value function plus a tree search) play against each other. After the game, the neural network is updated with the new data. Once the training is complete, the network’s value function learns to assign slightly higher values for the game positions that led to victory and slightly lower values for the game positions that led to defeat. This process is repeated thousands of times: two copies of AlphaZero play each other, the value function is updated, two new copies (with slightly improved value functions) play each other, the value function is updated further, etc. After 300,000 generations of neural network improvements and a billion games of self-play (the actual AlphaZero played itself around 4,000 times in between each update) AlphaZero was good enough to beat Stockfish.

Using self-play to recursively improve an agent’s ability to play a game isn’t new. Why hasn’t this method yielded a champion chess or Go engine until 2017? Historically, systems that improve via self-play have been very unstable. Previous attempts often ended up in cycles, forgetting and relearning strategies over and over rather than improving to superhuman levels. Or sometimes the agent would get stuck, failing to improve after achieving moderate success.

AlphaZero’s main contribution was solving these problems. After lots of experiments, DeepMind developed a series of new tricks and discovered a value function and tree search that reliably learned through self-play alone. They then leveraged their engineering talent and infrastructure resources to demonstrate that the system could work on the massive scale required to master complicated games such as chess and Go (the version that played Stockfish employed 5,000 custom machine learning chips). Interestingly, it appears that researchers from the University College London came up with a similar algorithm around the same time, but lacked the engineering resources to apply it to a game as combinatorically rich as chess. Instead, they settled for a Hex, a simpler and lesser-known game. As a result, their paper attracted much less attention than DeepMind’s did.

If you are interested in the specific tricks that make this work, I recommend this blog post by Seth Weidman. For more technical depth, I liked this post by Tim Wheeler on AlphaGoZero’s tree search and the paper from the University College London Researchers, which has more details than the AlphaZero paper and includes a comparison between the two approaches.

How AlphaZero is different

Chess is a very complicated game, so human players rely on a variety of mental models to play effectively. One of the most popular is the relative value point system, which most players learn after mastering the basics of the game. In this model, each pawn is worth one point, each bishop and knight three points, each rook five points, and the queen nine points. There’s a lot going on during any chess game, and the large number of possible moves make it very difficult to plan more than a couple steps ahead. Beginning players often use heuristics like the point system to guide their play. Should I trade a bishop and a knight for a rook? Five points is less than six, so according to the point system model you should avoid the trade. Developing a long-term strategy in the midst of a chess game is hard, but optimizing for points over the next few moves is feasible. Although it’s not a winning strategy against stronger players, the relative value point system allows beginners to play reasonably well in a short amount of time.

As players gain experience, they refine and expand their models. For example, a more experienced player knows that rooks are more useful in the endgame and as a result might trade a bishop and a knight for a rook under the right circumstances. They learn to take position into account, favoring pieces in the middle of the board, pawn chains, and a well-protected king. However, humans don’t learn by simply memorizing rules. Instead, we learn in “chunks.” Once we have internalized a model, decisions become intuitive and no longer require substantial cognitive effort. That intuition can then be a foundation to build higher-level models. (see “Cognitive Architecture and Instructional Design”, pdf available here). A mathematician solving an equation doesn’t have to think about adding together two terms — after enough practice, the arithmetic becomes automatic and he or she can focus conscious thought on the bigger picture. Similarly, an expert chess player weighing the impact of capturing a particular piece can “feel” how important it is without focusing attention on the point value system or the importance of rooks in the endgame. They can focus their attention on higher level strategy (like what could happen three or four moves down the line).

Because new models are built by referencing and tweaking prior models, there is path dependence in how we think about chess. Once the “arithmetic” of chess (basics including openings, how to value pieces, etc.) becomes intuitive, those models shape the way we think about the game. Since most people learn about the game in the same way, the models affect the way we play in a way that’s hard for us or others to detect or change.

AlphaZero learns chess in an entirely different way. Human experts build a rich, hierarchical model of how chess works by playing tens of thousands of games. In contrast, AlphaZero’s neural network leverages its computational power to figure out which moves are more likely to lead to a victory by playing a billion games. Our ability to reason about the game allow us to be competent quickly. For example, once you learn the rules and watch a couple games, it’s evident that the queen is a very valuable piece. By contrast, AlphaZero is worthless until it has played thousands of games. The value of the queen, a concept we learn quickly, becomes apparent to AlphaZero only after it has established a reliable correlation between the queen’s survival and the outcome of the game.

Since AlphaZero’s understanding of chess is fundamentally statistical, it requires about 10,000 times the experience of a human to reach an expert level of play. But, unlike writing a novel or having a conversation, discarding semantics in favor of number-crunching has turned out to be a winning strategy for chess and Go. In fact, AlphaZero may have exposed flaws in our ability to play these games. For example, in one of publicly-released chess games, AlphaZero sacrifices a knight for a subtle positional advantage that ends up winning it the game many moves later (a video of the game with analysis is on YouTube). Our heuristics (e.g., the idea that a knight is worth three points and the failure to recognize the value of a small positional change) and our inability to look ahead as far into the future as a computer have prevented us from discovering these types of moves. (Although it’s not all bad for human knowledge; AlphaZero did rediscover and use several familiar openings.) Stockfish, despite its ability to explore many more possible moves than AlphaZero can, remains limited by its value function, a series of heuristics written by humans and drawing on our best understanding of the game.

In 1997, IBM created Deep Blue, the first chess engine that could defeat the best human players. Twenty years later, it appears that DeepMind has created an engine that can defeat the best chess engine developers. AlphaZero achieved world-class performance in chess and Go without specialized knowledge of either game. DeepMind now faces an ever bigger test — can they bring their game-playing algorithms into the real world?

I’m working on a new post on the difficulty of applying the AlphaZero algorithm to finance and other “real world” domains. Follow me on Twitter or Medium to receive the latest updates!