A very non-technical explanation of how AlphaGo Zero can teach itself to play Go so darn well

Neal Donnelly
BuzzRobot
Published in
6 min readNov 26, 2017

For many years, the ancient Chinese game of Go was seen as the last major board game at which computer programs could not outcompete (or even challenge) top human players. Google subsidiary DeepMind made headlines in March 2016 when their AlphaGo problem defeated Lee Sedol, one of the greatest Go players in the world. It was an impressive achievement, but one that relied on enormous super computers and a massive training dataset of professional Go games.

In late 2017, DeepMind made waves again with a breakthrough that allowed it to surpass the old system “without human knowledge.” Rather than study expert games, the system learned everything about Go simply by playing against itself. The new algorithm is much simpler, much faster, much more efficient, and much more effective than the version that defeated the human world champion.

The paper that the team published explaining the new program is a bit puzzling, since it highlights more the simplifications they made than how the simplified version is so much more effective. However, the key change is actually quite clever — and one that requires no technical background to understand.

How to play Go

On every turn, the player must choose between hundreds of options to place their stone.

Go is simple. Players take turns placing their stones on the board, attempting to surround their opponent’s stones and to encircle territory, thereby claiming it as their own. If a group of any player’s stones are entirely encircled by opposing stones, that group is captured by the opponent. The game has always been very difficult for computers because each turn presents so many choices of where to place the stone, and because it can be tough to tell just who is encircling who.

How the first AlphaGo worked

Imagine that AlphaGo is actually a team of three people, who we’ll call Nora, Valerie and Monty. Nora has studied lots and lots of professional games of Go, and learned to guess where a master Go player would place their stone in any given board situation. When shown the board, she can immediately identify the most promising spots to place a stone. Valerie has also studied many professional games of Go, and she’s really good at telling who’s winning just by looking at the board. Monty doesn’t know much about strategy, but he’s really good at remembering and imagining different configurations of Go stones on the board.

On board A, Nora recommends that the spots marked in red and blue are the best options for them to place their black stone. Monty remembers the two middle boards, then asks Nora where she thinks the opponent will play the white stone on hypothetical board B, and he remembers the two hypothetical boards her suggestions would land them on. By evaluating many of these hypotheticals, Monty can come up with his own recommendations for where to play on board A.

Together, Nora, Valerie, and Monty make a good team. When it’s their turn, Monty asks Nora where a pro would most likely play. For each of those spots, he then shows her the board as if they had played that stone, and asks which moves their opponent is most likely to respond with. For each of those, Monty asks Nora for the most likely responses, and so on. Monty is really fast at showing Nora the boards, and really good at remembering her responses, so they can explore a lot of possibilities very quickly.

After Monty and Nora imagine a few steps down the tree of possible moves, Monty asks Valerie to assess how likely they are to win from each possible board that they could reach. He then does some math to synthesize Valerie’s assessments of the possible future situations, figure out which option is most likely to help them win, and plays the stone for their turn.

The team does this process before each of their turns. Every day, they play many games like this. Every night, Nora and Valerie go home with photos of each move they played that day, labeled with whether or not they ended up winning that game. They study each move from their winning games, remembering it as a good move, and each move from their losing games, remembering it as a bad move. Studying makes Nora less likely to play the losing moves and more likely to play the winning moves.

How the new AlphaGo Zero works

For AlphaGo Zero, the change that received the most focus is that Nora never studies any professional games of Go. In fact, when she starts, she doesn’t even know the rules of the game, and just places stones at random. What’s more, Valerie isn’t around anymore, and Nora is asked to both predict promising moves and to assess who’s winning. And instead of three months of practice to master the game, Nora and Monty only have three days. With just these changes, it’s hard to see how AlphaGo Zero does so much better.

The new key to success is very subtle. When Nora goes home, she doesn’t study whether the moves they played lead to wins or losses — in fact, she doesn’t study the moves that they played at all. Instead, for each turn, she compares the initial suggestions that she gave to Monty with the recommendations that he came up with after exploring the tree of possibilities with her help.

On the left is Nora’s recommendations, highlighted in yellow. Monty uses those recommendations to explore the tree of future board states and generate his own recommendations, shown on the right in green. Of those, they decide to play in the best one, marked as Location A, but they end up losing the game. In the original AlphaGo, the lesson Nora would learn is that Location A is a bad place to move because it made them lose. In the new version, Nora just studies to try to make her yellow predictions match Monty’s green predictions.

As fast as Monty is, it takes him a while to imagine so many different possible futures. If Nora could intuit those same recommendations just by glancing at the board, her suggestions would make Monty’s search more useful, and would help him come up with even better recommendations. And if every night she studies better recommendations from Monty, her intuitions will become better, helping Monty make better recommendations, which she can study to build better intuition… a virtuous cycle.

It turns out that engaging with this virtuous cycle between Nora and Monty makes Nora much more effective than when she was simply studying each of their moves as good or bad. Furthermore, because the board is mostly the same after each move, back when she studied good and bad moves, she couldn’t study every move from each game or she’d become overly convinced that anything that looked like that game was good or bad. Instead, she would study just one move played from each game — so she and Valerie and Monty had to play a ton of games to be able to generate enough material for them to study. But now that she’s just studying to predict Monty’s recommendations, she can study the recommendation for every move, and doesn’t need to play nearly as many games to get lots of study material.

Of course, Nora and Monty aren’t real people. Nora is a neural network, and Monty is a Monte-Carlo tree search. (Valerie is also a valuation neural network). These are well known algorithms that people have been working with for years, and both are used in the original AlphaGo. A simple but brilliant change to the implementation of these well known algorithms makes the program vastly more effective. And that sort of brilliant ingenuity, understanding a problem and suggesting a clever conceptual solution, is exactly what we still have no idea how to make a computer do.

--

--