What future for the AI behind Google DeepMind’s AlphaGo?

Published in

Own Machine Learning

8 min readFeb 15, 2016

“Easy come, easy go” Flickr © Linh Nguyen

One thing that caught my eye in the announcement that the new DeepMind AI had defeated European Go champion Fan Hui 5 times in a row, was the following statement: “Because the methods we’ve used are general-purpose, our hope is that one day they could be extended to help us address some of society’s toughest and most pressing problems.” The superiority of an AI in Go is already a significant breakthrough (the news made the frontpage of Nature), but could the future impact of this discovery be even bigger?

AlphaGo is to Go what DeepBlue is to Chess, but after reading the above statement, I couldn’t help comparing AlphaGo with Watson, the AI that won Jeopardy in 2011 and that IBM tried for a long time to use in other domains. One important difference for me though is that the Google DeepMind team published a research paper that explains in detail how AlphaGo works (you can read the paper online for free, but Nature charges $32 to download a pdf copy). 4 years after graduating from UCL, it was fun to see that the first author of the paper is David Silver, who was my internal PhD thesis examiner! Some of the techniques explained in the paper looked familiar (tree search and reinforcement learning), but their combination with deep learning was new to me, so I turned to my friend Gabriel Synnaeve (Postdoc at Facebook working with deep learning) for help.

Once I had a high-level understanding of the functioning of AlphaGo, I explained it to a few people around me, but I noticed from their reactions that it wasn’t clear at all how to extend any of that outside of Go. Here are some thoughts on that—but first, let’s look at how AlphaGo works.

Look-ahead search (a.k.a. tree search)

Go is a board game that involves two players, who take turns to place stones on a board (see picture above). Points are given for capturing the opponent’s stones or for surrounding empty space (points of territory). One notable difference with a game like Chess is the number of possible moves one can make, which in Go is of the order of 10 times that in Chess.

At the heart of Chess and Go programs is an algorithm that examines possible sequences of moves (player A plays this move, then player B may play these moves, then player A may play those moves, etc.) and evaluates who has the advantage at the end of these sequences. This is referred to as tree search, where nodes of the tree contain representations of the board and each branch corresponds to a possible move (which makes a lot of branches per node). For this approach to work, one needs a good way of evaluating positions at leaf nodes, and for the search to be efficient, one needs a good way of prioritizing branches to explore.

Deep Learning

One approach that seems essential to understanding the state of a Go board and evaluating positions is to look for visual patterns. The board represents the world, and the game is about splitting the world into territories—therefore, it is quite visual.

Estimating score before endgame on a small Go board (from Sensei’s library)

Deep neural networks have been proved to work really well to pick up on visual patterns, and they have had lots of success on detecting objects in images. The process of training (i.e. creating) these networks is called deep learning and it typically requires large amounts of training data: examples of objects (i.e. images here) and of the aspects of these objects that we want to predict (e.g. the fact that the image contains a dog, a chair, a person, etc.).

One particular type of networks is being used in computer vision: convolutional networks. They take an image in input, represented as a matrix of pixels. Layers of the network correspond to representations of increasing level of abstraction. In the face recognition example below, they are contours, face parts (eye, nose, etc.), and face “shapes” learnt from example faces. The prediction for an image of a new face given in input is a function of the representation of the image given by the last layer of the network.

Schematic representation of a deep neural network, showing how more complex visual features are captured in deeper layers (by Larry Brown at NVIDIA)

If you want to learn more, I recommend Mike Wang’s introduction to deep learning at PAPIs ‘15.

In AlphaGo, Deep Learning is used to train two networks that take a basic representation of the Go board in input:

The value network predicts the expected outcome. It is used when reaching a leaf during tree search, so the program can evaluate the quality of the position associated to the leaf (how likely is it that it’s winning/losing).
The policy network outputs a probability distribution over all legal moves. In the tree search, it serves as a short-sighted ranking of moves and allows to reduce the exploration of all possible branches to only those that look good-enough.

The basic representation of the Go board would be a 19 x 19 matrix with ternary values: no stone, white stone, or black stone (that could be encoded as 0, 1 and 2 respectively). Networks would use the more complex features learnt from data and found in the last layer in order to make predictions and estimate probability distributions.

Reinforcement learning

Learning deep neural network models requires a lot of training data, i.e. examples of states and moves, which can be found in transcripts of previous Go games between experts. Unfortunately, the amount of available historical data wasn’t sufficient to train good quality networks (i.e. networks that make sufficiently good predictions), so another type of learning was also used to improve the networks: reinforcement learning. The general idea is to have the algorithm play and continuously update its policy model: it chooses a move to make based on this model, it gets a “reward” (positive or negative), it learns from that reward and adjusts its policy model consequently, then after the opponent has played it chooses a move to make with the new policy, it gets a reward, etc. When reaching the end of a game, the reward is +1 for winning and -1 for losing, and a new game is automatically started. The rest of the time, the reward is simply 0.

A version of the algorithm running with limited computational resources was made to play against another version which had more computational power. This “self-play” provided more data of Go moves and positions, and the results provided rewards to learn from. One of the novelties of DeepMind’s work is the scale of the self-play training (millions of games). Also, the team used 1920 CPUs, 280 GPUs, and training of the networks took weeks…

Long-term decision making

Go is a game where players practice long-term decision making. It involves choosing from lots of options based on an understanding of complex situations, and it involves long-term planning (long sequences of actions that impact the situation) in an adversarial setting. According to Troy Anderson, author of The Way of Go, many leaders in Asian business, politics and the military apply the game metaphorically to “maximize their time and resources, seize the initiative, adapt to change and compete in an established market.” Go is also part of the curriculum in some Western business schools.

Demis Hassabis gives two examples of areas where AlphaGo’s techniques could be used: climate modeling, and using medical images to make diagnoses or treatment plans of complex diseases. Unfortunately, he doesn’t expand on them yet. Another area that comes to mind is investment. Recent progress in AI suggests that machines can already do a better job than humans in choosing early-stage companies to invest in among a given set of candidates (see the AI Startup Battle at PAPIs Connect and my previous article on Trusting AI with important decisions). But what about long-term investment strategies? One could think of investment decisions as game moves, and investment portfolios along with representations of the economics of the world as the equivalent of a game board. The game never actually ends, but your aim as a player is to make investment decisions that result in a “winning” portfolio in the long term. For that, you would use tree search to simulate sequences of investment decisions by you and other companies, and deep learning to extract useful representations of the world.

Challenges in applying AlphaGo’s technology in the real world

Advancing the technology is crucial, but another part of the difficulties in tackling real-world problems is in modelling the world into a format that the machine can work with in an unsupervised way. This involves listing out what makes up the state of the world, which actions can be carried out, and then representing states and actions (the equivalents to Go boards and moves) in the machine’s memory.

These representations can be at a very low level and poorly structured, as they are in Go (i.e. our matrix with ternary values). Deep learning can be used to extract useful representations, but:

You have to choose a type of neural network to work with (e.g. for Go they chose convolutional networks), an interconnection pattern between the different layers of neurons and a size (number of layers, number of neurons).
You need training data, i.e. examples of previous states of the world and actions/decisions taken by humans. Deep learning needs a lot of data to produce good quality networks. Part of the data may come from self-play of the algorithm against itself, but you still need some data to get started.
In reality, you’re often not sure of the state you’ll arrive in after taking an action (or, to put it differently, in the consequences of your decisions). You would need to account for (and to model) uncertainty in the look-ahead search.

Conclusion

AlphaGo is amazing and I’m really impressed by what the DeepMind team has achieved.

If you don’t look behind the curtain, AI will invariably look magical, and any advances in AI will reinforce the feeling that it can do anything. It’s important to take the time to understand how AI works when considering its implications. We’ve seen here that extending what has been accomplished to other areas would require significant work and human intelligence. That’s why we speak of narrow AI, and it’s the only kind that exists today.