What do games tell us about intelligence?

On measuring human intelligence from comparisons

Over the weekend Google DeepMind’s alphaGo program defeated one of the world’s leading professional Go players, Lee Sedol, in a best-of-five unhandicapped Go matchup. The final tally was 4–1 in favor of alphaGo, and a profound reality is upon us: a major stronghold of superior human intelligence has fallen.

This defeat raises important questions for research on human intelligence. What can we learn from continued advances in gameplay artificial intelligence? What role can games play in measuring continued progress in research on intelligence more generally? Is there an “endgame” for the role of games in AI research?

AI and games: a storied history

The use of competitive gameplay to study artificial intelligence dates to the early days of modern AI, when Arthur Samuel developed a Checkers program in 1956 that trained itself using reinforcement learning. In 1962 Samuel’s program defeated a relatively strong amateur American player (only in a single game; it lost all the other games in the match), and this small but widely celebrated victory became perhaps the first “machine defeats man” cultural moment for artificial intelligence.

Arthur Samuel demonstrating his Checkers program on the IBM 701 computer in 1956.

As Checkers programs became more advanced, eventually they began defeating top human players in the late 1980s. The final Checkers human-machine title match was organized in 1996, a blow-out win for the Chinook program. In 2007, the developers of Chinook published a paper in the journal Science announcing that Chinook had completely “solved” Checkers: an exhaustive search revealed that their program could no longer be defeated by any Checkers opponent, human or otherwise. Of note: the greatest human to ever play the game, Marion Tinsley, unfortunately passed away shortly before Chinook achieved full strength. It is an open question whether Tinsley would have been capable of drawing Chinook consistently.

As computer Checkers advanced, so did Backgammon: in 1979 Hans Berliner’s BKG 9.8 program defeated reigning Backgammon world champion Luigi Villa, winning the matchup 7–1. Berliner and Villa both felt that the program got lucky, but that didn’t stop newscasters from proclaiming, “I hope the robot doesn’t get into newscasting too — I bet he works cheap” — another early “machine defeats man” moment. Since BKG 9.8, TD-Gammon and later programs learned to play at and above human levels.

TD-Gammon was implemented using a neural network model and applied a reinforcement learning approach similar to Samuel’s Checkers program, as does alphaGo. Reinforcement learning allows a game program to learn techniques (from e.g. playing against itself) that can in principle reach beyond what a human instructor can teach it. As a result, by studying TD-Gammon’s gameplay, Backgammon enthusiasts have in fact learned a great deal about the game of Backgammon.

There is an important lesson here: professional Go players are likely to learn new insights about the game of Go from studying alphaGo’s gameplay. It’s very likely that we see human Go play improve over the next few years; Bill Robertie’s 1993 short essay Learning from the Machine, about his experience playing against TD-Gammon and his thoughts on neural networks and machine learning in games at the time, is well worth a read if you can get your hands on it. Meanwhile Petter Holme has an interesting recent commentary on this idea as it applies to alphaGo.

And then there’s Chess. Before even Arthur Samuel built his Checkers program, Claude Shannon opined in 1950 that Chess was an excellent challenge problem for research in artificial intelligence: “a solution of this problem will force us to either admit the possibility of mechanized thinking, or to further restrict our concept of ‘thinking.’”

But Chess too gave way in a grand series of spectacles. When IBM’s Deep Blue defeated World Chess Champion Garry Kasparov in 1997, Go became the proverbial Hornburg (of Tolkein’s Helm’s Deep): the final stronghold of superior human intelligence that humanity fell back upon. And now, in a seemingly rapid chain of developments, it too has fallen.

So where does that leave us? Games may have outlived their usefulness for providing “human vs. machine” moments, but they have a more important future that is long from over: games provide a key way of measuring reasoning skills comparatively.

Games as comparisons

The lasting importance of games in AI research, beyond serving as a source of well-defined and widely understood challenge problems, is that they provide a unique means of measuring intelligence through task-based comparisons. Intelligence is notoriously difficult to measure, even in humans (cf. issues with standardized tests). Games offer simple and useful comparisons of skills, and in the case of board games, reasoning skills.

Psychometrician Louis Lean Thurstone pioneered the field of comparative judgement in the 1920’s, motivated by observations that pairwise preferences were easier to elicit than absolute measures across many domains: “Which painter do you like more, Kandinsky or Rothko?” (Kandinsky.) is easier to answer than “How much do like Kandinsky?” (A lot?) and “How much do you like Rothko?” (A little?). Thurstone showed that after asking a judge to make many pairwise comparisons between e.g. different paintings, one can then seek out (via an optimization procedure) an ordering of the paintings that is maximally consistent with the gathered pairwise preferences. This approach is sometimes also called “ranking from comparisons,” and is used widely across machine learning.

Kandinsky or Rothko? Comparison judgements can often be easier to elicit than absolute judgements.

Rather than considering a human judging paintings, we can instead have a game “judge” players (both humans and computers). Thurstone’s model is the basic idea — ignoring several technical details like tie games and how to handle new players — behind the Elo rating system developed for Chess by Arpad Elo, where players are assigned numerical ratings based on the gameplay history of other ranked Chess players. Elo ratings are calibrated so that a player who is 200 points above another player is predicted to be “chosen” by Chess 75% of the time in a match-up. As a result, if the world’s top ranked player Magnus Carlsen (Elo rating: 2851) played the 100th ranked player Loek Van Wely (Elo rating: 2653) tomorrow in a game, a large-scale analysis of historical gameplay predicts that Carlsen has about a 75% chance of beating Van Wely. Elo ratings are also widely applied to predict the outcomes of other sports. Moreover, it was recently revealed that Tinder uses Elo ratings to predict dating matches.

Elo ratings as a measure of “Chess skill” don’t just apply to humans: after Deep Blue beat Kasparov, several computer programs have gone on to achieve Chess Elo ratings well over 3300. These scores mean that they are predicted to almost always defeat essentially any human adversary. It also means that there exists a measuring stick for comparisons beyond human ability in Chess: we can compare high-end programs against each other to define the notion of “skill” beyond human gameplay, in Chess and elsewhere.

In a series of excellent blog posts and research papers, computer scientist and International Master-level Chess player Ken Regan has explored the concept of a ratings horizon in Elo ratings for Chess: more and more modern computer programs mostly draw ties against each other, and Regan notes that we are steadily approaching the point where Chess programs may not lose to each other — or to any human.

Chess is close to running its course as a yardstick of reasoning. And now computer programs are passing the human horizon of Go. Go likely has a great deal additional “measurement” to offer us as a comparator, but the natural question looking ahead is: what games can evaluate more advanced reasoning skills than Go? Can Go on a larger playing board fill that role?

An interesting contender for further research on reasoning skills is Poker (and related card games). Poker AI is very different from Chess or Go. First, there’s nothing computationally demanding about computing odds in Poker. Second, Poker does not have “perfect information:” players have different information about the state of the game, paving the way for deception to play a much larger role. Checkers, Chess, and Go, meanwhile, all test reasoning skills with perfect information; Poker reasoning is a different beast, and a contender for novel progress of AI research on reasoning.

A Nao robot taking a shot on goal during the 2013 Robocup in Eindhoven, Netherlands.

What other games test reasoning skills? Competitions such as RoboCup, an all-robotic soccer tournament, have long tested the mechanical capabilities of machines. As roboticists continue to advance the state of the mechanical art, such tournaments form an interesting venue for testing reasoning as well. Car racing, rich with strategy, is another possibility. Perhaps it’s time to enter self-driving cars in a NASCAR or Formula 1 race? Cruise control is, after all, one of the great early successes of artificial intelligence. To state the obvious: a successful NASCAR AI program would be very different from a successful autonomous vehicle AI, but it nonetheless has potential as a comparison game for reasoning skills.

Going back also to absolute (as opposed to comparative) evaluations of reasoning skills: even if standardized intelligence and competency tests have flaws, it’s worth considering how computer programs fare. Researchers at the Allen Institute for AI are working on that, and it’s certainly an exciting effort.

That leaves us with one comparison game that I’ve intentionally avoided thus far, even though it pre-dated all the others as a measure for human intelligence: a game famously conceived of by Alan Turing.

Turing’s Imitation Game: a comparison of humanity

The original Imitation Game (see Turing’s 1950 paper, “Computing Machinery and Intelligence”) went as follows: a female and male subject are seated out of view, and an interrogator (of either gender) is tasked with discerning which of the two subjects are female through written communication with both. The challenge, for the male subject, is to “imitate” female behavior. Thus far no computer programs are involved. Turing then goes on to ask: can a machine program imitate a woman as well as a man can imitate a woman? When a human interrogator asks questions in attempts to discern which of the two subjects is female (the woman or the computer program), are there programs for which a reasonable interrogator would choose the program?

Alan Turing, ca. 1935.

The above differs slightly from the popularized version of the Imitation Game that asks an interrogator to try and identify which of two subjects is human and which is machine. In the gendered version, the interrogator doesn’t know that there’s a computer program playing the game, while in the de-gendered and popularized “Turing test” version, the interrogator knows. But either way, Turing’s key insight was to define human intelligence as comparative, as opposed to attempting an absolute measure.

What I find most interesting about the de-gendered Turing test (“which subject is human, which is machine?”) is that it suggests the possibility that one day we may lose to computers at our own game. To be sure, in order to win such an Imitation Game a machine must be able to replicate all the fallible traits of human intelligence: from written typos to bounded analytic reasoning to the nuances of human cognitive and behavioral biases. But given all the ongoing advances, emulation of human intelligence may not be that far off. And comparisons give us a quantitative way to measure progress.

Imagine organizing a “Turing tournament” where all the subjects were human, but an interrogator was told that half of the subjects were machines. Tasked to determine which subjects were human and which were machine, the interrogator would be forced to choose which subject was “more human.” As a result, it is therefore possible to measure “how human” each human is. Or at least: how well each human performs human intelligence.

The next natural step is that there’s no reason to believe that computer programs can’t “out-human” us, achieving Elo ratings in the imitation game much higher than any human. This observation is particularly true if the interrogator in the game is human; the natural next step would be to put in place a machine interrogator, who would probably be able to discern the difference between subjects better than any human. As a first step in this direction, research on CAPTCHAs targets precisely this task of discriminating between machines and humans.

But beyond CAPTCHAs, at what point can a machine no longer tell the difference between a human and a machine? Essentially (cf. the omnipotence paradox): will a computer program ever be able to design a query protocol so discerning that even the program can’t deceive it?

Pulling back from the brink, there’s something else that separates an analysis of a Turing tournament from Chess or Go or really any other game. In most games, the value for AI research is in pushing the development of certain narrow reasoning skills. In Turing’s task, however, the skill itself is human intelligence broadly considered, and it therefore becomes relevant to ask questions about the game itself that are arguably of less general interest for other games.

For any comparison-based game, an intriguing measure of the complexity of that game itself is the so-called “depth” of the gameplay, as measured by the range of the Elo ratings for a player population. A depth of 1 is typically defined as a 200 Elo point range, meaning there exists two players where one is predicted to beat the other with modest certainty (75%). This basic range suggests there’s at least some skill involved (as opposed to pure chance). A depth of 2 means there is a chain of three players each 200 points apart, spanning a total of 400 points. The idea of depth is that a wide range in Elo ratings suggests the top player has skills that a player 200 points below them does not, who in turn has skills that a player 200 points below them does not, and so on. The wider the range, the more skill is arguably inherent to the game.

Robertie’s table on the depth complexity of various games.

Bill Robertie described, in a short letter to Inside Backgammon, his finding that Backgammon had a depth of 8, Chess 14, and Go 40. So what about the Imitation Game? What is the depth of the human game? Here, all of the sudden, the depth of the game itself says something directly about human intelligence, not just the complexity (or lack thereof) of a given board game. Beyond what we can learn about intelligence from the computer play of board games: could studying comparisons made by the Imitation Game provide insights about the nature of human intelligence as well? Computer programs have taught us new things about Backgammon, Chess, and Go, but can they also teach us about ourselves?

I’ll close by emphasizing that artificial intelligence research isn’t all just for fun and games: as machines continue to excel at an increasing number of tasks, it will be increasingly important to consider how we can integrate them usefully into our lives. What decisions do machines make better than us, and what decisions should we continue to keep for ourselves? And where can we work together as human-machine teams to achieve new kinds of complementary intelligence? For more on the plentiful potential for positive societal impacts from AI research, see Eric Horvitz’s thoughts framing the ambitious One Hundred Year Study on AI convened at Stanford last year.

Overall, games provide a rich framework for measuring progress in machine reasoning capabilities through competitive comparisons. Computer Go will likely continue to be relevant to AI researchers for quite some time, and it will be exciting to see how the related wide range of challenges are met by the broad AI research community.

In the meantime, if you need me, I’ll be playing Calvinball.

Posted 3/11/2016; updated 3/15/2016. I am grateful to Eric Horvitz, Jon Kleinberg, Stephanie Safdi, and Sean Taylor for many helpful discussions.

Johan Ugander is an Assistant Professor of Management Science & Engineering at Stanford University. He has recently spent a lot of time thinking about comparisons.