Understanding OpenAI Five

In this blog ( my first blog on medium o_o ), I will explain the challenges in making a Dota bot, appreciate how OpenAI addressed these challenges, and lay out its fundamental flaws. I hope that by explaining how it works in layman terms, we can watch its upcoming matches with the right mindset.

The OpenAI bot is trained like a pigeon: It is conditioned to associate short-term objectives such as last hits and surviving with positive reinforcements. By accomplishing a sufficiently large number of these short-term goals, the bot wins the game by coincidence, without planning its victory from the start.

The OpenAI Five Problem

Every project needs the right problem statement. I believe the OpenAI Five’s problem statement as follows: Beat a team of human in Dota in any way possible with a program. This view is both powerful and liberating.

It is powerful as it creates a tangible spectacle and H Y P E. Like how Alpha Go pwnt go, OpenAI would like to pwn Dota. Anyone would be very proud with an achievement of the form: we are the first ___ that beat humans in ___ .

It is liberating in that it frees OpenAI from any “moral principles” but attempt to reach its goal by any means. Should the computer use a mouse and a keyboard? No let’s give it game APIs. Is there a limit on the total numbers of games which the bot can train on? Sure, how about 180 years of Dota per day.

The key is that we need to evaluate what problem did OpenAI actually managed to solve, rather than be misled into thinking OpenAI solved a more challenging problem. From reading their blog, OpenAI is indeed very careful to not overstate their achievements. Yet, it has every reason to hope the public over-hype their achievements out of proportion with sensationalism.


Learning Is Better Than Programming

Simply put, an AI agent is playing Dota well if it acts rationally for every game-state it encounters. For a human, we understand intuitively that taking last-hits is in general a good action, but taking last-hits when a fight is happening is generally a bad action. However, transferring this intuition to an AI agent has been notoriously challenging since the dawn of AI.

The forefathers of AI erroneously equated AI with programming. “We’ll just hard-code every single behaviour of the agent”, they had thought. The result is mammoth efforts trying to program every possible interaction the agent might encounter. However, there will always be some interactions that were unanticipated, and these hard-coded agents failed in spectacular fashions.

Rather than explicitly programming the agent’s behaviours, Reinforcement Learning (RL), a sub-field of AI, opts for a different approach: Let the AI interacts with the game environment on its own and learn what are the best actions. In recent years, RL has shown indisputable results, such as defeating the best Go player and besting a wide range of Atari games. The roots of RL stems from behaviourism psychology, which states that all behaviours can be encouraged or discouraged with the proper stimulus (reward / punishment). Indeed, you can teach pigeons how to play ping pong using RL.


Challenges of RL

Applying RL to Dota, however, has some considerable challenges:

Long horizon — A key challenging in RL is you often only obtain a reward signal after executing a long and complex sequence of actions. In Dota, you need to last hit, use the gold to buy items, pwn scrubs, and make a push before finally destroying the ancient, thereby obtaining a reward from winning. However, at the start the agent knows nothing about Dota, and acts at random. The chances of it randomly winning the game is 0. As a result, the agent never observes any positive reinforcements and learns nothing.

In Dota, the game horizon is long. The chance of winning by acting randomly is infinitesimally small

Credit assignments — Even when ancient is destroyed, which actions are actually responsible for it? Was it hitting the tower, or using a truckload of mangoes with full mana? The judgement on which specific actions (out of a long sequence of executed actions) are responsible for your victory is the credit assignment problem. Without any prior knowledge, your best bet is a uniform assignment scheme: credit all actions that resulted in a victory, and discredit all actions that resulted in a defeat, hoping the right actions are credited more often on average. This approach works on a short games like Pong and 1v1 Dota, and is the optimal approach if you can afford the computations and patience. Indeed, AlphaGo Zero was entirely trained in this fashion, with only +1 and -1 reward signal for winning and losing the game. For Dota though, there are simply too many actions to account for, and OpenAI decided it is best to coach our pigeon more directly.

Which action the agent took actually contributed to winning the game? Without any game knowledge, the best one can do is evenly credit all the actions. Here, the agent will erroneously associate both last-hitting and dying with winning the game.

The OpenAI Solution — Reward Shaping

One pragmatic way of addressing the challenge of long horizon and credit assignments is reward shaping, where one breaks down the eventual reward into small pieces, to directly encourage the right behaviours at each step. The best way of explaining reward shaping is by watching this pigeon training video. The pigeon would never spontaneously spin around, but by rewarding each small step of a turn, the trainer slowly coax the pigeon into the correct behaviour. In OpenAI Five, rather than learning last-hit is indirectly responsible for the ultimate victory, a last-hit is directly rewarded with a score of 0.16, whereas dying is punished with a negative score of -1. The agent would immediately learn that last-hit is good while dying is bad, irrespective to the ultimate victory of a game. Here is the full list of shaped reward values.

By associating short-term goals like last-hitting and dying with immediate reward and punishment our pigeon can be coaxed into the right behaviour

The challenge of reward-shaping is that only certain behaviours can be effectively shaped. Killing and last-hitting have immediate benefits, so intuitive scores can be assigned to each when these events occur. However, compared to last-hitting, the score of ward and smoke usages are very nebulous, something even the OpenAI researchers have not a good answer to.


The OpenAI Muscle — Self Play

Everyone has catch phrases. My adviser tends to say “sounds like a plan!” at the end of our meetings. OpenAI too has catch phrases, and one of them is this : “How can we frame a problem in such a way that, by simply throwing more and more computers at it, the solution gets better and better?” One of the answer they have settled down is that of self-play, you can watch an explanation by Ilya Sutskever here. The two take-away from the talk are self-play turns computes into data and self-play induces the right curriculum.

By playing against itself in the task of maximising short-term rewards, the pigeons learns how to last-hit and not dying

Turning computes into data — With self-play, one can spawn thousands of copies game environments, and dynamically generate the training data by interacting with the environment. Self-play is purely bound by the amount of computes one can muscle. And if computes are muscles, OpenAI is on steroid.

Inducing the right curriculum — A baby in a college level class will learn nothing. Training an agent is often no different, it is easier to train an agent by first allowing it to accomplish a set of simple tasks, and gradually increasing the complexity (a curriculum) until it finally learns the set of complex tasks. Competitive self-play naturally induces a curriculum in increasing difficulty: In the beginning, the task of beating yourself is easy, as you are bad at the game. But as you get better, it gets harder and harder to beat yourself.

Since June 9th, at 180 years of Dota games played per day, OpenAI has played 10000 years of Dota using self-play, which is longer than the existence of human civilisation. Just let that sink in for a second.


The Deceit of Breadcrumbs

To recap, OpenAI carefully constructed a trail of breadcrumbs of short-term rewards that the pigeon, evolved through 10000 years of Dota playing, is an expert at obtaining: Pigeon sees creep, nom nom; Pigeon sees you, kills you, nom nom; Pigeons at your base after killing you, sees your buildings, nom nom and you lose the game. — TL;DR of OpenAI bot

Since the OpenAI agents are trained to maximise short-term rewards, the concept of winning is literally under a smoke, and won’t become visible until the AI is sufficiently close to it. This makes the agent oblivious to long term strategic maneuvers such as forming a push around an important item timing.

The pigeon, occupied with maximising short-term rewards, does not see the victory far in the distance

The Dilemma Of RL

We started by explaining the challenges of building a Dota AI using RL: long-horizon and credit-assignment. We explained that for a game like Dota, an uniform assignment scheme like that used for AlphaGo Zero and the previous Dota 1v1 bot would not be enough. We went through some of OpenAI’s decision to explicitly reward shape the learning process, and the resulting danger of short-sightedness. It turns out that you can settle for a middle-ground between uniform assignment and shaped rewards through the use of discount-factors, which I can explain in detail on a future blog post. For now, we can think of the discount-factor as a knob one can tune between 0 and 1 to interpolate between pure uniform assignment to heavily shaped rewards.

鱼与熊掌不可兼得 / You can’t have your cake and eat it too

This is a fundamental trade-off, the more you shape the rewards, the more near sighted your bot. On the other hand, the less you shape the reward, your agent would have the opportunity to explore and discover more long-term strategies, but are in danger of getting lost and confused. The current OpenAI bot is trained using a discount-factor of 0.9997, which seems very close to 1, but even then only allows for learning strategies roughly 5 minutes long. If the bot loses a game against a late-game champion that managed to farm up an expensive item for 20 minutes, the bot would have no idea why it lost.


I’m pretty tired so I’ll stop now. To summarise, the OpenAI bot is a pigeon. If given enough time, can discover optimal strategies and movements about 5 minutes in length, but ultimately cannot formulate a winning strategy from the beginning of the game to the end. Its behaviours are strongly incentivised by the rewards OpenAI has crafted, and may forsake winning the game in favour of obtaining more and more short-term benefits. For the most part, that is also how humans plays Dota, we too must focus intensely on short-term benefits, but that’s not to say we cannot formulate long term plans.

If you read all this far thank you so much, and give me a high-five! yeah!

— evan


Follow Ups: We had a nice discussion on Reddit on this piece in /r/Dota2. One neat idea is to use the shaped rewards as a “coach” to context the bot, and by altering this context, making the bot behave differently in test time. Other is on how should one judge whether the bots have truly surpassed human.