OpenAI Five DotA: Next Challenges

10 min readAug 7, 2018

If you don’t already know, OpenAI’s Dota Five had just played against some pro players during the benchmark event. It was streamed on Twitch; more details here:

OpenAI Five Benchmark: Results

Yesterday, OpenAI Five won a best-of-three against a team of 99.95th percentile Dota players: Blitz, Cap, Fogged…

blog.openai.com

As someone who had played DotA/DotA 2 since high school and work in reinforcement learning, below are some thoughts on the upcoming challenges OpenAI Five must solve before going up against world-class teams at TI 2018.

As a quick recap, Five won the first two matches against humans with great confidence and seamless teamwork. The game plays were definitely interesting, and feel like a 6–7k MMR level games. Interestingly the third match Five had to play heroes drafted adversely by the audience, and it lost due to the handicap.

A crucial thing to notice is that the game is still quite restricted, hence the human players were also handicapped in the matches:

the five humans players were individuals gathered separately to play, like in a pub match. Since they do not usually play together like a clan does, there was less team synergy, as observed. Considering team work alone, they were far less coordinated than Five, and this was the major difference in strength between the teams. DotA requires team work, and so the humans who are not used to playing together were at a disadvantage.
both teams were restricted to just 18 heroes to pick from (Earthshaker, Razor, Shadow Fiend, Viper, Tide, Sniper, Witch Doctor, CM, Gryocopter, Necrolyte, Lich, Prophet, Sven, Axe, QOP, Lion, Riki, Slark). This severely restricts the draft strategy by humans. For example, humans may have some intuitive counter heroes to those picked by Five, but they were not available. Moreover, the available ones are not necessarily the strongest/favorite of the players, nor can they be always combined into an ideal draft. There were not a lot of strong combo heroes. To get an idea, out of the 20 most picked and banned heroes in TI 7, only 5 of them (Earthshaker, Lich, Prophet, Sven, QOP) were available.
Pressure and surprises. Okay this is not strictly a handicap. Players mentioned that they constantly felt pressured because they could not get a break from the Five. How much of this is due to the knowledge that they were playing against machines is unknown. Normal expectations from human opponents could not be applied as well, so the humans were somewhat surprised by the styles of Five.

The 18 Heroes available, with progressively updated winning probability predicted by Five. I know…either I’m old, or my TV doesn’t have screenshot function.

I was watching the event with my friend Laura, who also works with me on deep RL. We were discussing the game play while thinking how certain features may be implemented, and how certain behaviors may arise from the algorithm (vanilla policy gradient, with smart feature engineering and a lot of compute). Of course, besides the strengths, a few shortcomings became apparent as well.

For now, I can say that the main strengths of Five are strong teamwork, simple but well executed strategies, and the ability to maintain peak performance throughout the game. With these, they were able to beat the humans. However, world class teams have those strengths too, and more. There are several problems to be solved if Five wants to defeat the world class teams. If the version of Five that goes up against the TI finalist teams is not too different from the current version, I would bet my money in favor of the humans. These challenges are, not coincidentally, the well known hard problems in deep RL. We will break them into several topics.

If you need some introduction to how RL is applied to DotA, here’s a nice and humorous introduction by Evan Pu: Understanding OpenAI Five.

Reward Design

We now know that the reward signal used by Five is not sparse, but we do not know the exact details. A good guess is that the dense intermediate reward signal is a function of some of these obvious quantities:

gold advantage: game matches always use these to compare team/hero advantages; more gold earned the better. Since gold is spent on items, consider the cumulative earned gold, not the current amount.
experience advantage: likewise, more experience lead means the team is further ahead in the game and is stronger.
creeps, heroes, towers killed: these correspond to experience, but counting kills serve as a more direct indicator of good signal, and vice versa for getting killed.
ability levels: a higher skill level is more powerful, this will also encouraged heroes to level up. However, some heroes may want to wait on upgrading skill levels, such as Doombringer’s LVL Death skill, which multiplies damage based on specific level.
damage, regen, armor etc. attributes amplified due to items or skill effects: these will indicate the good/bad effects of items, e.g. Assault adds attack speed and armor; getting hit by Desolator decreases armor.

Even with these reward signals, there are some behaviors that may seem to harm rewards, but are actually proper, strategic play. These behaviors will be tricky for the reward signal to encourage, unless one relaxes the reward signal, but that would mean longer training — there’s a tradeoff. Some of the tricky cases are:

Techies’s Suicide: the hero chooses to commit suicide and loses a life, and it is not guaranteed to grant Techies kills. While waiting for respawn, techies also loses the previous farming time, and the team may lose some map control. The reward signal may actually encourage Techies to just farm like a normal support and never suicide.
Bloodstone: this item also kills the targeted hero, although it heals allies in the surrounding. A proper usage is to deny enemy kills, or to sacrifice yourself to heal crucial ally heroes. A great replay is this Tinker played by EE, titled “EESama Died for Your Sins”. During a push, Tinker used Bloodstone to commit suicide while healing its allies taking heavy damages from Shadow Fiend.

Armlet toggle: Armlet is an item which adds strength and hp when toggled, but you will take damage over time when using it. When you untoggle it, the lost strength will translate into lost hp. Pretty bad item to use if the reward signal prefers higher hp. But here’s this masterful play by Dr. Lee using Armlet on Sniper (which is also an unusual build). Look at how Sniper is constantly at last-hit but managed to keep getting kills. This is one of my favorites.

Oracle’s ultimate False Promise: this skill delays all damages and heals taken to after some long duration, then add up the damages and 2x heals. This can give the dying ally more time to keep fighting, and if the heal is sufficient, the hero will end up surviving. However, the reward is only observed after the spell effect ends, which can take 8–12 seconds.
Morphling’s Attribute Shift, Terrorblade’s Sunder, Soul Ring all trade hp in some ways, so they are tricky cases for reward design as well.

Exploration/Exploitation Problem

We know that with sparse signals and a huge state/action space, exploration is difficult. This is something common to many RL algorithms, and it is even harder for those with low sample efficiency. Arthur Juliani from Unity wrote a great article illustrating a sparse reward problem PPO can’t solve, and an augmentation called Curiosity which addresses the issue:

Solving sparse-reward tasks with Curiosity - Unity Blog

We just released the new version of ML-Agents toolkit (v0.4), and one of the new features we are excited to share with…

blogs.unity3d.com

Coming back to DotA, even though they have a dense reward signal, it may still be insufficient. That reward signal may be sufficient to encourage ease-to-reach scenarios, and we do see a lot of common human strategies in the plays, such as last-hitting, creep-blocking, roaming, ganking, getting towers, baiting. However, there are also common human behaviors that we do not see:

creep pulling: pulling lane creeps to Neutrals to control the lane and get extra gold and experience
creep stacking: pull away Neutrals at the right time for a new wave to spawn, and repeat to have many waves of Neutrals to farm later on for a lot of gold and experience
Roshan: OpenAI has acknowledge that as well. By random exploration, it is hard to have all heroes gathered up at Roshan pit at the right time, and spend a long time and hp to kill it for the prize.

These strategies are hard to discover by the algorithm because the explore based on random actions, whereas humans do that via deliberate planning. They also involve much longer time horizon. The chance of doing these things randomly are astronomically small. Humans know what the outcome they can get, then imagine and try out plans executed over a long time horizon to obtain the prize.

Then there are things that are slightly less difficult, but I suspect are nontheless possible:

Invoker spells: this hero can change the combination of 3 orbs he has to create 10 different spells to use. It is quite hard for humans to master, and figuring out the problem combo sequence will also take a lot of trials for humans (imagination) and machines (simulation). There are a lot of things to control, especially if you buy a Refresher and Aghanim. So, by random exploration, what are the chances of successfully pulling off those long combos? Here are some mad ones:

terrain exploitation-path blocking: We have seen games where Venomancer’s ward blocks a path to prevent enemy from escaping. A narrow lane is easy to block, like the replay below, but a harder one is to block paths in between trees (unfortunately I could not find the replay). A new comer may think the trees in DotA are just for cosmetics, but a pro player actually knows where certain trees are, where paths exist among trees, if some lanes are narrow enough to block, etc. These can be explored by the algorithm with sufficient simulation.

terrain-exploitation — river: river is narrow, with banks on both sides. You can hook people over to your side and gank up on them, or corner your enemies in the river. An interesting double-hook replay below:

terrain-exploitation — Roshan pit: the pit can be extremely dangerous, and may win or lose your team 6 Million USD. The combo below made EG the champion of TI 5. Combos like these can of course be learned by the AI.

Creative (One-shot) Strategies

DotA TI events are so much fun to watch because of all kinds of crazy, inventive things those top players can pull off. A lot of them have actually never been done before, but came as a split-second light build moment from the players. These are extremely rare events, but it is precisely that reason which makes them so memorable, and separates TI from normal pro matches.

Wisp juke: In the replay below, Wisp is about to get teleported to its original place, and enemy heroes were waiting to slaughter it. When in the base, Wisp buys a Shadow Blade, and triggers it just on time to become invisible and escapes from the enemies. This strategy is uncommon, and it changes the build of the hero as well, but trading gold sacrifice for a life can make a huge difference in TI.

Fountain hook abuse: Navi is one of the most creative teams out there. In a TI game which they were losing, a Pudge-Chen fountain hook combo helped turn the tide and secure Navi a comeback. The combo was later considered as a glitch, and subsequently removed from the game. We have seen deep RL discover various game exploits with the Atari games. DotA has quite a lot of these discovered over time as well, so I’m really eager to see if the agents reproduce/invent these unusual scenarios, also known as abuses sometime. However, they can be so unusual that they cannot readily be exploited repeatedly by AI without knowing their significant and how to reproduce; but humans immediately notice and abuse them only after seeing them once. One-shot learning is very crucial here.

gamble: Invoker’s Sunstrike, Pudge’s Hook, Mirana’s Arrow, etc.; these actions can be targeted even without vision on enemy heroes, so they are often used to finish off heroes that would have escaped otherwise. Because they are done in the blind, it is hard for random actions to explore and exploit. First they need to know where the escaping heroes may go, then know where to place their bets. Only sometime an action may produce good outcome, so overall in training, these actions may seem bad (such as wasting mana and cooldown) to the algorithm. However, from experience just a few times, humans know that these actions can be promising, and will gamble for it. Some are shown below:
Dendi is the world’s best Pudge player, and it’s clear why. His hooks are calculated (a lot of mind game), and he hooks blind sometime too. With training of course, one can learn those intuitions as well, although they are not going to be perfect:

Invoker’s Sunstrike has global range, and the amazing kills are done when enemy hero is already nearly back to its base to heal:

Plain Creative Styles

Creativity combined with mastery is really what sets some players apart. There are ways to play a hero in such an unorthodox manner, that some may not consider them as common or optimal. I suspect these are just really narrow bands in the strategy space for the algorithm to come across and exploit. Some of the most creative, uncommon, masterful players play like no one else does:

AR1SE has the most unusual Magnus style, with skill combos enemies can’t run from:

Dendi is just plain fun to watch:

Mushi is brutal and has no mercy:

Sing Sing is one of the most unserious players. If you watch his pub stream, he usually just tries out stupid things, but they result in really interesting strategies. I bet he learns a lot from his fooling around too, although they are usually non-optimal, i.e. any algorithm that has nearly converged will not try out those stupid things. The first video is how he plays seriously, the second is when he trolls in pub. WARNING: he cusses a lot.

Conclusion

OpenAI acknowledges there are still gaps to be addressed. However, I’m not so sure if all of the challenges laid out above can be tackled with vanilla PPO and massive compute. If their goal is to just see how far they can push it, then there are problems above that are inherently out of reach from the algorithm. Given the timeframe, seems like they will stick with PPO some more feature engineering. Then, I’ll say that the TI champions will win against Five.

Perhaps next year when better techniques which address exploration, reward design, planning, one-shot learning, etc. are combined, they’ll match TI-level play more closely. Either way, I’m very excited for their next game at TI 2018. Good luck OpenAI!

P/S: I might alsowrite about hero draft and item build, with link to multitask and transfer learning, if time permits.

update: the next post is now available:

DotA Five: Solving the Global Skill Problem

Laura and I were talking about how we would solve the global skill problem in OpenAI’s DotA Five, which is still…

medium.com