A technical perspective on our machine learning-based bots for video games

Alex Borghi
WildMeta
Published in
4 min readJan 19, 2021

While we already released a blog post with a video demonstration on Dota 2 we didn’t have the opportunity to get into the details of the technology we’ve been working on and how we’re using machine learning to build AIs for video games.

Dota 2 as an RL environment

We implemented machine learning-based bots for Dota 2, a MOBA game developed by Valve to demonstrate our reinforcement learning and imitation learning systems. We chose Dota 2 as it is available on Linux, features a LUA bot API, can send each team’s state as a protocol buffer, has easily parsable replay files that can be used as demonstrations, and allows to run games at increased speed without graphics. Additionally, previous work from OpenAI provided valuable insights. One of the drawbacks of Dota 2 as an RL environment is that the game does not wait to execute time steps, which means our code has to be fast enough to match the game speed. This becomes a problem when the game is sped up excessively and the game step time becomes lower than the neural network inference time. Another detail to take into account is that Dota 2 doesn’t allow multiple instances of the game to run simultaneously on a single computer, but this issue has been simply overcome by using Docker.

Dota 2 is an interesting game for machine learning because of its rich game mechanics, its cooperative and competitive multi-agent nature, its partial observability and long horizon, as well as its rich observation and action spaces. Besides, it is possible to consider restricted versions of the game (e.g. 1v1, subset of heroes, no courier) or use small subsets of the original observation and action spaces to train agents. The game also supports modding and has a huge community. This makes Dota 2 a very interesting environment for RL that can be adapted to your needs.

Algorithms, system, implementation details

We designed our own framework to train agents on video games. Our system is generic and can be applied with minimal changes to many games (MOBA games but also other genres). Our implementation supports the PPG framework in combination to the PPO and RND algorithms, as well as SAC and SIL. We support multiple actors running on a single machine or distributed on several nodes via Docker Swarm.

Because actors are the main bottleneck of the system even when the game is sped up to its limit, it is important for the actor code to have minimal overhead so that most of the execution time is actually spent on game and inference. We designed the system for scalability, not only to be able to increase the number of actors but also to support multi-GPU training. Communication between nodes, memory consumption and mixed precision computations were also important topics as many components are involved during training and each of them can impact performance. In the multi-agent case (2 teams of up to 5 agents), we use batched inference as well as optimised preprocessing to avoid redundant computations. We tried many variants of the different approaches we use (reward shaping, features, neural network architectures, action heads, masks for valid actions, initialisations, normalisations,…) to get the best compromise in terms of training speed, numerical stability and agent performance.

Making sure training works correctly and being able to diagnose any potential problems are key in machine learning. Therefore our system generates many statistics during training (policy entropy, explained variance, approximate KL divergence,…) and evaluation phases (for instance winrate). Finally, we developed Minimoba, a simplified MOBA written in C that runs thousands of times faster than real-time, enabling us to debug, test hyper-parameters and prepare larger scale experiments.

Our bots in action

We compiled short videos after training our RL agents in a simple scenario: small scale training in 1v1 versus a hardcoded bot (provided with the game), without demonstration, using the PPG framework with PPO and a simple 2.7 million parameters neural network. We tuned hyper-parameters so that training is as sample efficient as possible for the fixed number of actors we planned to use while still allowing the agent to acquire interesting skills. Our agent quickly learned to stay in range of other units to gain experience while avoiding drawing aggro. After several hours of small scale training (64 CPU cores with 1 GPU) our 1v1 agents started to learn different interesting behaviours, as can be seen in the following video (more details on this in our previous blog post).

What we didn’t show…

In the video, you can see what our agent learned by playing against a hardcoded bot in 1v1, but we also support self-play and population training for larger scale scenarios. As mentioned previously, replay files can be used as demonstrations, which provides an opportunity for better sample efficiency and faster training than purely online RL. This allows to guide the agent to learn specific behaviours, for example techniques relevant to 5v5. That’s why our system also supports RL with demonstrations and offline learning via supervised learning on expert demonstrations.

Want to hear more? Be sure to follow us on Medium, Twitter or LinkedIn.

Do not hesitate to reach out at contact@wildmeta.com.

WildMeta, AI for video games.

--

--

Alex Borghi
WildMeta
Editor for

CTO at WildMeta | Machine learning research scientist | Ex Graphcore, Imagination Technologies & Feral interactive