For The Win: An AI Agent Achieves Human-Level Performance in a 3D Video Game

A detailed explanation for the FTW agent from DeepMind

Sherwin Chen
Jan 18 · 8 min read


In this article, we’ll discuss For The Win(FTW) agent, from DeepMind, that achieves human-level performance in a popular 3D team-based multiplayer first-person video game. The FTW agent utilizes a novel two-tier optimization process in which a population of independent RL agents is trained concurrently from thousands of parallel matches with agents playing in teams together and against each other on randomly generated environments. Each agent in the population learns its internal reward signal to complement the sparse delayed reward from winning and selects actions using a novel temporally hierarchical representation that enables the agent to reason at multiple timescales.

Task Description

The FTW agent is trained on the Capture The Flag(CTF) environment, where two opposing teams of multiple individual players(they only train with 2 vs. 2 games but find the agents generalize to different team sizes) compete to capture each other’s flags by strategically navigating, tagging, and evading opponents. The team with the greatest number of flag captures after five minutes wins.

Environment Observation

The observation consists of 84 ⨉ 84 pixels. Each pixel is represented by a triple of three bytes, which we scale by 1/255 to produce an observation x ∈ [0, 1]^{84 ⨉ 84 ⨉ 3} as we do on Atari games. Besides, certain game point signals 𝜌_t, such as “I picked up the flag,” are also available.

Action Space

The action space consists of six types of discrete partial actions:

  • Change in pitch with three values (-5, 0, 5)
  • Strafing left or right (ternary)
  • moving forward or backward (ternary)
  • tagging or not (binary)
  • jumping or not (binary)


Here we list some notations used later for better reference

  • 𝜋: agent policy
  • 𝛺: CTF map space
  • r=w(𝜌_t): intrinsic reward
  • p: player p
  • m_p(𝜋): a stochastic matchmaking scheme that biases co-players to be of similar skill to player p, see Elo scores in Supplementary Materials for details of scoring agent’s performance
  • 𝜄 ∼ m_p(𝜋): co-players of p

FTW Agent


Capture the Flag(CTF) presents three challenges:

Temporally Hierarchical Reinforcement Learning


Variational posterior from the fast ticking LSTM
Latent prior from the slow ticking LSTM
The hierarchical RNN structure
Slow and Fast LSTM. Where g_p, g_q are slow and fast timescale LSTM cores, respectively.
Equation 1. The loss function of FTW

Intrinsic Reward

The extrinsic reward is only given at the end of a game to indicate winning(+1), losing(-1), or tie(0). This delayed reward poses a prohibitively hard credit assignment problem to learning. To ease this problem, a dense intrinsic reward is defined based on the game point signal 𝜌_t. Specifically, for each game point signal, agents’ intrinsic reward mapping w(𝜌_t) is initially sampled independently from Uniform(-1, 1). Then these internal rewards are evolved using a process of population-based training(PBT), as well as other hyperparameters such as 𝝀s in Equation 1 and learning rates.

Population-Based Training

Population-based training(PBT) is an evolutionary method that trains a population of models in parallel and constantly replaces the worse models with better models plus minor modifications. In the case of our FTW agent, PBT can be summarized by repeating the following steps


We can summarize the policy training and PBT as a joint optimization of the following objectives


That’s it. It was a long journey; hopefully, you were enjoying it. If you bump into some mistakes or have some concerns, welcome leave a note or comment below. Thanks for reading:-)


Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castañeda, Charles Beattie, et al. 2019. “Human-Level Performance in 3D Multiplayer Games with Population-Based Reinforcement Learning.” Science 364 (6443): 859–65.

Supplementary Materials

Network Architecture

Elo scores

Given a population of M agents, let trainable variable 𝜓_i∈R be the rating for agent i. We describe a given match between two players (i, j) on blue and red, with a vector m∈Z^M, where m_i is the number of times agent I appears in the blue team less the number of times the agent appears in the red team — in the Eval step of PBT, where we use two players with 𝜋_i on the blue team and two with 𝜋_j on the red team, we have m_i=2 and m_j=-2. The standard Elo formula is

Towards AI

Towards AI, is the world’s fastest-growing AI community for learning, programming, building and implementing AI.

Sherwin Chen

Written by

A learner, interested in deep learning and reinforcement learning.

Towards AI

Towards AI, is the world’s fastest-growing AI community for learning, programming, building and implementing AI.

More From Medium

More from Towards AI

More from Towards AI

More from Towards AI

Data Science Curriculum

More from Towards AI

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade