# For The Win: An AI Agent Achieves Human-Level Performance in a 3D Video Game

## A detailed explanation for the FTW agent from DeepMind

# Introduction

In this article, we’ll discuss For The Win(FTW) agent, from DeepMind, that achieves human-level performance in a popular 3D team-based multiplayer first-person video game. The FTW agent utilizes a novel two-tier optimization process in which a population of independent RL agents is trained concurrently from thousands of parallel matches with agents playing in teams together and against each other on randomly generated environments. Each agent in the population learns its internal reward signal to complement the sparse delayed reward from winning and selects actions using a novel temporally hierarchical representation that enables the agent to reason at multiple timescales.

# Task Description

The FTW agent is trained on the Capture The Flag(CTF) environment, where two opposing teams of multiple individual players(they only train with 2 vs. 2 games but find the agents generalize to different team sizes) compete to capture each other’s flags by strategically navigating, tagging, and evading opponents. The team with the greatest number of flag captures after five minutes wins.

## Environment Observation

The observation consists of *84 ⨉ 84* pixels. Each pixel is represented by a triple of three bytes, which we scale by *1/255* to produce an observation *x ∈ [0, 1]^{84 ⨉ 84 ⨉ 3}* as we do on Atari games. Besides, certain game point signals *𝜌_t*, such as “I picked up the flag,” are also available.

## Action Space

The action space consists of six types of discrete partial actions:

- Change in yaw rotation with five values
*(-60, -10, 0, 10, 60)* - Change in pitch with three values
*(-5, 0, 5)* - Strafing left or right (ternary)
- moving forward or backward (ternary)
- tagging or not (binary)
- jumping or not (binary)

This gives us the total number of *5⨉3⨉3⨉3⨉2⨉2=540* composite actions that an agent can produce.

## Notations

Here we list some notations used later for better reference

- 𝜙: hyperparameters
- 𝜋: agent policy
- 𝛺: CTF map space
*r=**w**(𝜌_t)*: intrinsic reward*p*: player*p**m_p(𝜋)*: a stochastic matchmaking scheme that biases co-players to be of similar skill to player*p*, see Elo scores in Supplementary Materials for details of scoring agent’s performance*𝜄 ∼ m_p(𝜋)*: co-players of*p*

# FTW Agent

## Overall

Capture the Flag(CTF) presents three challenges:

1. CTF play requires memory and long-term temporal reasoning of high-level strategy.

2. The reward is sparse — only at the end of the game will the reward be given. Therefore the credit assignment problem is hard.

3. The environment setups are varied from match to match. Besides different maps, there could be a different number of players, level of teammates, etc.

The FTW agent meets the first requirement by introducing an architecture that features a multi-time scale representation, reminiscent of what has been observed in the primate cerebral cortex, and an external working memory module, broadly inspired by human episodic memory. The credit assignment problem is eased by enabling agents to evolve an internal reward signal based on the game points signal *𝜌_t*. Finally, to develop diverse generalizable skills that are robust to different environment setups, we concurrently train a large, diverse population of agents who learn by playing with each other in different maps. This diverse agent population also paves a way to perform population-based training on the optimization of internal rewards and hyperparameters.

In the rest of this section, we first address the architecture and objective used by FTW. Then we cover the intrinsic reward and population-based training.

## Temporally Hierarchical Reinforcement Learning

**Architecture**

The FTW agent uses a hierarchical RNN with two LSTMs operating at different timescale(although this architecture can be extended to more layers, the authors found in practice more than two layers made little difference on the task). The fast ticking LSTM, which evolves at each timestep, outputs the variational posterior

The *z_t* sampled from *N(*𝜇*_t^q,𝛴_t^q)* is then used as the latent variable for policy, value function, and auxiliary tasks. The slow timescale LSTM, which updates every 𝜏 timesteps, outputs the latent prior

This prior distribution, as we will explain soon, then serves as a regularization to the variational posterior. Intuitively, the slow LSTM generates a prior on *z,* which predicts the evolution of *z* for subsequent 𝜏 steps, while the fast LSTM generates a variational posterior on *z* that incorporates new observations but adheres to the predictions made by the prior. The following figure summarizes this process at time step *t*, and we will append a more detailed figure from the paper in the supplementary materials.

We also mathematically express the evolution of the hidden states of both fast and slow RNNs as follows

**Objectives**

The FTW uses almost the same objective as UNREAL with V-trace for off-policy correction and additional KL terms for regularization.

Most parts of the objects have previously been covered by UNREAL and IMPALA. The first newly introduced KL term regularizes the policy *Q* against a prior policy *P* following the idea of RL as probabilistic inference. Unlike traditional probabilistic RL methods that directly regularize the policy, Equation 1 introduces an intermediate latent variable *z*, which models the dependence on past observations. By regularizing the latent variable, the policy and the prior now differ only in the way this dependence on past observations is modeled. The second KL penalty regularizes the prior policy *P* against a multivariate Gaussian with mean *0* and standard deviation *0.1*(see *Section 5.4* in the paper)

Another subtle difference from the previous methods lies in the pixel control policy, denoted as *L_{PC}* in Equation 1. In light of the composite nature of the action space, the authors propose training independent pixel control policies for each of the six action groups(see *Figure S10(i)* in Supplementary Materials for details).

All the coefficients 𝝀s used in Equation 1 are first sampled from a certain range and then are optimized by population-based training.

## Intrinsic Reward

The extrinsic reward is only given at the end of a game to indicate winning(+1), losing(-1), or tie(0). This delayed reward poses a prohibitively hard credit assignment problem to learning. To ease this problem, a dense intrinsic reward is defined based on the game point signal *𝜌_t*. Specifically, for each game point signal, agents’ intrinsic reward mapping *w**(𝜌_t)* is initially sampled independently from *Uniform(-1, 1)*. Then these internal rewards are evolved using a process of population-based training(PBT), as well as other hyperparameters such as 𝝀s in Equation 1 and learning rates.

## Population-Based Training

Population-based training(PBT) is an evolutionary method that trains a population of models in parallel and constantly replaces the worse models with better models plus minor modifications. In the case of our FTW agent, PBT can be summarized by repeating the following steps

1. **Step**: Each model in the population of *P=30* is trained with its hyperparameters for some steps(e.g., *1K* games). For each game, we first randomly sample an agent *𝜋_p*, and then we select its teammate and opponents based on their Elo scores — see Supplementary Materials for a brief introduction to the Elo scores.

2. **Eval**: After the step requirement is met, we evaluate the performance of each model. In the case of FTW, we have the ready agent compete with another randomly sampled agent, and estimate the Elo scores.

3. **Exploit**: If the estimated win probability of the agent was found to be less than *70%*, then the losing agent copied the policy, the internal reward transformation, and the hyperparameters of the better agent.

4. **Explore**: We perturb the inherited internal reward and hyperparameters by *±20* with a probability of *5%*, except for the slow LSTM time scale 𝜏, which is uniformly sampled from the integer range *[5,20)*.

# Summary

We can summarize the policy training and PBT as a joint optimization of the following objectives

This can be viewed as a two-tier, reinforcement learning problem. The inner optimization, solved with the temporally hierarchical RL, maximizes *J_{inner}*, the agents’ expected future discounted internal rewards. The outer optimization of *J_{outer}*, solved with PBT, can be viewed as a meta-game, in which the meta-reward of winning the match is maximized w.r.t. Internal reward transformation ** w** and hyperparameters 𝜙, with the inner optimization providing the meta transition dynamics.

# End

That’s it. It was a long journey; hopefully, you were enjoying it. If you bump into some mistakes or have some concerns, welcome leave a note or comment below. Thanks for reading:-)

# References

Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castañeda, Charles Beattie, et al. 2019. “Human-Level Performance in 3D Multiplayer Games with Population-Based Reinforcement Learning.” *Science* 364 (6443): 859–65. https://doi.org/10.1126/science.aau6249.

Espeholt, Lasse, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymyr Mnih, Tom Ward, Boron Yotam, et al. 2018. “IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures.” In *35th International Conference on Machine Learning, ICML 2018*.

Max Jaderberg, Volodymyr Mnih, et al. 2018. “Reinforcement Learning with Unsupervised Auxiliary Tasks,” 1–9. https://doi.org/10.1051/0004-6361/201527329.

# Supplementary Materials

## Network Architecture

where DNC has been explained in our previous post.

## Elo scores

Given a population of *M* agents, let trainable variable *𝜓_i∈R* be the rating for agent *i*. We describe a given match between two players *(i, j)* on blue and red, with a vector *m**∈Z^M*, where *m_i* is the number of times agent I appears in the blue team less the number of times the agent appears in the red team — in the Eval step of PBT, where we use two players with *𝜋_i* on the blue team and two with *𝜋_j* on the red team, we have *m_i=2* and *m_j=-2*. The standard Elo formula is

where we optimize the rating ** 𝜓** to maximize the likelihood of the data(which we label

*y_i=1*for ‘blue beats red,’

*y_i=1/2*for draw, and

*y_i=0*for ‘red beats blue’). Since win probability is determined only by absolute difference in ratings, we typically anchor a particular agent to a rating of

*1000*for ease of interpretation.

The teammate&oppoents’ sampling distribution for a particular agent is defined as

which is a normal distribution over Elo-based probabilities of winning, centered on agents of the same skill(𝜇*=0.5*).