PART 1: Deep Reinforcement Learning Systems

Gursifath Bhasin
9 min readNov 12, 2021

This blog is published as a part of seminar series for COMSE 6998 (Practical Deep Learning Systems Performance) taught by Prof. Parijat Dube at Columbia University. My friend Gaurav and I went through 4 research papers on Deep Reinforcement Learning Systems and summarized each of them.

Introduction

As the name suggests,

Deep Reinforcement Learning = Deep Learning + Reinforcement Learning

Reinforcement Learning is a subfield of machine learning that teaches an agent how to choose an action from its action space, within a particular environment, in order to maximize rewards over time. The goal of the RL agent is then to compute a policy–a mapping between the environment states and actions–that maximizes a long term reward.

Reinforcement learning (RL) algorithms combined with deep neural networks have in parallel emerged as an active area of research due to promising results in complex control tasks. Deep Reinforcement Learning agents have defeated world champions in Alpha Go, Robotic Manipulations and Autonomous Vehicles.

Courtesy: https://intellabs.github.io/coach/

Problems with training and optimization of Deep Reinforcement learning:

  1. Interactions with real systems can be slow. Generalizing from faster simulated environments can be restrictive.
  2. Sparse rewards can be due to bad reward definition which can lead to instability and high variance.
  3. It becomes difficult to reproduce the results due to restricted access to the resources, code, and workloads use.
  4. The lack of generalization is an issue that deep RL solutions often suffer from.

All these issues hint towards the need for sophisticated Deep Reinforcement Learning Systems.

Below we discuss two important research papers that aim to democratize the use of RL by a larger community of researchers and enthusiasts.

ELF: An Extensive, Lightweight & Flexible Research Platform for Real-time Strategy Games

Because it’s difficult to get sufficient real world data in order to train intelligent RL agents, we resort to using game environments because in games, the amount of real world labelled data available is:

  1. Almost infinite
  2. Low cost
  3. Replicable
  4. Easily obtainable

However, despite these benefits of using games for training, it can be difficult for individuals to conduct AI research in a game environment. This could be because of a variety of reasons such as: Huge supply of computational resources required to run thousands of rounds of gameplay and the relevant algorithms are complex and delicate to tune.

These problems compound as the complexity of the training environment increases and multiple AI agents are introduced.

ELF is a deep reinforcement learning platform with focus on real-time strategy (RTS) games.

It allows researchers to test their algorithms in various game environments, including board games, Atari games, and custom-made, real-time strategy games. Not only does it run on a laptop with GPU, it also supports training AI in more complicated game environments, such as real-time strategy games, in just one day using only six CPUs and one GPU. ELF can work with any game with a C++ interface, and it handles concurrency issues automatically.

In this paper, three game environments are worked on: Mini-RTS, Capture the Flag and Tower Defense
Architecture of ELF

ELF works with any game with a C++ interface, and automatically handles concurrency issues like multithreading/multiprocessing. ELF follows a simple producer-consumer type architecture. Several concurrent game instances run at the same time on the C++ side, while simultaneously communicating with the AI models on the Python side. Unlike other RL environments such as OpenAI Gym, ELF wraps a batch of games into one Python interface. This enables models and RL algorithms to obtain a batch of game states in every iteration, which decreases the amount of time needed to train models.

KEY FEATURES:

  • Parallelism using C++ threads
  • Flexible Environment-Model Configurations
  • Highly customizable and unified interface
  • Reinforcement Learning backend
ELF performed faster simulation and delivers nearly 30 percent faster training speed than OpenAI Gym in Atari Games.

IMPLEMENTATION:

  1. For all games, initial game states are randomized.
  2. A3C (Asynchronous Advantage Actor-Critic Algorithm) is used to train the agents to play the full game.
  3. Experiments are run 5 times and mean and standard deviation is reported.
  4. Because input is sparse, CNN is used with Batch Normalization and Leaky ReLU — improve & stabilize performance.
  5. Use frame skip 10 for trained AI and 50 for the opponent to give trained AI a bit advantage.
  6. All models are trained from scratch with curriculum training.
Asynchronous Advantage Actor-Critic Algorithm

We simulate 1-on-1 full games between trained AIs and built-in AIs. Built-in AIs have access to full information about what’s going on in the game (e.g., number of opponent’s tanks), while trained AIs know partial information. They only know about the game environment within the sight of its own units. These are the steps involved in implementation. Both players gather resources, build facilities, explore unknown territory (terrain that is out of sight of the player), and attempt to control regions on the map. In addition, the engine has characteristics that facilitate AI research: perfect save/load/replay, full access to its internal game state, multiple built-in rule-based AIs, visualization for debugging, and a human-AI interface, among others.

USAGE:

The initialization and usage of ELF is quite easy as mentioned below:

# We run 1024 games concurrently.
num_games = 1024

# Wait for a batch of 256 games.
batchsize = 256

# The return states contain key 's', 'r' and 'terminal'
# The reply contains key 'a' to be filled from the Python side.
# The definitions of the keys are in the wrapper of the game.
input_spec = dict(s='', r='', terminal='')
reply_spec = dict(a='')

context = Init(num_games, batchsize, input_spec, reply_spec)
# Start all game threads and enter main loop.
context.Start()
while True:
# Wait for a batch of game states to be ready
# These games will be blocked, waiting for replies.
batch = context.Wait()

# Apply a model to the game state. The output has key 'pi'
# You can do whatever you want here. E.g., applying your favorite RL algorithms.
output = model(batch)

# Sample from the output to get the actions of this batch.
reply['a'][:] = SampleFromDistribution(output)

# Resume games.
context.Steps()

# Stop all game threads.
context.Stop()

RESULTS: As a baseline, AIs are trained on Mini-RTS have shown promising results, beating the built-in AI agent 70 percent of the time. These results show that it is possible to train AI to accomplish tasks and prioritize actions in relatively complex strategy environments.

The ELF paper demonstrates the performance of the environment on 3 simple games. Whether it can handle more complex game such as Atari is yet to be seen.

Horizon: Facebook’s Open Sourced Applied Reinforcement Learning Platform

It is the first open source end-to-end platform for RL that uses applied reinforcement learning to optimize systems in large-scale production environment and is used at Facebook. It handles large datasets with hundreds or thousands of varying feature types and distributions. Works on high dimensional discrete and continuous action spaces. It is built in Python and uses PyTorch for modeling and training and Caffe2 for model serving.

Comparison between Horizon and related frameworks

Horizon’s pipeline is divided into three components: (1) timeline generation, which runs across thousands of CPUs; (2) training, which runs across many GPUs; and then (3) serving, which also spans thousands of machines. This workflow allows Horizon to scale to Facebook data sets. For on-policy learning, Horizon can optionally feed data directly to training in a closed loop.

Horizon’s architecture

Now, let’s dig deeper into the various features that Horizon has:

  1. Data Preprocessing: Production systems data is often logged as it comes in, requiring some logic to convert to a format suitable for RL. The Spark pipeline used to transform data into meaningful row format (MDP ID, state features, action, etc.) for numerous deep RL models.
  2. Feature Normalization: Neural Networks learn faster & better on normalized data. Horizon automatically analyzes the training dataset and determines the best transformation function and corresponding normalization parameters for each feature. Developers can override this.
  3. Data Understanding Tool: Problem formulation that conforms to MDP can be tricky in real world environments. Quickly checks this using data & heuristics. Eg: feature importance (a feature’s importance is the increase of the model loss due to masking the feature).
  4. Model Implementations: Horizon contains implementations of several deep RL algorithms that solve discrete, very large discrete & continuous action domains. Deep Q-networks (DQN), Deep Q-networks with double Q-learning (DDQN), Deep Q-networks with dueling architecture (Dueling DQN)
  5. Training: Conducts training on many GPUs distributed over numerous machine — fast model iteration & high utilization of industry sized clusters. Supports CPU, GPU, multi-GPU, and multi-node training.
  6. Model Understanding And Evaluation: Difficult to get access to simulator, unlike in research settings — offline model evaluation imp! Use counterfactual policy evaluation (CPE) to compute expected performance of the newly trained RL model. Counterfactual policy evaluation (CPE) is a set of methods used to predict the performance of a newly learned policy without having to deploy it online. Use Tensorboard web visualization tool to view output.
  7. Model Serving: Post training, models exported from PyTorch to a Caffe2 network and set of parameters via ONNX. Caffe2 is optimized for performance and portability, allowing models to be deployed to thousands of machines.
  8. Testing: New area; no established best practices. Tests done via unit tests and integration test. Evaluation of RL model done on every pull request.
Counterfactual policy performance VS training time. As the number of training epochs increases, the CPE estimates improve.
Counterfactual policy performance VS number of epochs. A score of 1.0 means that the RL and the logged policy match in performance.

The first graph shows the expected performance of the newly trained policy relative to the policy that generated the training data on a real Facebook dataset. We have relative value estimates for several CPE methods on the y axis vs. training time on the x-axis. A score of 1.0 means that the RL and the logged policy match in performance. These results show that the RL model should achieve roughly 1.5x — 1.8x as much cumulative reward as the logged system. As the number of training epochs increases, the CPE estimates improve.

The second graph shows TensorboardX counterfactual policy evaluation results. The x-axis of each plot shows the number of epochs of training and the y-axis shows the CPE estimate. A score of 1.0 means that the RL and the logged policy match in performance. Here we see the RL model should achieve roughly 1.2x — 1.5x as much cumulative reward as the logged policy.

REAL WORLD DEPLOYMENT OF HORIZON: NOTIFICATIONS @ FB

Notifications are personalized and time sensitive updates. Historically, supervised learning models were used for predicting click through rate (CTR) and likelihood that the notification leads to meaningful interactions.

Limitations of this approach: It does not capture the long term or incremental value of sending notifications.

New Policy by Horizon: Horizon was used to train a discrete-action DQN model for sending push notifications.

The MDP is based on a sequence of notification candidates for a particular person. The actions here are sending and dropping the notification, and the state describes a set of features about the person and the notification candidate. Rewards for interactions and activity, penalty for sending the notification to control the volume of notifications sent. The policy optimizes for the long term value and is able to capture incremental effects of sending the notification by comparing the Q-values of the send and drop action.

The difference in Q-values is computed and passed into a sigmoid function to create an RL based policy

If the difference between Q(send) and Q(drop) is large, this means there is significant value in sending the notification. If this difference is small, it means that sending a notification is not much better than not sending a notification.

Outcome: An improvement was observed in the relevance of notifications, without increasing the total number of notifications sent out.

Horizon supports only one installation configuration, namely Anaconda packages. To increase user adoption, Horizon should be easier to install and manage. It has also been used in Facebook’s 360-degree video team has applies Horizon to reduce bitrate consumption without harming people’s watching experience.

Details about the other two research papers can be found on Gaurav’s Medium blog at https://bit.ly/3CcHU47

References:

  1. ELF: An Extensive, Lightweight & Flexible Research Platform for Real-time Strategy Games https://arxiv.org/pdf/1707.01067.pdf
  2. RAY: A Distributed Framework for Emerging AI Applications https://www.usenix.org/system/files/osdi18-moritz.pdf
  3. A view on deep reinforcement learning in system optimization: https://arxiv.org/pdf/1908.01275.pdf

--

--