Google Open Sourced this Architecture for Massively Scalable Reinforcement Learning Models
The new architecture improves upon the IMPALA model to achieve massive levels of scalability.
I recently started a new newsletter focus on AI education. TheSequence is a no-BS( meaning no hype, no news etc) AI-focused newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers and concepts. Please give it a try by subscribing below:
(Core ML concepts + groundbreaking research papers and frameworks + AI news and trends) x 5 minutes, 3 times a week =…
Deep reinforcement learning(DRL) is one of the fastest areas of research in the deep learning space. Responsible for some of the top milestones in the recent years of AI such as AlphaGo, Dota2 Five or Alpha Star, DRL seems to be the discipline that approximates human intelligence the closest. However, despite all the progress, the real world implementations of DRL methods remain constrained to the big artificial intelligence(AI) labs. This is partially due to the fact that DRL architecture rely of disproportionally large amounts of training which makes them computationally expensive and unpractical for most organizations. Recently, Google Research published a paper proposing SEED RL, a new architecture for massively scalable DRL models.
The challenges of implementing DRL models in the real world is directly tied to the their architecture. Intrinsically, DRL comprised of heterogeneous tasks such as running environments, model inference, model training or replay buffers. Most modern DRL architectures fail to efficiently distribute compute resources for this tasks making it unreasonably expensive to implement. Components such as AI hardware accelerators have helped with some of these limitations but they can only go so far. In recent years, new architectures have emerged that have been adopted by many of the most successful DRL implementations in the market.
Drawing Inspiration from IMPALA
Among the current generation of DRL architectures, IMPALA set a new standard for the space. Originally proposed by DeepMind in a 2018 research paper, IMPALA introduced a model that made use of accelerators specialized for numerical calculations, taking advantage of the speed and efficiency from which supervised learning has benefited for years. At the center of IMPALA was an actor-based model that is so commonly used to maximize concurrency and parallelization.
The architecture of an IMPALA-based DRL agent is separated into two main components: actors and learners. In this model, the actors typically run on CPUs and iterate between taking steps in the environment and running inference on the model to predict the next action. Frequently the actor will update the parameters of the inference model, and after collecting a sufficient amount of observations, will send a trajectory of observations and actions to the learner, which then optimizes the model. In this architecture, the learner trains the model on GPUs using input from distributed inference on hundreds of machines. From a computational standpoint, the IMPALA architecture enables the acceleration of learners using GPUs while actors can be scaled across many machines.
IMPALA set up new standards in DRL architectures. However, the model have some intrinsic limitations.
· Using CPUs for neural network inference: The actor machines are usually CPU-based. When the computational needs of a model increase, the time spent on inference starts to outweigh the environment step computation. The solution is to increase the number of actors which increases the cost and affects convergence.
· Inefficient resource utilization: Actors alternate between two tasks: environment steps and inference steps. The compute requirements for the two tasks are often not similar which leads to poor utilization or slow actors.
· Bandwidth requirements: Model parameters, recurrent state and observations are transferred between actors and learners. Furthermore, memory-based models send large states, increase bandwidth requirements.
Using the IMPALA actor model as an inspiration, Google worked on a new architecture that addresses some of the limitations of its predecessors for the scaling of DRL models.
At a high level, Google’s SEED RL architecture looks incredibly similar to IMPALA but it introduces a few variations that address some of the main limitations of the DeepMind model. In SEED RL, neural network inference is done centrally by the learner on specialized hardware (GPUs or TPUs), enabling accelerated inference and avoiding the data transfer bottleneck by ensuring that the model parameters and state are kept local. For every single environment step, the observations are sent
to the learner, which runs the inference and sends actions back to the actors. This clever solution addresses the inference limitations of models like IMPALA but might introduce latency challenges.
To minimize the latency impact, SEED RL relies on gPRC for messaging and streaming. Specifically, SEED RL leverages streaming RPCs where the connection from actor to learner is kept open and metadata sent only once. Furthermore, the framework includes a batching module that efficiently batches multiple actor inference calls together.
Deep diving into the IMPALA architecture, there are three fundamental types of threads running:
2. Data Prefetching
Inference threads receive a batch of observations, rewards and episode termination flags. They load the recurrent states and send the data to the inference TPU core. The sampled actions and new recurrent states are received, and the actions are sent back to the actors while the latest recurrent states are stored. When a trajectory is fully unrolled it is added to a FIFO queue or replay buffer and later sampled by data prefetching threads. Finally, the trajectories are pushed to a device buffer for each of the TPU cores taking part in training. The training thread (the main Python thread) takes the prefetched trajectories, computes gradients using the training TPU cores and applies the gradients on the models of all TPU cores (inference and training) synchronously. The ratio of inference and training cores can be adjusted for maximum throughput and utilization.
The SEED RL architecture allows learners to be scaled to thousands of cores and the number of actors can be scaled to thousands of machines to fully utilize the learner, making it possible to train at millions of frames per second. Given that the SEED RL is based on the TensorFlow 2 API and its performance was accelerated by TPUs.
To evaluate SEED RL, Google used common DRL benchmark environments such as cade Learning Environment, DeepMind Lab environments, and on the recently released Google Research Football environment. The results across all environments were remarkable. For instance, on the DeepMind Lab environment, SEED RL achieved 2.4 million frames per second with 64 Cloud TPU cores, which represents an improvement of 80x over the previous state-of-the-art distributed agent, IMPALA. Improvements in speed and CPU utilization were also seen.
SEED RL represents an improvement in massively scalable DRL models. Google Research open sourced the initial SEED RL architecture in GitHub. I can imagine this is going to be the underlying model for many practical DRL implementations in the foreseeable future.