Writing Fast Deep Q Learning Pipelines on Commodity Hardware
The state of the art in Deep Reinforcement Learning has been ramping up in scale over time, and its becoming ever more difficult to reproduce the state of art on commodity hardware.
Previous works have shown that with enough optimization, patience, and time, we can get pretty close. To this end, I began to study how to write efficient training pipelines with the goal of zero-down time: the GPU must never stall waiting for data (both on the training and inference ends), and pipeline must be able to take on the full throughput the system is capable of.
How NOT to Write Pipelines
Most DeepRL baselines are written in a synchronous form.
There are benefits to this:
1) Simplicity: the baseline is just that; a baseline. They typically try to keep the code as simple as possible to get the salient points across. Writing fast code often means making code less readable and more error-prone.
2) Benchmarks: Keeping things sequential makes it easy to have good apples-to-apples comparisons on how many [samples | episodes | training-steps | <insert-metric-here> ] algorithms take compared to one another. When these things are simply left to run as fast as they can, disparities can simply be due to how fast one stage can process data compared to the other.
But the problem is performance: inference, learning and environment rollouts are all blocking each other because there are data dependencies between them, but prior art has shown these dependencies can be weakened enough to allow us to run them all as separate processes just trying to execute as fast as they can.
Isolating the Learner Process: APEX
One exemplar way of isolating the training process is Horgan et al’s APEX DQN.
- The replay memory is instantiated asynchronously. It provides an API to add to and sample from it from other processes.
- The Inference & Environment steps still run sequentially in their own process (the “Actor Process”). They queue data into the replay process
- The Training process is instantiated asynchronously too. It runs in an infinite loop simply getting minibatches from the Replay as fast as it can and training on them.
- The Training (or “Learner”) process pipes new network parameters back to the actor process periodically
This allows us to treat the training the same way we do for deep learning, and the same pipelining tricks that are standard there apply here too:
To ensure the GPU is never stalled waiting for data, we use parallelize data loading in the Trainer process: minibatches from the Replay are asynchronously copied to the GPU while the training step runs so they are always ready for use. We use the same trick for getting parameters off the GPU without slowing training down.
[Future Note]: We can also use tricks like virtualization to expand the capacity of the replay network, and pipeline it aggressively so we don’t lose throughput, but that’s a story for another day.
While this is a huge faff, it leads to a significant speedup in wall-clock time:
The speedup comes from both the increase in training speed, and how quickly we gather data. With aggressive pre-fetching and tuning batch-sizes, we can max-out GPU utilization as is for training, but there are still improvements to be made for inference.
SampleFactory and SeedRL: Isolating Inference
The main flaw with APEX is it still needs an instance of the network for every environment, and those still run synchronously. This causes 2 main issues:
- Memory Waste: it requires multiple instances of the same network, which leaves less room in RAM for the replay buffer.
- Compute Waste: since it can’t vectorize inference across environments even though in principle the computation it does should allow for it.
- Latency: If running inference on GPU, it incurs further latency hits for copying observations synchronously to GPU and on all the redundant kernel launches.
To get around this, prior art batches the environments synchronously (ie: waits for all environments to return observations then concatenate them into a batch) with so called vectorized environments.
These are widely used in on-policy deep RL and get past problems (1) and (2) since they then only require a single network instance to operate on the batch of observations.
But these do nothing to alleviate problem 3, and infact worsen it: while they are canonically called vectorized environments, the environments themselves typically aren’t really vectorized at all. They run asynchronously then stop and synchronize when they return observations. This means some complete their steps at different times, and just stall waiting for others to catch up.
Two new works (SeedRL by Espeholt et al and Petrenko et al’s Sample Factory) concurrently address this issue with the same general idea:
Split the “Actor worker” into 2: the “Policy workers” which are responsible for model inference, and the “Rollout Workers” which are responsible for running the environment simulation:
These all run asynchronously, simply processing data as fast as they get it.
The Rollout workers send observations into a shared queue to the policy workers.
The Policy workers grabs all queued observations and processes them as a batch, then passes the action predictions back to the rollout workers
This allows the environments that finished later to queue their data in parallel while the policy workers are busy with the earlier environments.
Further, this allows us to hide the data loading latency by parallelizing it as we do for the trainer run it as another asynchronous pipeline stage.
We can then simply tune the number of environments we can run simultaneously and the maximum batch size the model can take with much better scalability, and max out inference throughput as well (… at least until we exceed the capacity of other pipeline stages).