Lightweight and Parallel GPU Task Scheduling for Deep Learning

SNU AI

Follow

Published in

SNU AIIS Blog

9 min readMar 26, 2022

--

By Sue Hyun Park

Using GPUs for deep learning (DL) is a standard, as they can perform computation concurrently. Recent DL frameworks like TensorFlow, PyTorch, and MXNet run models on GPUs to improve DL inference and training speed. Such frameworks automatically launch DL operators on GPUs, freeing users from manually handling GPU intricacies.

A DL operator represents numerical computations, like convolution and batch normalization, and consists of one or more GPU tasks: GPU kernels and GPU memory operations (e.g., memory copy).

To ultimately submit tasks to the GPU, a DL framework first represents a neural network as a computation graph of DL operators. Then it goes through a series of preparation steps, where it selects the next operator to run, and dispatches proper GPU kernel(s) based on the shape of the input tensor. We call this series of steps GPU task scheduling.

Existing DL frameworks conduct GPU task scheduling during run time and this may significantly limit framework performance; the execution of neural networks can take longer than required for the amount of computation assigned to GPUs.

Current DL frameworks may encounter suboptimal situations where they cannot fully utilize the computation power of GPUs.

This post introduces Nimble, a DL execution engine that solves existing DL frameworks’ inefficiencies. Our research paper appeared at NeurIPS 2020. First, we point out two important problems in run time GPU task scheduling. Next, we describe two novel techniques the system employs to improve framework performance.

Why Run Time GPU Task Scheduling is a Problem

1. High Scheduling Overhead Makes GPUs Idle

We experimentally notice that in existing DL frameworks, GPU idle time dominates the overall running time of DL execution, as the left graph shows. Both TensorFlow and PyTorch leave their GPUs idle for a substantial portion of the running time, up to 71% and 91%, respectively.

How do we know the performance bottleneck is particularly located at the scheduling procedure of the framework? We write a C++ program that can only perform the inference of a specific neural network, which uses the same GPU tasks as PyTorch but has some of the scheduling overhead reduced by hardcoding. This lightweight version exhibits up to 2.37 times speedup compared to PyTorch during inference, as the right graph shows. The result confirms that the main source of GPU idle time is the prohibitive amount of overhead in the scheduling procedure.

Left: Ratio of GPU idle time to the overall running time on DL inference (batch size: 1). Right: Inference latencies of PyTorch and its scheduling-minimized version

2. Non-Parallel GPU Task Execution

GPUs have a powerful capability to run thousands of threads in parallel. There are two ways to fully utilize the GPU power:

Making the most use of GPU cores in a single GPU stream 🤔 This is to make every GPU kernel maintain a sufficient level of intra-kernel parallelism so that each can take full advantage of GPU’s computation power on its own.
Running multiple GPU tasks in parallel using multiple GPU streams ✅ A single GPU kernel may not be enough! This method is to take advantage of the fact that unlike kernels on the same stream, kernels on different streams can be computed concurrently.

A GPU stream is a queue of GPU tasks where the tasks are scheduled sequentially in FIFO (First-In-First-Out) order.

Depending on the implementation of the kernel and the size of the tensor being computed, there can be a deficiency in the level of intra-kernel parallelism. Hence the first method does not guarantee stable efficiency. The second method better promises full utilization of GPU computing power for all occasions. Still, implementing this way is challenging in that existing DL frameworks are already designed and optimized to submit GPU kernels to a single GPU stream — they miss the opportunity of parallelization.

Current frameworks schedule GPU tasks to be executed one at a time, but parallel execution can make better use of GPU.

In fact, if GPU tasks are fully parallelized and executed concurrently on a powerful GPU, inference latency (represented by critical path time in the left graph) can be reduced by up to 3 times. Still, DL frameworks have scheduling overhead largely rooted in run time, which impedes the potential parallel execution so much that the GPU ends up executing tasks in serial order.

Left: Ratio of critical path time to the GPU active time on DL inference. Right: Even if GPU tasks A and B are scheduled in different streams, the scheduling overhead creates a wide gap between the start time of the tasks.

The “Nimble” DL Execution Engine

Motivated by the observations on existing DL frameworks, we present Nimble, a DL execution engine that aims to schedule GPU tasks:

with minimal scheduling overhead
to be executed in parallel
without redesigning the framework runtime

When resolving the inefficiencies in GPU task scheduling, we focus on the observation that for static neural networks, DL frameworks run the same set of GPU tasks for multiple iterations. So, how can we optimize the repetition happening at every GPU task execution? This idea drives the design of Nimble, expressed in the figure below. Two key techniques to realize the idea are explained in-depth in the following section.

1. Ahead-of-time (AoT) Scheduling

Nimble finishes scheduling GPU tasks only once ahead of time. Later when Nimble is given an input, it skips the scheduling and proceeds immediately to GPU task submission. This is because it just has to reuse the scheduling result obtained ahead of run time.

The idea is that the scheduling procedure can be hoisted out of run time execution loop, similar to handling a loop-invariant code motion. By doing this, the run time loop can be significantly shortened, providing speedup in execution time.

We hoist the scheduling work out of run time execution loop, placing the corresponding steps ahead-of-time.

How do we shovel the scheduling work off of run time? Unfortunately, GPU task scheduling is composed of lots of preparation works and it is tough to spot and shovel each. Instead, Nimble identifies non-scheduling work in the run time execution loop: the GPU tasks. Nimble takes advantage of the fact that the computation of a given static neural network is fully described by a fixed set of GPU tasks. In other words, if the execution engine completely identifies the specific GPU tasks, it doesn’t have to schedule tasks for every input given at run time.

This invokes the need to record an execution trace that contains all the essential information resulting from the scheduling. Nimble pre-runs the given neural network once at the initialization phase, doing the usual task scheduling and submission steps. The figure below gives more details. Along with the execution trace, Nimble’s AoT scheduler reserves memory allocated for the pre-run and packs them into a task schedule. Nimble’s runtime executes the neural network by replaying the recorded tasks based on the task schedule. Thus, the GPU tasks can be executed independently of the base DL framework, and this does not entail scheduling overhead in DL execution.

AoT GPU task scheduler and Runtime of Nimble. Dashed arrows represent the interception of GPU tasks and memory requests by the AoT scheduler.

2. Automatic Multi-Stream Execution

Nimble schedules GPU tasks to run in parallel by submitting them to multiple GPU streams in a single GPU. To do so, we set up an efficient algorithm to assign GPU tasks to a stream and determine proper synchronizations across streams.

Given a computation graph as input, Nimble automatically rewrites the graph by marking each operator with a GPU stream and embedding stream synchronization routines to the graph, as the bottom figure shows. To elaborate, Nimble first finds a stream assignment, a mapping from the node set of the computation graph to the stream set of the GPU. Then to guarantee the execution order of streams containing dependent GPU tasks, Nimble issues and activates a special type of DL operators that can act as barriers across specific streams. The stream assignment algorithm used here meets the following two goals:

Maximum logical concurrency: parallelize as many operators as possible.
Minimum number of synchronization: launch tasks fast.

Rewriting the computation graph using the stream assignment algorithm (detailed explanation is in our paper). The output graph should contain the optimal execution order.

Given the modified graph as input, Nimble’s AoT scheduler embeds information about the stream mapping and synchronization in the task schedule, which allows every run time GPU task execution to get parallelized without overhead.

Evaluation on DL Inference and DL Training

We implement Nimble on PyTorch for the pre-run of AoT task scheduling and use NVIDIA V100 GPU for main evaluation.

Setting PyTorch as the baseline, Nimble’s inference speed outperforms PyTorch by up to 22.3 times, and TensorRT by up to 2.8 times.

Relative inference speedup of Nimble and other systems on the baseline of PyTorch (batch size 1, NVIDIA V100). Various neural networks were trained on ImageNet.

It is also observed that the multi-stream execution of Nimble provides significant DL inference speedup, compared to the single-stream execution of Nimble. Additionally, Nimble exploits logical concurrency up to the degree of 15, which surpasses the limit of manual stream assignment and synchronization work.

Impact of the multi-stream execution of Nimble on DL inference, compared to its single-stream counterpart (NVIDIA V100). Deg. stands for the maximum degree of logical concurrency of each architecture. #MACs stands for the number of multiply-and-accumulate operations, i.e. the amount of computation.

Since DL training is commonly conducted on large batch sizes, GPU scheduling overhead or single-stream execution has a less pronounced effect on framework performance. This leads to limited performance improvement on Nimble. On the contrary, Nimble’s methods take effect on DL training on small-sized inputs, such as the CIFAR-10 dataset used in computer vision.

Relative training speedup of Nimble and TorchScript on the baseline of PyTorch (batch size 32)

Summary and Broader Impact

Thanks to GPUs’ parallel computing capabilities, the burden of running billions of computations and adjusting weights for neural networks is lifted. A variety of modern DL frameworks abstract the complexities of GPU computing and have contributed to many advances in deep learning. With that being said, it is natural to expect DL frameworks to utilize the computation power of GPU to the fullest. However, many popular frameworks may fail this and exhibit performance slowdown due to their inefficient GPU task scheduling architecture.

Therefore, we design an execution engine atop PyTorch to resolve the underlying problems of GPU task scheduling. Rather than removing the framework overhead, Nimble captures the core DL computations (i.e., GPU kernel execution trace) and prepares an environment for executing the captured trace. Nimble also automatically parallelizes and synchronizes multiple GPU streams.

Nimble can be a good cure for existing problems, but not a fundamental solution. We are concerned about one fact: DL frameworks would still leave GPUs much idle and forgo parallel execution in thousands of neural networks running right now. One might say putting every GPU core in an active state is a low-priority task in many situations, and thus existing DL frameworks’ run time scheduling is a minor issue. We do recognize the frequency; nevertheless, we deem it harmful to ignore cases that need more efficient GPU task scheduling for accelerated neural network execution.

It is time to rethink the DL framework design associated with GPU processing.

We think that DL frameworks should address the holes in their GPU processing. Run time GPU task scheduling certainly hampers the best utilization of GPU power and many DL models miss out potential performance gains. We hope our approach is adopted to further improve the design and optimization principles of DL frameworks.

Acknowledgement

This blog post is based on the following paper:

Woosuk Kwon*, Gyeong-In Yu*, Eunji Jeong, and Byung-Gon Chun (* equal contribution), Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning, 34th Conference on Neural Information Processing Systems (NeurIPS), Spotlight, 4 Dec 2020. (arXiv, NeurIPS Proceedings, code)

To learn more about the stream assignment algorithm, check four theorems and proof of the theorems the authors gave in the paper.

We would like to thank Gyeong-In Yu for providing valuable insights to this blog post. The views and opinions expressed in this blog are solely of the authors.

Originally posted on our Notion blog, at May 18, 2021.