DeepCuts: a Deep Learning Optimization Framework for Versatile GPU Workloads

SNU AI
SNU AIIS Blog
Published in
8 min readApr 1, 2022

By Seyeon An

Battle of Deep Learning Frameworks

Widely Used Deep Learning Optimization Frameworks — by now
Widely used deep learning optimization frameworks — by now

If you are a deep learning engineer, or more generally, if you are in the field of artificial intelligence, you must have used either of these deep learning frameworks: TensorRT by NVIDIA, Pytorch by Facebook, or TensorFlow by Google. Deep learning technology is essential everywhere — in artificial intelligence, or in big data. That is why those deep learning frameworks — that allow engineers to build optimized deep learning models more easily and quickly without getting into the details of underlying algorithms — are important. This is also why most of these Big Tech companies are investing in building the best Deep Learning Framework. Companies like Google and Facebook have put in enormous efforts to develop state-of-the-art deep learning software.

DeepCuts

Then, which software has won this battle of deep learning frameworks? Such question is hard to answer, since even the most popular and most extensively used DL softwares have their own problems. Yet, DeepCuts — which was developed by a group of researchers from Thunder Research Group, Seoul National University — is definitely one worth noting.

The Problem with Existing Deep Learning Frameworks

The Performance-Flexibility Trade-off Dilemma that DeepCuts Aims to Solve
The Performance-Flexibility Trade-off Dilemma that DeepCuts Aims to Solve

GPUs are the de facto standard to run DL applications. Almost every widely used DL frameworks, such as TensorFlow, PyTorch, and MXNet, support GPU acceleration via cuDNN provided by NVIDIA. It is the state-of-the-art DL primitive library, which acts as the smallest unit of processing, to accelerate DL computations.

However, using primitive libraries as cuDNN does not guarantee the best performance. These show poor performance as the convolutions of the deep learning network and the hardware are diversified. Moreover, they lack of kernel fusion (a well-known optimization method to reduce GPU global memory accesses between consecutively executed kernels by merging them into a single kernel) functionality. cuDNN supports kernel fusion only for a few DL workload patterns, as a sequence of a convolution, a bias addition, and a ReLU activation. However, it is not sufficient to handle various DL operation patterns found in emerging DL workloads.

There have also existed DL models that did not rely on hand-tuned GPU kernels, but their performance was relatively poor. In other words, previous DL optimization frameworks, or DL compilers had a trade-off dilemma of speed (performance) and flexibility:

Category 1 : frameworks that heavily rely on hand-tuned GPU kernels

  • TensorFlow XLA and TensorRT
  • Still use hand-tuned kernels for core routines (e.g., convolution)
  • Fast, but not very flexible

Category 2 : frameworks that use ML-based GPU kernel optimizer

  • TVM and Tensor Comprehensions
  • Optimize kernels using ML-based performance estimation models
  • Flexible, but the performance is relatively poor

Then, How about DeepCuts?

DeepCuts, which is a DL optimization framework just like TensorFlow, considers both kernel implementation parameters and GPU architecture parameters to build an optimized code. In DeepCuts, the flexible code generator supports versatile types of DL operations, and the kernel optimizer uses architectural information of target GPU. DeepCuts achieves higher performance than existing state-of-the-art DL optimization frameworks (Apache TVM, Google TensorFlow XLA, NVIDIA TensorRT).

Overall Structure of DeepCuts

Then how does DeepCuts achieve such difference?

DeepCuts takes the whole computational graph of given workload and correspondingly generates a set of GPU kernels.

The input graph is a graph that describes the computation and data flow of a DNN model. An edge of the graph represents a tensor of data and a node represents a tensor operation such as convolution. When the graph is given as an input to the GPU, each node corresponds to a GPU kernel call or a DNN library function call, as cuDNN. The input graph is similar to the computational graph of PyTorch or TensorFlow.

The diagram below shows the overall workflow of DeepCuts:

Overall Workflow of DeepCuts
Overall Workflow of DeepCuts

As shown in the figure above, DeepCuts consists of four modules: candidate generator, performance estimator, code generator, and code selector. Let’s take a look at each of those:

1) Candidate Generator : Finds the Best Way of Fusion

Workflow of the Candidate Generator
Workflow of the Candidate Generator

When it comes to generating code from a given deep learning workload, there are many options in the fusion of operations. The candidate generator generates multiple partitions as code generation candidates, evaluates each of them, and identifies a partition with the best performance.

It enumerates the ways of fusion and passes them to the following modules. It does not evaluate all cases, though — it filters the cases which are likely to show low performance, using the greedy search algorithm. This novel performance-model-driven code generation algorithm considers two critical factors for performance improvement, fusion and parameter search, simultaneously. The algorithm searches for the best-performing kernel fusion method by considering the best-performing implementation parameters in a relatively small amount of time thanks to the simple, but powerful performance model.

The numbers displayed on the right of the candidate partitions in the candidate generator module are the execution time of the given partition. This is given as an output of the three other modules. To check the performance of each subset in a partition, the candidate generator uses other modules: performance estimator, code generator, and code selector.

2) Performance Estimator : Predicts the Performance

Workflow of the Performance Estimator
Workflow of the Performance Estimator

For each subset of the given partition, the performance estimator searches for kernel implementation parameters, such as the number of total kernel threads, the thread block size, and the number of output features per thread.

It uses a simple performance estimation model and searches for parameter combinations that make the subset to show highest possibility in best performing combinations of implementation parameters, also considering the information given by the GPU architecture parameters.

DeepCuts’ performance estimator does not estimate the exact performance. Instead, it estimates the performance upperbound, and prunes the definitely-slow cases. Such difference in approach enhanced the accuracy of DeepCuts, compared to previous works. Unlike DeepCuts which uses GPU architecture parameters to estimate the upperbound, previous works aimed to estimate the performance itself — which lowered accuracy.

3) Code Generator : Generates Possible Codes

Workflow of the Code Generator
Workflow of the Code Generator

For each group of kernel implementation parameters found for the subset, the code generator generates a GPU kernel. Of course, the subset with very low performance according to the estimation of the performance estimator is filtered and therefore excluded in this process.

The code generator generates the code on versatile DL operations — including small-batch inference, large-batch inference, and training — and its fusion. Yet it is difficult to implement a code generator when if we do not utilize an intermediate expression, since there are too many types of operations (individual deep learning operations and their fusion patterns) we should represent by code.

Therefore, Deepcuts relies on **data-flow graph (DFG) — **each of which represents the computation of a tensor operation — as an intermediate representation. DFG is a graph in which each node represents an operation (addition, subtraction, etc.) and the edge represents the dependence in between operations. Each deep learning operation (e.g., convolution) can be represented in the form of a DFG, and the fusion of the operation can be represented in the form of a connection between two DFGs. Deepcuts undergoes two stages of code generation: first, generating DFGs for target operations and second, generating code from those DFGs.

4) Code Selector : Picks the Best One

Workflow of the Code Selector
Workflow of the Code Selector

The code selector executes the generated kernels with a random input and measures their execution time. It selects a kernel with the best execution time for the subset.

If the corresponding state-of-the-art DNN library function, such as cuDNN, is better than the selected kernel, the library function is selected as the best performing kernel. It also updates the total execution time with the best kernel execution time.

Experimental Results

In this section, we evaluate DeepCuts by comparing it with cuDNN and other state-of-the-art DNN optimization frameworks: TVM, TensorFlow-XLA, and TensorRT.

System Configuration & Software Versions
System Configuration & Software Versions
Deep Learning Benchmark Applications
Deep Learning Benchmark Applications

To check if DeepCuts and its performance model generically work for different GPU architectures, we evaluate DeepCuts on two different GPU architectures: NVIDIA Volta (V100) and Turing (RTX 2080). The system configuration and software tools used in the evaluation and their versions is summarized in the table above (Left).

We have evaluated the performance of DeepCuts on widely used deep learning benchmark models as ResNet and BERT — and precisely on 4 CNNs, 2 RNNs, 2 MLPs in total — as summarized in the table above (Right). We have measured the performance of both training and inference workloads for each benchmark.

The diagram below shows the speedup over cuDNN for inference and training:

As shown in the diagram above, DeepCuts achieves 1.13X speedup over cuDNN-based implementation on average. DeepCuts is the top-performer for 23 workloads out of 32 on a V100 GPU, and the top-performer for 22 workloads on an RTX 2080 GPU.

In other words, it outperforms the state-of-the-art DL optimization frameworks, such TVM of Apache, TensorFlow-XLA of Google, and TensorRT of NVIDIA.

Conclusions

DeepCuts is a DL optimization framework for versatile GPU workloads.

It is the era of deep learning — in which nearly all kinds of technology rely on deep learning. Since technology development until now has heavily relied on deep learning, the quality of deep learning optimization frameworks is an extremely important factor when it comes to DL development.

DeepCuts is especially notable since it is one of the few examples in which the deep learning field of Korea has achieved SOTA over other deep learning software technologies. DeepCuts can also be utilized as essential technology for the utilization and commercialization of AI semiconductors, which has been a global trend nowadays. We hope DeepCuts could be utilized by many engineers willing to apply deep learning to their technology, and provide them with valuable insights.

Acknowledgements

We thank Wookeun Jung and the co-authors of the paper “DeepCuts: A Deep Learning Optimization Framework for Versatile GPU Workloads” for their contributions and discussions in preparing this blog. The views and opinions expressed in this blog are solely of the authors.

This post is based on the following paper:

  • DeepCuts: A Deep Learning Optimization Framework for Versatile GPU Workloads, Wookeun Jung, Than Tuan Dao, Jaejin Lee, Programming Language Design and Implementation (PLDI’ 21) 2021, link.

This post was originally posted on our Notion blog, at July 31, 2021.

--

--

SNU AI
SNU AIIS Blog

AIIS is an intercollegiate institution of Seoul National University, committed to integrate and support AI related research at Seoul National University.