[jetson] Running yolov8 classfication model in DLA #1 – Why DLA?

DeeperAndCheaper

5 min readAug 27, 2023

Background

As a deep learning engineer, it is necessary to run several tasks in Jetson, such as detection, segmentation, and classification. In addition, in the case of poor outdoor environments, there are cases in which power efficiency must be considered.
Jetson Xavier AGX and ORIN are equipped with hardware called DLA (Deep Learning Accelerator) as well as GPU in the above situation.
In terms of performance, Jetson Xavier AGX’s GPU (512 core with 64 tensor core) *TFLOPS is 11, 2 DLAs are 2.5 TFLOPS per one. (*TFLOPS: Tera Floating point Operations per second)
Since it is not as good in terms of performance as GPU, it is not enough to run large models, but I think DLA is a pretty good option for running small models like yolo nano.
Therefore, here, we will look at the advantages of DLA and the points to pay attention to when building a DLA engine.

Advantages

DLA is useful for offloading CNN processing from the iGPU, and is significantly more power-efficient for these workloads.
The DLA delivers the highest AI performance in a power-efficient architecture. It accelerates the NVIDIA AI software stack with almost 2.5X the power efficiency of a GPU. It also delivers high performance per area, making it the ideal solution for lower-power, compact embedded and edge AI applications.
In addition, it can provide an independent execution pipeline in cases where redundancy is important, for example in mission-critical or safety applications.

NOTES

DLA is not superior to GPU in terms of TFLOPS and latency. It is not recommended to use DLA engine for Main Task.
Layers supported by NVIDIA DLA are limited. Even if Layer is supported, there are cases where the value of Attributes is also limited.

Supported Layer

Convolution and Fully Connected layers
Deconvolution layer
Pooling layer
Activation layer
Parametric ReLU layer
ElementWise layer
Equal operation
Scale layer
LRN (Local Response Normalization) layer
Concatenation layer
Resize layer
Unary layer
Slice layer
SoftMax layer
Shuffle layer

Even if supported, each layer clearly has attributes’ limitations. detailed description is here.
For example, the pooling layer must have a kernel (window) size between 1 and 8, and the resize layer must have a resize mode of `nearest`.

Short Result

If the above cumbersome requirements are modified with an onnx modification tool such as onnx graphsurgeon, the following results can be obtained.
(Before) DLA is not supported, the number of Layers Running on GPU is 15.

[08/10/2023-13:58:10] [W] [TRT] Validation failed for DLA layer: /softmax/Transpose + (Unnamed Layer* 193) [Shuffle]. Switching to GPU fallback.
[08/10/2023-13:58:11] [I] [TRT] ---------- Layers Running on DLA ----------
[08/10/2023-13:58:11] [I] [TRT] [DlaLayer] {ForeignNode[/init_bn/BatchNormalization.../compression3/compression3.0/Conv]}
[08/10/2023-13:58:11] [I] [TRT] [DlaLayer] {ForeignNode[/relu_2/Relu.../layer5/layer5.0/Add]}
[08/10/2023-13:58:11] [I] [TRT] [DlaLayer] {ForeignNode[/spp/scale1/scale1.1/BatchNormalization.../spp/scale4/scale4.3/Conv]}
[08/10/2023-13:58:11] [I] [TRT] [DlaLayer] {ForeignNode[/relu_8/Relu.../spp/Add_4]}
[08/10/2023-13:58:11] [I] [TRT] [DlaLayer] {ForeignNode[/Add_3.../final_layer/conv2/Conv]}
[08/10/2023-13:58:11] [I] [TRT] ---------- Layers Running on GPU ----------
[08/10/2023-13:58:11] [I] [TRT] [GpuLayer] RESIZE: /Resize
[08/10/2023-13:58:11] [I] [TRT] [GpuLayer] POOLING: /spp/scale1/scale1.0/AveragePool
[08/10/2023-13:58:11] [I] [TRT] [GpuLayer] POOLING: /spp/scale2/scale2.0/AveragePool
[08/10/2023-13:58:11] [I] [TRT] [GpuLayer] POOLING: /spp/scale3/scale3.0/AveragePool
[08/10/2023-13:58:11] [I] [TRT] [GpuLayer] POOLING: /spp/scale4/scale4.0/GlobalAveragePool
[08/10/2023-13:58:11] [I] [TRT] [GpuLayer] RESIZE: /spp/Resize
[08/10/2023-13:58:11] [I] [TRT] [GpuLayer] RESIZE: /spp/Resize_1
[08/10/2023-13:58:11] [I] [TRT] [GpuLayer] RESIZE: /spp/Resize_2
[08/10/2023-13:58:11] [I] [TRT] [GpuLayer] RESIZE: /spp/Resize_3
[08/10/2023-13:58:11] [I] [TRT] [GpuLayer] RESIZE: /Resize_1
[08/10/2023-13:58:11] [I] [TRT] [GpuLayer] RESIZE: /Resize_2
[08/10/2023-13:58:11] [I] [TRT] [GpuLayer] RESIZE: /Resize_3
[08/10/2023-13:58:11] [I] [TRT] [GpuLayer] SHUFFLE: /softmax/Transpose + (Unnamed Layer* 193) [Shuffle]
[08/10/2023-13:58:11] [I] [TRT] [GpuLayer] SOFTMAX: /softmax/Softmax
[08/10/2023-13:58:11] [I] [TRT] [GpuLayer] SHUFFLE: (Unnamed Layer* 195) [Shuffle] + /softmax/Transpose_1

(After) The number of Layers Running on GPU, where DLA is not supported, is 5.

[08/11/2023-16:19:49] [I] [TRT] ---------- Layers Running on DLA ----------
[08/11/2023-16:19:49] [I] [TRT] [DlaLayer] {ForeignNode[/init_bn/BatchNormalization.../spp/Add]}
[08/11/2023-16:19:49] [I] [TRT] [DlaLayer] {ForeignNode[/spp/Resize_3.../spp/Add_4]}
[08/11/2023-16:19:49] [I] [TRT] [DlaLayer] {ForeignNode[/Add_4.../final_layer/conv2/Conv]}
[08/11/2023-16:19:49] [I] [TRT] ---------- Layers Running on GPU ----------
[08/11/2023-16:19:49] [I] [TRT] [GpuLayer] RESIZE: /spp/Resize_1                                
[08/11/2023-16:19:49] [I] [TRT] [GpuLayer] RESIZE: /Resize_2                                
[08/11/2023-16:19:49] [I] [TRT] [GpuLayer] SHUFFLE: (Unnamed Layer* 187) [Shuffle] 
[08/11/2023-16:19:49] [I] [TRT] [GpuLayer] SOFTMAX: /softmax/Softmax                                  
[08/11/2023-16:19:49] [I] [TRT] [GpuLayer] SHUFFLE: (Unnamed Layer* 189) [Shuffle]

Conclusion

We can lower GPU usage by building an engine that runs on DLA.
To create a layer supported by DLA, work using onnx graphsurgeon is required, and will be covered in the next post.

Reference

dla faster than gpu ? — Does DLA work faster than GPU in fp16 model?
Dla sw — GitHub — NVIDIA/Deep-Learning-Accelerator-SW: NVIDIA DLA-SW, the recipes and tools for running deep learning workloads on NVIDIA DLA cores for inference applications.
resize dla core — Resize Layer did not supported in DLA
trtexec example — Revision History
developer page — Deep Learning Accelerator (DLA)
tutorial — GitHub — NVIDIA-AI-IOT/jetson_dla_tutorial: A tutorial for getting started with the Deep Learning Accelerator (DLA) on NVIDIA Jetson
official document dlacore — Revision History

—

About Authors

Hello, I’m Deeper&Cheaper.

I am a developer and blogger with the goal of integrating AI technology into the lives of everyone, pursuing the mission of “Make More People Use AI.” As the founder of the startup Deeper&Cheaper, operating under the slogan “Go Deeper Make Cheaper,” I am dedicated to exploring AI technology more deeply and presenting ways to use it cost-effectively.
The name encapsulates the philosophy that “Cheaper” reflects a focus on affordability to make AI accessible to everyone. However, from my perspective, performance is equally crucial, and thus “Deeper” signifies a passion for delving deep with high performance. Under this philosophy, I have accumulated over three years of experience in various AI fields.
With expertise in Computer Vision and Software Development, I possess knowledge and skills in diverse computer vision technologies such as object detection, object tracking, pose estimation, object segmentation, and segment anything. Additionally, I have specialized knowledge in software development and embedded systems.
Please don’t hesitate to drop your questions in the comments section.

[jetson] Running yolov8 classfication model in DLA #1 – Why DLA?

Background

Advantages

NOTES

Supported Layer

Short Result

Conclusion

Reference

Trending Articles

Hit! [yolov8] converting to Batch model engine

Hit! [Quantization] Go Faster with ReLU!

[Quantization] Achieve Accuracy Drop to Near Zero

[Quantization] How to achieve the best QAT performance

[Yolov8/Jetson/Deepstream] Benchmark test

[yolov8] NMS Post Processing implementation using only Numpy

[yolov8] batch inference using TensorRT python api

About Authors

Written by DeeperAndCheaper