[Quantization] Go Faster with ReLU! — YoloV8 QAT x2 Speed up on your Jetson Orin Nano #3

DeeperAndCheaper
5 min readOct 13, 2023

--

Additional Tip (Updated 24. Nov)

  • If you use nn.ReLU6(), you can get more good dynamic range of activation outputs ! see below !
/model.22/cv2.0/cv2.0.0/act/Clip | /model.22/cv2.0/cv2.0.0/act/Clip_output_0 | (-6.0, 6.0)  |  0.047244094
/model.22/cv2.0/cv2.0.1/act/Clip | /model.22/cv2.0/cv2.0.1/act/Clip_output_0 | (-6.0, 6.0) | 0.047244094
/model.22/cv3.0/cv3.0.0/act/Clip | /model.22/cv3.0/cv3.0.0/act/Clip_output_0 | (-6.0, 6.0) | 0.047244094
/model.22/cv3.0/cv3.0.1/act/Clip | /model.22/cv3.0/cv3.0.1/act/Clip_output_0 | (-6.0, 6.0) | 0.047244094
/model.22/cv2.1/cv2.1.0/act/Clip | /model.22/cv2.1/cv2.1.0/act/Clip_output_0 | (-6.0, 6.0) | 0.047244094
/model.22/cv2.1/cv2.1.1/act/Clip | /model.22/cv2.1/cv2.1.1/act/Clip_output_0 | (-6.0, 6.0) | 0.047244094
/model.22/cv3.1/cv3.1.0/act/Clip | /model.22/cv3.1/cv3.1.0/act/Clip_output_0 | (-6.0, 6.0) | 0.047244094
/model.22/cv3.1/cv3.1.1/act/Clip | /model.22/cv3.1/cv3.1.1/act/Clip_output_0 | (-6.0, 6.0) | 0.047244094

1. Goal

  • In this post, we introduce a method that may sacrifice some accuracy of model, but can further accelerate the inference performance in terms of speed.
  • As you can see the below, by using ReLU instead of the default type SiLU for Activation, the speed improvement can be increased by about 10%, and the accuracy is only reduced by about 1%.
yolov5 gpu optimization github

2. How ?

  • During TensorRT builds the engine, it automatically merge convolution, bias, and relu layers into simpler operations called layer fusion optimization that does not change the results, so more optimized results can be obtained.
  • NOTE: Of course, since the Silu (Sigmoid + Mul) operation is originally larger than ReLU operation, speed will be improved even if there is no layer fusion technique.
  • If convolution, bias, and relu are calculated separately as shown below, time may be wasted in reading and writing memory. However, layer fusion can reduce the time spent in memory write/read, which is effective in improving latency.
TensorRT layer fusion example
  • Please refer to the following link for supported Fusion types. (link)

3. Modify activation type in yolov8 ! (super easy!!)

  • The method of learning by changing only the activation in Yolov8 is as follows.
  • In model config yaml file provided by yolov8 (github link), Just add one line (activation: nn.ReLU()) and start training !
# Ultralytics YOLO 🚀, AGPL-3.0 license
# YOLOv8 object detection model with P3-P5 outputs. For Usage examples see https://docs.ultralytics.com/tasks/detect

# Parameters
nc: 80 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov8n.yaml' will call yolov8.yaml with scale 'n'
# [depth, width, max_channels]
n: [0.33, 0.25, 1024] # YOLOv8n summary: 225 layers, 3157200 parameters, 3157184 gradients, 8.9 GFLOPs
s: [0.33, 0.50, 1024] # YOLOv8s summary: 225 layers, 11166560 parameters, 11166544 gradients, 28.8 GFLOPs
m: [0.67, 0.75, 768] # YOLOv8m summary: 295 layers, 25902640 parameters, 25902624 gradients, 79.3 GFLOPs
l: [1.00, 1.00, 512] # YOLOv8l summary: 365 layers, 43691520 parameters, 43691504 gradients, 165.7 GFLOPs
x: [1.00, 1.25, 512] # YOLOv8x summary: 365 layers, 68229648 parameters, 68229632 gradients, 258.5 GFLOPs

# ---- add this line ---- #
activation: nn.ReLU()

# YOLOv8.0n backbone
backbone:
# [from, repeats, module, args]
- [-1, 1, Conv, [64, 3, 2]] # 0-P1/2
- [-1, 1, Conv, [128, 3, 2]] # 1-P2/4
- [-1, 3, C2f, [128, True]]
- [-1, 1, Conv, [256, 3, 2]] # 3-P3/8
- [-1, 6, C2f, [256, True]]
- [-1, 1, Conv, [512, 3, 2]] # 5-P4/16
- [-1, 6, C2f, [512, True]]
- [-1, 1, Conv, [1024, 3, 2]] # 7-P5/32
- [-1, 3, C2f, [1024, True]]
- [-1, 1, SPPF, [1024, 5]] # 9

# YOLOv8.0n head
head:
- [-1, 1, nn.Upsample, [None, 2, 'nearest']]
- [[-1, 6], 1, Concat, [1]] # cat backbone P4
- [-1, 3, C2f, [512]] # 12

- [-1, 1, nn.Upsample, [None, 2, 'nearest']]
- [[-1, 4], 1, Concat, [1]] # cat backbone P3
- [-1, 3, C2f, [256]] # 15 (P3/8-small)

- [-1, 1, Conv, [256, 3, 2]]
- [[-1, 12], 1, Concat, [1]] # cat head P4
- [-1, 3, C2f, [512]] # 18 (P4/16-medium)

- [-1, 1, Conv, [512, 3, 2]]
- [[-1, 9], 1, Concat, [1]] # cat head P5
- [-1, 3, C2f, [1024]] # 21 (P5/32-large)

- [[15, 18, 21], 1, Detect, [nc]] # Detect(P3, P4, P5)
  • If it is changed to ReLU, the content of the graphsurgeon_modelfunction in the previous post will have to be changed to find the Relu node, not Sigmoidand Mul. For implementation, you must pay attention to this point.
  • If you export the Onnx model and check the graph with netron.app, you can see the modifications as follows.
Activation change SiLU to ReLU

4. TensorRT Graph & Latency Result

  • Here, we look at the TensorRT graph through the Trex tool and see how much the speed is improved.
(Left) Conv + BN + Silu, (Right) Conv + BN + ReLU
  • As you can see in the picture above, the latency of the first layer is 4.3 ms for silu and 2.7 ms for Relu. It can be confirmed that there is an improvement of about 59% on only the one conv layer.
  • And in the case of Silu, you can see that layer fusion has not occurred and a PWN (PointWiseNode) has been created, and in the case of Relu, you can see that layer fusion has occurred and a single convolution operation has been performed.
  • Finally, the latency of Yolov8 medium (batch 4) of ReLU activation we obtained was 75.172 ms on Jetson Orin Nano 4GB.

5. Conclusion

  • Without stopping at QAT, we experimented with a way to make Yolov8 faster and were actually able to make it 14.2% faster !!!
  • As there is an improvement in speed, there may be a decrease in accuracy, so in the next post, let’s find out how much performance decreases in QAT and how to recover. Stay Tune !!!

Trending Articles

Hit! [yolov8] converting to Batch model engine

Hit! [Quantization] Go Faster with ReLU!

[Quantization] Achieve Accuracy Drop to Near Zero

[Quantization] How to achieve the best QAT performance

[Yolov8/Jetson/Deepstream] Benchmark test

[yolov8] NMS Post Processing implementation using only Numpy

[yolov8] batch inference using TensorRT python api

About Authors

Hello, I’m Deeper&Cheaper.

  • I am a developer and blogger with the goal of integrating AI technology into the lives of everyone, pursuing the mission of “Make More People Use AI.” As the founder of the startup Deeper&Cheaper, operating under the slogan “Go Deeper Make Cheaper,” I am dedicated to exploring AI technology more deeply and presenting ways to use it cost-effectively.
  • The name encapsulates the philosophy that “Cheaper” reflects a focus on affordability to make AI accessible to everyone. However, from my perspective, performance is equally crucial, and thus “Deeper” signifies a passion for delving deep with high performance. Under this philosophy, I have accumulated over three years of experience in various AI fields.
  • With expertise in Computer Vision and Software Development, I possess knowledge and skills in diverse computer vision technologies such as object detection, object tracking, pose estimation, object segmentation, and segment anything. Additionally, I have specialized knowledge in software development and embedded systems.
  • Please don’t hesitate to drop your questions in the comments section.

--

--