Stories by Maro JEON on Medium

What activation is suitable for your edge devices

Maro JEON — Tue, 03 Dec 2024 14:02:18 GMT

[What Activation is Suitable for Your Edge Device?]

In the field of Edge AI, not just robotics, significant effort is invested in reducing latency by a certain percentage while maintaining accuracy. Every model optimization method involves a trade-off between accuracy and latency.

Recently, I shared a LinkedIn post about a failed pruning experiment. A kind expert reached out via DM and suggested I refer to the paper “To Bridge Neural Network Design and Real-World Performance” to better understand the issue. (Thank you once again!)

This paper provided not only insights into the problem I faced but also highlighted an interesting point: “Even the same activation function can exhibit different latency tendencies depending on the hardware platform.” As shown in the figure below, different activations are hardware-friendly for different platforms.

Since I frequently work with mobile GPUs, I focused on finding activations that are optimal for GPU performance. This is where I first learned about Hardswish. It offers accuracy comparable to SiLU (Swish) while achieving latency similar to ReLU6. (In fact, their functional curves are quite similar.)

To test this, I trained a YOLOv11 small model on the COCO dataset and evaluated its latency on the Jetson platform. The results showed an approximately 11.4% improvement in latency compared to SiLU, with only about 1% degradation in accuracy.

If you have your own insights or know-how related to activation functions or model architecture design, please share them! I believe it could lead to valuable exchanges of ideas and knowledge for everyone.

[ Can we develop Large Autonomous Driving Models like ChatGPTs ? ]

Maro JEON — Mon, 02 Dec 2024 23:41:00 GMT

Subtitle: After Reading the UniAD Paper (https://github.com/OpenDriveLab/UniAD)

In the past, language prediction systems were modularized with multiple components, making them very complex. Similarly, object detection systems were complex and slow, using pre-knowledge or modularization like two-stage methods and anchor-based techniques for localization and classification tasks. However, with the advent of LLMs, language prediction has been simplified into a single model, and object detection has also been simplified with one-stage, anchor-free methods.

Currently, autonomous driving consists of complex modules for perception, prediction, and planning. These are developed independently, leading to accumulated errors in information as it progresses (according to the UniAD introduction). To overcome these inherent limitations, UniAD attempts to integrate various models performing perception, prediction, and planning tasks into a single AI model.

I further thought that a large model approach based on unsupervised learning like LLMs could be the future solution. For this to be possible, the mode of input seems important.

UniAD uses a single RGB image as the mode of input for autonomous driving. However, to predict subsequent words from previous ones like LLMs, sequential inputs such as video might be necessary. To better understand all spatial information, 360-degree depth images might be more needed than simple RGB or BEV images (focusing only on feasible solutions excluding cost considerations).

These are merely my personal thoughts, and such research trends might already exist. If anyone knows about this, it would be great to receive feedback or knowledge through DMs or comments.

Visual Language Model (VLM) Optimization — Activation-aware Weight Quantization (AWQ)

Maro JEON — Mon, 09 Sep 2024 05:37:22 GMT

Why VLM ?

Continue reading on Medium »

[Yolov8/Jetson/Deepstream] Benchmark test — Orin Nano 4GB, 8GB, NX, TX2

Maro JEON — Tue, 27 Aug 2024 13:29:10 GMT

Backrounds

Continue reading on Medium »

[Quantization] YoloV8 QAT x2 Speed up on your Jetson Orin Nano #2 — How to achieve the best QAT…

Maro JEON — Tue, 27 Aug 2024 13:26:32 GMT

Abstract

Continue reading on Medium »

[Quantization] Achieve Accuracy Drop to Near Zero — YoloV8 QAT x2 Speed up on your Jetson Orin…

Maro JEON — Tue, 27 Aug 2024 13:23:25 GMT

Background Knowledge

Continue reading on Medium »

[YoloV9][Model Optimization][Knowledge Distillation] #2 — How to implement Feature based KD ?

Maro JEON — Tue, 27 Aug 2024 13:20:06 GMT

Focal and Global Knowledge Distillation for Yolo V9 Object Detector

Continue reading on Medium »

Quantization Basics

Maro JEON — Mon, 26 Aug 2024 12:00:17 GMT

Introduction

Continue reading on Medium »

[YoloV9][Model Optimization][Knowledge Distillation] #1 — Why Knowledge Distillation for Object…

Maro JEON — Thu, 09 May 2024 11:37:30 GMT

[YoloV9][Model Optimization][Knowledge Distillation] #1 — Why Knowledge Distillation for Object Detector ?

Why knowledge distillation ?

Model optimization methods are broadly divided into five types:

Parameter Pruning
Parameter Optimization (Quantization)
Low rank matrix Factorization
Transferred/Compact convolutional filters
Knowledge Distillation

Among the five, quantization, a parameter optimization, was previously implemented and tested (see this post). This can show good performance and make stable results.

I thought about a method that could be used simultaneously with quantization. So I became interested in Knowledge Distillation, a method that can optimize models in a way that is independent of quantization.

What is knowledge distillation ? What does it consist of?

Knowledge Distillation begins with the assumption that deeper, broader models will provide more knowledge than shallower, narrower models. Here, the former is called the Teacher model and the latter is called the Student model.

So where is the knowledge of the deep learning model, called black box, hidden? How can we find it?

Although the exact answer to this question has not yet been determined, Academic categorizes it into three categories.

It is said that deep learning knowledge can be divided into three types: 1. Response based, 2. Feature based, and 3. Relation based. It can be briefly explained as follows.

Response based KD is a method of constructing knowledge from the output of a deep learning model and delivering it to students.
Feature based KD is a method of constructing knowledge based on the assumption that there is knowledge in the feature, which is the output of the intermediate layer of a deep learning model.
Relation-based Knowledge Distillation is a method of transferring the relationship between the internal structure of the Teacher model and the feature map to the Student model. This allows you to learn the structural patterns of the Teacher model and the interactions between features, rather than simply imitating the output results. This approach improves the generalization ability of the Student model and helps transfer deeper knowledge.

Relational Knowledge Distillation

As shown above, this is a method that seeks to match the relationship between teacher outputs and student outputs.

Then, what’s the knowledge for Object Detectors ?

Classification, a rather simple task, can be easily explained as above, but how can knowledge be defined for the Object Detection Task?

To put it simply, object detection can be said to be a task that finds the location of an object (“localization”) and classifies the object (“classification”). Therefore, the difference from Classification is that there is an additional difficult task called localization.

Object detection, unlike classification, is divided into (1) “one stage” and (2) “two stage” depending on how many procedures are processed to produce bbox and class information results, and depending on whether prior knowledge called Anchor is used or not. It is divided into (3) “Anchor based” and (4) “Anchor free”.

Accordingly, the Knowledge Distillation method that can be applied also changes.

In particular, Object Detection has been mainly studied in Response-based and Feature-based methods, but not much research has been done in Relation-based methods.

Let’s look at the latest papers on each.

Response based KD — “Distilling Object Detectors with Fine-grained Feature Imitation”

Distilling Object Detectors with Fine-grained Feature Imitation

In addition to the GT’s bbox, the Bbox (predictions) that the teacher predicts at the anchor’s location is defined as knowledge, and the teacher’s bbox around the GT is reflected in the loss term. This method can only be applied to anchor-based object detectors.

In the picture above, the green dot can be seen as the center point of teacher prediction, and the red dot can be seen as the center point of GT.

Green dots are distributed around red dots. In the case of Teacher, compared to Student, it will be more concentrated in a narrow distribution around GT.

In other words, the student learns from how the teacher infers predictions around a red dot.

Feature based KD — “Focal and Global Knowledge Distillation for Detectors“

Focal and Global Knowledge Distillation for Detectors

Where are the features of the Object Detector concentrated? Object detectors are largely composed of three parts: (1) Backbone — (2) Neck — (3) Head. And object detectors use an algorithm called FPN (Feature Pyramid Network) to combine multi-scale features into one to detect everything from small to large objects. This algorithm is performed in the neck part, and from this, it can be inferred that the neck contains important features for object detection.

So what kind of knowledge does a feature contain? This paper argues that Feature based knowledge (focal part) that the teacher focuses on and relation context (global context) knowledge extracted from teacher features are included in the Feature. Therefore, to put it simply, if the feature (Spatial & Channel attention map), which is the intermediate information of the teacher model, is similar to the student’s feature, the result can be as good as the teacher’s generalization performance.

What is the best knowledge distillation strategy for YoloV9 ?

In order to find the Knowledge Distillation method most suitable for YoloV9, it is very important to know what characteristics yolov9 has as an Object Detector. YoloV9 is a one stage detector and anchor free detector.

Therefore, the “Distilling Object Detectors with Fine-grained Feature Imitation” method, which is one of the response based KDs, cannot be used because it is based on anchor based predictions.

Since features can be extracted from the neck and the teacher and student of homogeneous models with the same architecture but different model sizes will be used, “Focal and Global Knowledge Distillation for Detectors” was thought to be the most suitable KD method for Yolov9.

YOLOv8 model structure

What can we expect from this KD method ?

YoloV9 currently supports C(large) and E(x-large) models. We will increase the generalization performance from 53 to 55.6 by distilling knowledge. YoloV9 E model into YoloV9 C model.

YoloV9 mAP Performance

[Quantization] Go Faster with ReLU! — YoloV8 QAT x2 Speed up on your Jetson Orin Nano #3

Maro JEON — Fri, 13 Oct 2023 13:31:33 GMT

[Quantization] Go Faster with ReLU! — YoloV8 QAT x2 Speed up on your Jetson Orin Nano #3

Additional Tip (Updated 24. Nov)

If you use nn.ReLU6(), you can get more good dynamic range of activation outputs ! see below !

/model.22/cv2.0/cv2.0.0/act/Clip | /model.22/cv2.0/cv2.0.0/act/Clip_output_0 | (-6.0, 6.0)  |  0.047244094
/model.22/cv2.0/cv2.0.1/act/Clip | /model.22/cv2.0/cv2.0.1/act/Clip_output_0 | (-6.0, 6.0)  |  0.047244094
/model.22/cv3.0/cv3.0.0/act/Clip | /model.22/cv3.0/cv3.0.0/act/Clip_output_0 | (-6.0, 6.0)  |  0.047244094
/model.22/cv3.0/cv3.0.1/act/Clip | /model.22/cv3.0/cv3.0.1/act/Clip_output_0 | (-6.0, 6.0)  |  0.047244094
/model.22/cv2.1/cv2.1.0/act/Clip | /model.22/cv2.1/cv2.1.0/act/Clip_output_0 | (-6.0, 6.0)  |  0.047244094
/model.22/cv2.1/cv2.1.1/act/Clip | /model.22/cv2.1/cv2.1.1/act/Clip_output_0 | (-6.0, 6.0)  |  0.047244094
/model.22/cv3.1/cv3.1.0/act/Clip | /model.22/cv3.1/cv3.1.0/act/Clip_output_0 | (-6.0, 6.0)  |  0.047244094
/model.22/cv3.1/cv3.1.1/act/Clip | /model.22/cv3.1/cv3.1.1/act/Clip_output_0 | (-6.0, 6.0)  |  0.047244094

1. Goal

In this post, we introduce a method that may sacrifice some accuracy of model, but can further accelerate the inference performance in terms of speed.
As you can see the below, by using ReLU instead of the default type SiLU for Activation, the speed improvement can be increased by about 10%, and the accuracy is only reduced by about 1%.

yolov5 gpu optimization github

2. How ?

During TensorRT builds the engine, it automatically merge convolution, bias, and relu layers into simpler operations called layer fusion optimization that does not change the results, so more optimized results can be obtained.
NOTE: Of course, since the Silu (Sigmoid + Mul) operation is originally larger than ReLU operation, speed will be improved even if there is no layer fusion technique.
If convolution, bias, and relu are calculated separately as shown below, time may be wasted in reading and writing memory. However, layer fusion can reduce the time spent in memory write/read, which is effective in improving latency.

TensorRT layer fusion example

Please refer to the following link for supported Fusion types. (link)

3. Modify activation type in yolov8 ! (super easy!!)

The method of learning by changing only the activation in Yolov8 is as follows.
In model config yaml file provided by yolov8 (github link), Just add one line (activation: nn.ReLU()) and start training !

# Ultralytics YOLO 🚀, AGPL-3.0 license
# YOLOv8 object detection model with P3-P5 outputs. For Usage examples see https://docs.ultralytics.com/tasks/detect

# Parameters
nc: 80  # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov8n.yaml' will call yolov8.yaml with scale 'n'
  # [depth, width, max_channels]
  n: [0.33, 0.25, 1024]  # YOLOv8n summary: 225 layers,  3157200 parameters,  3157184 gradients,   8.9 GFLOPs
  s: [0.33, 0.50, 1024]  # YOLOv8s summary: 225 layers, 11166560 parameters, 11166544 gradients,  28.8 GFLOPs
  m: [0.67, 0.75, 768]   # YOLOv8m summary: 295 layers, 25902640 parameters, 25902624 gradients,  79.3 GFLOPs
  l: [1.00, 1.00, 512]   # YOLOv8l summary: 365 layers, 43691520 parameters, 43691504 gradients, 165.7 GFLOPs
  x: [1.00, 1.25, 512]   # YOLOv8x summary: 365 layers, 68229648 parameters, 68229632 gradients, 258.5 GFLOPs

# ---- add this line ---- #
activation: nn.ReLU()

# YOLOv8.0n backbone
backbone:
  # [from, repeats, module, args]
  - [-1, 1, Conv, [64, 3, 2]]  # 0-P1/2
  - [-1, 1, Conv, [128, 3, 2]]  # 1-P2/4
  - [-1, 3, C2f, [128, True]]
  - [-1, 1, Conv, [256, 3, 2]]  # 3-P3/8
  - [-1, 6, C2f, [256, True]]
  - [-1, 1, Conv, [512, 3, 2]]  # 5-P4/16
  - [-1, 6, C2f, [512, True]]
  - [-1, 1, Conv, [1024, 3, 2]]  # 7-P5/32
  - [-1, 3, C2f, [1024, True]]
  - [-1, 1, SPPF, [1024, 5]]  # 9

# YOLOv8.0n head
head:
  - [-1, 1, nn.Upsample, [None, 2, 'nearest']]
  - [[-1, 6], 1, Concat, [1]]  # cat backbone P4
  - [-1, 3, C2f, [512]]  # 12

  - [-1, 1, nn.Upsample, [None, 2, 'nearest']]
  - [[-1, 4], 1, Concat, [1]]  # cat backbone P3
  - [-1, 3, C2f, [256]]  # 15 (P3/8-small)

  - [-1, 1, Conv, [256, 3, 2]]
  - [[-1, 12], 1, Concat, [1]]  # cat head P4
  - [-1, 3, C2f, [512]]  # 18 (P4/16-medium)

  - [-1, 1, Conv, [512, 3, 2]]
  - [[-1, 9], 1, Concat, [1]]  # cat head P5
  - [-1, 3, C2f, [1024]]  # 21 (P5/32-large)

  - [[15, 18, 21], 1, Detect, [nc]]  # Detect(P3, P4, P5)

If it is changed to ReLU, the content of the graphsurgeon_modelfunction in the previous post will have to be changed to find the Relu node, not Sigmoidand Mul. For implementation, you must pay attention to this point.
If you export the Onnx model and check the graph with netron.app, you can see the modifications as follows.

Activation change SiLU to ReLU

4. TensorRT Graph & Latency Result

Here, we look at the TensorRT graph through the Trex tool and see how much the speed is improved.

(Left) Conv + BN + Silu, (Right) Conv + BN + ReLU

As you can see in the picture above, the latency of the first layer is 4.3 ms for silu and 2.7 ms for Relu. It can be confirmed that there is an improvement of about 59% on only the one conv layer.
And in the case of Silu, you can see that layer fusion has not occurred and a PWN (PointWiseNode) has been created, and in the case of Relu, you can see that layer fusion has occurred and a single convolution operation has been performed.
Finally, the latency of Yolov8 medium (batch 4) of ReLU activation we obtained was 75.172 ms on Jetson Orin Nano 4GB.

https://medium.com/media/1158c2d7f00b7677e723f37a30dd9a88/href

5. Conclusion

Without stopping at QAT, we experimented with a way to make Yolov8 faster and were actually able to make it 14.2% faster !!!
As there is an improvement in speed, there may be a decrease in accuracy, so in the next post, let’s find out how much performance decreases in QAT and how to recover. Stay Tune !!!

—

About Authors

Hello, I’m Deeper&Cheaper.

I am a developer and blogger with the goal of integrating AI technology into the lives of everyone, pursuing the mission of “Make More People Use AI.” As the founder of the startup Deeper&Cheaper, operating under the slogan “Go Deeper Make Cheaper,” I am dedicated to exploring AI technology more deeply and presenting ways to use it cost-effectively.
The name encapsulates the philosophy that “Cheaper” reflects a focus on affordability to make AI accessible to everyone. However, from my perspective, performance is equally crucial, and thus “Deeper” signifies a passion for delving deep with high performance. Under this philosophy, I have accumulated over three years of experience in various AI fields.
With expertise in Computer Vision and Software Development, I possess knowledge and skills in diverse computer vision technologies such as object detection, object tracking, pose estimation, object segmentation, and segment anything. Additionally, I have specialized knowledge in software development and embedded systems.
Please don’t hesitate to drop your questions in the comments section.

Stories by Maro JEON on Medium

What activation is suitable for your edge devices

[What Activation is Suitable for Your Edge Device?]

[ Can we develop Large Autonomous Driving Models like ChatGPTs ? ]

Visual Language Model (VLM) Optimization — Activation-aware Weight Quantization (AWQ)

[Yolov8/Jetson/Deepstream] Benchmark test — Orin Nano 4GB, 8GB, NX, TX2

[Quantization] YoloV8 QAT x2 Speed up on your Jetson Orin Nano #2 — How to achieve the best QAT…

[Quantization] Achieve Accuracy Drop to Near Zero — YoloV8 QAT x2 Speed up on your Jetson Orin…

[YoloV9][Model Optimization][Knowledge Distillation] #2 — How to implement Feature based KD ?

Quantization Basics

[YoloV9][Model Optimization][Knowledge Distillation] #1 — Why Knowledge Distillation for Object…

[YoloV9][Model Optimization][Knowledge Distillation] #1 — Why Knowledge Distillation for Object Detector ?

Why knowledge distillation ?

What is knowledge distillation ? What does it consist of?

Then, what’s the knowledge for Object Detectors ?

What is the best knowledge distillation strategy for YoloV9 ?

What can we expect from this KD method ?

[Quantization] Go Faster with ReLU! — YoloV8 QAT x2 Speed up on your Jetson Orin Nano #3

[Quantization] Go Faster with ReLU! — YoloV8 QAT x2 Speed up on your Jetson Orin Nano #3

Additional Tip (Updated 24. Nov)

1. Goal

2. How ?

3. Modify activation type in yolov8 ! (super easy!!)

4. TensorRT Graph & Latency Result

5. Conclusion

Trending Articles

Hit! [yolov8] converting to Batch model engine

Hit! [Quantization] Go Faster with ReLU!

[Quantization] Achieve Accuracy Drop to Near Zero

[Quantization] How to achieve the best QAT performance

[Yolov8/Jetson/Deepstream] Benchmark test

[yolov8] NMS Post Processing implementation using only Numpy

[yolov8] batch inference using TensorRT python api

About Authors