<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Maro JEON on Medium]]></title>
        <description><![CDATA[Stories by Maro JEON on Medium]]></description>
        <link>https://medium.com/@MaroJEON?source=rss-92bcabecb40------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*S7a-PLbsP2vnetFHBmiQbg@2x.jpeg</url>
            <title>Stories by Maro JEON on Medium</title>
            <link>https://medium.com/@MaroJEON?source=rss-92bcabecb40------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Mon, 18 May 2026 11:46:20 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@MaroJEON/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[What activation is suitable for your edge devices]]></title>
            <link>https://medium.com/@MaroJEON/what-activation-is-suitable-for-your-edge-devices-c7e1612cbf03?source=rss-92bcabecb40------2</link>
            <guid isPermaLink="false">https://medium.com/p/c7e1612cbf03</guid>
            <category><![CDATA[deep-learning]]></category>
            <category><![CDATA[optimization]]></category>
            <category><![CDATA[model-optimization]]></category>
            <category><![CDATA[ai]]></category>
            <dc:creator><![CDATA[Maro JEON]]></dc:creator>
            <pubDate>Tue, 03 Dec 2024 14:02:18 GMT</pubDate>
            <atom:updated>2024-12-03T14:02:18.073Z</atom:updated>
            <content:encoded><![CDATA[<h2>[What Activation is Suitable for Your Edge Device?]</h2><figure><img alt="" src="https://cdn-images-1.medium.com/max/563/1*gaL6b13DCZo0QbdhcFHnfw@2x.jpeg" /></figure><p>In the field of Edge AI, not just robotics, significant effort is invested in reducing latency by a certain percentage while maintaining accuracy. Every model optimization method involves a trade-off between accuracy and latency.</p><p>Recently, I shared a LinkedIn post about a failed pruning experiment. A kind expert reached out via DM and suggested I refer to the paper “To Bridge Neural Network Design and Real-World Performance” to better understand the issue. (Thank you once again!)</p><p>This paper provided not only insights into the problem I faced but also highlighted an interesting point: “Even the same activation function can exhibit different latency tendencies depending on the hardware platform.” As shown in the figure below, different activations are hardware-friendly for different platforms.</p><p>Since I frequently work with mobile GPUs, I focused on finding activations that are optimal for GPU performance. This is where I first learned about Hardswish. It offers accuracy comparable to SiLU (Swish) while achieving latency similar to ReLU6. (In fact, their functional curves are quite similar.)</p><p>To test this, I trained a YOLOv11 small model on the COCO dataset and evaluated its latency on the Jetson platform. The results showed an approximately 11.4% improvement in latency compared to SiLU, with only about 1% degradation in accuracy.</p><p>If you have your own insights or know-how related to activation functions or model architecture design, please share them! I believe it could lead to valuable exchanges of ideas and knowledge for everyone.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=c7e1612cbf03" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[[ Can we develop Large Autonomous Driving Models like ChatGPTs ? ]]]></title>
            <link>https://medium.com/@MaroJEON/can-we-develop-large-autonomous-driving-models-like-chatgpts-1d620774d6d0?source=rss-92bcabecb40------2</link>
            <guid isPermaLink="false">https://medium.com/p/1d620774d6d0</guid>
            <dc:creator><![CDATA[Maro JEON]]></dc:creator>
            <pubDate>Mon, 02 Dec 2024 23:41:00 GMT</pubDate>
            <atom:updated>2024-12-02T23:43:59.950Z</atom:updated>
            <content:encoded><![CDATA[<p>Subtitle: After Reading the UniAD Paper (https://github.com/OpenDriveLab/UniAD)</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*TBRAIpS6LdpWW5ObJ3uh6w@2x.jpeg" /></figure><p>In the past, language prediction systems were modularized with multiple components, making them very complex. Similarly, object detection systems were complex and slow, using pre-knowledge or modularization like two-stage methods and anchor-based techniques for localization and classification tasks. However, with the advent of LLMs, language prediction has been simplified into a single model, and object detection has also been simplified with one-stage, anchor-free methods.</p><p>Currently, autonomous driving consists of complex modules for perception, prediction, and planning. These are developed independently, leading to accumulated errors in information as it progresses (according to the UniAD introduction). To overcome these inherent limitations, UniAD attempts to integrate various models performing perception, prediction, and planning tasks into a single AI model.</p><p>I further thought that a large model approach based on unsupervised learning like LLMs could be the future solution. For this to be possible, the mode of input seems important.</p><p>UniAD uses a single RGB image as the mode of input for autonomous driving. However, to predict subsequent words from previous ones like LLMs, sequential inputs such as video might be necessary. To better understand all spatial information, 360-degree depth images might be more needed than simple RGB or BEV images (focusing only on feasible solutions excluding cost considerations).</p><p>These are merely my personal thoughts, and such research trends might already exist. If anyone knows about this, it would be great to receive feedback or knowledge through DMs or comments.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=1d620774d6d0" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Visual Language Model (VLM) Optimization — Activation-aware Weight Quantization (AWQ)]]></title>
            <description><![CDATA[<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@MaroJEON/visual-language-model-vlm-optimization-activation-aware-weight-quantization-awq-90d6c7b44455?source=rss-92bcabecb40------2"><img src="https://cdn-images-1.medium.com/max/685/1*B4vUIRrat6L9NQV5i4gb_w.png" width="685"></a></p><p class="medium-feed-snippet">Why VLM&#xA0;?</p><p class="medium-feed-link"><a href="https://medium.com/@MaroJEON/visual-language-model-vlm-optimization-activation-aware-weight-quantization-awq-90d6c7b44455?source=rss-92bcabecb40------2">Continue reading on Medium »</a></p></div>]]></description>
            <link>https://medium.com/@MaroJEON/visual-language-model-vlm-optimization-activation-aware-weight-quantization-awq-90d6c7b44455?source=rss-92bcabecb40------2</link>
            <guid isPermaLink="false">https://medium.com/p/90d6c7b44455</guid>
            <category><![CDATA[model-optimization]]></category>
            <category><![CDATA[vlm]]></category>
            <category><![CDATA[quantization]]></category>
            <category><![CDATA[model-quantization]]></category>
            <category><![CDATA[llm]]></category>
            <dc:creator><![CDATA[Maro JEON]]></dc:creator>
            <pubDate>Mon, 09 Sep 2024 05:37:22 GMT</pubDate>
            <atom:updated>2024-09-09T05:41:15.895Z</atom:updated>
        </item>
        <item>
            <title><![CDATA[[Yolov8/Jetson/Deepstream] Benchmark test — Orin Nano 4GB, 8GB, NX, TX2]]></title>
            <description><![CDATA[<div class="medium-feed-item"><p class="medium-feed-snippet">Backrounds</p><p class="medium-feed-link"><a href="https://medium.com/@MaroJEON/yolov8-jetson-deepstream-benchmark-test-orin-nano-4gb-8gb-nx-tx2-f3993f9c8d2f?source=rss-92bcabecb40------2">Continue reading on Medium »</a></p></div>]]></description>
            <link>https://medium.com/@MaroJEON/yolov8-jetson-deepstream-benchmark-test-orin-nano-4gb-8gb-nx-tx2-f3993f9c8d2f?source=rss-92bcabecb40------2</link>
            <guid isPermaLink="false">https://medium.com/p/f3993f9c8d2f</guid>
            <category><![CDATA[yolov8]]></category>
            <category><![CDATA[jetson-nx]]></category>
            <category><![CDATA[jetson-orin]]></category>
            <category><![CDATA[jetson-xavier]]></category>
            <category><![CDATA[tensorrt]]></category>
            <dc:creator><![CDATA[Maro JEON]]></dc:creator>
            <pubDate>Tue, 27 Aug 2024 13:29:10 GMT</pubDate>
            <atom:updated>2024-08-27T13:29:10.710Z</atom:updated>
        </item>
        <item>
            <title><![CDATA[[Quantization] YoloV8 QAT x2 Speed up on your Jetson Orin Nano #2 — How to achieve the best QAT…]]></title>
            <description><![CDATA[<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@MaroJEON/quantization-yolov8-qat-x2-speed-up-on-your-jetson-orin-nano-2-how-to-achieve-the-best-qat-c6069fb83ab7?source=rss-92bcabecb40------2"><img src="https://cdn-images-1.medium.com/max/600/1*cDe5ofRCamaGtlytr_m6qg.png" width="600"></a></p><p class="medium-feed-snippet">Abstract</p><p class="medium-feed-link"><a href="https://medium.com/@MaroJEON/quantization-yolov8-qat-x2-speed-up-on-your-jetson-orin-nano-2-how-to-achieve-the-best-qat-c6069fb83ab7?source=rss-92bcabecb40------2">Continue reading on Medium »</a></p></div>]]></description>
            <link>https://medium.com/@MaroJEON/quantization-yolov8-qat-x2-speed-up-on-your-jetson-orin-nano-2-how-to-achieve-the-best-qat-c6069fb83ab7?source=rss-92bcabecb40------2</link>
            <guid isPermaLink="false">https://medium.com/p/c6069fb83ab7</guid>
            <category><![CDATA[model-optimization]]></category>
            <category><![CDATA[ptqs]]></category>
            <category><![CDATA[tensorrt]]></category>
            <category><![CDATA[qat]]></category>
            <category><![CDATA[yolov8]]></category>
            <dc:creator><![CDATA[Maro JEON]]></dc:creator>
            <pubDate>Tue, 27 Aug 2024 13:26:32 GMT</pubDate>
            <atom:updated>2024-08-27T13:26:32.812Z</atom:updated>
        </item>
        <item>
            <title><![CDATA[[Quantization] Achieve Accuracy Drop to Near Zero — YoloV8 QAT x2 Speed up on your Jetson Orin…]]></title>
            <description><![CDATA[<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@MaroJEON/quantization-achieve-accuracy-drop-to-near-zero-yolov8-qat-x2-speed-up-on-your-jetson-orin-2b99819775e4?source=rss-92bcabecb40------2"><img src="https://cdn-images-1.medium.com/max/1025/1*jIDBH2pW-w0AYq16KK0Ilw.png" width="1025"></a></p><p class="medium-feed-snippet">Background Knowledge</p><p class="medium-feed-link"><a href="https://medium.com/@MaroJEON/quantization-achieve-accuracy-drop-to-near-zero-yolov8-qat-x2-speed-up-on-your-jetson-orin-2b99819775e4?source=rss-92bcabecb40------2">Continue reading on Medium »</a></p></div>]]></description>
            <link>https://medium.com/@MaroJEON/quantization-achieve-accuracy-drop-to-near-zero-yolov8-qat-x2-speed-up-on-your-jetson-orin-2b99819775e4?source=rss-92bcabecb40------2</link>
            <guid isPermaLink="false">https://medium.com/p/2b99819775e4</guid>
            <category><![CDATA[qat]]></category>
            <category><![CDATA[jetsons]]></category>
            <category><![CDATA[yolov8]]></category>
            <category><![CDATA[model-optimization]]></category>
            <category><![CDATA[deep-learning]]></category>
            <dc:creator><![CDATA[Maro JEON]]></dc:creator>
            <pubDate>Tue, 27 Aug 2024 13:23:25 GMT</pubDate>
            <atom:updated>2024-08-27T13:23:25.266Z</atom:updated>
        </item>
        <item>
            <title><![CDATA[[YoloV9][Model Optimization][Knowledge Distillation] #2 — How to implement Feature based KD ?]]></title>
            <description><![CDATA[<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@MaroJEON/yolov9-model-optimization-knowledge-distillation-2-how-to-implement-feature-based-kd-5bdae730094d?source=rss-92bcabecb40------2"><img src="https://cdn-images-1.medium.com/max/1024/1*vrceyeKk7_9p70Plxeftlg.png" width="1024"></a></p><p class="medium-feed-snippet">Focal and Global Knowledge Distillation for Yolo V9 Object Detector</p><p class="medium-feed-link"><a href="https://medium.com/@MaroJEON/yolov9-model-optimization-knowledge-distillation-2-how-to-implement-feature-based-kd-5bdae730094d?source=rss-92bcabecb40------2">Continue reading on Medium »</a></p></div>]]></description>
            <link>https://medium.com/@MaroJEON/yolov9-model-optimization-knowledge-distillation-2-how-to-implement-feature-based-kd-5bdae730094d?source=rss-92bcabecb40------2</link>
            <guid isPermaLink="false">https://medium.com/p/5bdae730094d</guid>
            <category><![CDATA[knowledge-distillation]]></category>
            <category><![CDATA[dl]]></category>
            <category><![CDATA[yolov9]]></category>
            <category><![CDATA[object-detection]]></category>
            <category><![CDATA[model-optimization]]></category>
            <dc:creator><![CDATA[Maro JEON]]></dc:creator>
            <pubDate>Tue, 27 Aug 2024 13:20:06 GMT</pubDate>
            <atom:updated>2024-08-27T13:20:06.659Z</atom:updated>
        </item>
        <item>
            <title><![CDATA[Quantization Basics]]></title>
            <description><![CDATA[<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/@MaroJEON/quantization-basics-2b9b1a49882d?source=rss-92bcabecb40------2"><img src="https://cdn-images-1.medium.com/max/645/1*cZ2zGGGg7YRbDUX7est4uQ.png" width="645"></a></p><p class="medium-feed-snippet">Introduction</p><p class="medium-feed-link"><a href="https://medium.com/@MaroJEON/quantization-basics-2b9b1a49882d?source=rss-92bcabecb40------2">Continue reading on Medium »</a></p></div>]]></description>
            <link>https://medium.com/@MaroJEON/quantization-basics-2b9b1a49882d?source=rss-92bcabecb40------2</link>
            <guid isPermaLink="false">https://medium.com/p/2b9b1a49882d</guid>
            <category><![CDATA[quantization]]></category>
            <category><![CDATA[cnn]]></category>
            <category><![CDATA[yolo]]></category>
            <category><![CDATA[model-quantization]]></category>
            <category><![CDATA[model-optimization]]></category>
            <dc:creator><![CDATA[Maro JEON]]></dc:creator>
            <pubDate>Mon, 26 Aug 2024 12:00:17 GMT</pubDate>
            <atom:updated>2024-08-26T12:00:17.931Z</atom:updated>
        </item>
        <item>
            <title><![CDATA[[YoloV9][Model Optimization][Knowledge Distillation] #1  —  Why Knowledge Distillation for Object…]]></title>
            <link>https://medium.com/@MaroJEON/yolov9-model-optimization-knowledge-distillation-1-why-knowledge-distillation-for-object-08d420499141?source=rss-92bcabecb40------2</link>
            <guid isPermaLink="false">https://medium.com/p/08d420499141</guid>
            <category><![CDATA[model-optimization]]></category>
            <category><![CDATA[nvidia]]></category>
            <category><![CDATA[tensorrt]]></category>
            <category><![CDATA[yolo]]></category>
            <dc:creator><![CDATA[Maro JEON]]></dc:creator>
            <pubDate>Thu, 09 May 2024 11:37:30 GMT</pubDate>
            <atom:updated>2024-05-15T13:25:13.303Z</atom:updated>
            <content:encoded><![CDATA[<h2>[YoloV9][Model Optimization][Knowledge Distillation] #1 — Why Knowledge Distillation for Object Detector ?</h2><h3>Why knowledge distillation ?</h3><p>Model optimization methods are broadly divided into five types:</p><ul><li>Parameter Pruning</li><li>Parameter Optimization (Quantization)</li><li>Low rank matrix Factorization</li><li>Transferred/Compact convolutional filters</li><li>Knowledge Distillation</li></ul><p>Among the five, quantization, a parameter optimization, was previously implemented and tested (see this post). This can show good performance and make stable results.</p><p>I thought about a method that could be used simultaneously with quantization. So I became interested in Knowledge Distillation, a method that can optimize models in a way that is independent of quantization.</p><h3>What is knowledge distillation ? What does it consist of?</h3><p>Knowledge Distillation begins with the assumption that deeper, broader models will provide more knowledge than shallower, narrower models. Here, the former is called the Teacher model and the latter is called the Student model.</p><p>So where is the knowledge of the deep learning model, called black box, hidden? How can we find it?</p><p>Although the exact answer to this question has not yet been determined, Academic categorizes it into three categories.</p><p>It is said that deep learning knowledge can be divided into three types: 1. Response based, 2. Feature based, and 3. Relation based. It can be briefly explained as follows.</p><ul><li>Response based KD is a method of constructing knowledge from the output of a deep learning model and delivering it to students.</li><li>Feature based KD is a method of constructing knowledge based on the assumption that there is knowledge in the feature, which is the output of the intermediate layer of a deep learning model.</li><li>Relation-based Knowledge Distillation is a method of transferring the relationship between the internal structure of the Teacher model and the feature map to the Student model. This allows you to learn the structural patterns of the Teacher model and the interactions between features, rather than simply imitating the output results. This approach improves the generalization ability of the Student model and helps transfer deeper knowledge.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/287/1*3u_6KgcGQjQkZwzLxMxVtw@2x.jpeg" /><figcaption>Relational Knowledge Distillation</figcaption></figure><p>As shown above, this is a method that seeks to match the relationship between teacher outputs and student outputs.</p><h3>Then, what’s the knowledge for Object Detectors ?</h3><p>Classification, a rather simple task, can be easily explained as above, but how can knowledge be defined for the Object Detection Task?</p><p>To put it simply, object detection can be said to be a task that finds the location of an object (“localization”) and classifies the object (“classification”). Therefore, the difference from Classification is that there is an additional difficult task called localization.</p><p>Object detection, unlike classification, is divided into (1) “one stage” and (2) “two stage” depending on how many procedures are processed to produce bbox and class information results, and depending on whether prior knowledge called Anchor is used or not. It is divided into (3) “Anchor based” and (4) “Anchor free”.</p><p>Accordingly, the Knowledge Distillation method that can be applied also changes.</p><p>In particular, Object Detection has been mainly studied in Response-based and Feature-based methods, but not much research has been done in Relation-based methods.</p><p>Let’s look at the latest papers on each.</p><ul><li>Response based KD — “Distilling Object Detectors with Fine-grained Feature Imitation”</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/375/1*W98llir25g03NBiaKur9ww@2x.jpeg" /></figure><p>Distilling Object Detectors with Fine-grained Feature Imitation</p><p>In addition to the GT’s bbox, the Bbox (predictions) that the teacher predicts at the anchor’s location is defined as knowledge, and the teacher’s bbox around the GT is reflected in the loss term. This method can only be applied to anchor-based object detectors.</p><p>In the picture above, the green dot can be seen as the center point of teacher prediction, and the red dot can be seen as the center point of GT.</p><p>Green dots are distributed around red dots. In the case of Teacher, compared to Student, it will be more concentrated in a narrow distribution around GT.</p><p>In other words, the student learns from how the teacher infers predictions around a red dot.</p><ul><li>Feature based KD — “Focal and Global Knowledge Distillation for Detectors“</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/322/1*JigZ1TxL3l2X7eSnljs5Cg@2x.jpeg" /><figcaption>Focal and Global Knowledge Distillation for Detectors</figcaption></figure><p>Where are the features of the Object Detector concentrated? Object detectors are largely composed of three parts: (1) Backbone — (2) Neck — (3) Head. And object detectors use an algorithm called FPN (Feature Pyramid Network) to combine multi-scale features into one to detect everything from small to large objects. This algorithm is performed in the neck part, and from this, it can be inferred that the neck contains important features for object detection.</p><p>So what kind of knowledge does a feature contain? This paper argues that Feature based knowledge (focal part) that the teacher focuses on and relation context (global context) knowledge extracted from teacher features are included in the Feature. Therefore, to put it simply, if the feature (Spatial &amp; Channel attention map), which is the intermediate information of the teacher model, is similar to the student’s feature, the result can be as good as the teacher’s generalization performance.</p><h3>What is the best knowledge distillation strategy for YoloV9 ?</h3><p>In order to find the Knowledge Distillation method most suitable for YoloV9, it is very important to know what characteristics yolov9 has as an Object Detector. YoloV9 is a one stage detector and anchor free detector.</p><p>Therefore, the “Distilling Object Detectors with Fine-grained Feature Imitation” method, which is one of the response based KDs, cannot be used because it is based on anchor based predictions.</p><p>Since features can be extracted from the neck and the teacher and student of homogeneous models with the same architecture but different model sizes will be used, “Focal and Global Knowledge Distillation for Detectors” was thought to be the most suitable KD method for Yolov9.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/499/1*HXjgdSw9K_JaaQaYw-dCow@2x.jpeg" /><figcaption>YOLOv8 model structure</figcaption></figure><h3>What can we expect from this KD method ?</h3><p>YoloV9 currently supports C(large) and E(x-large) models. We will increase the generalization performance from 53 to 55.6 by distilling knowledge. YoloV9 E model into YoloV9 C model.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/379/1*L11UGj988-LiECL6vOmOaw@2x.jpeg" /><figcaption>YoloV9 mAP Performance</figcaption></figure><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=08d420499141" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[[Quantization] Go Faster with ReLU! — YoloV8 QAT x2 Speed up on your Jetson Orin Nano #3]]></title>
            <link>https://medium.com/@MaroJEON/quantization-go-faster-with-relu-yolov8-qat-x2-speed-up-on-your-jetson-orin-nano-3-4d4733c9e435?source=rss-92bcabecb40------2</link>
            <guid isPermaLink="false">https://medium.com/p/4d4733c9e435</guid>
            <category><![CDATA[yolov8]]></category>
            <category><![CDATA[tensorrt]]></category>
            <category><![CDATA[ptqs]]></category>
            <category><![CDATA[qat]]></category>
            <category><![CDATA[model-optimization]]></category>
            <dc:creator><![CDATA[Maro JEON]]></dc:creator>
            <pubDate>Fri, 13 Oct 2023 13:31:33 GMT</pubDate>
            <atom:updated>2023-12-22T01:54:03.745Z</atom:updated>
            <content:encoded><![CDATA[<h3>[Quantization] Go Faster with ReLU! — YoloV8 QAT x2 Speed up on your Jetson Orin Nano #3</h3><h4>Additional Tip (Updated 24. Nov)</h4><ul><li>If you use nn.ReLU6(), you can get more good dynamic range of activation outputs ! see below !</li></ul><pre>/model.22/cv2.0/cv2.0.0/act/Clip | /model.22/cv2.0/cv2.0.0/act/Clip_output_0 | (-6.0, 6.0)  |  0.047244094<br>/model.22/cv2.0/cv2.0.1/act/Clip | /model.22/cv2.0/cv2.0.1/act/Clip_output_0 | (-6.0, 6.0)  |  0.047244094<br>/model.22/cv3.0/cv3.0.0/act/Clip | /model.22/cv3.0/cv3.0.0/act/Clip_output_0 | (-6.0, 6.0)  |  0.047244094<br>/model.22/cv3.0/cv3.0.1/act/Clip | /model.22/cv3.0/cv3.0.1/act/Clip_output_0 | (-6.0, 6.0)  |  0.047244094<br>/model.22/cv2.1/cv2.1.0/act/Clip | /model.22/cv2.1/cv2.1.0/act/Clip_output_0 | (-6.0, 6.0)  |  0.047244094<br>/model.22/cv2.1/cv2.1.1/act/Clip | /model.22/cv2.1/cv2.1.1/act/Clip_output_0 | (-6.0, 6.0)  |  0.047244094<br>/model.22/cv3.1/cv3.1.0/act/Clip | /model.22/cv3.1/cv3.1.0/act/Clip_output_0 | (-6.0, 6.0)  |  0.047244094<br>/model.22/cv3.1/cv3.1.1/act/Clip | /model.22/cv3.1/cv3.1.1/act/Clip_output_0 | (-6.0, 6.0)  |  0.047244094</pre><h3>1. Goal</h3><ul><li>In this post, we introduce a method that may sacrifice some accuracy of model, but can further accelerate the inference performance in terms of speed.</li><li>As you can see the below, by using ReLU instead of the default type SiLU for Activation, the speed improvement can be increased by about 10%, and the accuracy is only reduced by about 1%.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/768/1*gGK4jfc2woa43tDiEEYcpg.png" /><figcaption><a href="https://github.com/NVIDIA-AI-IOT/yolov5_gpu_optimization">yolov5 gpu optimization github</a></figcaption></figure><h3>2. How ?</h3><ul><li>During TensorRT builds the engine, it automatically merge convolution, bias, and relu layers into simpler operations called layer fusion optimization that does not change the results, so more optimized results can be obtained.</li><li>NOTE: Of course, since the Silu (Sigmoid + Mul) operation is originally larger than ReLU operation, speed will be improved even if there is no layer fusion technique.</li><li>If convolution, bias, and relu are calculated separately as shown below, time may be wasted in reading and writing memory. However, layer fusion can reduce the time spent in memory write/read, which is effective in improving latency.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/844/1*nW9RmX_QUQtp0x8sa_jIAg.png" /><figcaption>TensorRT layer fusion example</figcaption></figure><ul><li>Please refer to the following link for supported Fusion types. (<a href="https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#fusion-types">link</a>)</li></ul><h3>3. Modify activation type in yolov8 ! (super easy!!)</h3><ul><li>The method of learning by changing only the activation in Yolov8 is as follows.</li><li>In model config yaml file provided by yolov8 (<a href="https://github.com/ultralytics/ultralytics/blob/main/ultralytics/cfg/models/v8/yolov8.yaml">github link</a>), Just add one line (activation: nn.ReLU()) and start training !</li></ul><pre># Ultralytics YOLO 🚀, AGPL-3.0 license<br># YOLOv8 object detection model with P3-P5 outputs. For Usage examples see https://docs.ultralytics.com/tasks/detect<br><br># Parameters<br>nc: 80  # number of classes<br>scales: # model compound scaling constants, i.e. &#39;model=yolov8n.yaml&#39; will call yolov8.yaml with scale &#39;n&#39;<br>  # [depth, width, max_channels]<br>  n: [0.33, 0.25, 1024]  # YOLOv8n summary: 225 layers,  3157200 parameters,  3157184 gradients,   8.9 GFLOPs<br>  s: [0.33, 0.50, 1024]  # YOLOv8s summary: 225 layers, 11166560 parameters, 11166544 gradients,  28.8 GFLOPs<br>  m: [0.67, 0.75, 768]   # YOLOv8m summary: 295 layers, 25902640 parameters, 25902624 gradients,  79.3 GFLOPs<br>  l: [1.00, 1.00, 512]   # YOLOv8l summary: 365 layers, 43691520 parameters, 43691504 gradients, 165.7 GFLOPs<br>  x: [1.00, 1.25, 512]   # YOLOv8x summary: 365 layers, 68229648 parameters, 68229632 gradients, 258.5 GFLOPs<br><br># ---- add this line ---- #<br>activation: nn.ReLU()<br><br># YOLOv8.0n backbone<br>backbone:<br>  # [from, repeats, module, args]<br>  - [-1, 1, Conv, [64, 3, 2]]  # 0-P1/2<br>  - [-1, 1, Conv, [128, 3, 2]]  # 1-P2/4<br>  - [-1, 3, C2f, [128, True]]<br>  - [-1, 1, Conv, [256, 3, 2]]  # 3-P3/8<br>  - [-1, 6, C2f, [256, True]]<br>  - [-1, 1, Conv, [512, 3, 2]]  # 5-P4/16<br>  - [-1, 6, C2f, [512, True]]<br>  - [-1, 1, Conv, [1024, 3, 2]]  # 7-P5/32<br>  - [-1, 3, C2f, [1024, True]]<br>  - [-1, 1, SPPF, [1024, 5]]  # 9<br><br># YOLOv8.0n head<br>head:<br>  - [-1, 1, nn.Upsample, [None, 2, &#39;nearest&#39;]]<br>  - [[-1, 6], 1, Concat, [1]]  # cat backbone P4<br>  - [-1, 3, C2f, [512]]  # 12<br><br>  - [-1, 1, nn.Upsample, [None, 2, &#39;nearest&#39;]]<br>  - [[-1, 4], 1, Concat, [1]]  # cat backbone P3<br>  - [-1, 3, C2f, [256]]  # 15 (P3/8-small)<br><br>  - [-1, 1, Conv, [256, 3, 2]]<br>  - [[-1, 12], 1, Concat, [1]]  # cat head P4<br>  - [-1, 3, C2f, [512]]  # 18 (P4/16-medium)<br><br>  - [-1, 1, Conv, [512, 3, 2]]<br>  - [[-1, 9], 1, Concat, [1]]  # cat head P5<br>  - [-1, 3, C2f, [1024]]  # 21 (P5/32-large)<br><br>  - [[15, 18, 21], 1, Detect, [nc]]  # Detect(P3, P4, P5)</pre><ul><li>If it is changed to ReLU, the content of the graphsurgeon_modelfunction in <a href="https://medium.com/@smallerNdeeper/quantization-yolov8-qat-x2-speed-up-on-your-jetson-orin-nano-2-how-to-achieve-the-best-qat-8077ac0a167b">the previous post</a> will have to be changed to find the Relu node, not Sigmoidand Mul. For implementation, you must pay attention to this point.</li><li>If you export the Onnx model and check the graph with netron.app, you can see the modifications as follows.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/674/1*01C_YxYply3V418x0xvUuA.png" /><figcaption>Activation change SiLU to ReLU</figcaption></figure><h3>4. TensorRT Graph &amp; Latency Result</h3><ul><li>Here, we look at the TensorRT graph through the Trex tool and see how much the speed is improved.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/976/1*SLC1WAV9WPh0TbIxwNp3wg.png" /><figcaption>(Left) Conv + BN + Silu, (Right) Conv + BN + ReLU</figcaption></figure><ul><li>As you can see in the picture above, the latency of the first layer is 4.3 ms for silu and 2.7 ms for Relu. It can be confirmed that there is an improvement of about 59% on only the one conv layer.</li><li>And in the case of Silu, you can see that layer fusion has not occurred and a PWN (PointWiseNode) has been created, and in the case of Relu, you can see that layer fusion has occurred and a single convolution operation has been performed.</li><li>Finally, the latency of Yolov8 medium (batch 4) of ReLU activation we obtained was<strong> 75.172 ms </strong>on Jetson Orin Nano 4GB.</li></ul><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/1158c2d7f00b7677e723f37a30dd9a88/href">https://medium.com/media/1158c2d7f00b7677e723f37a30dd9a88/href</a></iframe><h3>5. Conclusion</h3><ul><li>Without stopping at QAT, we experimented with a way to make Yolov8 faster and were actually able to make it 14.2% faster !!!</li><li>As there is an improvement in speed, there may be a decrease in accuracy, so in the next post, let’s find out how much performance decreases in QAT and how to recover. Stay Tune !!!</li></ul><p>—</p><h3>Trending Articles</h3><h4>Hit! [yolov8] <a href="https://medium.com/@smallerNdeeper/yolov8-batch-inference-implementation-using-tensorrt-2-converting-to-batch-model-engine-e02dc203fc8b">converting to Batch model engine</a></h4><h4>Hit! <a href="https://medium.com/@smallerNdeeper/quantization-go-faster-with-relu-yolov8-qat-x2-speed-up-on-your-jetson-orin-nano-3-4d4733c9e435">[Quantization] Go Faster with ReLU!</a></h4><h4><a href="https://medium.com/@smallerNdeeper/quantization-achieve-accuracy-drop-to-near-zero-yolov8-qat-x2-speed-up-on-your-jetson-orin-nano-e178c4d8a5e3">[Quantization] Achieve Accuracy Drop to Near Zero</a></h4><h4><a href="https://medium.com/@smallerNdeeper/quantization-yolov8-qat-x2-speed-up-on-your-jetson-orin-nano-2-how-to-achieve-the-best-qat-8077ac0a167b">[Quantization] How to achieve the best QAT performance</a></h4><h4><a href="https://medium.com/@smallerNdeeper/yolov8-jetson-deepstream-benchmark-test-orin-nano-4gb-8gb-nx-tx2-5a2eb3560eeb">[Yolov8/Jetson/Deepstream] Benchmark test</a></h4><h4>[yolov8] <a href="https://medium.com/@smallerNdeeper/yolov8-batch-inference-implementation-using-tensorrt-4-nms-post-processing-implementation-daecfef41b78">NMS Post Processing implementation using only Numpy</a></h4><h4>[yolov8] <a href="https://medium.com/@smallerNdeeper/yolov8-batch-inference-implementation-using-tensorrt-3-batch-inference-using-tensorrt-python-cf30ae10920c">batch inference using TensorRT python api</a></h4><h3>About Authors</h3><p>Hello, I’m Deeper&amp;Cheaper.</p><ul><li>I am a developer and blogger with the goal of integrating AI technology into the lives of everyone, pursuing the mission of “Make More People Use AI.” As the founder of the startup Deeper&amp;Cheaper, operating under the slogan “Go Deeper Make Cheaper,” I am dedicated to exploring AI technology more deeply and presenting ways to use it cost-effectively.</li><li>The name encapsulates the philosophy that “Cheaper” reflects a focus on affordability to make AI accessible to everyone. However, from my perspective, performance is equally crucial, and thus “Deeper” signifies a passion for delving deep with high performance. Under this philosophy, I have accumulated over three years of experience in various AI fields.</li><li>With expertise in Computer Vision and Software Development, I possess knowledge and skills in diverse computer vision technologies such as object detection, object tracking, pose estimation, object segmentation, and segment anything. Additionally, I have specialized knowledge in software development and embedded systems.</li><li><strong>Please don’t hesitate to drop your questions in the comments section.</strong></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=4d4733c9e435" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>