[Quantization] Achieve Accuracy Drop to Near Zero – YoloV8 QAT x2 Speed up on your Jetson Orin Nano #4

DeeperAndCheaper

6 min readNov 22, 2023

Background Knowledge

QAT process is that deep learning model is added by quantization / dequantization (Q/DQ) layer to the layer that requires quantization
the added model obtains the scale factor and zero point information used for quantization. TensorRT only supports symmetric quantization, zero point is always zero (refer this)

zero point is always zero for TensorRT quantization

Therefore, in our experiments, what is obtained through fine tuning is the scale factor information.

Goal

The accuracy drop after engine conversion is determined depending on how well the scale factor of the Q/DQ node is stored.
Here, we will talk about methods and valuable tips on how to minimize accuracy drop through fine tuning.

NOTE: We used the coco dataset here. Training the Yolov8-medium model for 100 epochs was considered baseline accuracy.

Abstract

QAT workflow considers two cases: fine tuning a model that has been trained, or QAT fine tuning is carried out sequentially after learning is completed.
1. The initial training hyper-parameter settings for QAT are different from normal training settings
2. Before Training QAT, add a q/dq layer to the model. And then calibrate scale factors with calibarion dataset.
3. Export to Onnx and modify onnx to reduce unnecessary latency.
4. Finally, check how much accuracy drop occurred.

1. qat_flag for Distinguishing from Normal training

if self.epochs >= 0 and self.qat:
    self.qat_flag = True
    self._do_train(world_size)

Normal training and QAT training should be conducted separately.

2. Initial Training Setup for QAT

No warm-up. Warm-up can have a good effect on models that are being trained for the first time, but when fine tuning is performed, the loss becomes very large, so the training time for convergence becomes longer. To prevent this bad effect, set warmup_epochs to 0.
Set epochs to 20 epochs. Many reference documents suggest additional fine tuning by adding 1/10 of the epochs of the epochs used during training. However, in this case, there could be less convergence. Therefore, 20 epochs or 20% is recommended. (It totally depends on how complex your data is)
The learning rate is recommended to be 1/100 of lr0 for normal training. So I set it to 0.0001. (In my case, lr0 is 0.01)
AMP (automatic mixed precision) is set to false. An error occurs if half precision is used after Q/DQ nodes are added.
A type of scheduler recommends cosine annealing scheduler. I concluded that it converges faster than the linear scheduler. There is no recommendation for T_max. I decided set to epochs * 1.0 after much experimentation. eta_min is recommended to be 1/100 of QAT lr0. (lr0 * 0.01)

    def _setup_train(self, world_size, qat_epochs = 20, qat_lr = 0.0001):
        self.args.warmup_epochs = 0 # no warm up. 
        self.epochs = qat_epochs
        self.args.lr0 = qat_lr
        
        # Model
        self.run_callbacks('on_pretrain_routine_start')
        ckpt = self.setup_model()
        self.model = self.model.to(self.device)
        self.set_model_attributes()
        # Check AMP
        self.amp = False
        self.scaler = amp.GradScaler(enabled=self.amp)
        if world_size > 1:
            self.model = DDP(self.model, device_ids=[RANK])
        # Check imgsz
        gs = max(int(self.model.stride.max() if hasattr(self.model, 'stride') else 32), 32)  # grid size (max stride)
        self.args.imgsz = check_imgsz(self.args.imgsz, stride=gs, floor=gs, max_dim=1)
        # Batch size
        if self.batch_size == -1:
            if RANK == -1:  # single-GPU only, estimate best batch size
                self.args.batch = self.batch_size = check_train_batch_size(self.model, self.args.imgsz, self.amp)
            else:
                SyntaxError('batch=-1 to use AutoBatch is only available in Single-GPU training. '
                            'Please pass a valid batch size value for Multi-GPU DDP training, i.e. batch=16')

        # Dataloaders
        batch_size = self.batch_size // max(world_size, 1)
        self.train_loader = self.get_dataloader(self.trainset, batch_size=batch_size, rank=RANK, mode='train')
        if RANK in (-1, 0):
            self.test_loader = self.get_dataloader(self.testset, batch_size=batch_size * 2, rank=-1, mode='val')
            self.validator = self.get_validator()
            metric_keys = self.validator.metrics.keys + self.label_loss_items(prefix='val')
            self.metrics = dict(zip(metric_keys, [0] * len(metric_keys)))  # TODO: init metrics for plot_results()?
            self.ema = ModelEMA(self.model)
            if self.args.plots and not self.args.v5loader:
                self.plot_training_labels()

        # Optimizer
        self.accumulate = max(round(self.args.nbs / self.batch_size), 1)  # accumulate loss before optimizing
        weight_decay = self.args.weight_decay * self.batch_size * self.accumulate / self.args.nbs  # scale weight_decay
        iterations = math.ceil(len(self.train_loader.dataset) / max(self.batch_size, self.args.nbs)) * self.epochs
        self.optimizer = self.build_optimizer(model=self.model,
                                              name=self.args.optimizer,
                                              lr=self.args.lr0,
                                              momentum=self.args.momentum,
                                              decay=weight_decay,
                                              iterations=iterations)
        # Scheduler
        if self.args.cos_lr:
            self.lf = one_cycle(1, self.args.lrf, self.epochs)  # cosine 1->hyp['lrf']
        else:
            self.lf = lambda x: (1 - x / self.epochs) * (1.0 - self.args.lrf) + self.args.lrf  # linear
        
        self.scheduler = optim.lr_scheduler.CosineAnnealingLR(self.optimizer, T_max= self.epochs * 1.0, eta_min=self.args.lr0 * 0.01)
        self.stopper, self.stop = EarlyStopping(patience=self.args.patience), False
        self.resume_training(ckpt)
        self.scheduler.last_epoch = self.start_epoch - 1  # do not move
        
        self.run_callbacks('on_pretrain_routine_end')

3. Add q/dq layer & Calibration

Refer to #2 posting in my blog and add Q/DQ nodes to the torch model
Next, calibrate with calibration dataset. refer this.
There are histogram, percentile, mse, and entropy for calibration methods, but mse was found to be the best through experiments.
Using a large number of batches of images rather than the calibration method was a good impact to reduce the loss of fine tuning at the beginning. Therefore, I used 1024 batch here.
For calibration, a validation dataloader is generally used (the method used in PTQ), but in QAT, there is a process with additional training, and the goal is to converge the loss as quickly as possible, so the train dataloader was used as a calibration dataset.


def cal_model(model, data_loader, device, num_batch=1024):
    num_batch = num_batch
    def compute_amax(model, **kwargs):
        for name, module in model.named_modules():
            if isinstance(module, quant_nn.TensorQuantizer):
                if module._calibrator is not None:
                    if isinstance(module._calibrator, calib.MaxCalibrator):
                        module.load_calib_amax(strict=False)
                    else:
                        module.load_calib_amax(**kwargs)

                    module._amax = module._amax.to(device)
        
    def collect_stats(model, data_loader, device, num_batch=1024):
        """Feed data to the network and collect statistics"""
        # Enable calibrators
        model.eval()
        for name, module in model.named_modules():
            if isinstance(module, quant_nn.TensorQuantizer):
                if module._calibrator is not None:
                    module.disable_quant()
                    module.enable_calib()
                else:
                    module.disable()

        # Feed data to the network for collecting stats
        with torch.no_grad():
            for i, datas in tqdm(enumerate(data_loader), total=num_batch, desc="Collect stats for calibrating"):
                # imgs = datas[0].to(device, non_blocking=True).float() / 255.0
                imgs = datas['img'].to(device, non_blocking=True).float() / 255.0
                model(imgs)

                if i >= num_batch:
                    break

        # Disable calibrators
        for name, module in model.named_modules():
            if isinstance(module, quant_nn.TensorQuantizer):
                if module._calibrator is not None:
                    module.enable_quant()
                    module.disable_calib()
                else:
                    module.enable()

    collect_stats(model, data_loader, device, num_batch=num_batch)
    compute_amax(model, method="mse")

NOTE: module._amax is the storage where scale factor is saved.

4. Export to ONNX & modify ONNX model

The method to export a QAT torch model to ONNX is as follows.
refer this.

  from pytorch_quantization import nn as quant_nn
  from pytorch_quantization import quant_modules
  quant_nn.TensorQuantizer.use_fb_fake_quant = True
  quant_modules.initialize()

Before exporting, you must set it as above
opset version must be 13 or higher.
In the case of convolution and batch normalization, avoid fusion if possible. (I remember that it caused an error.)
The exported model contains redundant Q/DQ nodes. This can be corrected by referring to post #2.

Result

This is the result of confirming whether QAT fine tuning (20 epochs) can reach the target mAP50:95 which normal training achieved during 100 epochs.
The result of normal training 100 epochs is 0.464, and the result of fine tuning 20 epochs is 0.45.
The recovery rate was obtained by subtracting the qat value from the original value and dividing it by the original value.
In this experiment, we achieved a recovery rate of 97%, and we believe we can achieve 100% or better by increasing the epoch or adjusting other hyper-parameters.

Conclusion

We finally recover the accuracy of the model ! If this is not enough, please change the initial setting. It can help for improvements.

In terms of latency, we get 92 % improvement (144 to 75 ms for batch 4) !
In terms of Accuracy, we get 3 % degradation (0.464 to 0.45 for mAP50:95) !

With such good results, shouldn’t we definitely try this?

—

About Authors

Hello, I’m Deeper&Cheaper.

I am a developer and blogger with the goal of integrating AI technology into the lives of everyone, pursuing the mission of “Make More People Use AI.” As the founder of the startup Deeper&Cheaper, operating under the slogan “Go Deeper Make Cheaper,” I am dedicated to exploring AI technology more deeply and presenting ways to use it cost-effectively.
The name encapsulates the philosophy that “Cheaper” reflects a focus on affordability to make AI accessible to everyone. However, from my perspective, performance is equally crucial, and thus “Deeper” signifies a passion for delving deep with high performance. Under this philosophy, I have accumulated over three years of experience in various AI fields.
With expertise in Computer Vision and Software Development, I possess knowledge and skills in diverse computer vision technologies such as object detection, object tracking, pose estimation, object segmentation, and segment anything. Additionally, I have specialized knowledge in software development and embedded systems.
Please don’t hesitate to drop your questions in the comments section.

[Quantization] Achieve Accuracy Drop to Near Zero – YoloV8 QAT x2 Speed up on your Jetson Orin Nano #4

Background Knowledge

Goal

Abstract

1. qat_flag for Distinguishing from Normal training

2. Initial Training Setup for QAT

3. Add q/dq layer & Calibration

4. Export to ONNX & modify ONNX model

Result

Conclusion

With such good results, shouldn’t we definitely try this?

Trending Articles

Hit! [yolov8] converting to Batch model engine

Hit! [Quantization] Go Faster with ReLU!

[Quantization] Achieve Accuracy Drop to Near Zero

[Quantization] How to achieve the best QAT performance

[Yolov8/Jetson/Deepstream] Benchmark test

[yolov8] NMS Post Processing implementation using only Numpy

[yolov8] batch inference using TensorRT python api

About Authors

Written by DeeperAndCheaper