Efficiency in AI (Part 2)

7 min readMay 6, 2024

Practical tips for optimizing AI systems

This blog post is inspired by my webinar on YouTube (in Ukrainian), so I decided to make notes on it. Most of the images were taken from my presentation. This material is divided into two parts: first part & second part.

Change your model architecture
The knowledge distillation
Model pruning technique
Shrinking the bytes
Practical example: How to run YOLOv8 with x3 speed up?

A retro 8-bit style image depicting a small robot embodying artificial intelligence, running swiftly towards a vibrant sunset. Credits: ChatGPT

Change your model architecture

Any neural network operates with one of the building blocks: Convolutional Layer, Attention, Recurrent Units, and dozens of others. They are quite expensive in operations, as processing millions of parameters simultaneously. However, you can apply some optimizations to such blocks to make them work faster, sacrificing only a little accuracy.

Figure 1. Ways to Optimize the Model Architecture | Source: polukhin.tech

If you are working with a convolutional neural network, convolution operations usually take up the majority of computation time of any CNN, but you can switch to depthwise separable convolutions, as used in MobileNet, or group convolutions from ShuffleNet, among various techniques.

For transformers and, actually, Large Language Models, replacing standard attention with linear attention can reduce complexity from O(N²) to O(N), making a significant difference in speed.

General optimizations across architectures can involve techniques like Neural Architecture Search, such as FBNet, which systematically searches through combinations of layers to find the most efficient model structure.

The knowledge distillation

This is a fascinating concept taken from real life. Imagine a teacher with extensive knowledge and a student who knows little. How do you effectively transfer that knowledge? You could either have the student read 100 books to learn everything, or the teacher could pass on the steps they take to solve problems, essentially distilling the knowledge.

Figure 2. The teacher-student framework for knowledge distillation | Source: Arxiv

In terms of neural networks, this means transferring the features learned at each layer from the teacher to the student — this is known as Knowledge Distillation. Model distillation is the process of extracting and transferring this critical information to the student model. This method can not only compress a large model into a smaller one but sometimes even improve the existing results by distilling from several teacher models or using various combinations.

The list of GitHub repositories with Python implementations of KD:

Awesome Knowledge Distillation: A curated list of Knowledge Distillation resources.
Distiller: clean Pytorch implementation to run quick distillation experiments.
torchdistill: various state-of-the-art knowledge distillation methods.

Model pruning technique

The essence of the Model Pruning method lies in the fact that we can remove certain neurons from a network without significantly compromising the result. As shown in the image below, a neuron is disconnected, and the outcome no longer depends on it. While this might slightly worsen the result, the process also involves retraining.

Figure 3. Pruning: Before and After | Source: Arxiv

Simply speaking, pruning is a process of shrinking a network by eliminating parameters. Formally, a pruned model is created by elementwise multiplying a binary mask that sets certain parameters of the model to 0 by a collection of parameters of the model. In production, trimmed parameters are either set to zero or eliminated, rather than employing an explicit mask.

More about the pruning you are welcome to read my blog post.

The list of GitHub repositories with Python implementations of pruning:

Awesome Pruning: A curated list of neural network pruning resources.
SparseML: Libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models.
Intel® Neural Compressor: open-source Python library supporting popular model compression techniques on all mainstream deep learning frameworks (TensorFlow, PyTorch, ONNX Runtime, and MXNet).

Code example of doing pruning with Python:

# Install https://github.com/intel/neural-compressor pruning engine
!pip3 install neural-compressor

from neural_compressor.training import prepare_pruning, WeightPruningConfig
WeightPruningConfig = WeightPruningConfig(configs)
prepare_pruning(model, config, optimizer)
for epoch in range(num_train_epochs):
    model.train()
    for step, batch in enumerate(train_dataloader):
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        model.zero_grad()

Shrinking the bytes

Reducing the precision of numerical representations has emerged as a promising approach in AI optimizations, offering significant speed gains while maintaining reasonable accuracy levels. Let’s discuss half-precision, model quantization, and the 4-bit integers.

Half Precision: Doubling the Speed, Halving the Precision

Imagine a scenario where your machine learning model operates at 70% accuracy with standard 32-bit floating-point numbers (Float32). Now, what if you could potentially double the speed of that model with a minor trade-off in precision? Enter half-precision computation, often referred to as fp16.

Numbers are typically represented using Float32, which consists of a sign bit, exponent, and mantissa. However, by switching to fp16, we effectively halve the number of operations, reducing the required memory and computation while sacrificing only a small amount of precision. For instance, if your model’s accuracy drops from 70% to 69% or 65%, the impact may be negligible, yet the speed boost could be substantial.

Figure 4. Comparison of FP16 (half precision floating points) and FP32 (single precision floating points) | Source: ResearchGate

Implementing half-precision is remarkably straightforward across various frameworks.

# PyTorch
import torch
model = model.half() # Convert model to half precision
input_data = input_data.half() # Convert input data to half precision

# TensorFlow
import tensorflow as tf
from tensorflow.keras.mixed_precision import experimental as mixed_precision
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_policy(policy)

# TensorRT
import tensorrt as trt
builder = trt.Builder(TRT_LOGGER)
builder.fp16_mode = True

Model Quantization: Reducing Granularity, Maximizing Efficiency

While half-precision is a form of quantization, full-model quantization takes the concept a step further by reducing the granularity of the numbers used in a model. The transformation from FP32 to FP16 or even to INT8 decreases the number of bits from 32 to 16 or 8, with INT8 allowing only 256 possible values for weights.

Figure 5. Quantization technique | Source: velog.io

Quantization often involves preprocessing, where weights are converted from FP32 to INT8 using datasets to understand the range of values each layer typically works with during training. These ranges, often between -1000 and +1000 for FP32, need to be mapped to a new range, from -127 to +127, to fit within the 8-bit integer range.

The list of GitHub repositories with quantization:

Awesome Model Quantization: A curated list of neural network quantization.
Model Compression Toolkit (MCT): open source project for neural network model optimization under efficient, constrained hardware.
MQBench: open-source model quantization toolkit based on PyTorch fx.

Deploy Just in 4 Lines with MQBench (Source: Docs):

import torchvision.models as models                           # for example model
from mqbench.prepare_by_platform import prepare_by_platform   # add quant nodes for specific Backend
from mqbench.prepare_by_platform import BackendType           # contain various Backend, contains Academic.
from mqbench.utils.state import enable_calibration            # turn on calibration algorithm, determine scale, zero_point, etc.
from mqbench.utils.state import enable_quantization           # turn on actually quantization, like FP32 -> INT8

model = models.__dict__["resnet18"](pretrained=True)
model = prepare_by_platform(model, BackendType.Tensorrt)
enable_calibration(model) # turn on calibration
for i, batch in enumerate(data): # do forward procedures
    ...
enable_quantization(model) # turn on actually quantization
input_shape = {'data': [10, 3, 224, 224]}
convert_deploy(model, backend, input_shape) # model export

Int4: Pushing the Limits for Large Language Models

While aggressive quantization may not work well with every model type, the concept of “int4” quantization, which utilizes 4-bit integers, has shown promising results, particularly for Large Language Models (LLMs). By reducing the representational capacity of the weights in a neural network, int4 quantization can significantly speed up inference times, making it an attractive option for deploying models in resource-constrained environments.

Figure 6. MLPerf v0.5 Inference results | Source: nvidia.com

As an example, quantizing a ResNet-50 model to int4 has demonstrated almost a doubling in speed, although it’s likely that there was a small trade-off with precision loss, which is a common occurrence with more aggressive forms of quantization.

Practical example: How to run YOLOv8 with x3 speed up?

There is a case study of speeding up YOLOv8 by a factor of three using half-precision computation and TensorRT optimization. Originally, YOLOv8-M on a G5.2XLarge Amazon server with 24 GB of video memory, a batch size of 64, and an image size of 1280x1280 pixels processes at 56 frames per second (FPS). By converting the model to half-precision, the speed can almost double to 91 FPS. However, simply converting to TensorRT doesn’t show a significant speed increase.

Figure 7. Schema of the YOLOv8 optimization.

When both half-precision and TensorRT conversion are utilized together, the speed triples to 143 FPS. This improvement is achieved with simple Python code that sets the model engine to TensorRT, specifies the image size, and batch size, and enables half precision.

# Code to optimize
import ultralytics
model = ultralytics.YOLO("yolov8m")
model.export(format="engine", imgsz=1280, batcg=64, half=True)

# Or, use https://github.com/NVIDIA-AI-IOT/torch2trt

This optimization illustrates the benefits of using half-precision and a dedicated inference engine like TensorRT to significantly boost the performance of deep learning models, especially in a production environment where speed is crucial.

Thank you for reading this article on the various optimization techniques for neural networks and machine learning models. It’s fascinating to see how advancements in this field can bring powerful AI capabilities to everyday devices.

If you found it informative and engaging, please connect with me through my social media channels.

If you have any questions or feedback, please feel free to leave a comment below or contact me directly via any of my communication channels. I look forward to sharing more insights and knowledge with you!