Post-Training Optimization Techniques

7 min readOct 4, 2021

Deep Neural Networks(DNNs) have witnessed great success and have become popular lately.Nonetheless, DNNs require expensive computational resources and enormous storage space, making it difficult for deployment on resource-constrained devices, such as devices for Internet of Things (IoT), processors on smartphones, and embedded controllers in mobile robots.Post training optimizations can be done for better performance in many ways and is an active research topic in the field of deep learning.

DNNs are easy to train but are usually very large and over-parameterized. With increasing amounts of data available for learning and its usage in different complex tasks, these neural architectures are becoming very large. This tends to make the architecture slow and the inference speed also suffers. It also causes hindrance in the performance of the real-world applications in which these models are deployed. Thus, it becomes necessary to optimize the neural architectures. It helps to overcome the challenge of deploying the huge neural networks in the real-world scenario. This blog introduces and briefly explains the various post training optimization techniques at different levels.

The above figure shows different layers that contribute to post training optimization for faster inference. At least one of these layers should be used for optimizing the model. By intuitively

using several levels in conjunction, a significant speedup can be achieved. Let’s take a closer look at each of these levels and how they optimize the models.

Hardware Level
Hardware devices are at the lowest level for inference as these are the units that perform computations for the inference. These units could be CPUs, GPUs, FPGAs, and some specialized ASIC accelerators like Google’s TPU just to name a few. Idea behind this is that a simple and stronger computational hardware leads to faster inference. Hardware plays an important role in running the production grade DNNs.
Software Level
Software based optimizations refer to any optimization that is performed without changing the model itself. These are computer programs that translate DNNs computational graph to actual operations on the hardware. They can be further classified into two categories:

a. Target Optimized Libraries:
This includes cuDNN, MKL-DNN, and others. These are generally synchronised with GPU computations and provide highly tuned implementations for standard DNN operations like convolution, pooling, normalization, and activation functions that fully utilizes the parallelism of GPU.

b. Deep Learning Compilers:
This includes TVM, XLA OpenVino, Tensor-RT to name a few. DNN graphs are optimised by graph compilers and an optimised code for the target hardware is generated. This accelerates the training and deployment of DL models. Each compiler has its own way to get the task done like NVIDIA’s TensorRT compiler is built on top of CUDA and optimises inference by providing high throughput and low latency for deep learning inference applications. TVM introduced a new tensor expression language(DSL). It is used to construct tensor operators, which are converted to optimised kernels according to the target hardware. A differentiable internal representation is generated by these compilers. All of the graph compilers have high-level internal representation, where they differ is in the layers of transformation and lowering steps.
Software optimization methods are capable of increasing inference speed by 2x or 3x times.The standard compiler techniques like common subexpression elimination or constant folding can be applied as transformations to the high-level graph. While compiling the model, many of the parameters in the high level nodes can be simplified by breaking down into more primitive linear algebra operations. It does depend on the target architecture chosen. The transformations for GPU will vary from those made for a CPU. For example with ResNet, various compilers will have different ways based on various factors like size of filter weights and will require different memory layouts.Once the transformation to low-level internal representation is done, it will be more specific about memory layout representation.
The important optimization can then be made at this layer. Finally, depending upon the backend architecture, code is generated from the final low-level graph which is an optimized form of model which could give magical results. Inference time can be accelerated by software optimization, though not always.

3. Algorithmic accelerators — at algorithmic level

This method refers to any changes made to the model architecture. Neural architecture contains several redundancies. This method attempts to accelerate the inference process by handling these redundancies. The two most commonly used methods are:

a. Pruning:
Pruning has a motto which says most weights in DNNs are useless and it’s often true . This also means that most of the weights in a DNN can be removed with limited eﬀect on the loss. Consider an example of a fully connected network. After pruning some of the connections will be removed and the resulting network will not be a fully connected network.

Computer vision models can be pruned upto 90% sparse in some cases like ResNet 50(ImageNet) and still end up with baseline accuracy of the whole model. NLP models can also be pruned upto 60% while preserving the accuracy. The application of this method varies on the basis of output required like more speed or reducing storage.

Pruning requires significant considerations to be made like the connections to be pruned and how much to prune. Also there is lack of support and ease of use across DNN frameworks.

Pruning is an iterative process. There are two hypotheses on how to prune a neural network — structured pruning and unstructured pruning. The diﬀerence between these two comes from the way the weights are removed — either individual weight or group of weights. This causes an effect on performance as well as maximum achievable sparsity.

In unstructured pruning, the individual weight connections are removed. These are removed from the network by setting them to 0. It is like introducing multiplication by 0 in the network. At prediction time it can be turned into no-ops.

Structured pruning includes removal of groups of weight connections together. Every system can run structurally pruned network faster, as this type of pruning impacts the shape of layers and weight matrices. Though the drawback of this method is that it severely limits the maximum sparsity on a network. It therefore limits improvements in the performance and memory. Moreover once pruning is completed, the model needs to be re-calibrated or retrained to attain the baseline accuracy.

b. Network quantization

It is also a form of algorithm optimization. It is a process which reduces the precision of weights by replacing the floating-point weights with lower precision compact representations.

Let’s assume a simple example of matrix multiplication, which is a basic operation for a machine learning model. Consider matrices or tensors A and B, whereas C as a resulting multiplication tensor. The models are usually trained in higher precision like ﬂoat32. The operands for each multiplication are ﬂoat32 ,the product is ﬂoat32 and the accumulation as well is ﬂoat32.

Quantization reduces the precision for inference. Thus moving from 32-bit ﬂoat to 8-bit integers and then operate entirely in integer operations. Shifting from 32-bit float to 8-bit integer, reduces the model size by 4x. Integer operations are typically faster to execute and consume less power. They are also fairly common denominators across various hardware. Consequently the question that arises is how to convert 32-bit ﬂoat to 8-bit integers?

There are various ways to quantize and it is still an active research area but the most common way is to take the minimum and maximum value in the tensor. Based on these values they are evenly spread over an 8-bit range. In this example a 32-bit float is quantized to an 8-bit integer. The operands and multipliers are 8-bit integer and the product is 16-bit integer. It accumulates further to a 32-bit integer. So what are the implications of all these optimizations?

The static values, parameters, weights and the dynamic values like activations are changed. It seems fairly easy to quantize the weights and get faster inference, although there might be some conflicting points. Like for the model to be efficient, lower precision is good. At the same time lower precision is not good for accuracy and may negatively affect it. There are a lot of trade-offs that need to be taken into consideration. Lot of transformations can be done for diﬀerent layers in the model. Quantizing a model should not result in catastrophic errors. Thus care needs to be taken while selecting the layers and the way they are quantized. Quantization is the process of transforming an ML program into an approximated representation with available precision operations.

This article covers a few post training optimization techniques. These optimization techniques help to deploy the huge neural networks on the edge device and achieve better performance.

EDGENeural.AI is a deep-tech startup with the vision to decentralize AI to make every device faster, smarter and secure with unified, cloud neutral, hardware agnostic platform that accelerates and optimises EDGE AI application development.
Featured as “30 Startups to Watch” by Inc42 Media | Top Indian AI Startups to watch 2021 by Open Data Science | Featured in Hindustan Time
Our supporters include NVIDIA Inception Program | | NXP Semiconductors & FabCI | Microsoft for Startups | NASSCOM DeepTech Club 2.0 | StartUp India | MSME

Post-Training Optimization Techniques

Written by EDGENeural.AI