Neural Network Compression Using Quantization

Published in

ShareChat TechByte

11 min readDec 18, 2021

Written by Akash Manna, Vikram Gupta, Debdoot Mukherjee

Every day, ShareChat and Moj receive millions of User Generated Content (UGC) pieces. To derive insights from these content pieces and recommend relevant and interesting content to our users, we require accurate, fast and highly scalable machine learning models at all stages of the content pipeline.

In the last decade, deep learning has been instrumental in solving numerous problems that were previously deemed unsolvable, and that too at par or even surpassing human level accuracy for some tasks. It is also widely accepted and empirically established that deeper networks have higher accuracy as shown in Figure1. This is the reason for the push towards bigger and deeper networks.

Fig 1: Top-1 accuracy of models on ImageNet dataset, as a function of the model complexity represented using GFLOPS. source

However, the pursuit of human-level accuracy using deeper networks comes with its set of challenges, including :

Higher inference times
Higher computation requirements
Longer Training Schedules

Longer training schedules of these deep models using high compute can be acceptable since training is usually done once or at a fixed interval, but deployment in a high throughput scenario becomes extremely difficult and expensive.

At ShareChat, getting high inference throughput with minimal latency is not optional, but a necessity!

In such scenarios, model compression techniques become crucial as they allow us to reduce the footprint of such huge models without compromising on the accuracy. Through this introductory blog, we will discuss different techniques that can be used for optimizing heavy deep neural network models.

Model Compression Approaches

Following approaches are primarily used in modern day deep learning for model compression:

Quantization based approaches
Quantization involves using lower precision datatypes for storing model weights and performing computation (eg: 8-bit integer instead of 32-bit floating point).
Model Pruning
Model Pruning involves purging connections between neurons or some neurons altogether, which do have less contribution on the model performance. This works because of the fact that deep neural networks especially are inherently sparse in nature, as described in their paper The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks by Frankle et al.
Knowledge Distillation
In this approach, a small model is trained to mimic the soft predictions of a larger and more accurate pre-trained model. Quoting from our previous blog where we discuss how we use Knowledge Distillation at ShareChat for our content moderation pipeline-

Soft-labels allow the student model to generalize well as soft-labels represent a higher level of abstraction and understanding of similarity across different categories instead of peaky one-hot-encoded representation.

In the following sections, we elaborate on Model Quantization, which is the most widely used form of model compression.

What is Quantization?

By definition, Quantization is the process of mapping values from a large set to a smaller set, with the objective of having the least information loss in the transformation. This process is widely used in various domains, including signal processing, data compression, signal transformations to name a few.

Fig 2: Quantization being applied to continuous analog signals to convert them to discrete digital signals by sampling and rounding to nearest representable quantized value. source

Floating Point Numeral Representation

The IEEE 754 standard created in 1985 is the technical standard for binary representation of floating-point values in modern computers. As per IEEE 754, there are defined levels that can be used to represent a floating-point numeral, ranging from 16-bit (half-precision) to 256-bit (octuple-precision). The representation of a floating point numeral includes three components: The sign bit, the significand (fraction) and the exponent.

These values in binary form are concatenated to represent the numeral in memory. A representation for a 32-bit precision floating point numeral is described in the below figure:

Fig 3: Representation of 32-bit Floating point numeral

With discrete integer representation, using n bits, we can at most represent 2^n distinct numbers. However, using floating point representations enables us to represent a much wider range of numerals in the number line. This happens due to the non-uniform nature of the exponent component that distributes the range of 2^n distinct numbers non-uniformly to have a much dense distribution towards the center of the number line (near zero value) and a sparse distribution towards the extremes due to its exponential nature, as described in the following figure:

Fig 4: Distribution of floating point representation. source

It should also be noted that floating points with different precision use different bits to represent the exponent and significand bits, hence the ranges represented also vary. The following table shows a comparison of ranges and minimum values represented by datatypes FP32, FP16 and INT8

The Quantization Operation
The variance of the representable range has an effect on the representation of a fixed value in different data types. Taking an example for the numerical value of 𝞹, the value changes with different data types it is represented with.

As evident from the table, a direct casting of a numeral from a higher precision datatype to a lower precision data type may lead to errors in the representation of the values. These errors are called quantization errors. Having such errors in a deep neural network can be catastrophic, and hence direct typecasting of models to a lower precision is not a trivial task. We will discuss approaches to minimize such errors in this section.

We can theoretically represent a maximum of 2^n distinct values using n bits. Hence, mapping an entire range represented by a higher precision datatype to a lower precision data type will inherently raise quantization errors due to pigeonhole principle. But what if we have the knowledge about our distribution to be converted apriori, can that be used in any way to minimize errors?

Fig 5:Representative mapping from FP32 to INT8 source

FP32 can represent a range between 3.4 * 10³⁸ and -3.4 * 10³⁸. However, most deep network model weights do not have such wide variations in their weights and activations. If we know or can estimate our input ranges beforehand, we can determine the relationship between the range of our input data (instead of the entire FP32 range) to the entire range of lower precision data type. This would result in a more optimized mapping.

Let us assume that the distribution range of the input data is known apriori, and take an example of conversion from a floating point set (FP32) to Integer (INT8) discussing the conversion process.

Quantization in Deep Learning

Let us now understand the application of quantization in the context of deep learning. In deep neural networks, the model parameters are stored as floating point values and a forward pass through model involves a series of floating point operations. For deep learning, quantization refers to performing quantization for both weights and activations in lower precision data types as shown in the following figure.

Fig 7: The weights of the model are passed through the quantization layer. A similar process is followed for the activations also. source

Are bias layers not quantized? In practice, Bias layers are generally quantized from float to INT32 precision and not to lower INT8 precision, since the number of biases is a lot fewer than weights / convolution layers. Using the larger size of INT32 is a negligible addition for a deep neural network model.

Quantizing Weight Vectors

In practice, this is a straight-forward step since we usually have visibility on the model weights, and the weights can be used as prior in case of quantizing a given layer. Fortunately, it has been observed that the value distribution of neural network weight is usually of a small range, which is very close to 0. The following figure shows the distribution of weights of some convolution and fully connected layers from the Alexnet model and MobileNet v1 models. This limited range makes the mapping to a lower precision data type less prone to quantization errors.

Fig 8.1 (left) and Fig 8.2 (right)
Fig. 8.1: Distribution for weights in Alexnet model Source
Fig. 8.2: Distribution for weights in MobileNet v1 model source

Quantizing Activations
Unlike weights of a model, the activation of a neural network layer varies as per the input data fed to the model, and to estimate the range of activations, a representative set of input data samples is required. Hence quantizing activations is a data-dependent process and requires additional data to calibrate the outputs of a neural network layer.

Now, there are different ways that we can use to identify the scale factor and zero point for the model weights and activations, which we discuss in the next section.

Types of Quantization

Modern deep learning frameworks like Pytorch, Tensorflow etc. support different types of Quantizations. Mainly, there are two major buckets in which we can classify the Quantization Algorithms -

Post Training Quantization — Performed after a model is fully trained
Quantization Aware Training — Training is done with quantization constraints

1)Post Training Quantization
In this approach, Quantization is performed after a model has been fully trained. Since the weights are fixed post training, the mapping for weights is straightforward to compute. However, computing the range of activations is challenging post training, since the activation values for a layer vary based on the input tensor passed on. There are two approaches to handle this:

a) Dynamic Post-Training Quantization:
This involves fine-tuning the activation ranges on the fly during inference, based on the data distribution fed to the model at runtime. This approach is the easiest to implement since no additional steps are required for quantization. Since no additional data is required, this method is most suitable for cases when generating exhaustive data distribution is difficult — ex: sequence-to-sequence models. However, the estimation for the range of activations happens using exponential moving averages during runtime, which adds up to the model latency.

b) Static Post-Training Quantization
In this approach, an additional calibration step is involved, wherein a representative dataset is used to estimate the range of activations using the variations in the dataset. This estimation happens in full precision to minimize errors, after which the activations are then scaled down to lower precision data types. Since no additional compute is done during inference during runtime, this approach produces models which are the quickest (have the least latency)

The primary advantage of using Post Training Quantization is that without requiring a large amount of resources to train on the entire dataset, 8-bit or 16-bit quantization can be applied to any existing pre-trained model. Modern deep learning frameworks like Pytorch and Tensorflow have one line implementation for these approaches. This strategy, however, results in some (usually minor) loss of accuracy due to quantization errors. These errors can further be mitigated during training by using a smart trick, which we discuss in the next section.

2)Quantization-Aware training

Quantization Aware Training (QAT) tries to address the aspect of accuracy loss due to quantization errors during the model training. In the forward pass, QAT replicates quantized behaviour during weights and activation computation, while the loss computation and backward propagation of loss remain unchanged and are done in higher precision. This idea was suggested by Jacob et. al.

Fig 9: QAT training flow diagram and latency vs accuracy tradeoff for quantized model Source

Using QAT, all the model weights and activations are “fake quantized” during the forward pass: that is, float values are rounded to mimic lower precision (usually int8) values, but all other computations are still done with floating point numbers. Since all the weight adjustments during training are made while “aware” of the fact that the model will ultimately be quantized, after quantizing, this method usually yields higher accuracy than the other methods, and the trained quantized models are nearly lossless compared to full precision counterparts.

Automatic Mixed Precision Quantization
As an extension to QAT, there are still some cases where it is not possible to completely fit the entire range of the input domain of some layers into lower precision quantized buckets without compromising on the accuracy. In those cases, it is beneficial to preserve such layers using higher precision values, while quantizing other layers wherever quantization is possible. This issue has recently been addressed to a great extent with the introduction of Automatic Mixed precision training, which involves determining the quantization for individual layers during training time based on the activation ranges of the layers.
Automatic Mixed Precision training virtually has no impact on the accuracy of models.

Summary of Types of quantization
The following table summarizes the above discussion in terms of dataset requirements and the tradeoffs involved.

The discussion can be summarised in the form of a flowchart below:

Fig. 10: Flowchart summary for quantizing models Source

Real Model Benchmarks
In this section, we look at the effects of these quantization methods in various real models. As expected, for all the models, we see that the Quantized Aware Training (QAT) performs better in terms of accuracy and latency than Post Training Quantized. Most of the models see a small drop in the accuracy due to the quantization but overall the latency improvements might overshadow the small drop in performance for practical purposes.

1.Benchmarks for CNN Models

Post Training Quantization vs QAT performance for CNN based models on Imagenet data source

2.Quantized BERT Model

DQ (dynamic quantization) vs QAT accuracy comparison for BERT source

3.Speed-up using various types of Quantization

CPU inference speed-up and accuracy change for various techniques of quantization. source

Conclusion

In this blog, we discussed various approaches of quantization that can be used to compress deep neural networks with minimal impact on the accuracy of the models. At ShareChat, we use deep neural networks spanning across a wide spectrum of tasks including recommender systems, computer vision, NLP, speech recognition. The inferences from these models are required to be scaled to the order of millions of UGC content per day, for our users in hundred of millions. We regularly employ various techniques including quantization to optimize our models. In the upcoming blogs, we will discuss ways by which we have been able to reduce our latencies and computation requirements by the order of magnitudes, while maintaining similar accuracy.

Stay tuned!!

References

1. A White Paper on Neural Network Quantization by Negel et. al.

2. Neural Network Quantization by Lei Mao

3. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference by Jacob et. al.

Cover illustration by Ritesh Waingankar

Neural Network Compression Using Quantization

Written by Tech @ ShareChat