Deep Dive Into Multi-Bit Weighted Quantization for CNNs

Published in

The Startup

9 min readSep 18, 2020

A joint work together with Ido Glanz

Reducing neural network complexity and memory consumption has become a broad and vast field of research aiming to allow both running complex deep models on edge-devices as well as allow faster and potentially more accurate inference on various new tasks.

As part of it, quantization has become a common and effective tool to do so, yet often requiring running on the complete original data-set and re-training the network, something which is not always feasible and furthermore while some quantization schemes suits one task they might be off on a different.

Below we will investigate the quantization of neural networks’ weight matrices and in a weighted quantization optimization scheme, influenced by the assumption that not all weights were created equal thus capturing some more accurately than others in the quantization process could derive more accurate compressed models compared to vanilla quantization schemes.

Quantization

First, let’s shortly discuss what quantization is all about in the sense of model compression.

In recent years, many papers address the issue of model compression with quantization, many of them achieving state-of-the-art results both in terms of compression rate, complexity, and maintaining accuracy.

But what is quantization?

The process of constraining an input from a continuous or otherwise large set of values (such as the real numbers) to a discrete set (such as the integers)

3-bit resolution with eight levels of quantization, compared with an analog sine wave. Adapted from

In neural networks compression, quantization is utilized in the sense of optimizing the memory consumption of the model weights and sometimes also to simplify mathematical operations thus reducing computational complexity. A classic example is if we were looking to run our models on edge devices and need to consider the limited memory, computing resources, and power.

In addition, quantization has become popular paired together with specific hardware optimized for deep learning application tool-chains like NVIDIA TensorRT, Xilinx, and many (many) more, that optimize the matrix multiplications in a low bit form.

Quantization techniques, Alternating Multi-bit quantization

As mentioned above, in recent years quantization had become a broad and vast field of research and many quantization techniques have been suggested, here we will focus on the work of Chen Xu et al, “Alternating Multi-bit Quantization for Recurrent Neural Networks” [1] suggesting a method of describing the multi-bit quantization as an optimization problem, separating the binary codes (-1/1 matrices) and coefficients and iterating the process of binary codes calculation and coefficients calculation (freezing each in turns) to derive a high accuracy quantization per the given number of bits.

This type of quantization decomposes the weight matrices into coefficients vectors (with 16 or 32 bits) and binary matrices (-1/1), which are then multiplied and accumulated to form an approximation of the original weight matrix (i.e., under a 2-bit constraint, we would have 2 coefficients and 2 binary matrices, as so each matrix is multiplied by a coefficient and they are summed together). See figure 1 for clearer visualization.

Figure 1: Illustration of quantized matrix-vector multiplication

Quantization as implemented by Chen Xu et al, is done as follows: Given a weight matrix W and a desired number of bits, the matrix is decomposed as so it is a linear combination of coefficients (in high resolution, e.g. uint16) and binary -1/1 matrices of the original size of W.

To do so, we first initialize the binary matrices and coefficients with Greedy Approximation (like suggested by (Guo et al., 2017)[2]), roughly described as iterating on the binary matrices, initializing them as the sign of the residual of the weight it should express (where the residual would start as the original weight and at each iteration decrease by the coefficient of that layer multiplied by the sign) and the coefficients as the average of the residual as follow:

The next step is refined a greedy approximation using below equation, then alternating to recalculate the binary matrices using Binary Search Trees (or a closed-form solution if using 2 or fewer bits)

Compression estimation

For the sake of brevity, we remind that when stating n-bits this means a set of, per output filter, n vectors (alpha) of the length of the input filter (64 for example) + n binary matrices of the original shape (e.g. 64x3x3) (the B tensor). For example, for a 3-bit quantization, a CNN kernel of [64,64,3,3] float32 elements would decompose to 3x[64x64x3x3] binary elements + 3x[64,64,1] float32/16 elements. Comparing memory, we get about 42% of the size.

Memory comparison for different kernels and varying bit-width

Weighted-Refined greedy approximation

Under the theoretical assumption that not all weights were created equal (in terms of importance), we first need to find a way to push the quantization process towards putting more “effort” into elements of the filter which have more importance in the context of overall layer activation output, i.e, acknowledging that every weight has different importance and thus we would care less if some are badly captured by the quantization algorithm.

To do so, in addition to the quantization minimization process described before, we incorporate importance weighting (or highlighting) of the given soon-to-be compressed weight matrix therefore pushing towards a quantized representation biased by the given weights and therefore capturing the “important” weights better than others (we will later elaborate more on what important means).

Let’s define U a HeatMap matrix,

Hence,

Thus, if we fed the above algorithm a weighting matrix in the shape of the original matrix (i.e., a heat-map like matrix) pointing to elements it needed to quantize more accurately (e.g. because they have a greater role in the layers activation output), we would theoretically obtain a quantized representation serving that purpose better.

But who are the important weights?

This is probably the most interesting question, how can we spot the weights that are more important to the model in the sense of the model task (e.g. classifying objects or translating text)?

Let’s start with a recap to Han’s [3] work in pruning and quantization. In the pruning section, he spotted that if we create a histogram of our weights value we can notice that most of the weights are around zero and if we’ll zero-out some percentage of them, the accuracy of the network will barely decrease, thus they had less impact in the sense of the overall network.

In the quantization case, since the quantization minimization scheme tries to minimize the distance between all the weights before and after the quantization (i.e. keep the matrix as close as possible to the original) and since most of the weights are around zero, the quantization scheme will focus in this area although it does not contribute to the network performance.

L1-Norm Weighting

Inspired by the above pruning, we’ll try a method of weighting the kernels using an L1 norm on the weight matrix and feeding its output to the quantization module as a weighting term. Under this scheme, the underlying assumption is that larger-valued weights have more importance and the quantization of these values should be of higher priority.

Self-attention Weighting

Another weighting approach inspired by the rise of attention models (and more specifically self-attention schemes) is done by implementing a self-attention module processing each weight matrix to generate “heatmaps” of relative importance of the different kernels then used to serve as regression weights for the quantization process we described above and trained against the activation-outputs of the original network.

But we will cover this optimization method and others in the next article.

Experiments

Like everything in the deep learning filed, theory it’s fine but no one will believe you without experiments. :)

We will evaluate the weighted quantization scheme on three types of computer vision tasks:

Image classification

To evaluate the quantization scheme for image classification tasks we trained a ResNet18 and ResNet50 on a CIFAR10 dataset reaching a top 1 accuracy of 89% (probably with more time and hyperparameters search it is possible to obtain better results)

Now that we have a trained model we can apply the multi-bit weighted quantization and compare it to the vanilla quantization (i.e., without weights) to check whether it makes any difference.

In the table below we tested our model in 2–4 bit quantization on all the CNN layers.

As can be seen, we were able to obtain an increase in performance while using the L1 weighting method.

Super-resolution

Now we would like to evaluate different tasks such as super-resolution and style transfer, we tested it on the Coco datasets i.e., large scale datasets usually used for object detection, segmentation, and captioning and WIKI faces publicly available dataset of face images.

to evaluate super-resolution and style transfer networks we wouldn’t be able to test accuracy (since there is no such thing in this domain) so we use the MSE score:

MSE loss for super-resolution evaluation

But estimating HR pictures with only MSE loss is not enough since frames that are still fairly blurry can gain small error but for the human eye, it will be noticeably bad. So, we’ll need to add an additional metric that together we’ll get a better evaluation.

The Structural Similarity Index Measure (SSIM) is a more objective similarity comparison which is used for measuring the similarity between two images.

Structural Similarity Index:

SSIM mathematical definition

In the table below we tested our model using the two losses above in 3-bit quantization on all the CNN layers.

MSE and SSIM for the original network, vanilla quantization, and L1 weighted quantization

Visual results

Style transfer

Similarly to super-resolution, we also evaluate the Style transfer model.

As can be seen, also for SR and ST tasks, using a weighted quantization can improve the MSE loss and the SSIM of the vanilla quantization without any noticeable loss in quality compared to the original model.

Conclusions

The task of model compression and weight quantization is a large and challenging task, relevant than ever with the movement to edge computing and the use of neural networks in more and more applications and as such being researched by many and the subject of many recent papers. Above is just the tip of the iceberg in what we believe could be further researched and leveraged to obtain better post-training compressions and enable the use of neural networks in more and more applications.

References

[1] C. Xu, J. Yao, Z. Lin, W. Ou, Y. Cao, Z. Wang, and H. Zha, “Alternating multi-bit quantization for recurrent neural networks,” arXiv preprint arXiv:1802.00150, 2018.

[2] Y. Guo, A. Yao, H. Zhao, and Y. Chen, “Network sketching: Exploiting binary structure in deep cnns,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5955–5963, 2017.

[3] Han, S., Mao, H. and Dally, W.J., 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149.