AI Technology

4 Types of AI Compression Methods You Should Know

Semin Cheon
SqueezeBits Team Blog
7 min readMar 21, 2024

--

Our last post discussed the advantages and indispensability of compressing AI models. In this writing, we further explore the idea of AI compression, exploring 4 types of existing compression methodology with the objective of creating succinct and efficient neural networks.

Quantization

source: Nvidia Technical Blog (link)

Quantization converts large amounts of complex data into simpler, lighter forms. For instance, a 32-bit floating point(FP32) representation may be converted to an 8-bit integer(INT8) data format. In many cases, after being trained in the format of 32-bit floating point(FP32), 16-bit floating point(FP16), or 16-bit brain floating point(BF16), the model is quantized to a lower precision format during the inference stage. By lowering the precision of weights (or additionally activations), quantization can achieve memory efficiency and speed up matrix multiplications. Though the quantization method has proven to make the model faster and power efficient, it’s important to acknowledge the challenges that arise from accuracy degradation. It’s also tricky as the hardware needs to be able to support lower precision data computations.

Despite limitations, quantization has aggressively pushed its limit to a point where 2-bit (ternary weights) and 1-bit(binary weights) representation is achievable. Past literature(Courbariaux, 2016) discusses Binary Neural Network(BNN) experiments achieving state-of-the-art results after being trained on MNIST, CIFAR-10 and SVHN. More recent developments show quantization methods applied to Large Language Models in the case of BitNet and GPTQ.

Quantization can also be classified into two forms, depending on when it occurs: quantization-aware training(QAT) and post-training quantization(PTQ). While PTQ applies quantization on a model that has finished training, QAT simulates quantization during the training stage. During QAT, fake nodes for quantization simulation are inserted into a pre-trained model and then further re-trained for several iterations. Despite the need for additional training in QAT, it has been demonstrated that it is a step ahead in comparison to PTQ in terms of model accuracy.

Pruning

source: TensorFlow Blog (link)

Deep neural network models are ‘superfluous.’ They are heavily over-parameterized, many of which are conveying redundant weights thus making the network computationally expensive and time-consuming. Akin to cutting unwanted, undesirable tree branches for better plant growth, deep learning neural networks can also be pruned for compression. Parameters observed to have little to no contribution during training can be pruned afterward leaving only the most informative connections to retrain the network (Han, 2016). Subjects of pruning may vary — weights, neurons, layers, filters and all of which are less salient(of less importance and sensitivity) or of small magnitude. In the case of VGG-16, this CNN model has a total of around 138 million parameters and takes up 12GB of memory space. Pruning methods used by Han (2015) demonstrated that the number of parameters can be reduced by nearly 13 times, to 10.3 million. By disposing the number of inessential parameters, thus reducing the computations, a network can be compressed and be small enough to run the inference of a large network on a resource-constrained computing environment.

As a result of pruning, a sparser network is induced(Cai, 2022) but recent studies in this field have progressed to overcome the problems of less dense architecture and manifest ‘marginal loss in accuracy.’ Even more, researchers have found large-sparse models to ‘consistently outperform small-dense models’ (Zhu, 2018). One must also note that in achieving compression and acceleration through pruning, specially designed hardware and software must be employed to handle the sparsity (Han, 2016).

Knowledge Distillation

source: Gou, 2020 (link)

In comparison to pruning and quantization techniques where the focus is on changing the computational method, knowledge distillation(KD) looks at changing the structure of the model. In this method of compression, a larger and more accurate model termed ‘the teacher’ supervises a more compact and smaller model, ‘the student’. The teacher is a computationally complex and cumbersome deep neural network and due to its unwieldy qualities, it is hindered from being deployed on a resource-limited environment. The pre-trained teacher, having learned as much as it could from the large dataset, transfers knowledge to the student network. The student will later take on the role of inferencing as it can be deployed on a computationally restricted device. The student will ‘mimic’ the teacher model to perform at a superior level.

According to Gou(2021), distilled knowledge can be diversified into 3 categories: response-based, feature-based, and relation-based knowledge. Learning the last output layer or the final prediction of the teacher model is referred to as response-based KD. Feature-based KD focuses on the output of intermediate layers with the goal of matching the feature activations of the teacher and student. In relation-based KD, specific layers of the model are not the focal point. It looks towards the correlation between different layers, data, or feature maps.

AutoML

Manual exploration of optimal model architecture out of the near-infinite possibilities using hand-tuned human heuristics is known to be burdensome because it is time-consuming and labor-intensive. These methods of conventional ‘rule-based policies’ are considered to be suboptimal because policies formed once in one model will not be applicable in other models. Instead, the engineering field now looks towards using automation techniques, namely AutoML, to facilitate finding a neural architectural design for compression. More specifically, automated Neural Architecture Search(NAS) is a technique within AutoML that has been a subject of growing attention. According to He’s experiment in 2018, automated compression using a reinforcement learning framework called AMC outperforms ‘rule-based’ model compression in accuracy and compression quality. NAS can further be categorized into practices such as automated pruning and automated quantization which enable faster and more efficient compression.

source: He, 2018 (link)

The above-mentioned techniques are independently designed and are not mutually exclusive. They can be used together in integrated combinations as they are complementary to each other (Cheng, 2023). Each of them comes with a distinct set of trade-offs, thus the selection of compression methods for joint implementation requires proficient expertise in this particular field.

Here at SqueezeBits, we are adept and competent in employing compression methods with regard to our client’s target hardware constraints. In 2023, we employed quantization, pruning, and knowledge distillation to compress the Stable Diffusion model. In generating a 512 x 512 image using this model, we achieved an astounding inference latency of less than 7 seconds on a Galaxy S23 and less than a second on an iPhone 14 Pro device. We constantly endeavor to keep updated with the latest methodology to find newer, more functional ways to enhance and accelerate your AI model.

For more information and updates, please visit:

References

[1]Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., & Bengio, Y. (2016). Binarized Neural Networks: Training deep neural networks with weights and activations constrained to +1 or -1.

[2]Wang, H., Ma, S., Dong, L., Huang, S., Wang, H., Ma, L., … Wei, F. (2023). BitNet: Scaling 1-bit Transformers for Large Language Models.

[3]Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2022). GPTQ: Accurate post-training quantization for generative Pre-trained transformers.

[4]Han, S., Mao, H., & Dally, W. J. (2015). Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding.

[5]Han, S., Pool, J., Tran, J., & Dally, W. J. (2015). Learning both weights and connections for efficient neural networks.

[6]Cai, H., Lin, J., & Han, S. (2022). Chapter 4 — Efficient methods for deep learning. In E. R. Davies & M. A. Turk (Eds.), Advanced Methods and Deep Learning in Computer Vision (pp. 159–190).

[7]Zhu, M., & Gupta, S. (2017). To prune, or not to prune: exploring the efficacy of pruning for model compression.

[8]Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M. A., & Dally, W. J. (2016). EIE: Efficient inference engine on compressed deep neural network.

[9]Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network.

[10]Gou, J., Yu, B., Maybank, S. J., & Tao, D. (2020). Knowledge Distillation: A Survey.

[11]Elsken, T., Metzen, J. H., & Hutter, F. (2018). Neural Architecture Search: A Survey.

[12]He, Y., Lin, J., Liu, Z., Wang, H., Li, L.-J., & Han, S. (2018). AMC: AutoML for Model Compression and acceleration on mobile devices.

[13]Cheng, H., Zhang, M., & Shi, J. Q. (2023). A survey on deep neural network pruning-taxonomy, comparison, analysis, and recommendations.

[14]Choi, J., Kim, M., Ahn, D., Kim, T., Kim, Y., Jo, D., … Kim, H. (2023). Squeezing large-scale diffusion models for mobile.

--

--