Sitemap
TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Follow publication

Quantizing the AI Colossi

81 min readApr 15, 2024

--

Image by author using DALL-E 3

Quantization

Image by author using DALL-E 3
Figure from the Gholami et al. 2021 survey demonstrates the concept of Quantization-Aware Training (QAT).
Figure from Gholami et al. 2021 survey shows the distinction between full-precision, simulated quantization, and integer-only quantization.
Flowchart from Olivia Weng’s 2021 quantization survey concisely delineates QAT and PTQ.

Outline

The Mechanics of Quantization

Image by author using DALL-E 3.

Bit Width

Floating-Point, Fixed-Point, and Integer-Only Quantization

Uniform Quantization

Typical quantization function. Image by author.
Formula for calculating the quantization function’s scaling factor (S) based on the clipping range ([α, β]) and desired bit-width (b). Image by author.
Dequantization operation. Image by author.

Non-Uniform Quantization

Non-Uniform Quantization formula, where Xi are quantization levels, and ∆i are the quantization steps. Image by author.

Mixed-Precision Quantization

Scalar vs. Vector Quantization

Compensating for the Effects of Quantization

The History of Neural Network Quantization

Image by author using DALL-E 3.

Early Work in Neural Network Quantization

Quantization in the Post-AlexNet Era

Quantization-Aware Training of CNNs

Image by author using DALL-E 3.
Results from Gupta et al. 2015 show the power of stochastic rounding, as the 16bit fixed-point training with 14 bits allocated to the fractional length (FL) nearly matches the floating-point training curve, while the same fixed-point arithmetic using “Round to Nearest” caused training to diverge.
Flowchart from “Deep Compression” illustrates their three stage compression technique.
Results from “Deep Compression” show impressive compression rates with no loss in accuracy.
Results from Jacob et al. 2017 compare latency-vs-accuracy tradeoff of the reference floating-point and their 8-bit quantized MobileNets one two types of mobile CPUs. Notice that for the Sanpdragon 821, which is more optimized for floating-point arithmetic, the advantage of 8-bit quantization is less noticeable.

The Rise of Mixed-Precision Quantization

Image by author using DALL-E 3.
DNAS results show computational cost compression rates for top three searched architectures by accuracy. Note that their method gets “free lunch” computational compression of 33–40x in the arch-1 column.
HAQ results table compares their approach to Deep Compression. Note that both approaches nearly match the full-precision baseline models in ~4bit settings.
Table of HAWQ-V2 results with ResNet50 on ImageNet.
Results from HAWQ-V3 paper using ResNet50 on ImageNet. Notice that the all-8bit quantized HAWQ-V3 beats the full precision baseline. “Int” means integer-only, “Uni” for uniform, “BL” for baseline (these vary between publications, they choose the strongest available for their study). “Dist” refers to use of Knowledge Distillation.

Post-Training Quantization of CNNs

Image by author using DALL-E 3.
Results from Krishnamoorthi 2018 show the effects of different W8A8 PTQ schemes across different CNN architectures. “Mv1” and “Mv2” indicate MobileNet v1 and v2, which show catastrophic accuracy loss when using per-layer weight quantization.
Chart from Banner et al. 2019 shows the relative performance of their 4-bit PTQ method.
Chart from DFQ shows how quickly PTQ performance drops without special techniques being applied, with catastrophic loss below 12bits. Even the DFQ approach does not hold up below 8 bits.
Results from ZeroQ paper show the superiority of their method over previous state-of-the-art PTQ methods. “No D” means “No Data” (data-free aka zero-shot), and “No FT” indicates no fine-tuning required (PTQ). Note that 8bit ZeroQ offers very close to full-precision baseline performance without any data or retraining.
Chart from AdaRound shows the distribution of performance across the randomly sampled perturbations in comparison to a round-to-nearest scheme, showing there are many better solutions, and that the better solutions correlate strongly with the second-order Taylor series term.
Results from AdaRound demonstrate that this method preserves CNN performance down to W4A8 precision better than previous methods.
Results of AdaQuant show superior ImageNet top-1 accuracy over AdaRound and Quantization-Aware Knowledge Distillation (QAT-KLD) at various calibration dataset scales. Variance is calculated over 5 runs for each configuration.

Extreme Quantization: Binary and Ternary Networks

Image by author using DALL-E 3.
Results from BinaryConnect show that the binarization of the network weights acts as a form of regularization and actually improves training outcomes in the CNN and task studied.
Results on CIFAR-10 from Lin et al. 2015 show that using QBP leads to equivalent results as BinaryConnect, and that the ternary network achieves slightly better training outcomes. All quantization approaches exceed baseline performance thanks to the induced regularization effect of stochastic quantization.
Results from BNN paper show that binarized networks achieve validation errors nearly on par with baseline.
Figure from XNOR-Net shows the accuracy tradeoffs of their two proposed approaches. Note that the Binary Weight Network without binarized inputs matches the baseline performance, but doesn’t achieve the dramatic computation saving of the XNOR-Net configuration.
Results from XNOR-Net paper show the improvements of XNOR-Net over BNN (in full end-to-end binarization), and and BWN over BinaryConnect (in weight-only binarization) on the complex ImageNet classification benchmark. Note that for the XNOR-Net top-1 scores, the eval scores seem to saturate about halfway, which could be an indication of a crippling lack of representation power using binary signal.
Results from TWN paper show that the additional expressive power of ternary networks is beneficial, particularly in the more challenging ImageNet and Pascal VOC tasks. Remember that BNN and XNOR-Net binarize the activation signals, whereas the TWN approach, like BinaryConnect and BWN, focus only on the quantization of weights, which is less challenging.
Results from LAQ paper show comparisons with LAB, BinaryConnect, and BWN.
Results from Dong et al. 2017 show that stochastic quantization (SQ) leads to marked improvements over baselines.
Results from BENN show that BNN ensembles can offer a step change in the performance of binarized networks. Note that they compare to ABC-Net with only 1 bit base for weights and activations, which is a curious choice, since they are comparing against ensembles of 3 or 6 BNNs. Since the complexity of these ensembles without parallelization would be O(3) and O(6), it would be more fair to compare them to similarly complex ABC-Net configurations, which as we can see above range from 49.1% (3bit/1bit) to 54.1% (5bit/1bit), which are still lower, but offer much more informative comparisons. Consider also that INQ produces ternary networks which exceed these results with fewer bits, but the concept of parallelized ensembles of binary networks is extremely compelling.

Quantization of LLMs

Image by author using DALL-E 3.

Quantization in Early Era of Transformers

Results from BinaryBERT paper gives comprehensive view of BERT compression techniques and their relative sizes, and shows the impressive representation power maintained by their binarization method. Keep in mind that GOBO is a PTQ method which is a considerable handicap.

Post-Training Quantization (PTQ) of LLMs

Plots from ZeroQuant paper show the token activations and weight ranges in the attention output matrices across transformer layers in GPT-3 350M.
Results from ZeroQuant of GPT-3 350M show baseline performance is matched by the W8A8 configuration. For lower precisions, the LKD step greatly improves performance, but still does not recover baseline accuracy, even in the sub-billion parameter scale model shown here. Note the sharp drop in performance from 8 to 4 bits in the weights in all settings, even with the higher-precision activations, which shows us how difficult low-bit PTQ of LLMs was in mid-2022.
Results from GPTQ (aka OPTQ) shows that their separate handling of systematic outliers preserves accuracy in transformers at ≥6.7B scale.
Results from GPTQ show better consistency and performance than RTN (LLM.int8) in 4-bit and 3-bit quantization scenarios.
Figure from SmoothQuant demonstrates the offloading of activation outliers into the weights in order to make them more amenable to quantization.
Diagram from SmoothQuant clearly demonstrates the channel-wise regularity of outliers in the activations, which can be absorbed into the weights.
Results from SmoothQuant demonstrate equal performance to LLM.int8() without the need for mixed-precision decomposition of the activation signals.
Results from ZeroQuant-V2 show perplexity scores of various PTQ approaches on LLMs. Note that the real divergence between ZeroQuant-V2 and GPTQ is seen in the W4A8 quantization of OPT-style models, which we now know contain sensitive outliers thanks to the SmoothQuant paper, and so it is likely that the improved performance of ZeroQuant-V2 over GPTQ is the ability for the block-wise granularity to better preserve the outliers occurring in the activation channels.
Results from AWQ showing improved perplexity scores over RTN (LLM.int8) and GPTQ.
Chart from SqueezeLLM compares their quantized models to FP16 models of the same size, showing that for an equivalent memory footprint, quantized models provide better significantly better performance. This figure is compelling in the undisputable benefits of network quantization.
Results from SqueezeLLM show roughly equivalent instruction-tuning results to AWQ under given bit constraints.
Compute time of HQQ compared with GPTQ and AWQ on the large LLaMA-2–70B model.
Chart from HQQ shows that it provides lower or equal perplexity at given memory budgets to other state-of-the-art approaches, while being much faster. BNB refers to bitsandbytes, aka the data-free LLM.int8() method.
Results from SmoothQuant+ show improved performance over RTN and AWQ, exceeding the FP16 baseline in larger models.
Chart from LUT-GEMM shows the benefit of avoiding the costly dequantization step by using the LUT system.
Results from LUT-GEMM show that their kernel and quantization method outperform AWQ and GPTQ in terms of inference latency.

Quantization-Aware Training (QAT) of LLMs

Image by author using DALL-E 3.
Diagram from LLM-QAT provides an intuitive visual reference of weight, activation, and KV quantization in transformer layers.
Results from LLM-QAT show the dominance of their approach in low-bit settings. Bit values shown in W-A-KV order. Perplexity is considered a stringent metric.
Table from QLoRA paper demonstrates that the performance lost due to quantization can be fully recovered by applying QLoRA fine-tuning on 4-bit quantized models.

Extreme Quantization of LLMs

Charts showing that BitNet performance scales similarly to full-precision transformer. Note that the left chart demonstrates that equal performance can be achieved with an order of magnitude less energy consumption.
Results from BitNet b1.58 show that it closely matches or exceeds FP16 LLaMA models with equivalent parameter counts. Note that they cap their experiment to 3.9B params, likely because their training process does not improve efficiency over FP16, and is therefore very expensive.

Practitioner’s LLM Quantization Guide

Image by author using DALL-E 3.
Results from MLC-LLM shows throughput of 4-bit CodeLlama-34B and Llama2–70B on two NVIDIA RTX 4090. We can see that any specialized inference engine offers substantial gains over using HF Transformers, but that for these NVIDIA cards, MLC-LLM outperforms the others consistently.

LLM Quantization Decision Tree

Conclusion

Future Work

--

--

TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Nate Cibik

Written by Nate Cibik

Data Scientist with a focus in autonomous navigation

No responses yet