Boosting AI Accelerator Performance with Quantization

Published in

deMISTify

9 min readApr 28, 2024

This is a continuation of my previous article on hardware parallelism for AI.

It’s no secret that, while the mechanisms of deep neural networks (DNNs) are well-understood, their sheer efficiency at certain tasks such as image classification and natural language processing is still a mystery. Unlike much of the statistical techniques that underpin deep learning, a robust mathematical explanation for why DNNs perform as well as they do remains a prominent gap in our current understanding of AI as a whole. Instead, much of the innovation that drives deep learning is rooted in empirical results. The technology works — that much is clear. Hence, existing research focuses on maximizing the amount of practical value we can extract.

When the topic of improving the performance of DNNs is raised, the most obvious avenue concerns the models themselves. This includes developing more efficient architectures, as well as optimizing deep learning frameworks, employing better algorithms for DNN operations. However, there is another approach which has gained significant attention in recent months: hardware acceleration. Several big tech companies, such as NVIDIA and Google, have developed their own in-house chips that are highly specialized for the acceleration of DNN computation. This article will highlight one of the most important methods used in AI hardware accelerators: quantization.

What is quantization?

In programming, data types are an essential concept, as they define the characteristics of the data we work with. These types include, for instance, integers or floating-point numbers, each with varying levels of precision. An integer, for example, can store whole numbers, while a floating-point number is used for more precise numerical values, capable of representing fractions and numbers with decimal points. The precision of these data types is dictated by their bit-width. For example, a 32-bit floating-point number offers less precision than a 64-bit floating-point number but requires half the amount of memory.

Quantization, in the context of computing, takes this concept of precision adjustment a step further. It’s a process of mapping values from a larger set to a smaller set, and a way of simplifying the representation of numerical values to make them require less storage space. Moving from a 64-bit floating-point to a 32-bit floating-point, for instance, is an example of quantization.

Why is quantization a big deal for AI accelerators?

The most obvious benefit quantization offers is faster processing speed. An AI accelerator is filled with hardware that executes common operations that DNNs utilize. The precision of the numerical values being operated on has physical significance to the design of the hardware:

The more precise, the larger the bit-width, and the less the capacity for parallel processing. For example, a 128-bit wide processing unit can handle just 2 operations on 64-bit numbers at a given time, versus 4 operations on 32-bit numbers. This translates to a significant decrease in throughput. If you are interested in learning more about parallelism in hardware, you can check out my previous article here.
Precision directly correlates to hardware complexity. Higher precision means a larger bit-width, and more bits means more handling needed for calculations, requiring more complicated circuitry. Also, a larger bit-width translates quite literally to more physical space needed. Together, these physical considerations mean higher power consumption, more latency and reduced efficiency.
AI accelerators execute instructions from the CPU. Higher precision arithmetic means more complicated instructions. Not only are these instructions harder to transferred between devices, but they also require the AI accelerator to dedicate more space to internal memory storage and caches to compensate for the increased pressure on memory bandwidth.

Effective quantization mitigates all of these issues by intentionally lowering the precision of the arithmetic, sacrificing DNN performance for speed.

Fixed-point vs. floating-point

Contrary to its name, the integer data type isn’t necessarily limited to representing integer values. Fixed-point arithmetic allows integers to represent decimal values by assigning certain bits to represent the fractional part of a number. Because the number of bits is fixed and divided between the integer and fractional parts, the range of values that can be represented is limited by the number of bits allocated to the integer part. Increasing the range means reducing the precision (bits available for the fractional part), and vice versa.

Figure 1: Diagram of fixed-point representation of a binary number [1]

This lack of precision if a large range is required is often a dealbreaker for DNNs, especially during gradient descent and backpropagation. Input data values can also vary widely in magnitude — from large image pixel values to tiny feature values in natural language processing tasks. But on the flip side, fixed-point is much simpler to implement in hardware, and it runs much faster. This leads to a bit of a dilemma, where balancing sufficient range with throughput becomes a significant issue.

Meanwhile, floating-point numbers work very differently. They consist of a mantissa and an exponent. Floating-point numbers function similarly to scientific notation, where a decimal value ranging between 1 to 10 is multiplied with a power of 10. The difference with floating-point is that everything is in binary, so instead of a decimal, we have a binary value that is stored in the mantissa. This value is then multiplied with 2 raised to the power of the value stored in the exponent.

Figure 1: Diagram of floating-point representation [2]

This specific floating-point number is 13 bits. Ignoring the sign bit (which determines if the value of positive or negative), it can represent numerical values from as small as 0.0078125 all the way to 511. A quirk of floating point is that the precision scales with the value of the exponent. When the exponent is large, each increment becomes large as well. This is equivalent to how, in scientific notation, when the power of 10 becomes large, increasing the decimal value by a certain amount leads to a big increase in the number’s overall value.

The remainder of this article will focus on two of the most promising numerical representations: minifloat and block floating point.

Minifloat

As its name suggests, minifloats are floating-point values with very small bit-widths, such 8-bit floating-point or even smaller. Their appeal lies in their potential to enable massively faster and more efficient processing units. The downside is that, with lower precision, calculations become more error prone. Quantization error is the metric that is used to quantify how much of a difference there is between quantized value and true value.

Usually, the minimization of quantization error is a top design priority. This kind of unwanted rounding has led to many high-profile engineering disasters. Perhaps the most notable is the Patriot Missile Failure in 1991, where the missile in question was programmed with a tracking system that used a 24-bit counter to calculate time. The small rounding error every time the time incremented accumulated significantly, to the point where, after 100 hours, the inaccuracy was enough for the missile to completely miss its target.

However, when it comes DNNs, quantization error is surprisingly not as big of an issue. Empirical evidence has shown that DNNs are inherently robust to noise caused by reduced precision. There are many theories as to why this is the case, from network learning to compensate for the quantization error during the process, to redundancy in the large number of parameters, to inherent error tolerance in the statistical underpinnings of DNNs. Regardless, minifloats offer promising benefits for increasing the efficiency of AI accelerators while bypassing much of the traditional issues that come with excessive quantization.

Block floating point (BFP)

Because of the dynamic nature of the values which floating-point numbers can take, doing mathematical operations with them requires very complicated circuitry. When we add two fixed-point numbers together, it’s as simple as adding each corresponding bit with each other and carrying over, just like you would when adding regular numbers together. But with floating-point numbers…

Figure 2: Diagram of a floating-point adder circuit [3]

As you can see, this circuit is very complicated. The key reason as to why floating-point necessitates all this is because of the exponent. When you add two numbers together in scientific notation, you first need to scale the numbers accordingly to the same power of 10 before you can add the decimal parts together. The concept is the same for floating-point, but implementing this shifting process in hardware is, as evidenced above, a big hassle.

In the real world, complicated circuitry has physical consequences, leading to lower throughput and increased production cost. We can therefore see the dilemma of choosing between fixed-point and floating-point. The former offers much faster and simpler hardware implementation, whereas the latter has the benefit of a higher range, crucial for many DNN tasks. However, researchers have started to look at numerical representations that try to reach a compromise. Among the most promising is block floating point (BFP).

Figure 3: Diagram of the shared exponent feature in BFP [2]

In BFP, a block of numbers share a single exponent. The mantissas are represented in fixed-point format. Because all mantissas within a block share the same exponent, they effectively have the same scale. This shared scaling mechanism allows BFP to represent numbers with varying magnitudes more compactly than fixed-point representation without requiring a separate exponent for each number. Hence, when performing operations such as addition, there is no longer a need to shift the exponent. This means is that BFP can implemented on fixed-point hardware, circumventing the issue of complicated circuitry that comes attached with floating-point representation. And, as mentioned earlier, simpler hardware leads to an increased capacity for parallel processing. Meanwhile, BFP’s range, while slightly lower than regular floating-point due to less flexibility in the exponent value, is still a significant improvement over fixed-point.

As it turns out, the concept of BFP is independent of the bit-width of the numbers involved. In other words, BFP can actually be combined with minifloats, forming a hybrid numerical representation dubbed block minifloat (BM) [4]. Research has shown that BM achieves competitive accuracy while possessing much higher computational density, indicating a much higher potential for parallel processing.

Figure 4: Block minifloat accuracy vs. computational density compared to other numerical representations [4]

However, while DNNs are indeed resistant to noise, at a certain point, quantization begins to degrade performance to the point where the benefit of computational efficiency perhaps no longer outweighs the ultimate goal of accuracy. In Figure 4, this is seen in the “BM5” representation, which indicates a 5-bit minifloat.

If you’re interested in learning more about how block minifloat is implemented, you can check out the source article here.

Conclusion

While novel quantization techniques are topics of intensive research, they have yet to be widely adopted in industry practice. Most current software libraries and compilers are yet to support these numerical representations, and fully committing to them in the design of hardware is a weighty decision for semiconductor companies. However, the gap between current industry practice and the potential offered by novel quantization techniques will only continue to narrow in the near future. Once these new numerical representations are embraced, we can expect a surge in the efficiency and capabilities of AI accelerators, paving the way for exciting innovation and growth in the field.

References:

[1] D. Ngo and B. Kang, “Taylor-Series-Based Reconfigurability of Gamma Correction in Hardware Designs,” Electronics, vol. 10, no. 1959, Aug. 2021, doi: 10.3390/electronics10161959.

[2] D. Elam and C. Iovescu, “A Block Floating Point Implementation for an N-Point FFT on the TMS320C55x DSP,” Texas Instruments, SPRA948, Sep. 2003. [Online]. Available: https://www.ti.com/lit/an/spra948/spra948.pdf

[3] F. Brosser, H. Cheah, and S. Fahmy, “Iterative floating point computation using FPGA DSP blocks,” in Proc. of the 2013 23rd International Conference on Field programmable Logic and Applications, 2013, pp. 1–6, doi: 10.1109/FPL.2013.6645531.

[4] S. Fox, S. Rasoulinezhad, J. Faraone, D. Boland, and P. Leong, “A Block Minifloat Representation for Training Deep Neural Networks,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=6zaTwpNSsQ2

Boosting AI Accelerator Performance with Quantization

Written by Jeremy Qu