[EN] FP8 Quantization with OwLite

Published in

SqueezeBits Team Blog

6 min readAug 5, 2024

Over the past few years, AI has made tremendous progress, and applications based on large language models (LLMs), such as ChatGPT, have already helped us in various areas of our lives. This progress can be attributed to the development of deep learning algorithms, the exponential increase in data, and groundbreaking computing power advances. In this context, new data representation formats have emerged, enabling efficient data processing and model deployment. Among them, the FP8 (Floating-point 8-bit) format stands out as a notable innovation in AI hardware architecture.

Latest AI hardware such as NVIDIA’s GPUs (e.g., Hopper, Ada Lovelace, Blackwell) and Intel’s Gaudi-2 & 3 support FP8-based operations. Specification sheets of those often report FP8 TFLOPS (Tera-Floating-Point Operations per Second) alongside BF16 TFLOPS, and there has been a recent push within frameworks like TensorRT-LLM to leverage the FP8 format for enhancing the inference efficiency of Transformer models, primarily LLMs. However, although the hardware supports the FP8 format, it was difficult to use in various applications like computer vision until recently due to the lack of software support.

At SqeeuzeBits, we have proactively supported FP8 convolution even before TensorRT officially supports it. With the recent TensorRT updates adding FP8 convolution support, we can now compare the performance of FP8 quantized models using OwLite and TensorRT.

In this post, we will briefly explore the FP8 format and examine the performance of TensorRT engines quantized to FP8 format using OwLite.

What is the FP8 (Floating-point 8-bit) Format ?

The FP8 format is a floating-point data type with an 8-bit data length. A floating-point number consists of (1) a sign bit, (2) an exponent, and (3) a mantissa. According to the IEEE-754 standard, the sign bit, exponent, and mantissa for BF16, FP32, and FP16 are as follows.

https://www.exxactcorp.com/blog/hpc/what-is-fp64-fp32-fp16

Due to the absence of a standard for FP8 format, various combinations of exponent and mantissa sizes have been explored. In 2022, NVIDIA/ARM/Intel proposed two 8-bit floating-point types, FP8 E5M2 and FP8 E4M3, adhering to the IEEE-754 standard.

NVIDIA’s Hopper and Ada Lovelace architectures support both FP8 formats.

E4M3 primarily used for inference [1]. Unless otherwise notes, FP8 format in this context refers to E4M3.

FP8 Format Support in TensorRT

TensorRT has supported the FP8 data type (E4M3) since version 8.6 and began supporting FP8 operations through explicit quantization in version 9.0. However, since it only supported FP8 Gemm operations, it could not be used for models containing Convolution layers, and it was mainly used in large language models (LLMs) via TensorRT-LLM.

In the recently updated version 10.2, normal FP8 convolution has been added, allowing models with convolution layers to be quantized to FP8 format. Despite this, some constraints remain, limiting its support across all models:

No implementation for FP8 operations in convolution layers with input/output channels not being multiples of 16.
No support for Group Convolution and Depthwise Convolution layers.

We tested the ResNet18 model, applying Post-Training Quantization (PTQ) to all layers except the first convolution layer due to channel limitations.

PTQ was done using OwLite.

Testing Environment and Throughput Results:

GPU: NVIDIA L40S (Ada Lovelace architecture)
Data Types: FP16 (baseline), FP8
PTQ Calibration Method: AbsMax, Per-Tensor Quantization for Activations and Weights
Datasets: ImageNet

Surprisingly, the results show that the engine built with FP8 format has lower throughput than that built with FP16 format. This likely indicates that TensorRT’s FP8 kernels are not yet fully optimized as for FP16.

When building an engine with FP8 data type in TensorRT 10.2, the --stronglyTyped flag must be added; otherwise, FP8 implementation tactics will not activated. This issue has been shared with NVIDIA-TensorRT team and is expected to be fixed in version 10.3.

OwLite’s FP8 Format Support

SqueezeBits’ OwLite has supported the FP8 format even before TensorRT 10.2’s update. Utilizing NVIDIA’s CUTLASS library, we implemented FP8 Gemm and Convolution operations as TensorRT custom plugins, facilitating easy quantization to FP8 format and building faster TensorRT engines.

The process is straightforward. First, upload the ResNet18 baseline through the owlite package. Then, create an experiment via the web GUI, selecting fp8_e4m3 as the data type for the operations to be quantized.

Once the operations are set, save the configuration, and return to the code to perform calibration and benchmarking, completing all steps.

For detailed instructions, refer to the ResNet18 Tutorial.

Here is the extracted ONNX file:

Building a TensorRT engine with this ONNX file uses TensorRT’s native FP8 implementations. However, as shown, the performance is lower than the engine built with FP16. OwLite provides custom Gemm and Convolution operations implemented as custom plugins to build a faster engine.

To apply custom plugins via ONNX, the ONNX conversion is required so that it is recognized as a custom plugin operation rather than a native operation. The OwLite backend first performs the necessary ONNX graph conversion before building the engine.

Here is the converted ONNX graph:

You may notice the graph appears simpler than the original. As mentioned, for TensorRT to use custom plugins, the graph must be transformed into custom nodes. However, custom nodes prevent TensorRT from applying its graph optimizations. To achieve optimal efficiency, OwLite performs graph optimizations similar to TensorRT’s after ONNX transformation, resulting in the above graph.

The transformed ONNX, along with SqueezeBits’ custom plugin library, is built into a TensorRT engine. The performance results of the engines built using the custom plugin operations are shown below.

**FP8-native**: an engine built with TensorRT’s native FP8 operations, **FP8-OwLite**: an engine built with custom FP8 operations

These results demonstrate that using OwLite’s custom plugins significantly enhances performance compared to engines built with native FP8 operations.

—

We’ve reviewed the latency of engines built with PTQ. In PTQ, optimizing the model to reduce latency while maintaining the original model’s accuracy is crucial.

OwLite provides a web GUI to easily compare various metrics measured from each experiment. For example, we can compare the accuracy of models with a batch size of 128, as shown below.

A result of the baseline and experiments in Web GUI

The deployment size of models using custom plugins (fp8_plugin) is larger due to the inclusion of the custom plugin library in the TensorRT engine.

The results above show that we can obtain a model with a similar level of accuracy (69.569) as the baseline (69.788) even if we lighten the model with FP8 PTQ.

Closing Thoughts

Like the INT8 format, the FP8 format enhances memory efficiency and computational speed in AI hardware architectures. Recent AI hardware supports the FP8 format and provides dedicated acceleration, maximizing AI model performance and offering new possibilities.

Whether FP8 or INT8 is better depends on the application domain, model characteristics, and hardware specifications. OwLite offers diverse quantization options, helping users find the optimal quantization method for their models. Unsatisfied with INT8 quantization results ? Why not try FP8 Quantization ?

—

Try FP8 Format with OwLite Now !

OwLite

AI compression got much easier

owlite.ai

References

[1] Micikevicius, P., Stosic, D., Judd, P., Kamalu, J., Oberman, S., Shoeybi, M., Siu, M., Wu, H., Burgess, N., Ha, S., Grisenthwaite, R., Mellempudi, N., Cornea, M., Heinecke, A., and Dubey, P., FP8 Formats for Deep Learning. arxiv:2209.05433, 2022.