This post was written by Morgan Funtowicz, Machine Learning Engineer from Hugging Face and Yufeng Li, Senior Software Engineer from Microsoft
Transformer models used for natural language processing (NLP) are big. BERT-base-uncased has ~110 million parameters, RoBERTa-base has ~125 million parameters, and GPT-2 has ~117 million parameters. Each parameter is a floating-point number that requires 32 bits (FP32). This means the file sizes of these models are huge as is the memory they consume. Not to mention all the computation that needs to happen on all these bits.
These challenges make it difficult to run transformer models on client devices with limited memory and compute resource. Growing awareness of privacy and data transfer costs make on-device inferencing appealing. Even on the cloud, latency and cost are very important and any large-scale application needs to optimize for these.
Quantization and distillation are two techniques commonly used to deal with these size and performance challenges. These techniques are complementary and can be used together. Distillation was covered in a previous blog post by Hugging Face. Here we discuss quantization which can be applied to your models easily and without retraining. This work builds on the optimized inference with ONNX Runtime we previously shared and can give you additional performance boost as well as unblock inferencing on client devices.
Quantization approximates floating-point numbers with lower bit width numbers, dramatically reducing memory footprint and accelerating performance. Quantization can introduce accuracy loss since fewer bits limit the precision and range of values. However, researchers have extensively demonstrated that weights and activations can be represented using 8-bit integers (INT8) without incurring significant loss in accuracy.
Compared to FP32, INT8 representation reduces data storage and bandwidth by 4x, which also reduces energy consumed. In terms of inference performance, integer computation is more efficient than floating-point math.
Performance varies with the input data and the hardware. For online inferencing, a small batch size (number of inputs) is common. The sequence lengths (size of input) vary based on the scenario. In our benchmark, we measured batch sizes of 1 and 4 with sequence lengths ranging from 4 to 512. Modern CPUs support the Advanced Vector Extensions 2 (AVX2) instruction set for high performance computing. The latest Intel CPUs also support AVX512 Vector Neural Network Instructions (AVX512 VNNI) which is designed to accelerate deep learning INT8 inference performance. We benchmarked performance for BERT-base-uncased, RoBERTa-base, and GPT-2 on two machines:
- AVX2: Intel(R) Xeon(R) CPU E5–1650 v4 @ 3.60GHz
- VNNI: Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz
For PyTorch, we used PyTorch 1.6 with TorchScript. For PyTorch + ONNX Runtime, we used Hugging Face’s convert_graph_to_onnx method and inferenced with ONNX Runtime 1.4.
We saw significant performance gains compared to the original model by using ONNX Runtime’s quantization:
The speedup over the original PyTorch model comes from both the quantization as well as acceleration by ONNX Runtime. Let’s see how this breaks down. Compared with ONNX Runtime FP32, we saw that ONNX Runtime INT8 quantization can accelerate inference performance by up to 6x for all three models on the VNNI machine. We saw smaller, but still significant, speedups on the AVX2 machine.
Using ONNX Runtime INT8 quantization consistently showed performance gains compared to using PyTorch INT8 quantization on both the AVX2 and VNNI machines:
Our detailed data is shared at the end of this post.
After converting the original PyTorch FP32 model to ONNX FP32 format, the model size was almost the same, as expected. Then we applied the respective INT8 quantization process on both models. ONNX Runtime was able to quantize more of the layers and reduced model size by almost 4x, yielding a model about half as large as the quantized PyTorch model.
Don’t forget about accuracy
Smaller and faster is great but we also need to make sure the model is returning good results. Given accuracy is task-specific, we took a fine-tuned BERT model for accuracy benchmarking. This model is fine-tuned using the BERT-base-uncased model in Hugging Face Transformers for the Microsoft Research Paraphrase Corpus (MRPC) task in the General Language Understanding Evaluation benchmark (GLUE). MRPC is a common NLP task for language pair classification.
Accuracy measures the number of correctly predicted values among the total predicted value. It’s not a complete measure since it does not work well when the cost of false negatives is high. So we also calculate the F1 score which takes into account both the precision and recall. This is more useful when you care more about the positive class. Compared to PyTorch quantization, even with a smaller model, ONNX Runtime quantization showed the same accuracy and a slightly higher F1 score.
We hope you are intrigued to try this yourself. Here are the instructions to get started quantizing your Hugging Face models to reduce size and speed up inference.
Step 1: Export your Hugging Face Transformer model to ONNX
The Hugging Face Transformers library includes a tool to easily make use of ONNX Runtime. The convert_graph_to_onnx.py script is located directly at the root of the Transformers repository and takes a few arguments such as the model to be exported and the framework you want to export from (PyTorch or TensorFlow) to generate the associated ONNX graph.
In conjunction with the quantization support in the ONNX Runtime 1.4 release, we also updated the Hugging Face Transformers conversion script and added a new command line argument --quantize to easily export quantized ONNX models directly from Transformers:
python convert_graph_to_onnx.py --framework pt --model bert-base-uncased --quantize bert-base-uncased.onnx
This will output both the full precision ONNX model and the quantized ONNX model.
Note: there is currently a limit on the model size to be less than 2GB to use the — quantize option. The next ONNX Runtime release will remove this.
You can find more information in the Hugging Face documentation.
Step 2: Inference with ONNX Runtime
session = onnxruntime.InferenceSession(onnx_model_path)
ONNX Runtime INT8 quantization shows very promising results for both performance acceleration and model size reduction on Hugging Face transformer models. We’d love to hear any feedback or suggestions as you try it in your production scenarios. You can also participate in our GitHub repos (Hugging Face Transformers library and ONNX Runtime).
So far, we’ve been discussing inference optimizations. In future blogs we’ll cover training optimizations to help you significantly reduce the time it takes to train and fine-tune your NLP models.
Latencies below are measured in milliseconds. PyTorch refers to PyTorch 1.6 with TorchScript. PyTorch + ONNX Runtime refers to PyTorch versions of Hugging Face models exported and inferenced with ONNX Runtime 1.4.