Quantizing a DistilBERT Humor NLP Model
Going from FP32 to INT8 for Faster Inference with Optimum Intel and Intel Neural Compressor
Quantization is a model compression technique that helps reduce inference time. Quantization from 32-bit floating-point (FP32) to 8-bit integer (INT8) type improves performance with only a slight drop in accuracy. Generally, the model size can shrink by 4–5x, and inference time can be reduced by more than 3x (source).
We must first import a few more libraries for quantization, including neural_compressor
and optimum.intel
:
I’m setting the torch
device to cpu
:
We can now apply quantization. Take a look at the text classification examples in the optimum.intel
GitHub repository to learn more. Let’s load a previously trained model:
I set up the trainer to use the IncTrainer
or the Intel® Neural Compressor Trainer class:
Everything is now in place to run quantization. Note that I’m using a configuration file called quantization.yml
downloaded from the optimum.intel
GitHub repository. You can change the configuration parameters in this file to adjust how the model is quantized. Here, I am applying post-training dynamic quantization, a technique that does not require retraining, but usually sacrifices a little accuracy:
We can then go ahead and run quantization with optimizer.fit()
:
Now that we have an optimized model, we can compare it to the baseline model:
We can save the quantized model to disk, and have both the FP32 and the newly quantized INT8 model available:
Running evaluation with the FP32 model, we get:
And for INT8:
Accuracy is roughly the same but the INT8 model is 1.5x faster than the FP32 model.
To learn more about Intel’s AI hardware and software solutions, visit here.