Photo by Tengyart on Unsplash

Quantizing a DistilBERT Humor NLP Model

Going from FP32 to INT8 for Faster Inference with Optimum Intel and Intel Neural Compressor

Benjamin Consolvo
2 min readDec 12, 2022

--

Quantization is a model compression technique that helps reduce inference time. Quantization from 32-bit floating-point (FP32) to 8-bit integer (INT8) type improves performance with only a slight drop in accuracy. Generally, the model size can shrink by 4–5x, and inference time can be reduced by more than 3x (source).

We must first import a few more libraries for quantization, including neural_compressor and optimum.intel:

I’m setting the torch device to cpu:

We can now apply quantization. Take a look at the text classification examples in the optimum.intel GitHub repository to learn more. Let’s load a previously trained model:

I set up the trainer to use the IncTrainer or the Intel® Neural Compressor Trainer class:

Everything is now in place to run quantization. Note that I’m using a configuration file called quantization.yml downloaded from the optimum.intel GitHub repository. You can change the configuration parameters in this file to adjust how the model is quantized. Here, I am applying post-training dynamic quantization, a technique that does not require retraining, but usually sacrifices a little accuracy:

We can then go ahead and run quantization with optimizer.fit():

Now that we have an optimized model, we can compare it to the baseline model:

We can save the quantized model to disk, and have both the FP32 and the newly quantized INT8 model available:

INT8 model saved to disk

Running evaluation with the FP32 model, we get:

And for INT8:

Accuracy is roughly the same but the INT8 model is 1.5x faster than the FP32 model.

To learn more about Intel’s AI hardware and software solutions, visit here.

--

--

Benjamin Consolvo

AI Software Engineering Manager at Intel. I like to write on topics in AI to help other developers along their coding journey.