Accelerate Big Transfer (BiT) Model Even More with Quantization using OpenVINO and Neural Network Compression Framework (NNCF)

OpenVINO™ toolkit

Published in

OpenVINO-toolkit

7 min readDec 1, 2023

Authors: Pradeep Sakhamoori, Ravi Panchumarthy, Nico Galoppo.

1. Introduction

In the first part of this blog series, we discussed how to use Intel®’s OpenVINO™ toolkit to accelerate inference of the Big Transfer (BiT) model for computer vision tasks. We covered the process of importing the BiT model into the OpenVINO environment, leveraging hardware optimizations, and benchmarking performance. Our results showcased significant performance gains and reduced inference latency for BiT when using OpenVINO compared to the original TensorFlow implementation. With this strong base result in place, there’s still room for further optimization. In this second part, we will further enhance BiT model inference with the help of OpenVINO and Neural Network Compression Framework (NNCF) and low precision (INT8) inference. NNCF provides sophisticated tools for neural network compression through quantization, pruning, and sparsity techniques tailored for deep learning inference. This allows BiT models to become viable for power and memory-constrained environments where the original model size may be prohibitive. The techniques presented will be applicable to many deep learning models beyond BiT.

2. Model Quantization

Model quantization is an optimization technique that reduces the precision of weights and activations in a neural network. It converts 32-bit floating point representations (FP32) to lower bit-widths like 16-bit floats (FP16) or 8-bit integers (INT8) or 4-bit integers (INT4). The key benefit is enhanced efficiency — smaller model size and faster inference time. These improvements not only increase efficiency on server platforms but, more importantly, also enable deployment onto resource-constrained edge devices. So, while server platform performance is improved, the bigger impact is opening all-new deployment opportunities. Quantization transforms models from being restricted to data centers to being deployable even on low-power devices with limited compute or memory. This massively expands the reach of AI to the true edge.

Below are a few of the key model quantization concepts:

Precision Reduction — Decreases the number of bits used to represent weights and activations. Common bit-widths: INT8, FP16. Enables smaller models.
Efficiency — Compressed models are smaller and faster, leading to efficient system resource utilization.
Trade-offs — Balancing model compression, speed, and accuracy for target hardware. The goal is to optimize across all fronts.
Techniques — Post-training and quantization-aware training. Bakes in resilience to lower precision.
Schemes — Quantization strategies like weight, activation, or combined methods strike a balance between compressing models and preserving accuracy.
Preserving Accuracy — Fine-tuning, calibration, and retraining maintain the quality of real-world data.

3. Neural Network Compression Framework (NNCF)

NNCF is a powerful tool for optimizing deep learning models, such as the Big Transfer (BiT) model, to achieve improved performance on various hardware, ranging from edge to data center. It provides a comprehensive set of features and capabilities for model optimization, making it easy for developers to optimize models for low-precision inference. Some of the key capabilities include:

Support for a variety of post-training and training-time algorithms with minimal accuracy drop.
Seamless combination of pruning, sparsity, and quantization algorithms.
Support for a variety of models: NNCF can be used to optimize models from a variety of frameworks, including TensorFlow, PyTorch, ONNX, and OpenVINO.

NNCF provides samples that demonstrate the usage of compression algorithms for different use cases and models. See compression results achievable with the NNCF-powered samples on the Model Zoo page. For more details refer to:

GitHub - openvinotoolkit/nncf: Neural Network Compression Framework for enhanced OpenVINO™…

Neural Network Compression Framework for enhanced OpenVINO™ inference - GitHub - openvinotoolkit/nncf: Neural Network…

github.com

4. BiT Classification Model Optimization with OpenVINO™

Note: Before proceeding with the following steps, ensure you have a conda environment set up. Refer to this blog post for detailed instructions on setting up the conda environment.

4.1. Download BiT_M_R50x1_1 tf classification model:

wget https://tfhub.dev/google/bit/m-r50x1/1?tf-hub-format=compressed 
-O bit_m_r50x1_1.tar.gz

mkdir -p bit_m_r50x1_1 && tar -xvf bit_m_r50x1_1.tar.gz -C bit_m_r50x1_1

4.2. OpenVINO™ Model Optimization:

Execute the below command inside the conda environment to generate OpenVINO IR model files (.xml and .bin) for the bit_m_r50x1_1 model. These model files will be used for further optimization and for inference accuracy validation in subsequent sections.

ovc ./bit_m_r50x1_1 --output_model ./bit_m_r50x1_1/ov/fp32/bit_m_r50x1_1 
--compress_to_fp16 False

5. Data Preparation

To evaluate the accuracy impact of quantization on our BiT model, we need a suitable dataset. For this, we leverage the ImageNet 2012 validation set which contains 50,000 images across 1000 classes. The ILSVRC2012 validation ground truth is used for cross-referencing model predictions during accuracy measurement.

By testing our compressed models on established data like ImageNet validation data, we can better understand the real-world utility of our optimizations. Maintaining maximal accuracy while minimizing resource utilization is crucial for edge deployment. This dataset provides the rigorous and unbiased means to effectively validate those trade-offs.

Note: Accessing and downloading the ImageNet dataset requires registration.

6. Quantization using NNCF

In this section, we will delve into the specific steps involved in quantizing the BiT model using NNCF. The quantization process involves preparing a calibration dataset and applying 8-bit quantization to the model, followed by accuracy evaluation.

6.1. Preparing Calibration Dataset:

At this step, create an instance of the nncf.Dataset class that represents the calibration dataset. The nncf.Dataset class can be a wrapper over the framework dataset object used for model training or validation. Below is a sample code snippet of nncf.Dataset() call with transformed data samples.

# TF Dataset split for nncf calibration
img2012_val_split = get_val_data_split(tf_dataset_, \
                                       train_split=0.7, \
                                       val_split=0.3, \
                                       shuffle=True, \
                                       shuffle_size=50000)
    
img2012_val_split = img2012_val_split.map(nncf_transform).batch(BATCH_SIZE)

calibration_dataset = nncf.Dataset(img2012_val_split)

The transformation function is a function that takes a sample from the dataset and returns data that can be passed to the model for inference. Below is the code snippet of the data transform.

# Data transform function for NNCF calibration 
def nncf_transform(image, label):
  image = tf.io.decode_jpeg(tf.io.read_file(image), channels=3)
  image = tf.image.resize(image, IMG_SIZE)
  return image

6.2. NNCF Quantization (FP32 to INT8):

Once the calibration dataset is prepared and the model object is instantiated, the next step involves applying 8-bit quantization to the model. This is achieved by using the nncf.quantize() API, which takes the OpenVINO FP32 model generated in the previous steps along with the calibrated dataset values to initiate the quantization process. While nncf.quantize() provides numerous advanced configuration knobs, in many cases like this one, it just works out of the box or with minor adjustments. Below, is sample code snippet of nncf.quantize() API call.

ov_quantized_model = nncf.quantize(ov_model, \
                                   calibration_dataset, \
                                   fast_bias_correction=False)

For further details, the official documentation provides a comprehensive guide on the basic quantization flow, including setting up the environment, preparing the calibration dataset, and calling the quantization API to apply 8-bit quantization to the model.

6.3. Accuracy Evaluation

As a result of NNCF model quantization process, the OpenVINO INT8 quantized model is generated. To evaluate the impact of quantization on model accuracy, we perform a comprehensive benchmarking comparison between the original FP32 model and the quantized INT8 model. This comparison involves measuring the accuracy of BiT Model (m-r50x1/1) on the ImageNet 2012 Validation dataset. The accuracy evaluation results are shown in Table 1.

Table 1: Classification accuracy of BiT_m_r50x1_1 model on the ImageNet 2012 Validation dataset.

With TensorFlow (FP32) to OpenVINO™ (FP32) model optimization, the classification accuracy remained consistent at 0.70154, confirming that conversion to OpenVINO™ model representation does not affect accuracy. Furthermore, with NNCF Quantization to an 8-bit integer model, the accuracy was only marginally impacted of less than 0.03%, demonstrating that the quantization process did not compromise the model’s classification abilities.

Refer to Appendix A for the Python script bit_ov_model_quantization.py, which includes data preparation, model optimization, NNCF quantization tasks, and accuracy evaluation.

The usage of the bit_ov_model_quantization.py script is as follows:

$python bit_ov_model_quantization.py --help
usage: bit_ov_model_quantization.py [-h] [--inp_shape INP_SHAPE] --dataset_dir DATASET_DIR --gt_labels GT_LABELS --bit_m_tf BIT_M_TF --bit_ov_fp32 BIT_OV_FP32
                                    [--bit_ov_int8 BIT_OV_INT8]

BiT Classification model quantization and accuracy measurement

required arguments:
  --dataset_dir DATASET_DIR
                        Directory path to ImageNet2012 validation dataset
  --gt_labels GT_LABELS
                        Path to ImageNet2012 validation ds gt labels file
  --bit_m_tf BIT_M_TF   Path to BiT TF fp32 model file
  --bit_ov_fp32 BIT_OV_FP32
                        Path to BiT OpenVINO fp32 model file

optional arguments:
  -h, --help            show this help message and exit
  --inp_shape INP_SHAPE
                        N,W,H,C
  --bit_ov_int8 BIT_OV_INT8
                        Path to save BiT OpenVINO INT8 model file

7. Conclusion

The results emphasize the efficacy of OpenVINO™ and NNCF in optimizing model efficiency while minimizing computational requirements. The ability to achieve remarkable performance and accuracy retention, particularly when compressing models to INT8 precision, demonstrates the practicality of leveraging OpenVINO™ for deployment in various environments including resource-constrained environments. NNCF proves to be a valuable tool for practitioners seeking to balance model size and computational efficiency without substantial compromise on classification accuracy, opening avenues for enhanced model deployment across diverse hardware configurations.

Notices & Disclaimers:

Performance varies by use, configuration, and other factors. Learn more on the Performance Index site.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details.

No product or component can be absolutely secure.
Your costs and results may vary.
Intel technologies may require enabled hardware, software or service activation.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. 

Additional Resources:

Appendix A

· Software Configuration:

· ILSVRC2012 ground truth: ground_truth_ilsvrc2012_val.txt

· See bit_ov_model_quantization.py below for the BiT model quantization pipeline with NNCF described in this blog